Improving the binary classification model
This section builds on the earlier binary classification task and looks to increase the accuracy for that task. The first thing we can do to improve the model is to use more data, 100 times more data in fact! We will download the entire dataset, which is over 4 GB data in zip files and 40 GB of data when the files are unzipped. Go back to the download link (https://www.dunnhumby.com/sourcefiles) and select Let’s Get Sort-of-Real again and download all the files for the Full dataset. There are nine files to download and the CSV files should be unzipped into the dunnhumby/in folder. Remember to check that the CSV files are in this folder and not a subfolder. You need to run the code in Chapter4/prepare_data.R again. When this completes, the predict.csv file should have 390,000 records.
The code for this section is in the Chapter4/binary_predict2.R script. Since we have more data, we can build a more complicated model. We have 100 times more data, so our new model adds an extra layer, and more nodes to our hidden layers. We have decreased the amount of regularization and the learning rate. We have also added more epochs. Here is the the code in Chapter4/binary_predict2.R, which constructs and trains the deep learning model. We have not included the boilerplate code to load and prepare the data, as that has not changed from the original script:
# hyper-parameters
num_hidden <- c(256,128,64,32)
drop_out <- c(0.2,0.2,0.1,0.1)
wd=0.0
lr <- 0.03
num_epochs <- 50
activ <- "relu"
# create our model architecture
# using the hyper-parameters defined above
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=num_hidden[1])
act1 <- mx.symbol.Activation(fc1, name="activ1", act_type=activ)
drop1 <- mx.symbol.Dropout(data=act1,p=drop_out[1])
fc2 <- mx.symbol.FullyConnected(drop1, name="fc2", num_hidden=num_hidden[2])
act2 <- mx.symbol.Activation(fc2, name="activ2", act_type=activ)
drop2 <- mx.symbol.Dropout(data=act2,p=drop_out[2])
fc3 <- mx.symbol.FullyConnected(drop2, name="fc3", num_hidden=num_hidden[3])
act3 <- mx.symbol.Activation(fc3, name="activ3", act_type=activ)
drop3 <- mx.symbol.Dropout(data=act3,p=drop_out[3])
fc4 <- mx.symbol.FullyConnected(drop3, name="fc4", num_hidden=num_hidden[4])
act4 <- mx.symbol.Activation(fc4, name="activ4", act_type=activ)
drop4 <- mx.symbol.Dropout(data=act4,p=drop_out[4])
fc5 <- mx.symbol.FullyConnected(drop4, name="fc5", num_hidden=2)
softmax <- mx.symbol.SoftmaxOutput(fc5, name="sm")
# run on cpu, change to 'devices <- mx.gpu()'
# if you have a suitable GPU card
devices <- mx.cpu()
mx.set.seed(0)
tic <- proc.time()
# This actually trains the model
model <- mx.model.FeedForward.create(softmax, X = train_X, y = train_Y,
ctx = devices,num.round = num_epochs,
learning.rate = lr, momentum = 0.9,
eval.metric = mx.metric.accuracy,
initializer = mx.init.uniform(0.1),
wd=wd,
epoch.end.callback = mx.callback.log.train.metric(1))
print(proc.time() - tic)
user system elapsed
1919.75 1124.94 871.31
pr <- predict(model, test_X)
pred.label <- max.col(t(pr)) - 1
t <- table(data.frame(cbind(testData[,"Y_categ"]$Y_categ,pred.label)),
dnn=c("Actual", "Predicted"))
acc<-round(100.0*sum(diag(t))/length(test),2)
print(t)
Predicted
Actual 0 1
0 10714 4756
1 3870 19649
print(sprintf(" Deep Learning Model accuracy = %1.2f%%",acc))
[1] " Deep Learning Model accuracy = 77.88%"
The accuracy has increased from 77.16% in the earlier model to 77.88% for this model. This may not seem significant, but if we consider that the large dataset has almost 390,000 rows, the increase in accuracy of 0.72% represents about 2,808 customers that are now classified correctly. If each of these customers is worth $50, that is an additional $140,000 in revenue.
In general, as you add more data, your model should become more complicated to generalize across all the patterns in the data. We will cover more of this in Chapter 6, Tuning and Optimizing Models, but I would encourage you to experiment with the code in Chapter4/binary_predict.R. Try changing the hyper-parameters or adding more layers. Even a small improvement of 0.1 - 0.2% in accuracy is significant. If you manage to get over 78% accuracy on this dataset, consider it a good achievement.
If you want to explore further, there are other methods to investigate. These involve making changes in how the data for the model is created. If you really want to stretch yourself, here are a few more ideas you can try:
- Our current features are a combination of department codes and weeks, we use the PROD_CODE_40 field as the department code. This has only nine unique values, so for every week, only nine fields represent that data. If you use PROD_CODE_30, PROD_CODE_20, or PROD_CODE_10, you will create a lot more features.
- In a similar manner, rather than using a combination of department codes and weeks, you could try department codes and day. This might create too many features, but I would consider doing this for the last 14 days before the cut-off date.
- Experiment with different methods of preparing the data. We use log scale, which works well for our binary classification task, but is not the best method for a regression task, as it does not create data with a normal distribution. Try applying z-scaling and min-max standardization to the data. If you do this, you must ensure that it is applied correctly to the test data before evaluating the model.
- The training data uses the sales amount. You could change this to item quantities or the number of transactions an item is in.
- You could create new features. One potentially powerful example would be to create fields based on a day of the week, or a day of the month. We could create features for the spend amounts and number of visits for each day of the week.
- We could create features based on the average size of a shopping basket, how frequently a customer visits, and so on.
- We could try a different model architecture that can take advantage of time-series data.
These are all things I would try if I was given this task as a work assignment. In traditional machine learning, adding more features often leads to problems as most traditional machine learning algorithms struggle with high-dimensionality data. Deep learning models can handle these cases, so there usually is no harm in adding more features.