The binary classification model
The code from the previous section creates a new file called predict.csv in the dunnhumby folder. This dataset has a single row for each customer with a 0/1 field indicating whether they visited in the last two weeks and predictor variables based on sales data before those two weeks. Now we can proceed to build some machine learning models. The Chapter4/binary_predict.R file contains the code for our first prediction task, binary classification. The first part of the code loads the data and creates an array of predictor variables by including all columns except the customer ID, the binary classification predictor variable, and the regression predictor variable. The feature columns are all numeric fields that are heavily right-skewed distributed, so we apply a log transformation to those fields. We add 0.01 first to avoid getting a non-numeric result from attempting to get a log of a zero value (log(0)= -Inf).
The following plot shows the data before transformation, on the left, and the data after transformation, on the right:
The large bar on the left in the second plot is where the original field was zero (log(0+0.01) = -4.6). The following code loads the data, performs the log transformation, and creates the previous plot:
set.seed(42)
fileName <- "../dunnhumby/predict.csv"
dfData <- read_csv(fileName,
col_types = cols(
.default = col_double(),
CUST_CODE = col_character(),
Y_categ = col_integer())
)
nobs <- nrow(dfData)
train <- sample(nobs, 0.9*nobs)
test <- setdiff(seq_len(nobs), train)
predictorCols <- colnames(dfData)[!(colnames(dfData) %in% c("CUST_CODE","Y_numeric","Y_categ"))]
# data is right-skewed, apply log transformation
qplot(dfData$Y_numeric, geom="histogram",binwidth=10,
main="Y value distribution",xlab="Spend")+theme(plot.title = element_text(hjust = 0.5))
dfData[, c("Y_numeric",predictorCols)] <- log(0.01+dfData[, c("Y_numeric",predictorCols)])
qplot(dfData$Y_numeric, geom="histogram",binwidth=0.5,
main="log(Y) value distribution",xlab="Spend")+theme(plot.title = element_text(hjust = 0.5))
trainData <- dfData[train, c(predictorCols)]
testData <- dfData[test, c(predictorCols)]
trainData$Y_categ <- dfData[train, "Y_categ"]$Y_categ
testData$Y_categ <- dfData[test, "Y_categ"]$Y_categ
Before we train a deep learning model, we train three machine learning models – a logistic regression model, a Random Forest model, and an XGBoost model – on the data as a benchmark. This code section contains the data load, transformation, and three models:
#Logistic Regression Model
logReg=glm(Y_categ ~ .,data=trainData,family=binomial(link="logit"))
pr <- as.vector(ifelse(predict(logReg, type="response",
testData) > 0.5, "1", "0"))
# Generate the confusion matrix showing counts.
t<-table(dfData[test, c(predictorCols, "Y_categ")]$"Y_categ", pr,
dnn=c("Actual", "Predicted"))
acc<-round(100.0*sum(diag(t))/length(test),2)
print(t)
Predicted
Actual 0 1
0 130 42
1 48 174
print(sprintf(" Logistic regression accuracy = %1.2f%%",acc))
[1] " Logistic regression accuracy = 77.16%"
rm(t,pr,acc)
rf <- randomForest::randomForest(as.factor(Y_categ) ~ .,
data=trainData,
na.action=randomForest::na.roughfix)
pr <- predict(rf, newdata=testData, type="class")
# Generate the confusion matrix showing counts.
t<-table(dfData[test, c(predictorCols, "Y_categ")]$Y_categ, pr,
dnn=c("Actual", "Predicted"))
acc<-round(100.0*sum(diag(t))/length(test),2)
print(t)
Predicted
Actual 0 1
0 124 48
1 30 192
print(sprintf(" Random Forest accuracy = %1.2f%%",acc))
[1] " Random Forest accuracy = 80.20%"
rm(t,pr,acc)
xgb <- xgboost(data=data.matrix(trainData[,predictorCols]), label=trainData[,"Y_categ"]$Y_categ,
nrounds=75, objective="binary:logistic")
pr <- as.vector(ifelse(
predict(xgb, data.matrix(testData[, predictorCols])) > 0.5, "1", "0"))
t<-table(dfData[test, c(predictorCols, "Y_categ")]$"Y_categ", pr,
dnn=c("Actual", "Predicted"))
acc<-round(100.0*sum(diag(t))/length(test),2)
print(t)
Predicted
Actual 0 1
0 125 47
1 44 178
print(sprintf(" XGBoost accuracy = %1.2f%%",acc))
[1] " XGBoost accuracy = 76.90%"
rm(t,pr,acc)
We create logistic regression, Random Forest, and XGBoost models for a number of reasons. Firstly, most of the work is already done in preparing the data, so it is trivial to do so. Secondly, it gives us a benchmark to compare our deep learning model to. Thirdly, if there were a problem in the data-preparation tasks, these machine learning algorithms would highlight these problems more rapidly because they will be quicker than training a deep learning model. In this case, we only have a few thousand records, so these machine learning algorithms will easily run on this data. If the data were too large for these algorithms, I would consider taking a smaller sample and running our benchmark tasks on that smaller sample. There are many machine learning algorithms to choose from, but I used these algorithms as benchmarks for the following reasons:
- Logistic regression is a basic model and is always a good benchmark to use
- Random Forest is known to train well using the default parameters and is robust to overfitting and correlated variables (which we have here)
- XGBoost is consistently rated as the one of the best-performing machine learning algorithms
All three algorithms achieve a similar amount of accuracy, the highest accuracy was achieved by Random Forest with an 80.2% accuracy. We now know that this dataset is suitable for prediction tasks and we have a benchmark to compare against.
Now we will build a deep learning model using MXNet:
require(mxnet)
# MXNet expects matrices
train_X <- data.matrix(trainData[, predictorCols])
test_X <- data.matrix(testData[, predictorCols])
train_Y <- trainData$Y_categ
# hyper-parameters
num_hidden <- c(128,64,32)
drop_out <- c(0.2,0.2,0.2)
wd=0.00001
lr <- 0.03
num_epochs <- 40
activ <- "relu"
# create our model architecture
# using the hyper-parameters defined above
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=num_hidden[1])
act1 <- mx.symbol.Activation(fc1, name="activ1", act_type=activ)
drop1 <- mx.symbol.Dropout(data=act1,p=drop_out[1])
fc2 <- mx.symbol.FullyConnected(drop1, name="fc2", num_hidden=num_hidden[2])
act2 <- mx.symbol.Activation(fc2, name="activ2", act_type=activ)
drop2 <- mx.symbol.Dropout(data=act2,p=drop_out[2])
fc3 <- mx.symbol.FullyConnected(drop2, name="fc3", num_hidden=num_hidden[3])
act3 <- mx.symbol.Activation(fc3, name="activ3", act_type=activ)
drop3 <- mx.symbol.Dropout(data=act3,p=drop_out[3])
fc4 <- mx.symbol.FullyConnected(drop3, name="fc4", num_hidden=2)
softmax <- mx.symbol.SoftmaxOutput(fc4, name="sm")
# run on cpu, change to 'devices <- mx.gpu()'
# if you have a suitable GPU card
devices <- mx.cpu()
mx.set.seed(0)
tic <- proc.time()
# This actually trains the model
model <- mx.model.FeedForward.create(softmax, X = train_X, y = train_Y,
ctx = devices,num.round = num_epochs,
learning.rate = lr, momentum = 0.9,
eval.metric = mx.metric.accuracy,
initializer = mx.init.uniform(0.1),
wd=wd,
epoch.end.callback = mx.callback.log.train.metric(1))
print(proc.time() - tic)
user system elapsed
9.23 4.65 4.37
pr <- predict(model, test_X)
pred.label <- max.col(t(pr)) - 1
t <- table(data.frame(cbind(testData[,"Y_categ"]$Y_categ,pred.label)),
dnn=c("Actual", "Predicted"))
acc<-round(100.0*sum(diag(t))/length(test),2)
print(t)
Predicted
Actual 0 1
0 136 36
1 54 168
print(sprintf(" Deep Learning Model accuracy = %1.2f%%",acc))
[1] " Deep Learning Model accuracy = 77.16%"
rm(t,pr,acc)
rm(data,fc1,act1,fc2,act2,fc3,act3,fc4,softmax,model)
The deep learning model achieved a 77.16% accuracy on the test data, which is only beaten by the Random Forest model. This shows that a deep learning model can be competitive against the best machine learning algorithms. It also shows that deep learning models on classification tasks do not always beat other machine learning algorithms. We used these models to provide a benchmark, so that we would know that our deep learning model was getting decent results; it gives us confidence that our deep learning model is competitive.
Our deep learning model uses 20% dropout in each layer and weight decay for regularization. Without dropout, the model overtrained significantly. This was probably because the features are highly correlated, as our columns are the spend in various departments. It figures that if one column is for a type of bread, and another column is for a type of milk, then these change together, namely someone who has more transactions and spends more is likely to buy both.