Dataset creation
The data we use in this chapter can be downloaded from any source on the internet or from GitHub at this link: https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R/tree/master/Chapter05.
I found this data on a website dedicated to providing datasets for support vector machine analysis. You can follow the following link to find numerous sets to test your learning methods: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
The authors have asked to cite their work, which I will abide by:
Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011
The data we're using is named a5a, consisting of the training data with 6414 observations. This is a sufficient size dataset for the interest of facilitating learning, and not causing computational speed issues. Also, when doing KNN or SVM, you need to center/scale or normalize your data to 0/1 if the input features are of different scales. Well, this data's input features are of just two levels, 0 or 1, so we can forgo any normalization efforts.
I'll show you how to load this data into R, and you can replicate that process on any data you desire to use.
While we're at it, we may as well load all of the packages needed for this chapter:
> library(magrittr)
> install.packages("ggthemes")
> install.packages("caret")
> install.packages("classifierplots")
> install.packages("DataExplorer")
> install.packages("e1071")
> install.packages("InformationValue")
> install.packages("kknn")
> install.packages("Matrix")
> install.packages("Metrics")
> install.packages("plm")
> install.packages("ROCR")
> install.packages("tidyverse")
> options(scipen=999)
It's a simple matter to access this data using R's download.file() function. You need to provide the link and give the file a name:
> download.file('https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/
a5a', 'chap5')
What's rather interesting now is that you can put this downloaded file into a usable format with a function created explicitly for this data from the e1071 library:
> df <- e1071::read.matrix.csr("chap5")
The df object is now an extensive list of input features, and the response labels structured as a factor with two levels (-1 and +1). This list is what is saved on GitHub in an R data file like this:
> saveRDS(df, file = "chapter05")
Let's look at how to turn this list into something usable, assuming we need to start by loading it into your environment:
> df <- readRDS("chapter05")
We'll create the classification labels in an object called y, and turn -1 into 0, and +1 into 1:
> y <- df$y
> y <- ifelse(y == "+1", 1, 0)
> table(y)
y
0 1
4845 1569
The table shows us that just under 25% of the labels are considered an event. What event? It doesn't matter for our purposes, so we can move on and produce a dataframe of the predictors called x. I tried a number of ways to put the sparse matrix into a dataframe, and it seems that the following code is the easiest, using a function from the Matrix package:
> x <- Matrix::as.matrix(df$x)
> x <- as.data.frame(x)
> dim(x)
[1] 6414 122
We now have our dataframe of 6,414 observations and 122 input features. Next, we'll create train/test sets and explore the features.