Creating the model file using scikit-learn
This section will explain how we are going to create the random forest model file using scikit-learn and convert it into the .mlmodel file that is compatible with Core ML. We are going to use the Breast Cancer dataset to create the model. The following is a Python program that creates a simple random forest model using scikit-learn and the Breast Cancer dataset. Then, the Core ML tools convert it into the Core ML—compatible model file. Let's go through the program in detail.
First, we need to import the required packages:
# importing required packages
import numpy as np
NumPy is the fundamental package for scientific computing with Python. It contains a powerful N-dimensional array object. This numpy array will be used in this program for storing the dataset, which has 14 dimensions:
import pandas as pd
from pandas.core import series
Here, we are using pandas (https://pandas.pydata.org/pandas-docs/stable/10min.html) which is an open source, BSD-licensed library providing high-performing, easy-to-use data structures and data analysis tools for the Python programming language. Using pandas, we can create a data frame. You can assume that a pandas dataframe is an Excel sheet in which every sheet has headings and data.
Now, let's move on to understand the program written for solving the machine learning problem at hand:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import sklearn.datasets as dsimport sklearn.datasets as ds
The preceding lines import the sklearn packages. Now, we will import built-in datasets in the sklearn package:
dataset = ds.load_breast_cancer()
The preceding line loads the Breast Cancer dataset from the sklearn dataset package:
cancerdata = pd.DataFrame(dataset.data)
This will create a dataframe from the data present in the dataset. Let's assume that the dataset is an Excel sheet with rows and columns with column headings:
cancerdata.columns = dataset.feature_names
The following piece of code will add the column headings to the columns in the dataset:
for i in range(0,len(dataset.feature_names)):
if ['mean concave points', 'mean area', 'mean radius', 'mean perimeter', 'mean concavity'].\
__contains__(dataset.feature_names[i]):
continue
else:
cancerdata = cancerdata.drop(dataset.feature_names[i], axis=1)
The preceding lines will delete all the columns other than the following:
- Mean concave points
- Mean area
- Mean radius
- Mean perimeter
- Mean concavity
To reduce the number of feature columns in the dataset, I am deleting some of the columns that have less impact on the model:
cancerdata.to_csv("myfile.csv")
This line will save the data to a CSV file; you can open it and see in Excel to find out what is present in the dataset:
cancer_types = dataset.target_names
In the Excel dataset, when you examine it, you will know that the diagnosis will include the value as 0 or 1, where 0 is malignant and 1 is benign. To change these numeric values to the real names, we write the following piece of code:
cancer_names = []
//getting all the corresponding cancer types with name [string] format.
for i in range(len(dataset.target)):
cancer_names.append(cancer_types[dataset.target[i]])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(cancerdata,cancer_names,test_size=0.3, random_state=5)
This line of code will split the dataset into two—one for training and one for testing, and will save it in the corresponding variables defined for the purpose:
classifier = RandomForestClassifier()
The following will create a classifier:
classifier.fit(x_train, y_train)
This code will feed the training data and train the model:
//testing the model with test data
print(classifier.predict(x_test))
The preceding line will print the predicted cancer types for the testing data to the console, as shown here: