上QQ阅读APP看书，第一时间看更新

Creating the model file using scikit-learn

This section will explain how we are going to create the random forest model file using scikit-learn and convert it into the .mlmodel file that is compatible with Core ML. We are going to use the Breast Cancer dataset to create the model. The following is a Python program that creates a simple random forest model using scikit-learn and the Breast Cancer dataset. Then, the Core ML tools convert it into the Core ML—compatible model file. Let's go through the program in detail.

First, we need to import the required packages:

# importing required packages
 import numpy as np

NumPy is the fundamental package for scientific computing with Python. It contains a powerful N-dimensional array object. This numpy array will be used in this program for storing the dataset, which has 14 dimensions:

import pandas as pd
 from pandas.core import series

Here, we are using pandas (https://pandas.pydata.org/pandas-docs/stable/10min.html) which is an open source, BSD-licensed library providing high-performing, easy-to-use data structures and data analysis tools for the Python programming language. Using pandas, we can create a data frame. You can assume that a pandas dataframe is an Excel sheet in which every sheet has headings and data.

Now, let's move on to understand the program written for solving the machine learning problem at hand:

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
import sklearn.datasets as dsimport sklearn.datasets as ds

The preceding lines import the sklearn packages. Now, we will import built-in datasets in the sklearn package:

dataset = ds.load_breast_cancer()

The preceding line loads the Breast Cancer dataset from the sklearn dataset package:

 cancerdata = pd.DataFrame(dataset.data)

This will create a dataframe from the data present in the dataset. Let's assume that the dataset is an Excel sheet with rows and columns with column headings:

 cancerdata.columns = dataset.feature_names

The following piece of code will add the column headings to the columns in the dataset:

for i in range(0,len(dataset.feature_names)):
if ['mean concave points', 'mean area', 'mean radius', 'mean perimeter', 'mean concavity'].\
__contains__(dataset.feature_names[i]):
continue
else:
cancerdata = cancerdata.drop(dataset.feature_names[i], axis=1)

The preceding lines will delete all the columns other than the following:

Mean concave points
Mean area
Mean radius
Mean perimeter
Mean concavity

To reduce the number of feature columns in the dataset, I am deleting some of the columns that have less impact on the model:

cancerdata.to_csv("myfile.csv")

This line will save the data to a CSV file; you can open it and see in Excel to find out what is present in the dataset:

 cancer_types = dataset.target_names

In the Excel dataset, when you examine it, you will know that the diagnosis will include the value as 0 or 1, where 0 is malignant and 1 is benign. To change these numeric values to the real names, we write the following piece of code:

cancer_names = []
//getting all the corresponding cancer types with name [string] format.
for i in range(len(dataset.target)):
cancer_names.append(cancer_types[dataset.target[i]])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(cancerdata,cancer_names,test_size=0.3, random_state=5)

This line of code will split the dataset into two—one for training and one for testing, and will save it in the corresponding variables defined for the purpose:

 classifier = RandomForestClassifier()

The following will create a classifier:

classifier.fit(x_train, y_train)

This code will feed the training data and train the model:

//testing the model with test data
print(classifier.predict(x_test))

The preceding line will print the predicted cancer types for the testing data to the console, as shown here: