Data Preprocessing
Data preprocessing is a very critical step for developing ML solutions as it helps make sure that the model is not trained on biased data. It has the capability to improve a model's performance, and it is often the reason why the same algorithm for the same data problem works better for a programmer that has done an outstanding job preprocessing the dataset.
For the computer to be able to understand the data proficiently, it is necessary to not only feed the data in a standardized way but also make sure that the data does not contain outliers or noisy data, or even missing entries. This is important because failing to do so might result in the algorithm making assumptions that are not true to the data. This will cause the model to train at a slower pace and to be less accurate due to misleading interpretations of data.
Moreover, data preprocessing does not end there. Models do not work the same way, and each one makes different assumptions. This means that we need to preprocess the data in terms of the model that is going to be used. For example, some models accept only numerical data, whereas others work with nominal and numerical data.
To achieve better results during data preprocessing, a good practice is to transform (preprocess) the data in different ways and then test the different transformations in different models. That way, you will be able to select the right transformation for the right model. It is worth mentioning that data preprocessing is likely to help any data problem and any ML algorithm, considering that just by standardizing the dataset, a better training speed is achieved.
Messy Data
Data that is missing information or that contains outliers or noise is considered to be messy data. Failing to perform any preprocessing to transform the data can lead to poorly created models of the data, due to the introduction of bias and information loss. Some of the issues with data that should be avoided will be explained here.
Missing Values
Both the features and instances of a dataset can have missing values. Features where a few instances have values, as well as instances where there are no values for any feature, are considered missing data:
The preceding image displays an instance (Instance 8) with no values for any of the features, which makes it useless, and a feature (Feature 8) with seven missing values out of the 10 instances, which means that the feature cannot be used to find patterns among the instances, considering that most of them don't have a value for the feature.
Conventionally, a feature missing more than 5 to 10% of its values is considered to be missing data (also known as a feature with high absence rate), and so it needs to be dealt with. On the other hand, all instances that have missing values for all features should be eliminated as they do not provide any information to the model and, on the contrary, may end up introducing bias.
When dealing with a feature with a high absence rate, it is recommended to either eliminate it or fill it with values. The most popular ways to replace the missing values are as follows:
- Mean imputation: Replacing missing values with the mean or median of the features' available values
- Regression imputation: Replacing missing values with the predicted values that have been obtained from a regression function
Note
A regression function refers to the statistical model that's used to estimate a relationship between a dependent variable and one or more independent variables. A regression function can be linear, logistic, polynomial, and so on.
While mean imputation is a simpler approach to implement, it may introduce bias as it evens out all the instances. On the other hand, even though the regression approach matches the data to its predicted value, it may end up overfitting the model (that is, creating models that learn the training data too well and are not fit to deal with new unseen data) as all the values that are introduced follow a function.
Lastly, when the missing values are found in a text feature such as gender, the best course of action would be to either eliminate them or replace them with a class labeled as uncategorized or something similar. This is mainly because it is not possible to apply either mean or regression imputation to text.
Labeling missing values with a new category (uncategorized) is mostly done when eliminating them would remove an important part of the dataset, and hence would not be an appropriate course of action. In this case, even though the new label may have an effect on the model, depending on the rationale that's used to label the missing values, leaving them empty would be an even worse alternative as it would cause the model to make assumptions on its own.
Note
To learn more about how to detect and handle missing values, visit the following page: https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4.
Outliers
Outliers are values that are far from the mean. This means that if the values from a feature follow a Gaussian distribution, the outliers are located at the tails.
Note
A Gaussian distribution (also known as a normal distribution) has a bell-shaped curve, given that there is an equal number of values above and below the mean.
Outliers can be global or local. The former group represents those values that are far from the entire set of values for a feature. For example, when analyzing data from all members of a neighborhood, a global outlier would be a person who is 180 years old (as shown in the following diagram (A)). The latter, on the other hand, represents values that are far from a subgroup of values of that feature. For the same example that we saw previously, a local outlier would be a college student who is 70 years old (B), which would normally differ from other college students in that neighborhood:
Considering both examples that have been given, outliers do not evaluate whether the value is possible. While a person aged 180 years is not plausible, a 70-year-old college student might be a possibility, yet both are categorized as outliers as they can both affect the performance of the model.
A straightforward approach to detect outliers consists of visualizing the data to determine whether it follows a Gaussian distribution, and if it does, classifying those values that fall between three to six standard deviations away from the mean as outliers. Nevertheless, there is not an exact rule to determine an outlier, and the decision to select the number of standard deviations is subjective and will vary from problem to problem.
For example, if the dataset is reduced by 40% by setting three standard deviations as the parameter to rule out values, it would be appropriate to change the number of standard deviations to four.
On the other hand, when dealing with text features, detecting outliers becomes even trickier as there are no standard deviations to use. In this case, counting the occurrences of each class value would help to determine whether a certain class is indispensable or not. For instance, in clothing sizes, having a size XXS that represents less than 5% of the entire dataset might not be necessary.
Once the outliers have been detected, there are three common ways to handle them:
- Delete the outlier: For outliers that are true values, it is best to completely delete them to avoid skewing the analysis. This may also be a good idea for outliers that are mistakes, that is, if the number of outliers is too large to perform further analysis to assign a new value.
- Define a top: Defining a top may also be useful for true values. For instance, if you realize that all values above a certain threshold behave the same way, you can consider topping that value with a threshold.
- Assign a new value: If the outlier is clearly a mistake, you can assign a new value using one of the techniques that we discussed for missing values (mean or regression imputation).
The decision to use each of the preceding approaches depends on the outlier type and number. Most of the time, if the number of outliers represents a small proportion of the total size of the dataset, there is no point in treating the outlier in any way other than deleting it.
Note
Noisy data corresponds to values that are not correct or possible. This includes numerical (outliers that are mistakes) and nominal values (for example, a person's gender misspelled as "fimale"). Like outliers, noisy data can be treated by deleting the values completely or by assigning them a new value.
Exercise 1.02: Dealing with Messy Data
In this exercise, we will be using the tips dataset from seaborn as an example to demonstrate how to deal with messy data. Follow these steps to complete this exercise:
- Open a Jupyter Notebook to implement this exercise.
- Import all the required elements. Next, load the tips dataset and store it in a variable called tips. Use the following code:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
- Next, create a variable called size to store the values of that feature from the dataset. As this dataset does not contain any missing data, we will convert the top 16 values of the size variable into missing values. Print out the top 20 values of the age variable:
size = tips["size"]
size.loc[:15] = np.nan
size.head(20)
Note
A warning may appear at this point, saying A value is trying to be set on a copy of a slice from a DataFrame. This occurs because size is a slice of the tips dataset, and by making a change in the slice, the dataset is also changed. This is okay as the purpose of this exercise is to modify the dataset by modifying the different features that it contains.
The preceding code snippet creates the size variable as a slice of the dataset, then coverts the top 16 values of the variable into Not a Number (NaN), which is the representation of a missing value. Finally, it prints the top 20 values of the variable.
The output will appear as follows:
As you can see, the feature contains the NaN values that we introduced.
- Check the shape of the size variable:
size.shape
The output is as follows:
(244,)
- Now, count the number of NaN values to determine how to handle them. Use the isnull() function to find the NaN values, and use the sum() function to sum them all:
size.isnull().sum()
The output is as follows:
16
The participation of the NaN values in the total size of the variable is 6.55%, which can be calculated by piding the number of missing values by the length of the feature (16/244). Although this is not high enough to consider removing the entire feature, there is a need to handle the missing values.
- Let's choose the mean imputation methodology to replace the missing values. To do so, compute the mean of the available values, as follows:
mean = size.mean()
mean = round(mean)
print(mean)
The mean comes out as 3.
Note
The mean value (2.55) was rounded to its nearest integer since the size feature is a measure of the number of persons attending a restaurant.
- Replace all missing values with the mean. Use the fillna() function, which takes every missing value and replaces it with the value that is defined inside the parenthesis. To check that the values have been replaced, print the first 10 values again:
size.fillna(mean, inplace=True)
size.head(20)
Note
When inplace is set to True, the original DataFrame is modified. Failing to set the parameter to True will leave the original dataset unmodified. According to this, by setting inplace to True, it is possible to replace the NaN values for the mean.
The printed output is as follows:
As shown in the preceding screenshot, the value of the top instances has changed from NaN to 3, which is the mean that was calculated previously.
- Use Matplotlib to graph a histogram of the age variable. Use Matplotlib's hist() function, as per the following code:
plt.hist(size)
plt.show()
The histogram should look as follows. As we can see, its distribution is Gaussian-like:
- Discover the outliers in the data. Let's use three standard deviations as the measure to calculate the minimum and maximum values.
As we discussed previously, the min value is determined by calculating the mean of all of the values and subtracting three standard deviations from it. Use the following code to set the min value and store it in a variable named min_val:
min_val = size.mean() - (3 * size.std())
print(min_val)
The min value is around -0.1974. According to the min value, there are no outliers at the left tail of the Gaussian distribution. This makes sense, given that the distribution is tilted slightly to the left.
Opposite to the min value, for the max value, the standard deviations are added to the mean to calculate the higher threshold. Calculate the max value, as shown in the following code, and store it in a variable named max_val:
max_val = size.mean() + (3 * size.std())
print(max_val)
The max value, which comes to around 5.3695, determines that instances with a size above 5.36 represent outliers. As you can see in the preceding diagram, this also makes sense as those instances are far away from the bell of the Gaussian distribution.
- Count the number of instances that are above the maximum value to decide how to handle them, as per the instructions given here.
Using indexing, obtain the values in size that are above the max threshold and store them in a variable called outliers. Then, count the outliers using count():
outliers = size[size > max_val]
outliers.count()
The output shows that there are 4 outliers.
- Print out the outliers and check that the correct values were stored, as follows:
print(outliers)
The output is as follows:
As the number of outliers is small, and they correspond to true outliers, they can be deleted.
Note
For this exercise, we will be deleting the instances from the size variable to understand the complete procedure of dealing with outliers. However, later, the deletion of outliers will be handled while considering all of the features so that we can delete the entire instance, not just the size values.
- Redefine the values stored in size by using indexing to include only values below the max threshold. Then, print the shape of size:
age = size[size <= max_val]
age.shape
The output is as follows:
(240,)
As you can see, the shape of size (calculated in Step 4) has been reduced by four, which was the number of outliers.
Note
To access the source code for this specific section, please refer to https://packt.live/30Egk0o.
You can also run this example online at https://packt.live/3d321ow. You must execute the entire Notebook in order to get the desired result.
You have successfully cleaned a Pandas series. This process serves as a guide for cleaning a dataset later on.
To summarize, we have discussed the importance of preprocessing data, as failing to do so may introduce bias in the model, which affects the training time of the model and its performance. Some of the main forms of messy data are missing values, outliers, and noise.
Missing values, as their name suggests, are those values that are left empty or null. When dealing with many missing values, it is important to handle them by deleting them or by assigning new values. Two ways to assign new values were also discussed: mean imputation and regression imputation.
Outliers are values that fall far from the mean of all the values of a feature. One way to detect outliers is by selecting all the values that fall outside the mean plus/minus three/six standard deviations. Outliers may be mistakes (values that are not possible) or true values, and they should be handled differently. While true outliers may be deleted or topped, mistakes should be replaced with other values when possible.
Finally, noisy data corresponds to values that are, regardless of their proximity to the mean, mistakes or typos in the data. They can be of numeric, ordinal, or nominal types.
Note
Please remember that numeric data is always represented by numbers that can be measured, nominal data refers to text data that does not follow a rank, and ordinal data refers to text data that follows a rank or order.
Dealing with Categorical Features
Categorical features are features that comprise discrete values typically belonging to a finite set of categories. Categorical data can be nominal or ordinal. Nominal refers to categories that do not follow a specific order, such as music genre or city names, whereas ordinal refers to categories with a sense of order, such as clothing sizes or level of education.
Feature Engineering
Even though improvements in many ML algorithms have enabled the algorithms to understand categorical data types such as text, the process of transforming them into numeric values facilitates the training process of the model, which results in faster running times and better performance. This is mainly due to the elimination of semantics available in each category, as well as the fact that the conversion into numeric values allows you to scale all of the features of the dataset equally, as will be explained in subsequent sections of this chapter.
How does it work? Feature engineering generates a label encoding that assigns a numeric value to each category; this value will then replace the category in the dataset. For example, a variable called genre with the classes pop, rock, and country can be converted as follows:
Exercise 1.03: Applying Feature Engineering to Text Data
In this exercise, we will be converting the text features of the tips dataset into numerical data.
Note
Use the same Jupyter Notebook that you created for the previous exercise.
Follow these steps to complete this exercise:
- Import scikit-learn's LabelEncoder() class, as well as the pandas library, as follows:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
- Convert each of the text features into numeric values using the class that was imported previously (LabelEncoder):
enc = LabelEncoder()
tips["sex"] = enc.fit_transform(tips['sex'].astype('str'))
tips["smoker"] = enc.fit_transform(tips['smoker'].astype('str'))
tips["day"] = enc.fit_transform(tips['day'].astype('str'))
tips["time"] = enc.fit_transform(tips['time'].astype('str'))
As per the preceding code snippet, the first step is to instantiate the LabelEncoder class by typing in the first line of code. Second, for each of the categorical features, we use the built-in fit_transform() method from the class, which will assign a numeric value to each category and output the result.
- Print out the top values of the tips dataset:
tips.head()
The output is as follows:
As you can see, the text categories of the categorical features have been converted into numeric values.
Note
To access the source code for this specific section, please refer to https://packt.live/30GWJgb.
You can also run this example online at https://packt.live/3e2oaVu. You must execute the entire Notebook in order to get the desired result.
You have successfully converted text data into numeric values.
While improvements in ML have made dealing with text features easier for some algorithms, it is best to convert them into numeric values. This is mainly important as it eliminates the complexity of dealing with semantics, not to mention that it gives us the flexibility to change from model to model, without any limitations.
Text data conversion is done via feature engineering, where every text category is assigned a numeric value that replaces it. Furthermore, even though this can be done manually, there are powerful built-in classes and methods that facilitate this process. One example of this is the use of scikit-learn's LabelEncoder class.
Rescaling Data
Rescaling data is important because even though the data may be fed to a model using different scales for each feature, the lack of homogeneity can cause the algorithm to lose its ability to discover patterns from the data due to the assumptions it has to make to understand it, thereby slowing down the training process and negatively affecting the model's performance.
Data rescaling helps the model run faster, without any burden or responsibility to learn from the invariance present in the dataset. Moreover, a model trained over equally scaled data assigns the same weights (level of importance) to all parameters, which allows the algorithm to generalize to all features and not just to those with higher values, irrespective of their meaning.
An example of a dataset with different scales is one that contains different features, one measured in kilograms, another measuring temperature, and another counting the number of children. Even though the values of each attribute are true, the scale of each one of them highly differs from that of the other. For example, while the values in kilograms can go higher than 100, the children count will typically not go higher than 10.
Two of the most popular ways to rescale data are data normalization and data standardization. There is no rule on selecting the methodology to transform data to scale it, as all datasets behave differently. The best practice is to transform the data using two or three rescaling methodologies and test the algorithms in each one of them in order to choose the one that best fits the data based on its performance.
Rescaling methodologies are to be used inpidually. When testing different rescaling methodologies, the transformation of data should be done independently. Each transformation can be tested over a model, and the best suited one should be chosen for further steps.
Normalization: Data normalization in ML consists of rescaling the values of all features so that they lie in a range between 0 and 1 and have a maximum length of one. This serves the purpose of equating attributes of different scales.
The following equation allows you to normalize the values of a feature:
Here, zi corresponds to the ith normalized value and x represents all values.
Standardization: This is a rescaling technique that transforms the data into a Gaussian distribution with a mean equal to 0 and a standard deviation equal to 1.
One simple way of standardizing a feature is shown in the following equation:
Here, zi corresponds to the ith standardized value and x represents all values.
Exercise 1.04: Normalizing and Standardizing Data
This exercise covers the normalization and standardization of data, using the tips dataset as an example.
Note
Use the same Jupyter Notebook that you created for the previous exercise.
Follow these steps to complete this exercise:
- Using the tips variable, which contains the entire dataset, normalize the data using the normalization formula and store it in a new variable called tips_normalized. Print out the top 10 values:
tips_normalized = (tips - tips.min())/(tips.max()-tips.min())
tips_normalized.head(10)
The output is as follows:
As shown in the preceding screenshot, all of the values have been converted into their equivalents in a range between 0 and 1. By performing normalization for all of the features, the model will be trained on features of the same scale.
- Again, using the tips variable, standardize the data using the formula for standardization and store it in a variable called tips_standardized. Print out the top 10 values:
tips_standardized = (tips - tips.mean())/tips.std()
tips_standardized.head(10)
The output is as follows:
Compared to normalization, in standardization, the values distribute normally around zero.
Note
To access the source code for this specific section, please refer to https://packt.live/30FKsbD.
You can also run this example online at https://packt.live/3e3cW2O. You must execute the entire Notebook in order to get the desired result.
You have successfully applied rescaling methods to your data.
In conclusion, we have covered the final step in data preprocessing, which consists of rescaling data. This process was done in a dataset with features of different scales, with the objective of homogenizing the way data is represented to facilitate the comprehension of the data by the model.
Failing to rescale data will cause the model to train at a slower pace and may negatively affect the performance of the model.
Two methodologies for data rescaling were explained in this topic: normalization and standardization. On one hand, normalization transforms the data to a length of one (from 0 to 1). On the other hand, standardization converts the data into a Gaussian distribution with a mean of 0 and a standard deviation of 1.
Given that there is no rule for selecting the appropriate rescaling methodology, the recommended course of action is to transform the data using two or three rescaling methodologies independently, and then train the model with each transformation to evaluate the methodology that behaves the best.
Activity 1.02: Pre-processing an Entire Dataset
You are continuing to work for the safety department at a cruise company. As you did great work selecting the ideal target feature to develop the study, the department has decided to commission you for preprocessing the dataset as well. For this purpose, you need to use all the techniques you learned about previously to preprocess the dataset and get it ready for model training. The following steps serve to guide you in that direction:
- Import seaborn and the LabelEncoder class from scikit-learn. Next, load the Titanic dataset and create the features matrix, including the following features: sex, age, fare, class, embark_town, and alone.
Note
For this activity, the features matrix has been created using only six features since some of the other features were redundant for this study. For example, there is no need to keep both sex and gender.
- Check for missing values and outliers in all the features of the features matrix (X). Choose a methodology to handle them.
- Convert all text features into their numeric representations.
- Rescale your data, either by normalizing or standardizing it.
Note
The solution for this activity can be found on page 211.
Expected Output: Results may vary, depending on the choices you make. However, you must be left with a dataset with no missing values, outliers, or text features, and with the data rescaled.