
Introduction
What is data, and what are we doing with it?
A simple answer is that we attempt to place our data as points on paper, graph them, think, and look for simple explanations that approximate the data well. The simple geometric line of F=ma (force being proportional to acceleration) explained a lot of noisy data for hundreds of years. I tend to think of data science as data compression at times.
Sometimes, when a machine is given only win-lose outcomes (of winning games of checkers, for example) and trained, I think of artificial intelligence. It is never taught explicit directions on how to play to win in such a case.
This chapter deals with the pre-processing of data in scikit-learn. Some questions you can ask about your dataset are as follows:
- Are there missing values in your dataset?
- Are there outliers (points far away from the others) in your set?
- What are the variables in the data like? Are they continuous quantities or categories?
- What do the continuous variable distributions look like? Can any of the variables in your dataset be described by normal distributions (bell-shaped curves)?
- Can any continuous variables be turned into categorical variables for simplicity? (This tends to be true if the distribution takes on very few particular values and not a continuous-like range of values.)
- What are the units of the variables involved? Will you mix the variables somehow in the machine learning algorithm you chose to use?
These questions can have simple or complex answers. Thankfully, you ask them many times, even on the same dataset, and after these recipes you will have some practice at crafting answers to pre-processing machine learning questions.
Additionally, we will see pipelines: a great organizational tool to make sure we perform the same operations on both the training and testing sets without errors and with relatively little work. We will also see regression examples: stochastic gradient descent (SGD) and Gaussian processes.