Summary Statistics and Central Values_Applied Supervised Learning with Python-QQ阅读中文历史网

上QQ阅读APP看书，第一时间看更新

Summary Statistics and Central Values

In order to find out what our data really looks like, we use a technique known as data profiling. This is defined as the process of examining the data available from an existing information source (for example, a database or a file) and collecting statistics or informative summaries about that data. The goal is to make sure that you understand your data well and are able to identify any challenges that the data may pose early on in the project, which is done by summarizing the dataset and assessing its structure, content, and quality.

Data profiling includes collecting descriptive statistics and data types. Here are a few commands that are commonly used to get a summary of a dataset:

data.info(): This command tells us how many non-null values there are there in each column, along with the data type of the values (non-numeric types are represented as object types).
data.describe(): This gives us basic summary statistics for all the numerical columns in the DataFrame, such as the count of non-null values, minimum and maximum, the mean and standard deviation, and the quarter-wise percentiles for all numerical features. If there are any string-type features, it does not include a summary of those.
data.head() and data.tail(): These commands display the first five and last five rows of the DataFrame respectively. While the previous commands give us a general idea of the dataset, it is a good idea to get a closer look at the actual data itself, which can be done using these commands.

Standard Deviation

The standard deviation represents how widespread the distribution of the values of x are.

For a set of numerical values, xi, the standard deviation is given by:

Figure 2.1: Standard deviation equation

Here, is the standard deviation, N is the number of data points, and is the mean.

Say we have a set of 10 values, x = [0,1,1,2,3,4,2,2,0,1]. The mean, , will be the sum of these values, divided by 10. That is, = 1.6:

Figure 2.2: Mean square values for x

Then, standard deviation = sqrt(14.4/10) = 1.2.

Percentiles

For a set of values, the nth percentile is equal to the value that is greater than n% of values in the set. For example, the 50th percentile is the value in the dataset that has as many values greater than it as it does that are less than it. Additionally, the fiftieth percentile of a dataset is also known as its median, and the twenty-fifth and seventy-fifth percentiles are also known as the lower and upper quartiles.

Say we have the same set of 10 values as earlier, x = [0,1,1,2,3,4,2,2,0,1]. Let's first sort this list of values. Upon sorting, we have x = [0,0,1,1,1,2,2,2,3,4]. To find the twenty-fifth percentile, let's first calculate the index at which the value occurs: i = (p/100) * n), where p = 25 and n = 10. Then, i = 2.5.

Since i is not an integer, we round it up to 3 and take the third element in the list as the twenty-fifth percentile. The twenty-fifth percentile in the given list would then be 1, which is the third element in our sorted list.

Exercise 11: Summary Statistics of Our Dataset

In this exercise, we will use the summary statistics functions we read about previously to get a basic idea of our dataset:

Read the earthquakes data into a data pandas DataFrame and use the dtyp dictionary we read using the json library in the previous exercise to specify the data types of each column in the CSV:
data = pd.read_csv('earthquake_data.csv', dtype=dtyp)
Use the data.info() function to get an overview of the dataset:
data.info()
The output will be as follows:

Figure 2.3: Overview of the dataset
Print the first five and last five rows of the dataset. The first five rows are printed as follows:
data.head()
The output will be as shown here:

Figure 2.4: The first five rows
The last five rows are printed as follows:
data.tail()
The output will be as shown here:

Figure 2.5: The last five rows
We can see in these outputs that there are 28 columns, but not all of them are displayed. Only the first 10 and last 10 columns are displayed, with the ellipses representing the fact that there are columns in between that are not displayed.
Use data.describe() to find the summary statistics of the dataset. Run data.describe().T:
data.describe().T
Here, .T indicates that we're taking a transpose of the DataFrame to which it is applied, that is, turning the columns into rows and vice versa. Applying it to the describe() function allows us to see the output more easily with each row in the transposed DataFrame now corresponding to the statistics for a single feature.
We should get an output like this:

Figure 2.6: Summary statistics

Notice here that the describe() function only shows the statistics for columns with numerical values. This is because we cannot calculate the statistics for the columns having non-numerical values.