Python Data Analysis(Second Edition)
上QQ阅读APP看书,第一时间看更新

Handling missing values

We regularly encounter empty fields in data records. It's best that we accept this and learn how to handle this kind of issue in a robust manner. Real data can not only have gaps-it can also have wrong values, because of faulty measuring equipment, for example. In Pandas, missing numerical values will be designated as NaN, objects as None, and the datetime64 objects as NaT. The outcome of arithmetic operations with NaN values is also NaN. Descriptive statistics methods, such as summation and average, behave differently. As we observed in an earlier example, in such a case, NaN values are treated as zero values. However, if all the values are NaN during, say, summation, the sum returned is still NaN. In aggregation operations, NaN values in the column that we group are ignored. We will again load the WHO_first9cols.csv file into a DataFrame. Remember that this file contains empty fields. Let's only select the first three rows, including the headers of the Country and Net primary school enrolment ratio male (%) columns as follows:

df = df[['Country', df.columns[-2]]][:2] 
print("New df\n", df) 

We get a DataFrame with two NaN values:

New df
       Country  Net primary school enrolment ratio male (%)
0  Afghanistan                                          NaN
1      Albania                                           94

[2 rows x 2 columns]

The Pandas isnull() function checks for missing values as follows:

print("Null Values\n", pd.isnull(df)) 

The output for our DataFrame is as follows:

Null Values
  Country Net primary school enrolment ratio male (%)
0   False                                        True
1   False                                       False

To count the number of NaN values for each column, we can sum the Boolean values returned by isnull(). This works because, during summation, True values are considered as ones and False values are treated as zeros:

Total Null Values
Country                                        0
Net primary school enrolment ratio male (%)    1
dtype: int64

Likewise, we can check with the DataFrame notnull() method for any non-missing values that are present:

print("Not Null Values\n", df.notnull()) 

The result of the notnull() method is the opposite of the isnull() function:

Not Null Values
  Country Net primary school enrolment ratio male (%)
0    True                                       False
1    True                                        True

When we double values in a DataFrame that has NaN values, the product will still contain NaN values, since doubling is an arithmetic operation:

print("Last Column Doubled\n", 2 * df[df.columns[-1]]) 

We double the last column, which contains numerical values (doubling string values repeats the string):

Last Column Doubled
0    NaN
1    188
Name: Net primary school enrolment ratio male (%), dtype: float64

If we add a NaN value, however, the NaN value wins:

print("Last Column plus NaN\n", df[df.columns[-1]] + np.nan) 

As you can see, the NaN values declared total victory:

Last Column plus NaN
0   NaN1   NaN
Name: Net primary school enrolment ratio male (%), dtype: float64

Replace the missing values with a scalar value. For example, replace 0 with the fillna() method (we can't always replace missing values with zeros, but sometimes this is good enough):

print("Zero filled\n", df.fillna(0)) 

The effect of the preceding line is to replace the NaN value with 0:

Zero filled
       Country  Net primary school enrolment ratio male (%)
0  Afghanistan                                            0
1      Albania                                           94

The code for this section is in the ch-03.ipynb file of this book's code bundle: