Handling missing values
We regularly encounter empty fields in data records. It's best that we accept this and learn how to handle this kind of issue in a robust manner. Real data can not only have gaps-it can also have wrong values, because of faulty measuring equipment, for example. In Pandas, missing numerical values will be designated as NaN
, objects as None
, and the datetime64
objects as NaT
. The outcome of arithmetic operations with NaN
values is also NaN
. Descriptive statistics methods, such as summation and average, behave differently. As we observed in an earlier example, in such a case, NaN
values are treated as zero values. However, if all the values are NaN
during, say, summation, the sum returned is still NaN
. In aggregation operations, NaN
values in the column that we group are ignored. We will again load the WHO_first9cols.csv
file into a DataFrame. Remember that this file contains empty fields. Let's only select the first three rows, including the headers of the Country
and Net primary school enrolment ratio male (%)
columns as follows:
df = df[['Country', df.columns[-2]]][:2] print("New df\n", df)
We get a DataFrame with two NaN values:
New df Country Net primary school enrolment ratio male (%) 0 Afghanistan NaN 1 Albania 94 [2 rows x 2 columns]
The Pandas isnull()
function checks for missing values as follows:
print("Null Values\n", pd.isnull(df))
The output for our DataFrame is as follows:
Null Values Country Net primary school enrolment ratio male (%) 0 False True 1 False False
To count the number of NaN
values for each column, we can sum the Boolean values returned by isnull()
. This works because, during summation, True
values are considered as ones and False
values are treated as zeros:
Total Null Values Country 0 Net primary school enrolment ratio male (%) 1 dtype: int64
Likewise, we can check with the DataFrame notnull()
method for any non-missing values that are present:
print("Not Null Values\n", df.notnull())
The result of the notnull()
method is the opposite of the isnull()
function:
Not Null Values Country Net primary school enrolment ratio male (%) 0 True False 1 True True
When we double values in a DataFrame that has NaN
values, the product will still contain NaN
values, since doubling is an arithmetic operation:
print("Last Column Doubled\n", 2 * df[df.columns[-1]])
We double the last column, which contains numerical values (doubling string values repeats the string):
Last Column Doubled 0 NaN 1 188 Name: Net primary school enrolment ratio male (%), dtype: float64
If we add a NaN
value, however, the NaN
value wins:
print("Last Column plus NaN\n", df[df.columns[-1]] + np.nan)
As you can see, the NaN
values declared total victory:
Last Column plus NaN 0 NaN1 NaN Name: Net primary school enrolment ratio male (%), dtype: float64
Replace the missing values with a scalar value. For example, replace 0
with the fillna()
method (we can't always replace missing values with zeros, but sometimes this is good enough):
print("Zero filled\n", df.fillna(0))
The effect of the preceding line is to replace the NaN value with 0:
Zero filled Country Net primary school enrolment ratio male (%) 0 Afghanistan 0 1 Albania 94
The code for this section is in the ch-03.ipynb
file of this book's code bundle: