Grouping the data – aggregation, filtering, and transformation
In this section, you will learn how to aggregate data over categorical variables. This is a very common practice when the data consists of categorical variables. This analysis enables us to conduct a category-wise analysis and take further decisions regarding the modelling.
To illustrate the concepts of grouping and aggregating data better, let's create a simple dummy data frame that has a rich mix of both numerical and categorical variables. Let's use whatever we have explored till now about random numbers to create this data frame, as shown in the following snippet:
import numpy as np import pandas as pd a=['Male','Female'] b=['Rich','Poor','Middle Class'] gender=[] seb=[] for i in range(1,101): gender.append(np.random.choice(a)) seb.append(np.random.choice(b)) height=30*np.random.randn(100)+155 weight=20*np.random.randn(100)+60 age=10*np.random.randn(100)+35 income=1500*np.random.randn(100)+15000 df=pd.DataFrame({'Gender':gender,'Height':height,'Weight':weight,'Age':age,'Income':income,'Socio-Eco':seb}) df.head()
The output data frame df
looks something as follows:
Fig. 3.22: The resulting dummy data frame df containing 6 columns
As we can see from the preceding code snippet, the shape of the data frame is 100x6.
Grouping can be done over a categorical variable using the groupby
function. The column name of the categorical variable needs to be specified for this. Suppose that we wish to group the data frame based on the Gender
variable. This can be done by writing the following:
df.groupby('Gender')
If you run the preceding snippet on your IDE, you will get the following output indicating that a groupby
object has been created:
Fig. 3.23: Prompt showing that the groupby object has been created
The groupby
function doesn't split the original data frame into several groups, instead it creates a groupby
object that has two attributes, which are name
and group
.
These attributes can be accessed by following the name of the groupby
object with '.'
, followed by the name of the attribute. For example, to access the group attribute, one can write the following:
grouped = df.groupby('Gender') grouped.groups
The following is the output:
Fig. 3.24: Two groups based on gender
The numbers indicate the row numbers that belong to that particular group.
One important feature of these attributes is that they are iterable, and the same operation can be applied to each group just by looping. This comes in very handy when the number of groups are large and one needs results of the operation separately for each group.
Let's perform a simple operation to illustrate this. Let's try to print the name and groups in the groupby
object that we just created. This can be done as follows:
grouped=df.groupby('Gender') for names,groups in grouped: print names print groups
This prints the name of the group followed by the entire data for this group. The output looks something like the following:
Fig. 3.25.1: Name and the data in the group with gender female
Here is the second group as a part of the output:
Fig. 3.25.2: Name and the data in the group with gender male
A single group can be selected by writing the following:
grouped.get_group('Female')
This would generate only the first of the two groups, as shown in the preceding screenshot.
A data frame can be grouped over more than one categorical variable as well. As in this case, the data frame can be grouped over both Gender
and Soci-Eco
by writing something like the following:
grouped=df.groupby(['Gender','Socio-Eco'])
This should create six groups from a combination of two categories of Gender
and three categories of the Socio-Eco
variable. This can be checked by checking the length of the groupby
object as follows:
len(grouped)
It indeed returns six. To look at how these groups look, let's run the same iteration on the group attributes as we did earlier:
grouped=df.groupby(['Gender','Socio-Eco']) for names,groups in grouped: print names print groups
The code gives six groups' names and their entire data as the output. There would be six of such groups in total.
The first group looks like the following:
Fig. 3.26.1: Name and the data in the group with gender female and Socio_Eco Middle Class
The second group looks like the following:
Fig. 3.26.2: Name and the data in the group with gender female and Socio_Eco Middle Class
Aggregation
There are various aggregations that are possible on a data frame, such as sum
, mean
, describe
, size
, and so on. The aggregation basically means applying a function to all the groups all at once and getting a result from that particular group.
Let's see the sum
function. We just need to write the following code snippet to see how it works:
grouped=df.groupby(['Gender','Socio-Eco']) grouped.sum()
We gets the following table as the result:
Fig. 3.27: Sum of each column for different groups
To get the number of rows in each group (or calculate the size of each group), we can write something similar to the following code snippet:
grouped=df.groupby(['Gender','Socio-Eco']) grouped.size()
This results in a table, as shown in the following screenshot:
Fig. 3.28: Size of each group
One can use the describe
function to get the summary statistics for each group separately. The syntax is exactly the same as it is for the earlier two functions:
grouped=df.groupby(['Gender','Socio-Eco']) grouped.describe()
This output looks similar to the following table:
Fig. 3.29: All the summary statistics of each column for different groups
The groupby
objects behave similar to an inpidual data frame, in the sense that one can select columns from these groupby
objects just as we do from the data frames:
grouped=df.groupby(['Gender','Socio-Eco']) grouped_income=grouped['Income']
One can apply different functions to different columns. The aggregate
method used to do this is shown in the following snippet. With the following snippet, one can calculate sum
of Income
, mean
of Age
, and standard deviation of Height
, as shown:
grouped=df.groupby(['Gender','Socio-Eco']) grouped.aggregate({'Income':np.sum,'Age':np.mean,'Height':np.std})
The output of the preceding snippet looks similar to the following table:
Fig. 3.30: Selected summary statistics of selected columns for different groups
We can also define a function using the lambda
method of defining a calculation in Python. Suppose you don't want the mean of age but the ratio of mean and standard deviation for height. You can define the formula for this ratio using the lambda
method, illustrated as follows:
grouped=df.groupby(['Gender','Socio-Eco']) grouped.aggregate({'Age':np.mean,'Height':lambda x:np.mean(x)/np.std(x)})
Rather than applying different functions to different columns, one can apply several functions to all the columns at the same time, as shown:
grouped.aggregate([np.sum, np.mean, np.std])
The output of the code snippet contains the result of all the three functions applied on all the columns of the groupby
object, as seen in the following screenshot:
Fig. 3.31: More than one selected summary statistics of selected columns for different groups
Filtering
One important operation that can be applied on the groupby
objects is filter. We can filter elements based on the properties of groups. Suppose we want to choose elements from the Age
column that are a part of the group wherein the sum of Age
is greater than 700
. This filtering can be done by writing the following snippet:
grouped['Age'].filter(lambda x:x.sum()>700)
The output contains the row numbers that are part of the group where the sum of Age
is greater than 700
. The output is, as follows:
Fig. 3.32: The rows left after filtering it for elements, which are part of groups, where the sum of ages is greater than 700
Transformation
One can use the transform
method to mathematically transform all the elements in a numerical column. Suppose, we wish to calculate the standard normal values for all the elements in the numerical columns of our data frame; this can be done in a manner as shown:
zscore = lambda x: (x - x.mean()) / x.std() grouped.transform(zscore)
The output contains standard normal values for all the numerical columns in the data frame, as shown in the following screenshot:
Fig. 3.33: Result of applying a lambda defined function on the columns of groups
The transform
method comes in handy in a lot of situations. For example, it can be used to fill the missing values with the mean of the non-missing values, as shown:
f = lambda x: x.fillna(x.mean() grouped.transform(f)
Miscellaneous operations
In many situations, one needs to select the nth row of each group of a groupby
object, most often the first and the last row. This can be easily done once the groupby
object is created. Let's see how:
- The first row of each group can be selected by writing the following code snippet:
grouped.head(1)
- While the last row of each group can be selected by writing the following code snippet:
grouped.tail(1)
The result of the former, is as shown:
Fig. 3.34: First few rows of the grouped element
In general, we can use the nth
function to get the nth row from a group, as illustrated:
grouped=df.groupby('Gender') grouped.nth(1)
This gives the following result:
Fig. 3.35: First rows of each group
One can use any number (of course, less than the number of rows in each group) as the argument for the nth
function.
It is always a good practice to sort the data frame for the relevant columns before creating the groupby
object from the data frame. Suppose, you want to look at the youngest male and female members of this data frame.
This can be done by sorting the data frame, creating a groupby
object, and then taking the first element of each group:
df1=df.sort(['Age','Income']) grouped=df1.groupby('Gender') grouped.head(1)
The output has two rows containing the details of the two youngest members from the two groups:
Fig. 3.36: Sorting by the age column before grouping by gender and then selecting the first row from each group can give you the oldest/youngest guy in the group
The oldest members can be identified in the same way by typing grouped.tail(1)
.