Similarity and Data Standardization
For a clustering algorithm to try to find groups of customers, they need some measure of what it means for a customer to be similar or different. In this section, we will learn how to think about how similar two data points are and how to standardize data to prepare it for clustering.
Determining Similarity
In order to use clustering for customer segmentation (to group customers together with other customers who have similar traits), you first have to decide what "similar" means, or in other words, you need to be very specific about defining what kind of customers are similar. The customer traits you use should be those that are most related to the kind of marketing campaigns you would like to do.
Ideally, each feature you choose should have roughly equal importance in your mind in terms of how well it captures something important about the customer. For example, segmenting customers based on the flavor of toothpaste they tend to buy may not make sense if you want to design marketing strategies for selling cars.
Customer behavior, such as how they have responded to marketing campaigns in the past, is often the most relevant kind of data. However, in the absence of this, it's often useful to segment customers based on other traits, such as their income and age.
Standardizing Data
To be able to group customers based on continuous variables, we first need to rescale these parameters such that the data is on similar scales. Take age and income, for instance. These are on very different scales. It's not uncommon to see incomes that are different by $10,000, but it would be very odd to see two people whose ages differed by 10,000 years. Therefore, we need to be explicit about how big a change in one of these variables is about the same as changing the others in terms of customer similarity. For example, we might say that a difference of 10 years of age between two customers makes those two customers as different for our purposes as if they had an income disparity of $10,000. However, making these kinds of determinations manually for each variable would be difficult. This is the reason why we typically standardize the data, to put them all on a standard scale.
One way to standardize parameters for clustering is to calculate their z-score, which is done in two steps:
- The first step is to subtract the mean of the data from each data point. This centers the data around 0, to make the data easier to look at and interpret, although this is not strictly required for clustering.
- The second step is to divide the parameters by their standard deviation.
The standard deviation is a measure of how spread out our points are. It is calculated by comparing the average of the data to each data point. Data such as income, where the points can be spread out by many thousands, will have much larger standard deviations than data, such as age, where the differences between the data points tend to be much smaller. The following formula is used for calculating the standardized value of a data point:
Figure 3.1: The standardization equation
Here, zi corresponds to the ith standardized value, x represents all values, mean(x) is the mean value of all x values, and std(x) is the standard deviation of the x values.
In this example, by dividing all of our ages by the standard deviation of the ages, we transform the data such that the standard deviation is equal to 1. When we do the same thing with the income, the standard deviation of the income will also be equal to 1. Therefore, a difference of 1 between two customers on either of these measures would indicate a similar level of difference between them.
The downside of performing this kind of transformation is that the data becomes harder to interpret. We all have an intuitive sense of what $10,000 or 10 years means, but it's harder to think of what one standard deviation's worth of income means. However, we do this to help the machine learning algorithm, as it doesn't have the same intuitive understanding of the data that we do.
Note
Because standardization depends on both the mean and standard deviation of your data, the standardization is specific to your population's data. For example, if all of your customers are seniors, one year will account for a larger distance after standardization than if your customers included all age groups, since the standard deviation in age in your population would be much lower.
Exercise 10: Standardizing Age and Income Data of Customers
In this exercise, you will deal with data pertaining to the ages and incomes of customers and learn how to standardize this data using z-scoring:
- First, you need to import the packages that you will be using, namely NumPy and pandas. NumPy is a widely used package for scientific computing, which we will use to create random data. pandas is a package that allows data to be stored and accessed using DataFrames, which are data structures that have rows and columns that make dealing with data much easier. Use the following code:
import numpy as np
import pandas as pd
- Create some data to stand in as your age and income data. Here, we have created each with different scales to better simulate what income and age data might really look like:
np.random.seed(100)
df = pd.DataFrame()
df['income'] = np.random.normal(50000, scale=10000, size=100)
df['age'] = np.random.normal(40, scale=10, size=100)
df = df.astype(int)
We first set the random state of NumPy, so that everyone will generate the same data. Then we create a DataFrame, df, to hold the data. We use the np.random.normal function to create some data with a normal distribution on different scales. We generate 100 numbers for income and age, and then we store them in our DataFrame. We then convert to int so that we end up with whole numbers.
- Use the head function to look at the first five rows of the data, as follows:
df.head()
The data should look like this:
Figure 3.2: The printed output of the first five rows of the data
- We can calculate the standard deviation of both columns simultaneously using the std function, which will return the standard deviation for all columns in our DataFrame:
df.std()
You should see the following output:
Figure 3.3: The standard deviation of the two columns
- Similarly, use the mean function to calculate the means of the two columns, as follows:
df.mean()
You should get the following values for income and age:
Figure 3.4: The mean of the two columns
- Next, you need to standardize the variables using their standard deviation and mean. Use the following snippet:
df['z_income'] = (df['income'] - df['income'].mean())/df['income'].std()
df['z_age'] = (df['age'] - df['age'].mean())/df['age'].std()
This will create two new columns, z_income and z_age, in our DataFrame, which contain the standardized values of income and age.
- Use the head function on the DataFrame again to look at the original data and their standardized values:
df.head()
Your output should look as follows:
Figure 3.5: The first five rows of the DataFrame after the standardized columns have been created
Note
The standardized columns should have a mix of small positive and negative values. They represent the number of standard deviations the original data point was from the mean (with positive being above the mean and negative being below it).
- Similarly, use the std function on the DataFrame to look at the standard deviations:
df.std()
Note that the standard deviation of the standardized columns should have a value of 1, as you will observe in the screenshot here:
Figure 3.6: The standard deviations of each column in the DataFrame
- Finally, use the mean function on the DataFrame to look at the mean of all columns:
df.mean()
The mean of the standardized values should have values very close to 0 (though not exactly 0 due to floating point precision), as shown here:
Figure 3.7: The mean values of each column in the DataFrame
Congratulations! You've successfully standardized age and income data of customers. If you are to use this data for clustering, you would use the z_income and z_age columns of the DataFrame.
Calculating Distance
Once the data is standardized, we need to calculate the similarity between customers. Typically, this is done by calculating the distance between the customers in the feature space. In a two-dimensional scatterplot, the Euclidean distance between two customers is just the distance between their points, as you can see in the following plot:
Figure 3.8: A plot showing the Euclidean distance between two points
In the preceding plot, the length of the red line is the Euclidean distance between the two points. The larger this distance, the less similar the customers are. This is easier to think about in two dimensions, but the math for calculating Euclidean distance applies just as well to multiple dimensions.
For two data points, p and q, the distance between them is calculated as follows:
Figure 3.9: Equation for calculating the Euclidean distance between two points
Here, p = (p1+p2+...pn), q = (q1+q2+...qn), and n is the number of features.
We can therefore find the distance between customers regardless of how many features/dimensions we want to use.
Note
This section describes finding the Euclidean distance between two points, which is the most common type of distance metric to use for clustering. Another common distance metric is the Manhattan distance.
Exercise 11: Calculating Distance Between Three Customers
In this exercise, you will calculate the distance between three customers to learn how distance is calculated as well as the importance of standardization. For this, you need to calculate the distance between data points, both before and after standardization. The following is the data regarding the customers:
Figure 3.10: Table showing incomes and ages of three customers
- First, import the math package, as shown:
import math
- Next, create a list of incomes and ages, corresponding to the ages and incomes of the three customers. Use the following values:
ages = [40, 40, 30]
incomes = [40000, 30000, 40000]
- Calculate the distance between the first and the second customer using the following snippet:
math.sqrt((ages[0] - ages[1])**2 + (incomes[0] - incomes[1])**2)
The result should be 1000.
- Now calculate the distance between the first and third customer using the following snippet:
math.sqrt((ages[0] - ages[2])**2 + (incomes[0] - incomes[2])**2)
The result should be 10.
Note that this distance is much smaller in comparison to that obtained in the previous step because the difference between these two customers comes from their ages, where the absolute difference is much smaller than the difference in incomes between the first two customers.
- Now standardize the ages and incomes using the mean and standard deviation we found from the previous exercise (the mean and standard deviation for age is 40 and 10, and for income it's 50,000 and 10,000, respectively). Use the snippet given here:
z_ages = [(age - 40)/10 for age in ages]
z_incomes = [(income - 50000)/10000 for income in incomes]
- Calculate the distance between the standardized scores of the first and second customer:
math.sqrt((z_ages[0] - z_ages[1])**2 + (z_incomes[0] - z_incomes[1])**2)
The result should be 1.
- Also, calculate the distance between the standardized scores of the first and third customer.
math.sqrt((z_ages[0] - z_ages[2])**2 + (z_incomes[0] - z_incomes[2])**2)
The result should again be 1.
As you can see, the distances are now equivalent, because the second customer's income is one standard deviation away from the first customer's while having the same age, and the third customer's age is one standard deviation away from the first customer's while having the same income.
Activity 3: Loading, Standardizing, and Calculating Distance with a Dataset
For this activity, you have been provided with a dataset named customer_interactions.csv (https://github.com/TrainingByPackt/Data-Science-for-Marketing-Analytics/blob/master/Lesson03/customer_interactions.csv) that contains data regarding the amount spent by customers on your products and the number of times they have interacted with your business (for instance, by visiting your website). You've been asked to calculate how similar the first two customers in the dataset are to each other based on how frequently they interact with the business and their yearly spend on your business. Execute the following steps to complete this activity:
- Load the data from the customer_interactions.csv file into a pandas DataFrame, and look at the first five rows of data. You should see the following values:
Figure 3.11: The first few rows of the data in the customer_interactions.csv file
- Calculate the Euclidean distance between the first two data points in the DataFrame.
The output should be close to 437.07.
- Calculate the standardized values of the variables and store them in new columns named z_spend and z_interactions. Your DataFrame should now look like this:
Figure 3.12: The first few rows of the data after new columns are created for the standardized variables
- Calculate the distance between the first two data points using the standardized values.
You should get a final value that is close to 1.47.
Note
The solution for this activity can be found on page 333.