The relationship between two continuous variables_Data Analysis with R-QQ阅读男生玄幻网

上QQ阅读APP看书，第一时间看更新

The relationship between two continuous variables

Do you think that there is a relationship between women's heights and their weights? If you said yes, congratulations, you're right!

We can verify this assertion by using the data in R's built-in dataset, women, which holds the height and weight of 15 American women from ages 30 to 39.

  > head(women)
    height weight
  1     58    115
  2     59    117
  3     60    120
  4     61    123
  5     62    126
  6     63    129
  > nrow(women)
  [1] 15

Specifically, this relationship is referred to as a positive relationship, because as one of the variable increases, we expect an increase in the other variable.

The most typical visual representation of the relationship between two continuous variables is a scatterplot.

A scatterplot is displayed as a group of points whose position along the x-axis is established by one variable, and the position along the y-axis is established by the other. When there is a positive relationship, the dots, for the most part, start in the lower-left corner and extend to the upper-right corner, as shown in the following figure. When there is a negative relationship, the dots start in the upper-left corner and extend to the lower-right one. When there is no relationship, it will look as if the dots are all over the place.

Figure 3.4: Scatterplot of women's heights and weights

The more the dots look like they form a straight line, the stronger is the relationship between two continuous variables is said to be; the more diffuse the points, the weaker is the relationship. The dots in the preceding figure look almost exactly like a straight line—this is pretty much as strong a relationship as they come.

These kinds of relationships are colloquially referred to as correlations.

Covariance

As always, visualizations are great—necessary, even—but on most occasions, we are going to quantify these correlations and summarize them with numbers.

The simplest measure of correlation that is widely use is the covariance. For each pair of values from the two variables, the differences from their respective means are taken. Then, those values are multiplied. If they are both positive (that is, both the values are above their respective means), then the product will be positive too. If both the values are below their respective means, the product is still positive, because the product of two negative numbers is positive. Only when one of the values is above its mean will the product be negative.

Remember, in sample statistics we divide by the degrees of freedom and not the sample size. Note that this means that the covariance is only defined for two vectors that have the same length.

We can find the covariance between two variables in R using the cov function. Let's find the covariance between the heights and weights in the dataset, women:

  > cov(women$weight, women$height)
  [1] 69
  # the order we put the two columns in
  # the arguments doesn't matter
  > cov(women$height, women$weight)
  [1] 69

The covariance is positive, which denotes a positive relationship between the two variables.

The covariance, by itself, is difficult to interpret. It is especially difficult to interpret in this case, because the measurements use different scales: inches and pounds. It is also heavily dependent on the variability in each variable.

Consider what happens when we take the covariance of the weights in pounds and the heights in centimeters.

  # there are 2.54 centimeters in each inch
  # changing the units to centimeters increases
  # the variability within the height variable
  > cov(women$height*2.54, women$weight)
  [1] 175.26

Semantically speaking, the relationship hasn't changed, so why should the covariance?

Correlation coefficients

A solution to this quirk of covariance is to use Pearson's correlation coefficient instead. Outside its colloquial context, when the word correlation is uttered—especially by analysts, statisticians, or scientists—it usually refers to Pearson's correlation.

Pearson's correlation coefficient is different from covariance in that instead of using the sum of the products of the deviations from the mean in the numerator, it uses the sum of the products of the number of standard deviations away from the mean. These number-of-standard-deviations-from-the-mean are called z-scores. If a value has a z-score of 1.5, it is 1.5 standard deviations above the mean; if a value has a z-score of -2, then it is 2 standard deviations below the mean.

Pearson's correlation coefficient is usually denoted by r and its equation is given as follows:

which is the covariance divided by the product of the two variables' standard deviation.

An important consequence of using standardized z-scores instead of the magnitude of distance from the mean is that changing the variability in one variable does not change the correlation coefficient. Now you can meaningfully compare values using two different scales or even two different distributions. The correlation between weight/height-in-inches and weight/height-in-centimeters will now be identical, because multiplication with 2.54 will not change the z-scores of each height.

  > cor(women$height, women$weight)
  [1] 0.9954948
  > cor(women$height*2.54, women$weight)
  [1] 0.9954948

Another important and helpful consequence of this standardization is that the measure of correlation will always range from -1 to 1. A Pearson correlation coefficient of 1 will denote a perfectly positive (linear) relationship, a r of -1 will denote a perfectly negative (linear) relationship, and a r of 0 will denote no (linear) relationship.

Why the linear qualification in parentheses, though?

Intuitively, the correlation coefficient shows how well two variables are described by the straight line that fits the data most closely; this is called a regression or trend line. If there is a strong relationship between two variables, but the relationship is not linear, it cannot be represented accurately by Pearson's r. For example, the correlation between 1 to 100 and 100 to 200 is 1 (because it is perfectly linear), but a cubic relationship is not:

  > xs <- 1:100
  > cor(xs, xs+100)
  [1] 1
  > cor(xs, xs^3)
  [1] 0.917552

It is still about 0.92, which is an extremely strong correlation, but not the 1 that you should expect from a perfect correlation.

So Pearson's r assumes a linear relationship between two variables. There are, however, other correlation coefficients that are more tolerant of non-linear relationships. Probably the most common of these is Spearman's rank coefficient, also called Spearman's rho.

Spearman's rho is calculated by taking the Pearson correlation not of the values, but of their ranks.

Note

What's a rank?

When you assign ranks to a vector of numbers, the lowest number gets 1, the second lowest gets 2, and so on. The highest datum in the vector gets a rank that is equal to the number of elements in that vector.

In rankings, the magnitude of the difference in values of the elements is disregarded. Consider a race to a finish line involving three cars. Let's say that the winner in the first place finished at a speed three times that of the car in the second place, and the car in the second place beat the car in the third place by only a few seconds. The driver of the car that came first has a good reason to be proud of herself, but her rank, 1st place, does not say anything about how she effectively cleaned the floor with the other two candidates.

Try using R's rank function on the vector c(8, 6, 7, 5, 3, 0, 9). Now try it on the vector c(8, 6, 7, 5, 3, -100, 99999). The rankings are the same, right?

When we use ranks instead, the pair that has the highest value on both the x and the y axis will be c(1,1), even if one variable is a non-linear function (cubed, squared, logarithmic, and so on) of the other. The correlations that we just tested will both have Spearman rhos of 1, because cubing a value will not change its rank.

  > xs <- 1:100
  > cor(xs, xs+100, method="spearman")
  [1] 1
  > cor(xs, xs^3, method="spearman")
  [1] 1

Figure 3.5: Scatterplot of y=x + 100 with regression line. r and rho are both 1

Figure 3.6: Scatterplot of Correlation coefficients with regression line. r is .92, but rho is 1

Let's use what we've learned so far to investigate the correlation between the weight of a car and the number of miles it gets to the gallon. Do you predict a negative relationship (the heavier the car, the lower the miles per gallon)?

  > cor(mtcars$wt, mtcars$mpg)
  [1] -0.8676594

Figure 3.7: Scatterplot of the relationship between the weight of a car and its miles per gallon

That is a strong negative relationship. Although, in the preceding figure, note that the data points are more diffuse and spread around the regression line than in the other plots; this indicates a somewhat weaker relationship than we have seen thus far.

For an even weaker relationship, check out the correlation between wind speed and temperature in the airquality dataset as depicted in the following image:

  > cor(airquality$Temp, airquality$Wind)
  [1] -0.4579879
  > cor(airquality$Temp, airquality$Wind, method="spearman")
  [1] -0.4465408

Figure 3.8: Scatterplot of the relationship between wind speed and temperature

Comparing multiple correlations

Armed with our new standardized coefficients, we can now effectively compare the correlations between different pairs of variables directly.

In data analysis, it is common to compare the correlations between all the numeric variables in a single dataset. We can do this with the iris dataset using the following R code snippet:

  > # have to drop 5th column (species is not numeric)
  > iris.nospecies <- iris[, -5]
  > cor(iris.nospecies)
               Sepal.Length Sepal.Width Petal.Length Petal.Width
  Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
  Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
  Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
  Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

This produces a correlation matrix (when it is done with the covariance, it is called a covariance matrix). It is square (the same number of rows and columns) and symmetric, which means that the matrix is identical to its transposition (the matrix with the axes flipped). It is symmetrical, because there are two elements for each pair of variables on either side of the diagonal line of 1s. The diagonal line is all 1's, because every variable is perfectly correlated with itself. Which are the most highly (positively) correlated pairs of variables? What about the most negatively correlated?