Frequency distributions_Data Analysis with R-玄幻网

上QQ阅读APP看书，第一时间看更新

Frequency distributions

A common way of describing univariate data is with a frequency distribution. We've already seen an example of a frequency distribution when we looked at the preferences for soy ice cream at the end of the last chapter. For each flavor of ice cream (categorical variable), it depicted the count or frequency of the occurrences in the underlying data set.

To demonstrate examples of other frequency distributions, we need to find some data. Fortunately, for the convenience of useRs everywhere, R comes preloaded with almost one hundred datasets. You can view a full list if you execute help (package="datasets"). There are also hundreds more available from add on packages.

The first data set that we are going to use is mtcars—data on the design and performance of 32 automobiles that was extracted from the 1974 Motor Trend US magazine. (To find out more information about this dataset, execute ?mtcars.)

Take a look at the first few lines of this dataset using the head function:

> head(mtcars)
                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
 Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
 Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
 Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
 Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Check out the carb column, which represents the number of carburetors; by now you should recognize this as a discrete numeric variable, though we can (and will!) treat this as a categorical variable for now.

Running the carb vector through the unique function yields the distinct values that this vector contains.

  > unique(mtcars$carb)
  [1] 4 1 2 3 6 8

We can see that there must be repeats in the carb vector, but how many? An easy way for performing a frequency tabulation in R is to use the table function:

  > table(mtcars$carb)
   1  2  3  4  6  8 
   7 10  3 10  1  1

From the result of the preceding function, we can tell that the are 10 cars with 2 carburetors and 10 with 4, and there is one car each with 6 and 8 carburetors. The value with the most occurrences in a dataset (in this example, the carb column is our whole data set) is called the mode. In this case, there are two such values, 2 and 4, so this dataset is bimodal. (There is a package in R, called modeest, to find modes easily.)

Frequency distributions are more often depicted as a chart or plot than as a table of numbers. When the univariate data is categorical, it is commonly represented as a bar chart, as shown in the Figure 2.1:

The other data set that we are going to use to demonstrate a frequency distribution of a continuous variable is the airquality dataset, which holds the daily air quality measurements from May to September in NY. Take a look at it using the head and str functions. The univariate data that we will be using is the Temp column, which contains the temperature data in degrees Fahrenheit.

Figure 2.1: Frequency distribution of number of carburetors in mtcars dataset

It would be useless to take the same approach to frequency tabulation as we did in the case of the car carburetors. If we did so, we would have a table containing the frequencies for each of the 40 unique temperatures—and there would be far more if the temperature wasn't rounded to the nearest degree. Additionally, who cares that there was one occurrence of 63 degrees and two occurrences of 64? I sure don't! What we do care about is the approximate temperature.

Our first step towards building a frequency distribution of the temperature data is to bin the data—which is to say, we divide the range of values of the vector into a series of smaller intervals. This binning is a method of discretizing a continuous variable. We then count the number of values that fall into that interval.

Choosing the size of bins to use is tricky. If there are too many bins, we run into the same problem as we did with the raw data and have an unwieldy number of columns in our frequency tabulation. If we make too few, however, we lose resolution and may lose important information. Choosing the right number of bins is more art than science, but there are certain commonly used heuristics that often produce sensible results.

We can have R construct n number of equally-spaced bins for us by using the cut function which, in its simplest use case, takes a vector of data and the number of bins to create:

  > cut(airquality$Temp, 9)

We can then feed this result into the table function for a far more manageable frequency tabulation:

  > table(cut(airquality$Temp, 9))
  
    (56,60.6] (60.6,65.1] (65.1,69.7] (69.7,74.2] (74.2,78.8] 
            8          10          14          16          26 
  (78.8,83.3] (83.3,87.9] (87.9,92.4]   (92.4,97] 
           35          22          15           7

Rad!

Remember when we used a bar chart to visualize the frequency distributions of categorical data? The common method for visualizing the distribution of discretized continuous data is by using a histogram, as seen in the following image:

Figure 2.2: Daily temperature measurements from May to September in NYC