R Programming By Example
上QQ阅读APP看书,第一时间看更新

Vectors

The fundamental data type in R is the vector, which is an ordered collection of values. The first thing you should know is that unlike other languages, single values for numbers, strings, and logicals, are special cases of vectors (vectors of length one), which means that there's no concept of scalars in R. A vector is a one-dimensional data structure and all of its elements are of the same data type.

The simplest way to create a vector is with the c() function, which stands for combine, and coerces all of its arguments into a single type. The coercion will happen from simpler types into more complex types. That is, if we create a vector which contains logicals, numerics, and characters, as the following example shows, our resulting vector will only contain characters, which are the more complex of the three types. If we create a vector that contains logicals and numerics, our resulting vector will be numeric, again because it's the more complex type.

Vectors can be named or unnamed. Unnamed vector elements can only be accessed through positional references, while named vectors can be accessed through positional references as well as name references. In the example below, the y vector is a named vector, where each element is named with a letter from A to I. This means that in the case of x, we can only access elements using their position (the first position is considered as 1 instead of the 0 used in other languages), but in the case of y, we may also use the names we assigned.

Also note that the special values we mentioned before, that is NA, NULL, NaN, and Inf, will be coerced into characters if that's the more complex type, except NA, which stays the same. In case coercion is happening toward numerics, they all stay the same since they are valid numeric values. Finally, if we want to know the length of a vector, simply call the length() function upon it:

x <- c(TRUE, FALSE, -1, 0, 1, "A", "B", NA, NULL, NaN, Inf)
x
#> [1] "TRUE" "FALSE" "-1" "0" "1" "A" "B" NA
#> [9] "NaN" "Inf" x[1]
#> [1] "TRUE" x[5]
#> [1] "1" y <- c(A=TRUE, B=FALSE, C=-1, D=0, E=1, F=NA, G=NULL, H=NaN, I=Inf) y
#> A B C D E F H I
#> 1 0 -1 0 1 NA NaN Inf y[1]
#> A
#> 1
y["A"]
#> A
#> 1
y[5]
#> E
#> 1
y["E"]
#> E
#> 1
length(x)
#> [1] 10
length(y)
#> [1] 8

Furthermore, we can select sets or ranges of elements using vectors with index numbers for the values we want to retrieve. For example, using the selector c(1, 2) would retrieve the first two elements of the vector, while using the c(1, 3, 5) would return the first, third, and fifth elements. The : function (yes, it's a function even though we don't normally use the function-like syntax we have seen so far in other examples to call it), is often used as a shortcut to create range selectors. For example, the 1:5 syntax means that we want a vector with elements 1 through 5, which would be equivalent to explicitly using c(1, 2, 3, 4, 5). Furthermore, if we send a vector of logicals, which must have the same length as the vector we want to retrieve values from, each of the logical values will be associated to the corresponding position in the vector we want to retrieve from, and if the corresponding logical is TRUE, the value will be retrieved, but if it's FALSE, it won't be. All of these selection methods are shown in the following example:

x[c(1, 2, 3, 4, 5)]
#> [1] "TRUE" "FALSE" "-1" "0" "1"
x[1:5] #> [1] "TRUE" "FALSE" "-1" "0" "1"
x[c(1, 3, 5)]
#> [1] "TRUE" "-1" "1"
x[c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
FALSE, TRUE, FALSE, TRUE)]
#> [1] "TRUE" "-1" "1" "B" "NaN" NA

Next we will talk about operation among vectors. In the case of numeric vectors, we can apply operations element-to-element by simply using operators as we normally would. In this case, R will match the elements of the two vectors pairwise and return a vector. The following example shows how two vectors are added, subtracted, multiplied, and pided in an element-to-element way. Furthermore, since we are working with vectors of the same length, we may want to get their dot-product (if you don't know what a dot-product is, you may take a look at https://en.wikipedia.org/wiki/Dot_product), which we can do using the %*% operator, which performs matrix-like multiplications, in this case vector-to-vector:

x <- c(1, 2, 3, 4)
y <- c(5, 6, 7, 8)
x + y
#> [1] 6 8 10 12
x - y
#> [1] -4 -4 -4 -4
x * y
#> [1] 5 12 21 32
x / y
#> [1] 0.2000 0.3333 0.4286 0.5000
x %*% y
#> [,1]
#> [1,] 70

If you want to combine multiple vectors into a single one, you can simply use the c() recursively on them, and it will flatten them for you automatically. Let's say we want to combine the x and y into the z such that the y elements appear first. Furthermore, suppose that after we do we want to sort them, so we apply the sort() function on z:

z <- c(y, x)
z
#> [1] 5 6 7 8 1 2 3 4
sort(z)
#> [1] 1 2 3 4 5 6 7 8

A common source for confusion is how R deals with vectors of different lengths. If we apply an element-to-element operation, as the ones we covered earlier, but using vectors of different lengths, we may expect R to throw an error, as is the case in other languages. However, it does not. Instead, it repeats vector elements in order until they all have the same length. The following example shows three vectors, each of different lengths, and the result of adding them together.

The way R is configured by default, you will actually get a warning message to let you know that the vectors you operated on were not of the same length, but since R can be configured to avoid showing warnings, you should not rely on them:

c(1, 2) + c(3, 4, 5) + c(6, 7, 8, 9)
#> Warning in c(1, 2) + c(3, 4, 5):
longer object length is not a multiple of

#> shorter object length
#> Warning in c(1, 2) + c(3, 4, 5) + c(6, 7, 8, 9):
longer object length is

#> not a multiple of shorter object length
#> [1] 10 13 14 13

The first thing that may come to mind is that the first vector is expanded into c(1, 2, 1, 2), the second vector is expanded into c(3, 4, 5, 3), and the third one is kept as is, since it's the largest one. Then if we add these vectors together, the result would be c(10, 13, 14, 14). However, as you can see in the example, the result actually is c(10, 13, 14, 13). So, what are we missing? The source of confusion is that R does this step by step, meaning that it will first perform the addition c(1, 2) + c(3, 4, 5), which after being expanded is c(1, 2, 1) + c(3, 4, 5) and results in c(4, 6, 6), then given this result, the next step that R performs is c(4, 6, 6) + c(6, 7, 8, 9), which after being expanded is c(4, 6, 6, 4) + c(6, 7, 8, 9), and that's where the result we get comes from. It can be confusing at first, but just remember to imagine the operations step by step.

Finally, we will briefly mention a very powerful feature in R, known as vectorization. Vectorization means that you apply an operation to a vector at once, instead of independently doing so to each of its elements. This is a feature you should get to know quite well. Programming without it is considered to be bad R code, and not just for syntactic reasons, but because vectorized code takes advantage of many internal optimizations in R, which results in much faster code. We will show different ways of vectorizing code in Chapter 9, Implementing An Efficient Simple Moving Average, and in this chapter, we will see an example, followed by a couple more in following sections.

Even though the phrase vectorized code may seem scary or magical at first, in reality, R makes it quite simple to implement in some cases. For example, we can square each of the elements in the x vector by using the x symbol as if it were a single number. R is intelligent enough to understand that we want to apply the operation to each of the elements in the vector. Many functions in R can be applied using this technique:

x^2
#> [1] 1 4 9 16

We will see more examples that really showcase how vectorization can shine in the following section about functions, where we will see how to apply vectorized operations even when the operations depend on other parameters.