Mastering Machine Learning with R(Second Edition)
上QQ阅读APP看书,第一时间看更新

Interaction terms

Interaction terms are similarly easy to code in R. Two features interact if the effect on the prediction of one feature depends on the value of the other feature. This would follow the formulation, Y = B0 + B1x + B2x + B1B2x + e. An example is available in the MASS package with the Boston dataset. The response is the median home value, which is medv in the output. We will use two features: the percentage of homes with a low socioeconomic status, which is termed lstat, and the age of the home in years, which is termed age in the following output:

    > library(MASS)

> data(Boston)

> str(Boston)

'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237
0.06905 ...

$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5
...

$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87
7.87 7.87 7.87
...

$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458
0.524 0.524
0.524 0.524 ...

$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6
96.1 100 85.9
...

$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311
311 ...

$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2
15.2 15.2 15.2
...

$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9
27.1 16.5 18.9 ...

Using feature1*feature2 with the lm() function in the code puts both the features as well as their interaction term in the model, as follows:

    > value.fit <- lm(medv ~ lstat * age, data = 
Boston)


> summary(value.fit)

Call:
lm(formula = medv ~ lstat * age, data = Boston)

Residuals:
Min 1Q Median 3Q Max
-15.806 -4.045 -1.333 2.085 27.552

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.0885359 1.4698355 24.553 < 2e-16
***

lstat -1.3921168 0.1674555 -8.313 8.78e-16
***

age -0.0007209 0.0198792 -0.036 0.9711
lstat:age 0.0041560 0.0018518 2.244 0.0252
*

---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1


Residual standard error: 6.149 on 502 degrees of
freedom

Multiple R-squared: 0.5557, Adjusted R-squared:
0.5531

F-statistic: 209.3 on 3 and 502 DF, p-value: <
2.2e-16

Examining the output, we can see that, while the socioeconomic status is a highly predictive feature, the age of the home is not. However, the two features have a significant interaction to positively explain the home value.