Modeling and evaluation – MARS
How would you like a modeling technique that provides all of the following:
- Offers the flexibility to build linear and nonlinear models for both regression and classification
- Can support variable interaction terms
- Is simple to understand and explain
- Requires little data processing
- Handles all types of data: numeric and categorical
- Performs well on unseen data, that is, it does well in a bias-variance trade-off
If that all sounds appealing, then I cannot recommend the use of MARS models enough. I've found them to perform exceptionally well. In fact, in a past classification problem of mine, they outperformed both a random forest and boosted trees on test/validation data.
To understand MARS is quite simple:
- First, just start with a linear or generalized linear model like we discussed previously.
- Then, to capture any nonlinear relationship, a hinge function is added. These hinges are simply points in the input feature that equate to a coefficient change. For example, say we have this:
Where variables 1 and 2 are on a scale of 1 to 10. - Now, let's see how a hinge function for variable 2 could come into play:
We read the hinge function as we take the maximum of either 0 or variable 2-5.50. So, whenever variable 2 has a value greater than 5.5, that value will be multiplied times the coefficient; otherwise, it will be zero. The method will accommodate multiple hinges for each variable and also interaction terms.
The other interesting thing about MARS is the automatic variable selection. This can be done with cross-validation, but the default is to build through a forward pass, much like forward selection, then a backward pass to prune the model, which after the forward pass is likely to overfit the data. This backward pass prunes input features and removes hinges based on Generalized Cross Validation (GCV):
In the earth package in R, Penalty = 2 for an additive model and 3 for a multiplicative model. A multiplicative model is one with interaction terms. In earth, there are quite a few parameters you can tune. I'll demonstrate, in the example, a practical and straightforward way to implement the methodology. If you so desire, you can learn more about its flexibility in the excellent vignette on the earth package by Stephen Milborrow, available at this link: http://www.milbo.org/doc/earth-notes.pdf.
I'll specify a model selection of a five-fold cross-validation (pmethod = cv and nfold = 5) as an additive model only with no interactions (degree = 1) and only one hinge per input feature (minspan = -1). I also want to have a maximum of 25 features (nprune = 25). The code is as follows:
> set.seed(1988)
> earth_fit <-
earth::earth(
x = train_reduced[, -96],
y = train_reduced[, 96],
pmethod = 'cv',
nfold = 5,
degree = 1,
minspan = -1,
nprune = 25
)
summary() of earth_fit is quite lengthy, so here's the abbreviated version:
> summary(earth_fit)
Selected 20 of 26 terms, and 13 of 95 predictors using pmethod="cv"
Termination condition: RSq changed by less than 0.001 at 26 terms
Importance: OverallQual, X1stFlrSF, X2ndFlrSF, yearsOld, ...
Number of terms at each degree of interaction: 1 19 (additive model)
GRSq 0.9052 RSq 0.9113 mean.oof.RSq 0.8979 (sd 0.0115)
What we can discern is that only 13 features were selected with a total of 20 terms, including hinged features. mean.oof.RSq is the average of the out of fold R-squared values (0.8979), and the full-model R-squared is 0.9113. You can call feature importance as well:
> earth::evimp(earth_fit)
nsubsets gcv rss
OverallQual 19 100.0 100.0
X1stFlrSF 17 49.7 50.0
X2ndFlrSF 16 42.7 43.0
yearsOld 14 33.8 34.1
OverallCond 13 28.0 28.4
BsmtFinSF1 11 22.6 23.1
LotArea 10 19.1 19.7
Fireplaces 7 12.7 13.4
yearsGarage_isNA 6 10.9 11.6
CentralAir_lev_x_Y 4 7.9 8.5
Functional_lev_x_Typ 3 6.3 6.9
Condition1_lev_x_Norm 2 5.1 5.6
ScreenPorch 1 3.4 3.8
We see the feature name, n subsets, which is the number of model subsets that include the feature if we did the pruning pass instead of cross-validation, and the gcv and rss columns show the decrease in the respective value that the feature contributes (gcv and rss are scaled 0 to 100). Notice that the feature we created, yearsGarage_isNA, was selected by the model. You can ponder the hinge functions, but there's an excellent visual to see the various piecewise linear functions:
> plotmo::plotmo(earth_fit, nrug = TRUE, rug.col = "red")
The output of the preceding code is as follows:
Notice in the plot that LotArea contains a hinge. Initially, as the size of the property increases, the increase is rather dramatic, then at a certain point, a new slope is applied from there to the maximum observed value. Contrast that with OverallCond, which has only one slope coefficient over all possible values. An excellent example of how MARS can capture linear and non-linear relationships in a piecewise fashion.
Now, we must see how it performs out of sample:
> earth_pred <- predict(earth_fit, tested)
> caret::postResample(earth_pred, test_logy)
RMSE Rsquared MAE
0.12363 0.90120 0.08986
This is a superior RMSE than what we saw with simple linear regression! I'm curious how the residuals look on the test set:
> earth_residTest <- test_logy - earth_pred
> car::qqPlot(earth_residTest)
The output of the preceding code is as follows:
We still see a heavy-tailed distribution of the residuals. What this tells me is that we may have to resort to quantile regression (out-of-scope here) or create separate models for specific cuts of the response. Another option is to build an ensemble of models, but that's the subject of a later chapter.
Now, the issue I have here is that we predicted the natural log of sales price. How do we reverse transform to get actual sales price? I hear you saying, just take the exponent, correct? Well, maybe—or maybe not! I learned this painful lesson through experience, suffering the wrath of a PhD econometrician that just applying the exponent can lead to severe bias.
This is because the expected value of the response (sales price) is a function of the exponent of the predicted value plus an error term. If the error term isn't perfectly normal, then you have bias. The solution is Duan's Smearing Estimator. I shall address that with a custom function in the next section.
Should you desire to amuse yourself with the math behind all this, you can get started with Duan's paper:
Smearing Estimate: A Nonparametric Retransformation Method