Selecting variables using single-antecedent Association Rules>_IBM SPSS Modeler Cookbook-QQ阅读中文玄幻网

上QQ阅读APP看书，第一时间看更新

Selecting variables using single-antecedent Association Rules>

In this recipe we will identify and select variables to include as model inputs using the Apriori Association Rules node. We will select the top 24 predictors based on Association Rules variable selection. We will use the same KDD Cup 1998 data set, but this version of the data was prepared with the stream Recipe - variable selection apriori data prep.str to create quintile versions of continuous variables. The target variable is the top quintile in donation amounts, TARGET_D between $20 and $200.

Getting ready

This recipe uses the datafile cup98lrn_reduced_vars3_apriori.sav and the stream Recipe - variable selection apriori.str.

You will need a copy of Microsoft Excel to visualize the list of rules.

How to do it...

To identify and select variables to include as model inputs using the Apriori Association Rules node:

Open the stream Recipe - variable selection apriori.str by navigating to File | Open Stream.
Make sure the datafile points to the correct path to the file cup98lrn_reduced_vars3_apriori.sav.
Open the Type node named APRIORI Types. Notice that only Nominal and Flag variables are used. The variable set to Target should be the target variable TARGET_D_TILE5_1.
Open the Apriori node and look at the options. Note that the Minimum antecedent support is set to 10 percent, Confidence percent is set to 1 percent and the number of antecedents to 1.
Build the Association Rules model by clicking on Run.
Open the generated model. In the show/hide criteria drop-down menu, add Instances and Lift to the report as shown in following screenshot. If the list is not sorted by Confidence or Lift any longer, click on the sort by arrow to the right of the Confidence % text until the sort order is descending.
Export the rules by navigating to File | Export HTML | Model and save the file as associationrules.html.
Identify rules of interest, such as the 12 rules with the highest confidence and the 12 rules with the lowest confidence. A sample list is shown in the following screenshot. Make a note of these rules so you can include these as inputs.
In the Modeler stream, connect a Type node to the right of the APRIORI Type node. Double-click on the Type node, and set variables that were selected in step 8 to Input, and all other variables that were formerly inputs to None.

How it works...

The Association Rules model with only one antecedent is merely a convenient way to show the relationship between every categorical variable identified as Input and the Target variables. The figure of merit for this relationship is Confidence % which is the percentage of records matching the input variable value True with the Target variable value True.

Association rules require input and target variables to be categorical; in Modeler, these are the Nominal, Ordinal, or Flag variables. The data set analyzed in this recipe contained binned versions of continuous variables so that they could be assessed in addition to the variables that are nominal in their original state.

Once the association between the input variables and the target is listed along with the relationship to the target, one can choose to remove those fields with little relationship to the target, namely those whose lift is close to 1. Those with lift values larger or smaller than 1 have some relationship to the target, either the high-valued donors (donated $20-$200) or those who are not high-valued donors. The Select label in the previous screenshot was applied when the lift value was greater than 1.125 or less than 0.7. This selection criterion is subjective.

As a side note, the outcome of the last four recipes could be combined to determine which fields are consistently relevant across all methods.

There's more...

Note that the list in the previous screenshot only includes those variables or categories with greater than 10 percent support; this in itself reduces the number of variables. Try reducing the Support percent filter in the Apriori node from 10 percent to 1 percent and see how many more variables show up in the list.

The Association Rules do not provide a significance test to help assess the relationship between each input and the target variable. A chi-square test can be computed in Excel or one can use the CHAID modeling node to provide the chi-square statistic.

One can also expand the search for variables by adjusting the number of antecedents to two, thereby finding all pairwise combinations of inputs. This can sometimes be valuable because variables that are not good predictors on their own can sometimes be good predictors in combination with other variables.

As with the correlation matrix variable selection, selecting or removing a large number of variables may be tedious and prone to error, so writing a CLEM script to customize the Type node or Filter node can help.