Evaluating the use of sampling for speed>
Modern data mining practice is somewhat different from the ideal. Data miners certainly do develop valuable models that are used in the business and many have massive resources of data to mine, even more data than might have been foreseen a generation ago. But not all data miners meet the profile of a business user, someone whose primary work responsibility is not data analysis and who is not trained in, or concerned with, statistical methods. Nor does the modern data miner shy away from sampling.
In practice, it has been difficult to make discoveries and build models quickly when working with massive quantities of data. Although data mining tools may be designed to streamline the process, it still takes longer for each operation to complete on a large amount of data than it would with a smaller quantity. This sampling can be extremely useful.
Getting ready
We will start with a blank stream, and will be using the cup98lrn reduced vars2.txt
data set.
How to do it...
To evaluate the need for sampling:
- Build a stream with a Source node, a Type node, and a Table node then force instantiation by running the Table node.
- Force TARGET_B to be flag and make it the target. Set TARGET_D to None.
- Add a Partition node downstream of the Type node.
- Add a Feature Selection Modeling node and run it. (It will act like a filter, but it is critical not to trust it unless the data is clean.)
- Add an Auto Classifier node and edit it. Choose to use 9 Models (the default is 3).
- If you run the stream at this stage be prepared for a potentially long wait. The results of the stream at this stage are shown in the How it works section of this recipe.
- Add a Sample node set to
10
percent in between the Source node and the Type node. - Cache the Sample node and force execution by running a table off the Sample node.
- Run the Auto Classifier, and make note of the duration. (A test run on a newer machine took about 1 minute for the sampled data versus 13 minutes on the complete data.)
- Add in SVM and KNN in the Auto Classifier and re-run. Note the duration. (A test run on complete data using all 11 classifiers was manually halted after running 3.5 hours.)
- Take action to save your cache for future sessions:
- Either right-click on the Sample node and save the cache
- Or write the Sample node out to an external file
How it works...
This recipe is a demonstration of sorts. These steps are indeed the steps in sampling for speed. However, eventually your instincts will tell you that a model (or set of models) is going to be time-consuming. One does not need to run the stream in step 6 because we know, in advance, that it will take a long time. It is critical to remember that you are not in the modeling phase at this stage. You are merely planning. Notice that the Auto Classifier deselects SVMs and KNN as they are computationally expensive. It would be imprudent to be so skittish about sampling that you actually reduced the number of classifier models that you considered. (A test run on complete data using all 11 classifiers was manually halted after running 3.5 hours, but even a run on the sample failed. Unclean data is rougher on some algorithms than others.)
It is also critical to not trust the Feature Selection node to choose the best variables. We are simply using its ability to temporarily filter out variables that need cleaning to such an extent that they would cause the classifiers to fail. You won't get an early assessment of your data if the Auto Classifier turns red and fails to run.
Why not just select the most recent data or the current month? This actually can be quite effective, but it has risks as well. If the Target variable is affected by seasonality it probably is better to take a random sample of a year than to select a month.
This recipe will not be effective unless you pay careful attention to the caching of the Sample node. When you turn this feature on, the node will have an icon with the appearance of a white piece of paper. Once the cache turns green, the data has been stored. If you don't force it to cache before you model, it is performing the randomization and the modeling in the same step, and you won't notice an increase in speed until the second time that you run the model.
The bottom line is that, when you are doing initial exploration of the data, it is often appropriate to do bivariate and univariate analyses on all of your data; that is, use Distribution nodes and Data Audit nodes, because they run quickly on large files. Also, you generally use all of your data when you merge your data. But, it doesn't always make sense to run experimental, exploratory multivariate models on the entire data set when a random sample will give similar results. Running all the data will tend to change your behavior in a negative way, you will avoid computationally expensive algorithms, and/or you will avoid tuning the model properly.
Note that the results on the sampled data (shown in the following screenshot) and the result of the complete data (shown in the previous screenshot) are similar. At first glance, they may not look similar, but if you scratch the surface, you would learn what you need at this stage from the sample. The CHAID model on complete data uses more variables, but that is a consequence of CHAID's stopping rules. It is noteworthy that the accuracy of using 13 variables is nearly the same as using only four. What have you learned? Merely that those four variables are probably worth a closer look, and that it might be a good idea to run CHAID interactively to better understand what is going on. It would also be a useful exercise to compare the top variables in each of these models. In short, you've learned which clean variables have promise, but the potential of the variables that need cleaning is still a complete mystery.
It is critical to always test and validate against unbalanced data. Modeler automatically uses unbalanced data for the test. However, if you have taken a simple random sample you have effectively removed that data from the available data processed in the stream. Validation, unlike modeling, is fast so you almost always want complete data when you validate. Typically the most recent month is a great dress rehearsal. On most projects, the most recent month did not exist when the project began so it makes a perfect test. Run all of that most recent month—unbalanced and complete—as a validation.
There's more...
Remember that sampling must not always be simple random sampling (the kind that we demonstrate here). Balancing is a kind of sampling. Building models with and without new donors is a variation on the theme. The Sampling node also supports complex sampling. While not covered here, it is a topic in its own right.
It is also important to not get too excited that there are a handful of variables that seem to show promise. It is only a handful, and there is a long road ahead at this stage. The emphasis would immediately turn to cleaning the data and saving some of the variables filtered out by the Feature Selection node. Many variables were dropped because they need attention, not because they hold no value.
Why bother with all of this when we won't use these models as the final model?
- It is disheartening to spend weeks cleaning data with little sense of where you stand.
- It is not a bad idea to spend more time on the top three classifiers models and less time on the bottom three classifiers. While this is common sense, be forewarned that when you rerun on clean data the ranking may change dramatically.
- As you add more and more clean variables to the models it can be useful (and rewarding) to find that new variables are continuously joining the top ten. During this lengthy process it would be pointless to run algorithms that are taking hours; after all, you are still shoulder-deep in data prep at that point.
After the lengthy process of data prep draws to a close and you enter into the modeling phase, you may possibly decide to increase the percentage of your sample and/or eliminate it altogether. After all, at that stage you will have clean data and will have narrowed your modeling approach to your "semi-finalists". Why not just let it run overnight?
See also
- The Using an empty aggregate to evaluate sample size recipe in Chapter 1, Data Understanding
- The Evaluating the need to sample from the initial data recipe in Chapter 1, Data Understanding
- The Using a full data model/partial data model approach to address missing data recipe in Chapter 3, Data Preparation – Clean
- The Speeding up merge with caching and optimization settings recipe in Chapter 5, Data Preparation – Integrate and Format
- The How (and why) to validate as well as test recipe in Chapter 7, Modeling – Assessment, Evaluation, Deployment, and Monitoring