Running a Statistics node on anti-join to evaluate the potential missing data>
There is typically some data loss when various data tables are integrated. Although we won't discuss data integration until a later chapter, it is important to gauge what (and how much) is lost at this stage. Financial variables are usually aggregated in very different ways for the financial planner and the data miner. It is critical that the data miner periodically translate the data of the data miner back into the form that middle and senior management will recognize so that they can better communicate.
The data miner deals with transactions and individual customer data, the language of individual rows of data. The manager speaks, generally, the language of spreadsheets: regions, product lines, months rolled up into aggregated cells in Excel.
On a project, we once discovered that a small percentage of missing rows represented a larger fraction of revenue than average—much larger actually. We suddenly revisited our decision to drop those rows. Dropping them seemed the right decision—they were just bad IDs weren't they? Well, it is never that simple. There are few accidents in data. That experience produced a lesson:
Always include a revenue assessment in your decisions even when revenue is neither your input nor your target.
In this recipe we will learn a simple trick for assessing these variables at times when there is the potential for data loss.
Getting ready
We will start with a blank stream, and will be using the retail Transactions
file and the Products_Missing
file.
How to do it...
To evaluate potential missing data when integrating data:
- Build a stream with both of the Source nodes, two Type nodes, and a Merge node.
- Perform an anti-join and make a note of the record count.
- Run a Statistics node and request: Count, Sum, Mean, and Median.
- Set Products_Missing to be the first input, and run the Merge node making a note of the record count. Since there is only one record we will not run a Stats node.
- Reverse the inputs and repeat the merge, again making note of the record count.
- Re-run the Statistics node.
How it works...
Years ago on a project we discovered that 5 percent of the data—data that happened to be missing—represented more than 20 percent of the revenue. We expect, or perhaps more likely, we hope, that missing data will not derail us, but sometimes it certainly threatens the whole project. This recipe is about choosing your battles, identifying when missing data is a particularly serious problem and when we can move on to other aspects of the project.
Here, the bad news is that there is a substantial amount of missing data in the products file. How could this ever occur? The novice might be surprised. It occurs frequently. Perhaps the company just acquired a smaller retailer and there are issues in the old transactional data of the old vendor not matching up properly. There is good news, however. The average and median of the inner join and the anti-join suggest that we appear to have missing data that is missing at random and not systematically. It would be a much bigger problem if the usual customers (maybe even our best customers) were the ones that were missing. This is not usually the case although sometimes it takes detective work to figure out why. There is one additional bit of bad news, however; the total amount of the missing data points is not trivial. While it is dwarfed by the nearly 90 million that we can analyze, the missing 8 million might be large enough to warrant extensive data cleaning. One would now move to diagnose the problem, and if it seems achievable, and at a reasonable cost, address the problem.
See also
- The Using an @NULL multiple Derive to explore missing data recipe in Chapter 1, Data Understanding
- The Creating an Outlier report to give to SMEs recipe in Chapter 1, Data Understanding
- The Using a full data model/partial data model approach to address missing data recipe in Chapter 3, Data Preparation – Clean