Generating test datasets
Generating numerical test data is easy with Java. It boils down to using a java.util.Random
object to generate random numbers.
This program generates the following CSV file of eight rows and five columns of random decimal values.
Metadata
Metadata is data about data. For example, the preceding generated file could be described as eight lines of comma-separated decimal numbers, five per line. That's metadata. It's the kind of information you would need, for example, to write a program to read that file.
That example is quite simple: the data is unstructured and the values are all the same type. Metadata about structured data must also describe that structure.
The metadata of a dataset may be included in the same file as the data itself. The preceding example could be modified with a header line like this:
Data cleaning
Data cleaning, also called data cleansing or data scrubbing, is the process of finding and then either correcting or deleting corrupted data values from a dataset. The source of the corrupt data is often careless data entry or transcription.
Various kinds of software tools are available to assist in the cleaning process. For example, Microsoft Excel includes a CLEAN()
function for removing nonprintable characters from a text file. Most statistical systems, such as R and SAS, include a variety of more general cleaning functions.
Spellcheckers provide one kind of data cleaning that most writers have used, but they won't help against errors like from
for form
and there
for their
.
Statistical outliers are also rather easy to spot, for example, in our Excel Countries
table if the population of Brazil appeared as 2.10
instead of 2.01E+08
.
Programmatic constraints can help prevent the entry of erroneous data. For example, certain variables can be required to have only values from a specified set, such as the ISO standard two-letter abbreviations for countries (CN for China, FR for France, and so on). Similarly, text data that is expected to fit pre-determined formats, such as phone numbers and email addresses, can be checked automatically during input.
An essential factor in the data cleaning process is to avoid conflicts of interest. If a researcher is collecting data to support a preconceived theory, any replacement of raw data should be done only in the most transparent and justifiable ways. For example, in testing a new drug, a pharmaceutical laboratory would maintain public logs of any data cleaning.
Data scaling
Data scaling is performed on numeric data to render it more meaningful. It is also called data normalization. It amounts to applying a mathematical function to all the values in one field of the dataset.
The data for Moore's Law provides a good example. Figure 2-11 shows a few dozen data points plotted. They show the number of transistors used in microprocessors at various dates from 1971 to 2011. The transistor count ranges from 2,300 to 2,600,000,000. That data could not be shown if a linear scale were used for the transistor count field, because most of the points would pile on top of each other at the lower end of the scale. The fact is, of course, that the number of transistors has increased exponentially, not linearly. Therefore, only a logarithmic scale works for visualizing the data. In other words, for each data point (x, y), the point (x, log y) is plotted.
Microsoft Excel allows the use of scaling functions.
Data filtering
Filtering usually refers to the selection of a subset of a dataset. The selection would be made based upon some condition(s) on its data fields. For example, in a Countries
dataset, we might want to select those landlocked countries whose land area exceeds 1,000,000 sq. km.
For example, consider the Countries
dataset shown in Figure 2-12:
For efficient processing, we first define the Countries
class shown in Listing 2-14. The constructor at lines 20-27 read the four fields for the new Country
object from the next line of the file being scanned by the specified Scanner
object. The overridden toString()
method at lines 29-33 returns a String
object formatted like each line from the input file.
Listing 2-15 shows the main program to filter the data.
The readDataset()
method at lines 28-41 uses the custom constructor at line 34 to read all the data from the specified file into a HashSet
object, which is named dataset
at line 19. The actual filtering is done at line 21. That loop prints only those countries that are landlocked and have an area of at least 1,000,000 sq. km., as shown in Figure 2-13.
In Microsoft Excel, you can filter data by selecting Data | Filter or Data | Advanced | Advanced Filter.
Another type of data filtering is the process of detecting and removing noise from a dataset. In this context, noise refers to any sort of independent random transmission interference that corrupts the data. The term comes from the phenomenon of background noise in an audio recording. Similar phenomena occur with image files and video recordings. The methods for this kind of filtering are more advanced.
Sorting
Sometimes it is useful to sort or re-sort tabular data that is otherwise ready for processing. For example, the Countries.dat
file in Figure 2-1 is already sorted on its name
field. Suppose that you want to sort the data on the population
field instead.
One way to do that in Java is to use a TreeMap
(instead of a HashMap
), as shown in Listing 2-16. The dataset
object, instantiated at line 17, specifies Integer
for the key type and String
for the value type in the map. That is because we want to sort on the population
field, which has an integer type.
The TreeMap
data structure keeps the data sorted according to the ordering of its key field. Thus, when it is printed at line 29, the output is in increasing order of population.
Of course, in any map data structure, the key field values must be unique. So this wouldn't work very well if two countries had the same population.
A more general approach would be to define a DataPoint
class that implements the java.util.Comparable
interface, comparing the objects by their values in the column to be sorted. Then the complete dataset could be loaded into an ArrayList
and sorted simply by applying the sort()
method in the Collections
class, as Collections.sort(list)
.
In Microsoft Excel, you can sort a column of data by selecting Data | Sort from the main menu.
Merging
Another preprocessing task is merging several sorted files into a single sorted file. Listing 2-18 shows a Java program that implements this task. It is run on the two countries files shown in Figure 2-12 and Figure 2-13. Notice that they are sorted by population:
To merge these two files, we define a Java class to represent each data point, as shown in Listing 2-17. This is the same class as in Listing 2-14, but with two more methods added, at lines 30-38.
By implementing the java.util.Comparable
interface (at line 10), Country
objects can be compared. The compareTo()
method (at lines 34-38) will return a negative integer if the population of the implicit argument (this
) is less than the population of the explicit argument. This allows us to order the Country
objects according to their population size.
The isNull()
method at lines 30-32 is used only to determine when the end of the input file has been reached.
The program in Listing 2-15 compares a Country
object from each of the two files at line 24 and then prints the one with the smaller population to the output file at line 27 or line 30. When the scanning of one of the two files has finished, one of the Country
objects will have null fields, thus stopping the while
loop at line 24. Then one of the two remaining while
loops will finish scanning the other file.
For example, if one file contains a million records and the other file has a record whose population
field is maximal, then a million useless (null) objects would be created. This reveals another good reason for using Java for file processing. In Java, the space used by objects that have no references is automatically returned to the heap of available memory. In a programming language that does not implement this garbage collection protocol, the program would likely crash for exceeding memory limitations.
Hashing
Hashing is the process of assigning identification numbers to data objects. The term hash is used to suggest a random scrambling of the numbers, like the common dish of leftover meat, potatoes, onions, and spices.
A good hash function has these two properties:
- Uniqueness: No two distinct objects have the same hash code
- Randomness: The hash codes seem to be uniformly distributed
Java automatically assigns a hash code to each object that is instantiated. This is yet another good reason to use Java for data analysis. The hash code of an object, obj,
is given by obj.hashCode()
. For example, in the merging program in Listing 2-15, add this at line 24:
System.out.println(country1.hashCode());
You will get 685,325,104
for the hash code of the Paraguay
object.
Java computes the hash codes for its objects from the hash codes of the object's contents. For example, the hash code for the string AB
is 2081
, which is 31*65 + 66—that is, 31 times the hash code for A
plus the hash code for B
. (Those are the Unicode values for the characters A
and B
.)
Of course, hash codes are used to implement hash tables. The original idea was to store a collection of objects in an array, a[]
, where object x
would be stored at index i = h mod n
, where h
is the hash code for x
and n
is the size of the array. For example, if n = 255
, then the Paraguay
object would be stored in a[109]
, because 685,325,104 mod 255 = 109.
Recall that mod means remainder. For example, 25 mod 7 = 4 because 25 = 3.7 + 4.