Java Data Analysis
上QQ阅读APP看书,第一时间看更新

Generating test datasets

Generating numerical test data is easy with Java. It boils down to using a java.util.Random object to generate random numbers.

Generating test datasets

Listing 2-13 Generating random numeric data

This program generates the following CSV file of eight rows and five columns of random decimal values.

Generating test datasets

Figure 2-9 Test data file

Metadata

Metadata is data about data. For example, the preceding generated file could be described as eight lines of comma-separated decimal numbers, five per line. That's metadata. It's the kind of information you would need, for example, to write a program to read that file.

That example is quite simple: the data is unstructured and the values are all the same type. Metadata about structured data must also describe that structure.

The metadata of a dataset may be included in the same file as the data itself. The preceding example could be modified with a header line like this:

Metadata

Figure 2-10 Test data file fragment with metadata in header

Note

When reading a data file in Java, you can scan past header lines by using the Scanner object's nextLine() method, as shown at line 32 in Listing 2-15.

Data cleaning

Data cleaning, also called data cleansing or data scrubbing, is the process of finding and then either correcting or deleting corrupted data values from a dataset. The source of the corrupt data is often careless data entry or transcription.

Various kinds of software tools are available to assist in the cleaning process. For example, Microsoft Excel includes a CLEAN() function for removing nonprintable characters from a text file. Most statistical systems, such as R and SAS, include a variety of more general cleaning functions.

Spellcheckers provide one kind of data cleaning that most writers have used, but they won't help against errors like from for form and there for their.

Statistical outliers are also rather easy to spot, for example, in our Excel Countries table if the population of Brazil appeared as 2.10 instead of 2.01E+08.

Programmatic constraints can help prevent the entry of erroneous data. For example, certain variables can be required to have only values from a specified set, such as the ISO standard two-letter abbreviations for countries (CN for China, FR for France, and so on). Similarly, text data that is expected to fit pre-determined formats, such as phone numbers and email addresses, can be checked automatically during input.

An essential factor in the data cleaning process is to avoid conflicts of interest. If a researcher is collecting data to support a preconceived theory, any replacement of raw data should be done only in the most transparent and justifiable ways. For example, in testing a new drug, a pharmaceutical laboratory would maintain public logs of any data cleaning.

Data scaling

Data scaling is performed on numeric data to render it more meaningful. It is also called data normalization. It amounts to applying a mathematical function to all the values in one field of the dataset.

The data for Moore's Law provides a good example. Figure 2-11 shows a few dozen data points plotted. They show the number of transistors used in microprocessors at various dates from 1971 to 2011. The transistor count ranges from 2,300 to 2,600,000,000. That data could not be shown if a linear scale were used for the transistor count field, because most of the points would pile on top of each other at the lower end of the scale. The fact is, of course, that the number of transistors has increased exponentially, not linearly. Therefore, only a logarithmic scale works for visualizing the data. In other words, for each data point (x, y), the point (x, log y) is plotted.

Data scaling

Figure 2-11 Moore's law

Microsoft Excel allows the use of scaling functions.

Data filtering

Filtering usually refers to the selection of a subset of a dataset. The selection would be made based upon some condition(s) on its data fields. For example, in a Countries dataset, we might want to select those landlocked countries whose land area exceeds 1,000,000 sq. km.

For example, consider the Countries dataset shown in Figure 2-12:

Data filtering

Figure 2-12 Data on Countries

Data filtering

Listing 2-14 A class for data about Countries

For efficient processing, we first define the Countries class shown in Listing 2-14. The constructor at lines 20-27 read the four fields for the new Country object from the next line of the file being scanned by the specified Scanner object. The overridden toString() method at lines 29-33 returns a String object formatted like each line from the input file.

Listing 2-15 shows the main program to filter the data.

Data filtering

Listing 2-15 Program to filter input data

The readDataset() method at lines 28-41 uses the custom constructor at line 34 to read all the data from the specified file into a HashSet object, which is named dataset at line 19. The actual filtering is done at line 21. That loop prints only those countries that are landlocked and have an area of at least 1,000,000 sq. km., as shown in Figure 2-13.

Data filtering

Figure 2-13 Filtered data

In Microsoft Excel, you can filter data by selecting Data | Filter or Data | Advanced | Advanced Filter.

Another type of data filtering is the process of detecting and removing noise from a dataset. In this context, noise refers to any sort of independent random transmission interference that corrupts the data. The term comes from the phenomenon of background noise in an audio recording. Similar phenomena occur with image files and video recordings. The methods for this kind of filtering are more advanced.

Sorting

Sometimes it is useful to sort or re-sort tabular data that is otherwise ready for processing. For example, the Countries.dat file in Figure 2-1 is already sorted on its name field. Suppose that you want to sort the data on the population field instead.

One way to do that in Java is to use a TreeMap (instead of a HashMap), as shown in Listing 2-16. The dataset object, instantiated at line 17, specifies Integer for the key type and String for the value type in the map. That is because we want to sort on the population field, which has an integer type.

Sorting

Listing 2-16 Re-sorting data by different fields

The TreeMap data structure keeps the data sorted according to the ordering of its key field. Thus, when it is printed at line 29, the output is in increasing order of population.

Of course, in any map data structure, the key field values must be unique. So this wouldn't work very well if two countries had the same population.

A more general approach would be to define a DataPoint class that implements the java.util.Comparable interface, comparing the objects by their values in the column to be sorted. Then the complete dataset could be loaded into an ArrayList and sorted simply by applying the sort() method in the Collections class, as Collections.sort(list).

In Microsoft Excel, you can sort a column of data by selecting Data | Sort from the main menu.

Merging

Another preprocessing task is merging several sorted files into a single sorted file. Listing 2-18 shows a Java program that implements this task. It is run on the two countries files shown in Figure 2-12 and Figure 2-13. Notice that they are sorted by population:

Merging

Figure 2-12 African countries

Merging

Figure 2-13 South American countries

To merge these two files, we define a Java class to represent each data point, as shown in Listing 2-17. This is the same class as in Listing 2-14, but with two more methods added, at lines 30-38.

Merging

Listing 2-17 Country class

By implementing the java.util.Comparable interface (at line 10), Country objects can be compared. The compareTo() method (at lines 34-38) will return a negative integer if the population of the implicit argument (this) is less than the population of the explicit argument. This allows us to order the Country objects according to their population size.

The isNull() method at lines 30-32 is used only to determine when the end of the input file has been reached.

Merging

Listing 2-18 Program to merge two sorted files

The program in Listing 2-15 compares a Country object from each of the two files at line 24 and then prints the one with the smaller population to the output file at line 27 or line 30. When the scanning of one of the two files has finished, one of the Country objects will have null fields, thus stopping the while loop at line 24. Then one of the two remaining while loops will finish scanning the other file.

Merging

Figure 2-14 Merged files

Note

This program could generate a very large number of unused Country objects.

For example, if one file contains a million records and the other file has a record whose population field is maximal, then a million useless (null) objects would be created. This reveals another good reason for using Java for file processing. In Java, the space used by objects that have no references is automatically returned to the heap of available memory. In a programming language that does not implement this garbage collection protocol, the program would likely crash for exceeding memory limitations.

Hashing

Hashing is the process of assigning identification numbers to data objects. The term hash is used to suggest a random scrambling of the numbers, like the common dish of leftover meat, potatoes, onions, and spices.

A good hash function has these two properties:

  • Uniqueness: No two distinct objects have the same hash code
  • Randomness: The hash codes seem to be uniformly distributed

Java automatically assigns a hash code to each object that is instantiated. This is yet another good reason to use Java for data analysis. The hash code of an object, obj, is given by obj.hashCode(). For example, in the merging program in Listing 2-15, add this at line 24:

System.out.println(country1.hashCode());

You will get 685,325,104 for the hash code of the Paraguay object.

Java computes the hash codes for its objects from the hash codes of the object's contents. For example, the hash code for the string AB is 2081, which is 31*65 + 66—that is, 31 times the hash code for A plus the hash code for B. (Those are the Unicode values for the characters A and B.)

Of course, hash codes are used to implement hash tables. The original idea was to store a collection of objects in an array, a[], where object x would be stored at index i = h mod n, where h is the hash code for x and n is the size of the array. For example, if n = 255, then the Paraguay object would be stored in a[109], because 685,325,104 mod 255 = 109.

Recall that mod means remainder. For example, 25 mod 7 = 4 because 25 = 3.7 + 4.