Java Data Analysis
上QQ阅读APP看书,第一时间看更新

File formats

The Countries.dat data file in the preceding example is a flat file—an ordinary text file with no special structure or formatting. It is the simplest kind of data file.

Another simple, common format for data files is the comma separated values (CSV) file. It is also a text file, but uses commas instead of blanks to separate the data values. Here is the same data as before, in CSV format:

File formats

Figure 2-4 A CSV data file

Note

In this example, we have added a header line that identifies the columns by name: Country and Population.

For Java to process this correctly, we must tell the Scanner object to use the comma as a delimiter. This is done at line 18, right after the input object is instantiated:

File formats

Listing 2-3 A program for reading CSV data

The regular expression ,|\\s means comma or any white space. The Java symbol for white space (blanks, tabs, newline, and so on.) is denoted by '\s'. When used in a string, the backslash character itself must be escaped with another preceding backslash, like this: \\s. The pipe character | means "or" in regular expressions.

Here is the output:

File formats

Figure 2-5 Output from the CSV program

The format code %-10s means to print the string in a 10-column field, left-justified. The format code %,12d means to print the decimal integer in a 12-column field, right-justified, with a comma preceding each triple of digits (for readability).

Microsoft Excel data

The best way to read from and write to Microsoft Excel data files is to use the POI open source API library from the Apache Software Foundation. You can download the library here: https://poi.apache.org/download.html. Choose the current poi-bin zip file.

This section shows two Java programs for copying data back and forth between a Map data structure and an Excel workbook file. Instead of a HashMap, we use a TreeMap, just to show how the latter keeps the data points in order by their key values.

The first program is named FromMapToExcel.java. Here is its main() method:

Microsoft Excel data

Listing 2-4 The FromMapToExcel program

The load() method at line 23 loads the data from the Countries data file shown in Figure 2-1 into the map. The print() method at line 24 prints the contents of the map. The storeXL() method at line 25 creates the Excel workbook as Countries.xls in our data folder, creates a worksheet named countries in that workbook, and then stores the map data into that worksheet.

The resulting Excel workbook and worksheet are shown in Figure 2-6.

Notice that the data is the same as in the file shown in Figure 2-1. The only difference is that, since the population of Brazil is over 100,000,000, Excel displays it as rounded and in exponential notation: 2.01E+08.

The code for the load() method is the same as that in Lines 15-26 of Listing 2-1, without line 16.

Here is the code for the print() method:

Microsoft Excel data

Listing 2-5 The print() method of the FromMapToExcel program

Microsoft Excel data

Figure 2-6 Excel workbook created by FromMapToExcel program

In Listing 2-5, line 45 extracts the set of keys (countries) from the map. Then, for each of these, we get the corresponding population at line 47 and print them together at line 48.

Here is the code for the storeXL() method:

Microsoft Excel data

Listing 2-6 The storeXL() method of the FromMapToExcel program

Lines 60-63 instantiate the out, workbook, worksheet, and countries objects. Then each iteration of the for loop loads one row into the worksheet object. That code is rather self-evident.

The next program loads a Java map structure from an Excel table, reversing the action of the previous program.

Microsoft Excel data

Listing 2-7 The FromExcelToMap program

It simply calls this loadXL() method and then prints the resulting map:

Microsoft Excel data

Listing 2-8 The loadXL() method of the FromExcelToMap program

The loop at lines 37-45 iterates once for each row of the Excel worksheet. Each iteration gets each of the two cells in that row and then puts the data pair into the map at line 44.

Note

The code at lines 34-35 instantiates HSSFWorkbook and HSSFSheet objects. This and the code at lines 38-39 require the import of three classes from the external package org.apache.poi.hssf.usermodel; specifically, these three import statements:

import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;

The Java archive can be downloaded from https://poi.apache.org/download.html—POI-3.16. See Appendix for instructions on installing it in NetBeans.

XML and JSON data

Excel is a good visual environment for editing data. But, as the preceding examples suggest, it's not so good for processing structured data, especially when that data is transmitted automatically, as with web servers.

As an object-oriented language, Java works well with structured data, such as lists, tables, and graphs. But, as a programming language, Java is not meant to store data internally. That's what files, spreadsheets, and databases are for.

The notion of a standardized file format for machine-readable, structured data goes back to the 1960s. The idea was to nest the data in blocks, each of which would be marked up by identifying it with opening and closing tags. The tags would essentially define the grammar for that structure.

This was how the Generalized Markup Language (GML), and then the Standard Generalized Markup Language (SGML), were developed at IBM. SGML was widely used by the military, aerospace, industrial publishing, and technical reference industries.

The Extensible Markup Language (XML) derived from SGML in the 1990s, mainly to satisfy the new demands of data transmission on the World Wide Web. Here is an example of an XML file:

XML and JSON data

Figure 2-7 An XML data file

This shows three <book> objects, each with a different number of fields. Note that each field begins with an opening tab and ends with a matching closing tag. For example, the field <year>2017</year> has opening tag <year> and closing tag </year>.

XML has been very popular as a data transmission protocol because it is simple, flexible, and easy to process.

The JavaScript Object Notation (JSON) format was developed in the early 2000s, shortly after the popularity of the scripting language JavaScript began to rise. It uses the best ideas of XML, with modifications to accommodate easy management with Java (and JavaScript) programs. Although the J in JSON stands for JavaScript, JSON can be used with any programming language.

There are two popular Java API libraries for JSON: javax.jason and org.json. Also, Google has a GSON version in com.google.gson. We will use the Official Java EE version, javax.jason.

JSON is a data-exchange format—a grammar for text files meant to convey data between automated information systems. Like XML, it is used the way commas are used in CSV files. But, unlike CSV files, JSON works very well with structured data.

In JSON files, all data is presented in name-value pairs, like this:

"firstName" : "John"
"age" : 54
"likesIceCream": true

These pairs are then nested, to form structured data, the same as in XML.

Figure 2-8 shows a JSON data file with the same structured data as the XML file in Figure 2-7.

XML and JSON data

Figure 2-8 A JSON data file

The root object is a name-value pair with books for its name. The value for that JSON object is a JSON array with three elements, each a JSON object representing a book. Notice that the structure of each of those objects is slightly different. For example, the first two have an authors field, which is another JSON array, while the third has a scalar author field instead. Also, the last two have no edition field, and the last one has no isbn field.

Each pair of braces {} defines a JSON object. The outer-most pair of braces define the JSON file itself. Each JSON object then is a pair of braces, between which are a string, a colon, and a JSON value, which may be a JSON data value, a JSON array, or another JSON object. A JSON array is a pair of brackets [], between which is a sequence of JSON objects or JSON arrays. Finally, a JSON data value is either a string, a number, a JSON object, a JSON array, True, False, or null. As usual, null means unknown.

JSON can be used in HTML pages by including this in the <head> section:

< script src =" js/ libs/ json2. js" > </ script >

If you know the structure of your JSON file in advance, then you can read it with a JsonReader object, as shown in Listing 2-10. Otherwise, use a JsonParser object, as shown in Listing 2-11.

A parser is an object that can read tokens in an input stream and identify their types. For example, in the JSON file shown in Figure 2-7, the first three tokens are {, books, and [. Their types are START_OBJECT, KEY_NAME, and START_ARRAY. This can be seen from the output in Listing 2-9. Note that the JSON parser calls a token an event.

XML and JSON data

Listing 2-9 Identifying JSON event types

By identifying the tokens this way, we can decide what to do with them. If it's a START_OBJECT, then the next token must be a KEY_NAME. If it's a KEY_NAME, then the next token must be either a key value, a START_OBJECT or a START_ARRAY. And if that's a START_ARRAY, then the next token must be either another START_ARRAY or another START_OBJECT.

This is called parsing. The objective is two-fold: extract both the key values (the actual data) and the overall data structure of the dataset.

XML and JSON data

Listing 2-10 Parsing JSON files

Here is the getMap() method:

XML and JSON data

Listing 2-11 A getMap() method for parsing JSON files

And here is the getList() method:

XML and JSON data

Listing 2-12 A getList() method for parsing JSON files

Note

The actual data, both names and values, are obtained from the file by the methods parser.getString() and parser.getInt().

Here is unformatted output from the program, just for testing purposes:

{books=[{year=2004, isbn=0-13-093374-0, publisher=Prentice Hall, title=Data Structures with Java, authors=[John R. Hubbard, Anita Huray]}, {year=2006, isbn=0-321-34980-6, edition=4, publisher=Addison Wesley, title=The Java Programming Language, authors=[Ken Arnold, James Gosling, David Holmes]}, {year=2017, author=John R. Hubbard, publisher=Packt, title=Data Analysis with Java}]}

Note

The default way that Java prints key-value pairs is, for example, year=2004, where year is the key and 2004 is the value.

To run Java programs like this, you can download the file javax.json-1.0.4.jar from:

https://mvnrepository.com/artifact/org.glassfish/javax.json/1.0.4.

Click on Download (BUNDLE).

See Appendix for instructions on how to install this archive in NetBeans.

Copy the downloaded jar file (currently json-lib-2.4-jdk15.jar) into a convenient folder (for example, Library/Java/Extensions/ on a Mac). If you are using NetBeans, choose Tools | Libraries to load it into the IDE, and then right-click on the project icon and select Properties and then Libraries; choose Add JAR/Folder and then navigate to and select your javax.json-1.0.4.jar file.