File formats
The Countries.dat
data file in the preceding example is a flat file—an ordinary text file with no special structure or formatting. It is the simplest kind of data file.
Another simple, common format for data files is the comma separated values (CSV) file. It is also a text file, but uses commas instead of blanks to separate the data values. Here is the same data as before, in CSV format:
For Java to process this correctly, we must tell the Scanner
object to use the comma as a delimiter. This is done at line 18, right after the input
object is instantiated:
The regular expression ,|\\s
means comma or any white space. The Java symbol for white space (blanks, tabs, newline, and so on.) is denoted by '\s'
. When used in a string, the backslash character itself must be escaped with another preceding backslash, like this: \\s
. The pipe character |
means "or" in regular expressions.
Here is the output:
The format code %-10s
means to print the string in a 10-column field, left-justified. The format code %,12d
means to print the decimal integer in a 12-column field, right-justified, with a comma preceding each triple of digits (for readability).
Microsoft Excel data
The best way to read from and write to Microsoft Excel data files is to use the POI open source API library from the Apache Software Foundation. You can download the library here: https://poi.apache.org/download.html. Choose the current poi-bin zip file.
This section shows two Java programs for copying data back and forth between a Map
data structure and an Excel workbook file. Instead of a HashMap
, we use a TreeMap
, just to show how the latter keeps the data points in order by their key values.
The first program is named FromMapToExcel.java
. Here is its main()
method:
The load()
method at line 23 loads the data from the Countries
data file shown in Figure 2-1 into the map. The print()
method at line 24 prints the contents of the map. The storeXL()
method at line 25 creates the Excel workbook as Countries.xls
in our data
folder, creates a worksheet named countries in that workbook, and then stores the map data into that worksheet.
The resulting Excel workbook and worksheet are shown in Figure 2-6.
Notice that the data is the same as in the file shown in Figure 2-1. The only difference is that, since the population of Brazil is over 100,000,000, Excel displays it as rounded and in exponential notation: 2.01E+08
.
The code for the load()
method is the same as that in Lines 15-26 of Listing 2-1, without line 16.
Here is the code for the print()
method:
In Listing 2-5, line 45 extracts the set of keys (countries) from the map. Then, for each of these, we get the corresponding population at line 47 and print them together at line 48.
Here is the code for the storeXL()
method:
Lines 60-63 instantiate the out
, workbook
, worksheet
, and countries
objects. Then each iteration of the for
loop loads one row into the worksheet
object. That code is rather self-evident.
The next program loads a Java map structure from an Excel table, reversing the action of the previous program.
It simply calls this loadXL()
method and then prints the resulting map:
The loop at lines 37-45 iterates once for each row of the Excel worksheet. Each iteration gets each of the two cells in that row and then puts the data pair into the map at line 44.
The code at lines 34-35 instantiates HSSFWorkbook
and HSSFSheet
objects. This and the code at lines 38-39 require the import of three classes from the external package org.apache.poi.hssf.usermodel
; specifically, these three import statements:
import org.apache.poi.hssf.usermodel.HSSFRow; import org.apache.poi.hssf.usermodel.HSSFSheet; import org.apache.poi.hssf.usermodel.HSSFWorkbook;
The Java archive can be downloaded from https://poi.apache.org/download.html—POI-3.16. See Appendix for instructions on installing it in NetBeans.
XML and JSON data
Excel is a good visual environment for editing data. But, as the preceding examples suggest, it's not so good for processing structured data, especially when that data is transmitted automatically, as with web servers.
As an object-oriented language, Java works well with structured data, such as lists, tables, and graphs. But, as a programming language, Java is not meant to store data internally. That's what files, spreadsheets, and databases are for.
The notion of a standardized file format for machine-readable, structured data goes back to the 1960s. The idea was to nest the data in blocks, each of which would be marked up by identifying it with opening and closing tags. The tags would essentially define the grammar for that structure.
This was how the Generalized Markup Language (GML), and then the Standard Generalized Markup Language (SGML), were developed at IBM. SGML was widely used by the military, aerospace, industrial publishing, and technical reference industries.
The Extensible Markup Language (XML) derived from SGML in the 1990s, mainly to satisfy the new demands of data transmission on the World Wide Web. Here is an example of an XML file:
This shows three <book>
objects, each with a different number of fields. Note that each field begins with an opening tab and ends with a matching closing tag. For example, the field <year>2017</year>
has opening tag <year>
and closing tag </year>
.
XML has been very popular as a data transmission protocol because it is simple, flexible, and easy to process.
The JavaScript Object Notation (JSON) format was developed in the early 2000s, shortly after the popularity of the scripting language JavaScript began to rise. It uses the best ideas of XML, with modifications to accommodate easy management with Java (and JavaScript) programs. Although the J in JSON stands for JavaScript, JSON can be used with any programming language.
There are two popular Java API libraries for JSON: javax.jason
and org.json
. Also, Google has a GSON version in com.google.gson
. We will use the Official Java EE version, javax.jason
.
JSON is a data-exchange format—a grammar for text files meant to convey data between automated information systems. Like XML, it is used the way commas are used in CSV files. But, unlike CSV files, JSON works very well with structured data.
In JSON files, all data is presented in name-value pairs, like this:
"firstName" : "John" "age" : 54 "likesIceCream": true
These pairs are then nested, to form structured data, the same as in XML.
Figure 2-8 shows a JSON data file with the same structured data as the XML file in Figure 2-7.
The root object is a name-value pair with books
for its name. The value for that JSON object is a JSON array with three elements, each a JSON object representing a book. Notice that the structure of each of those objects is slightly different. For example, the first two have an authors
field, which is another JSON array, while the third has a scalar author
field instead. Also, the last two have no edition
field, and the last one has no isbn
field.
Each pair of braces {}
defines a JSON object. The outer-most pair of braces define the JSON file itself. Each JSON object then is a pair of braces, between which are a string, a colon, and a JSON value, which may be a JSON data value, a JSON array, or another JSON object. A JSON array is a pair of brackets [], between which is a sequence of JSON objects or JSON arrays. Finally, a JSON data value is either a string, a number, a JSON object, a JSON array, True
, False
, or null
. As usual, null
means unknown.
JSON can be used in HTML pages by including this in the <head>
section:
< script src =" js/ libs/ json2. js" > </ script >
If you know the structure of your JSON file in advance, then you can read it with a JsonReader
object, as shown in Listing 2-10. Otherwise, use a JsonParser
object, as shown in Listing 2-11.
A parser is an object that can read tokens in an input stream and identify their types. For example, in the JSON file shown in Figure 2-7, the first three tokens are {
, books
, and [
. Their types are START_OBJECT, KEY_NAME,
and START_ARRAY
. This can be seen from the output in Listing 2-9. Note that the JSON parser calls a token an event.
By identifying the tokens this way, we can decide what to do with them. If it's a START_OBJECT
, then the next token must be a KEY_NAME
. If it's a KEY_NAME
, then the next token must be either a key value, a START_OBJECT
or a START_ARRAY
. And if that's a START_ARRAY
, then the next token must be either another START_ARRAY
or another START_OBJECT
.
This is called parsing. The objective is two-fold: extract both the key values (the actual data) and the overall data structure of the dataset.
And here is the getList()
method:
Here is unformatted output from the program, just for testing purposes:
{books=[{year=2004, isbn=0-13-093374-0, publisher=Prentice Hall, title=Data Structures with Java, authors=[John R. Hubbard, Anita Huray]}, {year=2006, isbn=0-321-34980-6, edition=4, publisher=Addison Wesley, title=The Java Programming Language, authors=[Ken Arnold, James Gosling, David Holmes]}, {year=2017, author=John R. Hubbard, publisher=Packt, title=Data Analysis with Java}]}
To run Java programs like this, you can download the file javax.json-1.0.4.jar
from:
https://mvnrepository.com/artifact/org.glassfish/javax.json/1.0.4.
Click on Download (BUNDLE).
See Appendix for instructions on how to install this archive in NetBeans.
Copy the downloaded jar file (currently json-lib-2.4-jdk15.jar
) into a convenient folder (for example, Library/Java/Extensions/
on a Mac). If you are using NetBeans, choose Tools | Libraries to load it into the IDE, and then right-click on the project icon and select Properties and then Libraries; choose Add JAR/Folder and then navigate to and select your javax.json-1.0.4.jar file.