Pentaho Data Integration Beginner's Guide（Second Edition）

上QQ阅读APP看书，第一时间看更新

Time for action – calculating football match statistics by grouping data

Let's continue working with the football matches file. Suppose that you want to take that information to obtain some statistics, for example, the maximum number of goals per match in a given day. To do this, follow these instructions:

Create a new transformation, give it a name and description, and save it.
By using a Text file input step, read the matches.txt file, just like you did it in the previous section.
Do a preview just to confirm that the step is well configured.
Add a Sort rows step to the transformation, and sort the fields by region and match_date in ascending order.
Expand the Statistics category of steps, and drag a Group by step to the canvas. Create a hop from the Sort rows step to this new step.
Edit the Group by step and fill in the configuration window as shown in the following screenshot:
When you click on the OK button, a window appears to warn you that this step needs the input to be sorted on the specified keys, in this case, the region and match_date fields. Click on I understand, and don't worry because you already sorted the data in the previous step.
Add a final Dummy step.
Select the Dummy and the Sort rows steps, left-click on one and holding down the Ctrl key —cmd key on Mac — left-click the other.
Click on the Preview button. You will see this:
Click on Quick Launch.
The following window appears:
Double-click on the Sort rows option. A window appears with the data coming out of the Sort rows step.
Double-click on the Dummy option. A window appears with the data coming out of the Dummy step.
If you rearrange the preview windows, you can see both preview windows at a time, and understand better what happened with the numbers. The following would be the data shown in the windows:

What just happened?

You opened a file with results from several matches, and got some statistics from it.

After reading the file, you ordered the data by region and match date with a Sort rows step, and then you ran some statistical calculations:

First, you grouped the rows by region and match date. You did this by typing or selecting region and match_date in the upper grid of the Group by step.
Then, for every combination of region and match date, you calculated some statistics. You did the calculations by adding rows in the lower grid of the step, one for every statistic you needed.

Let's see how it works. As the Group by step was preceded by a Sort rows step, the rows came to the step already ordered. When the rows arrive at the Group by step, Kettle creates groups based on the fields indicated in the upper grid; in this case, the region and match_date fields. The following screenshot shows this idea:

Then, for every group, the fields that you put in the lower grid are calculated. Let's see, for example, the group for the region usa and match date 07-09. There are 12 rows in this group. For these rows, Kettle calculated the following:

Goals (total): The total number of goals converted in the region usa on 07-09. There were 17 (0+3+2+1+2+2+1+0+3+1+0+2) goals.
Maximum: The maximum number of goals converted by a team in a match. The maximum among the numbers in the preceding bullet point is 3.
Teams: The number of teams that played on a day in a region—12.

The same calculations were made for every group. You can verify the details by looking at the preview windows or the preceding screenshot.

Look at the Step Metrics tab in the Execution Results area of the following screenshot:

Note that 242 rows enter the Group by step, and only 10 came out of that step towards the Dummy step. That is because after the grouping, you no longer have the details of the matches. The output of the Group by step is your new data now: one row for every group created.

Group by Step

The Group by step allows you to create groups of rows and calculate new fields over those groups.

In order to define the groups, you have to specify which field or fields are the keys. For every combination of values for those fields, Kettle builds a new group.

In the previous section, you grouped by two fields: region and match_date. Then for every pair (region, match_date), Kettle created a different group, generating a new row.

The Group by step operates on consecutive rows. The step traverses the dataset and each time the value for any of the grouping field changes, it creates a new group. The step works in this way, even if the data is not sorted by the grouping field.

Note

As you probably don't know how the data is ordered, it is safer and recommended that you sort the data by using a Sort rows step just before using a Group by step.

Once you have defined the groups, you are free to specify new fields to be calculated for every group. Every new field is defined as an aggregate function over some of the existent fields.

Let's review some of the fields that you created in the previous section:

The Goals (total) field is the result of applying the Sum function over the field named goals.
The Maximum field is the result of applying the Maximum function over the field goals.
The Number of Teams field is the result of applying the Number of Values (N) function over the field team.

Note

Note that for a given region and date, it's supposed that a team only played a single match. That is, it will only appear once in the rows for that group. Therefore, you can safely use this function. If you can't make that assumption, you can use the Number of Distinct Values (N) function instead, which would count 1 for each team, even if it appears more than once.

Finally, you have the option to calculate aggregate functions over the whole dataset. You do this by leaving the upper grid blank. Following with the same example, you could calculate the number of teams that played and the average number of goals converted by a team in a football match. This is how you do it:

This is what you get:

In any case, as a result of the Group by step, you will no longer have the detailed rows, unless you check the Include all rows? checkbox.

Numeric fields

As you know, there are several data types in Kettle. Among This section has been removed as it is badly phrased and the sentence makes sense without it the most used are String, Date, and Number. There is not much to say about the String fields, and we already discussed the Date type in Chapter 2, Getting Started with Transformations. Now it's time to talk about numeric fields, which are present in almost all Kettle transformations. In the preceding transformation, we read a file with a numeric field. As discussed in the Time for action – sorting information about matches with the Sort rows step section, if you don't intend to use the field for math, you don't need to define it as a numeric field. In the Time for action – Calculating football matches statistics by grouping data section, however, the numeric field represented goals, and you wanted to do some calculations based on the values. Therefore, we defined it as an integer, but didn't provide a format. When your input sources have more elaborate fields, for example, numbers with separators, dollar signs, and so on—see as an example the transformations about projects in Chapter 2, Getting Started with Transformations—you should specify a format to tell Kettle how to interpret the number. If you don't, Kettle will do its best to interpret the number, but this could lead to unexpected results.

On the other hand, when writing fields to an output destination, you have the option of specifying the format in which you want the number to be written. The same occurs when you change the metadata from a numeric field to a String: you have the option of specifying the format to use for doing the conversion.

There are several formats you may apply to a numeric field. The format is basically a combination of predefined symbols, each with a special meaning.

These format conventions are not Kettle specific, but Java standard. The following are the most used symbols:

If you don't specify a format for your numbers, you may still provide a length and precision. Length is the total number of significant figures, while precision is the number of floating point digits.

If you neither specify the format nor length or precision, Kettle behaves as follows: while reading, it does its best to interpret the incoming number; when writing, it sends the data as it comes, without applying any format.

For a complete reference on number formats, you can check the Sun Java API documentation at http://java.sun.com/javase/7/docs/api/java/text/DecimalFormat.html.

Have a go hero – formatting 99.55

Create a transformation to see for yourself the different formats for the number 99.55. Test the formats shown in the preceding table, and try some other options as well.

To test this, you will need a dataset with a single row and a single field: the number. You can generate it with a Generate rows step.

Pop quiz — formatting output fields

Recall the transformation that you created in the first section of this chapter. You did not do any math, so you were free to read the goals field as a numeric or as a String field. Suppose that you define the field as a Number and, after sorting the data, you want to send it back to a file. How do you define the field in the Text file output step if you want to keep the same look and feel it had in the input? (You may choose more than one option):

As a Number. In the format, you put #.
As a String. In the format, you put #.
As a String. You leave the format blank.

Have a go hero – listing the languages spoken by a country

Read the file containing the information on countries, which you used in Chapter 3, Manipulating Real-world Data. Build a file where each row has two columns: the name of a country, and the list of languages spoken in that country.

Tip

Us a Group by step, and as aggregate function use the option Concatenate strings separated by ,.