Pentaho Data Integration Beginner's Guide（Second Edition）

上QQ阅读APP看书，第一时间看更新

Time for action – generating a range of dates and inspecting the data as it is being created

In this section, you will generate a rowset with one row by date in a date range.

Tip

As you progress, feel free to preview the data that is being generated even if you're not told to do so. This will help you understand what is going on. Testing each step as you move forward makes it easier to debug and craft a functional transformation.

Create a new transformation.
From the Input group of steps, drag to the canvas a Generate Rows step.
Double-click on the step and fill in the grid as shown in the following screenshot:
Note that you have to change the default value for the Limit textbox. Close the window.
From the Transform category of steps, add a Calculator step, and create a hop that goes from the Generate Rows step to this one.
Double-click on the Calculator step and add a field named diff_dates as the difference between end_date and start_date. That is, configure it exactly the same way as you did in the previous section.
Do a preview. You should see a single row with three fields: the start date, the end date, and a field with the number of days between both.
Now add a Clone row step. You will find it inside the Utility group of steps.
Create a hop from the Calculator step towards this new step.
Edit the Clone row step.
Select the Nr clone in field? option to enable the Nr Clone field textbox. In this textbox, type diff_dates.
Now select the Add clone num to option to enable the Clone num field textbox. In this textbox, type delta.
Run a preview. You should see the following:
Add another Calculator step, and create a hop from the Clone row step to this one.
Edit the new step, and add a field named a_single_date. As Calculation select Date A + B Days. As Field A select start_date, and as Field B select delta. Finally, as Value type select Date. For the rest of the columns, leave the default values.
Now add two Select values steps. You will find them in the Transform branch of the Steps tree.
Link the steps as shown in the following screenshot. When you create the hop leaving the first Select values step you will be prompted for the kind of hop. Select Main output of step:
Edit the first of the Select values steps and select the Meta-data tab.
Under Fieldname type or select a_single_date. As Type select String. As Format type or select MM/dd/yy.
Close the window, and edit the second Select values step.
In the Select & Alter tab (which appears selected by default), under Fieldname type a_single_date.
Close the window and save your work.
Select the first of the Select values steps, and do a preview. You should see this:
Try a preview on the second Select values step. You should only see the last column: a_single_date.

Tip

If you don't obtain the same results, check carefully that you followed the steps exactly as explained. If you hit errors in the middle of the section, you know how to deal with them. Take your time, read the log, fix the errors, and resume your work.

Now that you have an idea of what the transformation does, do the following modifications:

Edit the Generate Rows step and change the date range. As end_date, type 2023-12-31.
From the Utility group of steps, drag to the work area a Delay row step.
Drag the step to the hop between the Clone row step and the second Calculator step until the hop changes the width:
A window will appear asking you if you want to split the hop. Click on Yes. The hop will be split in two: one from the Clone row step to the Delay row step, and a second one from this step to the Calculator step.

Tip

You can configure Kettle to split the hops automatically. You can do it by selecting the Don’t ask again? checkbox in this same window, or by navigating to the Tools | Options… menu.
Double-click on the Delay row step, and configure it using the following information : as Timeout, type 500, and in the drop-down list select Milliseconds. Close the window.
Save the transformation, and run it. You will see that it runs at a slower pace. This delay was deliberately caused by adding the Delay row step. We did this on purpose, so you could try the next steps.
Click on the second Calculator step. A pop-up window will show up describing the execution results of this step in real time. Control-click two more steps: the Generate Rows step and the Clone row step. You will see the following screenshot:
Now let's inspect the data itself. Right-click on the second Calculator step and navigate to Sniff test during execution | Sniff test output rows. A window will appear showing the data as it's being generated.
Now do the same in the second Select values step. A new window appears also showing the progress, but this time you will only see one column.

What just happened?

In this section, you basically did two things. First, you created a transformation and had the opportunity to learn some new steps. Then, you ran the transformation and inspected the data as the transformation was being executed. Let's explain these two tasks in detail.

The transformation that you created generated a dataset with all dates in between a given range of dates. How did you do it? First of all, you created a dataset with a single row with two fields: the start and end dates. But you needed as many rows as the dates between those reference dates. You did this trick with two steps. First, a Calculator step that calculated the difference between the dates, and then a Clone row step that not only cloned the single row as many times as you needed (diff_dates field), but also numerated those cloned rows (delta field). Then you used that delta field to create the desired field by adding start_date and delta.

After having your date field, you used two Select values steps: the first to convert the date to a string in a specific format MM/dd/yy, and the second for keeping just this field, and discarding all the others.

After creating the transformation, you added a Delay row step, to deliberately delay each row of data for 500 milliseconds. If you didn't do this, the transformation would run so fast that it wouldn't allow you to do the sniff testing. The sniff testing is the possibility of seeing the rows that are coming into or out of a step in real time. While the transformation was running, you experimented with this feature for sniffing the output rows. In the same way, you could have selected the Sniff test input rows option to see the incoming rows of data.

Note

Note that sniff testing slows down the transformation and its use is recommended just for debugging purposes.

Finally, in the Execution Results window, it's worth noting two columns that we hadn't mentioned before:

As you put a delay of 500 milliseconds for each row, it's reasonable to see that the speed is 2 rows per second.

Adding or modifying fields by using different PDI steps

As you saw in the last transformation, once the data is created in the first step, it travels from step to step through the hops that link those steps. The hop's function is just to direct data from an output buffer to an input one. The real manipulation of data, as well as the modification of a stream by adding or removing fields occurs in the steps. In this section and also in the first section of this chapter, you used the Calculator Step to create new fields and add them to your dataset. The Calculator step is one of the many steps that PDI has to create new fields by combining existent ones. Usually, you will find these steps under the Transform category of the Steps tree. In the following table you have a description of some of the most used steps. The examples reference the first transformation you created in this chapter:

Any of these steps when added to your transformation, is executed for every row in the stream. It takes the row, identifies the fields needed to do its tasks, calculates the new field(s), and adds it to the dataset.

Note

For more details on a particular step, don't hesitate to visit the Wiki page for steps at:

http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps.

The Select values step

Despite being classified as member of the Transform category, just like most of the steps mentioned in the previous section, the Select values step is a very particular step and deserves a separate explanation. This step allows you to select, rename, and delete fields, or change the metadata of a field. The configuration window of the step has three tabs:

Select & Alter
Remove
Meta-data

Note

You may use only one of the Select values step tabs at a time. Kettle will not restrain you from filling more than one tab, but that could lead to unexpected behavior.

The Select & Alter tab which appears selected by default lets you specify the fields that you want to keep. You used it in the previous section for keeping just the last created field: the date field. This tab can also be used to rename the fields or reorder them. You may have noticed that each time you add a new field, this tab is added at the end of the list of fields. If for any reason you want to put it in another place, this is the step for doing that.

The Remove tab is useful to discard undesirable fields. This tab is useful if you want to remove just a few fields. For removing many, it's easier to use the Select & Alter tab, and specify not the fields to remove, but the fields to keep.

Note

Removing fields by using this tab is expensive from a performance point of view. Please don't use it unless needed!

Finally, the Meta-data tab is used when you want to change the definition of a field. In the earlier section, you used it in the first Select values step. In this case, you changed the metadata of the field named a_single_date. The field was of the Date type, and you changed it to a String. You also told Kettle to convert the date using MM/dd/yy as the format. For example, the date January 31st 2013 will be converted to the string 01/31/13. In the next section, you will learn more about date formats.

Getting fields

The Select values step is just one of several steps that contain field information. In these cases, the grids are usually accompanied by a Get Fields button. The Get Fields button is a facility to avoid typing. When you press that button, Kettle fills the grid with all the available fields.

Tip

Every time you see a Get Fields button, consider it as a shortcut to avoid typing. Kettle will bring the fields available to the grid, and you will only have to check the information brought, and do minimal changes.

The name of the button is not necessarily Get Fields.

Note

In the case of the Select values step, depending on the selected tab, the name of the button changes to Get fields to select, Get fields to remove, or Get fields to change, but the purpose of the button is the same in all cases.

Date fields

As we've already said, every field in a Kettle dataset must have a data type. Among the available data types, namely Number (float), String, Date, Boolean, Integer, and BigNumber, Date is one of the most used.

Look at the way you defined the start and end date in the earlier section. You told Kettle that these were Date fields, with the format yyyy-MM-dd. What does it mean? To Kettle, it means that when you provide a value for that field, it has to interpret the field as a date, where the four first positions represent the year, then there is a hyphen, then two positions for the month, another hyphen, and finally two positions for the day. This way, Kettle knows how to interpret, for example, the string 2013-01-01 that you typed as the start date. Something similar occurred with the date fields in the Data Grid step you created in the previous section.

Generally speaking, when a Date field is created, like the fields in the example, you have to specify the format of the date so Kettle can recognize in the values the different components of the date. There are several formats that may be defined for a date; all of them combinations of letters that represent date or time components.

Note

These format conventions are not Kettle specific, but based on this class in the Java library.

The following table shows the main letters used for specifying date formats:

There is also the opposite case: a Date type converted to a String type, such as that in the first Select values step of the earlier section. In this case, the format doesn't indicate how to interpret a given text, but which format to apply to the date when converted to a string. In other words, it indicates how the final string should look. It's worth mentioning a couple of things about this conversion. Let's explain this taking as an example the date January 31st, 2012:

A format does not have to have all the pieces for a date. As an example, your format could be simply yyyy. With this format, your full date will be converted to a string with just four positions representing the year of the original date. In the given example, the date field will be converted to the string 2012.
In case you don't specify a format, Kettle sets a default format. In the given example, this default will be 2012/01/31 00:00:00.000.

Note

As we've already said, there are more combinations to define the format to a Date field. For a complete reference, check the Sun Java API documentation, located at:

http://java.sun.com/javase/7/docs/api/java/text/SimpleDateFormat.html

Pop quiz – generating data with PDI

For each of the following rowsets, can the data be generated with:

A Data Grid step.
A Generate Rows step.
Any of the above options.

Have a go hero – experiencing different PDI steps

Taking as a starting point the transformation that you created in the first section, try implementing each of the examples provided in the section named Adding or modifying fields by using different PDI steps.

Have a go hero – generating a rowset with dates

Create a transformation that generates the following dataset:

The dataset has only one field of type String. There is one row for each Sunday starting from January 06, 2013 and ending at December 29, 2013. The transformation in the last section can serve as a model.

The following are a few hints that will help you:

For knowing how many rows to generate, do a little more math in the calculator
For generating the dates, combine the use of the Clone row and Add sequence steps
For generating the message, change the metadata of the date to string using the proper format (if necessary visit the format documentation), and then construct the string by using either a UDJE or a Calculator step