data:image/s3,"s3://crabby-images/287e0/287e0937c40f2a687ad3f965bbe13cf62e3b6286" alt="Pentaho Data Integration Beginner's Guide(Second Edition)"
Time for action – creating a simple transformation and getting familiar with the design process
In this exercise, you will create a simple transformation that takes a list of projects with the start and end dates, and calculates the time that it took to complete each project.
- Start Spoon.
- From the main menu, navigate to File | New | Transformation.
- Expand the Input branch of the Steps tree. Remember that the Steps tree is located in the Design tab to the left of the work area.
- Drag-and-drop the Data Grid icon on the canvas.
- Double-click on the Data Grid icon and enter
projects
in the Step name field. - Fill in the grid as shown in the following screenshot:
- Click on the Data tab, and fill the grid as shown in the following screenshot:
- Click on Preview and in the small window that appears, click on OK.
- Oops! It seems that something went wrong. The following ERROR window appears:
- Great! Now you know what the error was: Kettle tried to convert the string
N/A
to a date. You can easily fix it: delete theN/A
value, leaving the cell empty. - Try the preview again. This time you should see a preview window with the six rows of data you typed into the grid. Then close the window.
- Now expand the Transform branch of steps. Look for the Calculator step and drag-and-drop it to the work area.
- Create a hop from the Data Grid step towards the Calculator step. Remember that you can do it using the mouseover assistance toolbar.
Note
Don't miss this step! If you do, the fields will not be available in the next dialog window.
- Double-click on the Calculator step and fill in the grid as shown in the following screenshot:
- Click on OK to close the window.
- Now add a new step: Number range. This step is also inside the Transform branch of steps.
Tip
If you have difficulty in finding a step, you can type the search criteria in the textbox on top of the Steps tree. Kettle will filter and show only the steps that match your search.
- Link the Calculator step to the Number range step with a new hop. Make sure that the arrow goes from the Calculator step towards the Number range step, and not the other way.
- Double-click on the Number range step and fill in the grid as shown in the following screenshot. Then click on OK:
- Finally, from the Scripting branch, add a User Defined Java Expression step.
- Create a hop from the Number range step towards this new step. When you create the hop, you will be prompted for the kind of hop. Select Main output of step:
Tip
If you unintentionally select the wrong option, don't worry. Right-click on the hop and a contextual menu will appear. Select Delete hop, and create the hop again.
- Double-click on the User Defined Java Expression step or UDJE for short, and fill in the grid as shown in the following screenshot:
- Click on OK to close the window. Your final transformation should look like the following screenshot:
- Now, let's do a preview at each step to see what the output in each case is. Let's start with the Calculator step. Select it and run a preview. You already know how to do it: click on the Preview icon in the transformation toolbar and then click on Quick Launch. You'll see the following screenshot:
- Something was wrong with the UDJE step! Well, we don't know what the error was, but don't worry. Let's do this, step-by-step. Click on the hop that leaves the Number range step to disable it. It will become light gray.
- Select the Calculator step and try the preview again. As you disabled the hop, the steps beyond it will not be executed and you will not have the error, but a grid with the following results:
- Now close the preview window, select the Number range step, and do a new preview. Again, you should see a grid with the results, but this time with a new column named
performance
. - Now it's time to see what was wrong with the User Defined Java Expression step. Enable the hop that you had disabled. You can do that by clicking on it again.
- Select the UDJE step and try to run a preview. You will see an error window telling you that there weren't any rows to display. Close the preview window and switch to the Logging tab. In the logging table, you will see the error. The message, however, will not be very verbose. You'll just see:
Errors detected!
- Let's try another option. Before we do that, save the transformation.
- Now run the transformation. You can do it by clicking on the Run icon on the transformation toolbar, or by pressing F9.
- Click on Launch.
- This time, the error is clear:
Please specify a String type to parse [java.lang.String] for field [duration] as a result of formula [(diff_dates == null)?"unknown":diff_dates + " days"]
- We forgot to specify the data types for the fields defined in the UDJE step. Fix the error by editing the step, and selecting
String
from theValue type
column for both fields:duration
andmessage
. - Close the window, make sure the UDJE step is selected, and run a final preview. The error should have disappeared and the window should display the final data:
What just happened?
You created a very simple transformation that performed some calculations on a set of dummy data.
- A Data Grid step allowed you to create the starting set of data to work with. In the Meta tab, you defined three fields: a string named
project_name
, and two dates,start_date
andend_date
. Then, in the Data tab, you were prompted to fill in a grid with values for those three fields. You filled in the grid with six rows of values. A handy Preview icon allowed you to see the defined dataset. - After that Data Grid step, you used a Calculator step. With this step, you created a new field:
diff_dates
. This field was calculated as Date A - Date B (in days). As you must have guessed, this function expected two parameters. You provided those parameters in theField A
andField B
columns of the configuration window. You also told Kettle that the new field should be an integer. In this case, you only created one field, but you can define several fields in the same Calculator step. - After that, you used a Number range step. This step simply creates a new field,
performance
, based on the value of a field coming from a previous step:diff_dates
. - Finally, you used a UDJE to create some informative messages:
duration
,performance
, andmessage
. As in the Calculator step, this step also allows you to create a new field per row. The main difference between both kind of steps is that while the Calculator step has a list of predefined formulas, the UDJE allows you to write your own expressions using Java code.
As you designed the transformation, you experimented with the preview functionality that allows you to get an idea of how the data is being transformed. In all cases, you were able to see the Execution Results window with the details of what was going on.
Besides that, you learned how to deal with errors that may appear as you create a transformation and test your work.
There is something important to note about the preview functionality you experimented with in this section.
Note
When you select a step for previewing, the objective is to preview the data as it comes out from that step. The whole transformation is executed unless you disable a part of it.
That is why you disabled the last step while running the preview of the Calculator step. By disabling it, you avoid the error that appeared the first time you ran the preview.
Note
Don't feel intimidated if you don't understand completely how the used steps work. By now, the objective is not to fully dominate Kettle steps, but to understand how to interact with Spoon. There will be more opportunities throughout the book to learn about the use of the steps in detail.
Getting familiar with editing features
Editing transformations with Spoon can be very time-consuming if you're not familiar with the editing facilities that the software offers. In this section, you will learn a bit more about two editing features that you already faced in the last section: using the mouseover assistance toolbar and editing grids.
Using the mouseover assistance toolbar
The mouseover assistance toolbar , as shown in the following screenshot, is a tiny toolbar that assists you when you position the mouse cursor over a step. You have already used some of its functionality. Here you have the full list of options.
data:image/s3,"s3://crabby-images/407be/407bed7e37af43f0c80b66166fe0ccf49708d791" alt=""
The following table explains each button in this toolbar:
data:image/s3,"s3://crabby-images/4d18f/4d18f1ef56c580d268898b5d42802d50c020cd01" alt=""
Depending on the kind of source step, you might be prompted for the kind of hop to create. For now, just select the Main output of step option just as you did in the section when created the last hop.
Working with grids
Grids are tables used in many instances in Spoon to enter or display information. You already edited grids in the configuration window of the Data Grid, Calculator, Number range, and UDJE steps.
Grids can be used for entering different kinds of data. No matter what kind of grid you are editing, there is always a contextual menu that you may access by right-clicking on a row. That menu offers editing options such as copy, paste, or move rows of the grid.
Tip
When the number of rows in the grid are more, use shortcuts! Most of the editing options of a grid have shortcuts that make editing easier and quicker.
You'll find a full list of shortcuts for editing grids in Appendix D, Spoon Shortcuts.
Understanding the Kettle rowset
Transformations deal with datasets, that is, data presented in a tabular form, where:
- Each column represents a field. A field has a name and a data type. The data type can be any of the common data types—
Number
(float),String
,Date
,Boolean
,Integer
, andBig number
—or can also be of typeSerializable
orBinary
.Note
In PDI 5, you will see two new and very interesting data types:
Internet Address
andTimestamp
. - Each row corresponds to a given member of the dataset. All rows in a dataset have the same structure, that is, all rows have the same fields, in the same order. A field in a row may be null, but it has to be present.
A Kettle dataset is called rowset. The following screenshot is an example of rowset. It is the result of the preview in the Calculator step:
data:image/s3,"s3://crabby-images/6e613/6e613d40f536a6c9f5d62caec1bc636583b0a9e6" alt=""
In this case, you have four columns representing the four fields of your rowset: project_name
, start_date
, end_date
, and diff_dates
. You also have six rows of data, one for each project.
As we've already said, besides a name, each field has a data type. If you move the mouse cursor over a column title and leave it there for a second, you will see a small pop up telling you the data type of that field. For a full detail of the structure of the dataset, there is another option: select the Calculator step and press the space bar. A window named Step fields and their origin will appear:
data:image/s3,"s3://crabby-images/be47c/be47c5387ea8ec4597b49ecdaee578d7ff88f345" alt=""
Tip
Alternatively, you could open this window from the contextual menu available in the mouseover assistance toolbar, or by right-clicking on the step. In the menu you have to select the Show output fields option.
In this window you don't only see the name and type of the fields, but also some extra columns, for example, the mask and the length of each field.
As the name of the option suggests, this is the description of the fields that leaves the step towards the following step.
If you selected Show input fields instead, you would see the metadata of the incoming data, that is, data that left the previous step.
One of the columns in these windows is Step origin
. This column gives the name of the step where each field was created or modified. It's easy to compare the input fields against the output fields of a step. For example, in the Calculator step you created the field diff_dates
. This field appears in the output fields of the step but not in the input list, as expected.
Looking at the results in the Execution Results pane
The Execution Results pane shows you what is happening while you preview or run a transformation. This pane is located below the canvas. If not immediately visible, it will appear when a transformation is previewed or run.
Tip
If you don't see this pane, you can open it by clicking on the last icon in the transformation toolbar.
The Logging tab
The Logging tab shows the execution of your transformation, step-by-step. By default, the level of the logging detail is Basic logging but you can choose among the following options: Nothing at all, Error logging only, Minimal logging, Basic logging, Detailed logging, Debugging, or Rowlevel (very detailed). This is how you change the log level:
- If you run a transformation, just select the proper option in drop-down list available in the Execute a transformation window besides the Log level label.
- If you preview a transformation, instead of clicking on Quick Launch, select Configure. The Execute a transformation window appears allowing you to choose the desired level.
You should choose the option depending on the level of detail that you want to see, from Nothing at all to Rowlevel (very detailed), which is the most detailed level of log. In most situations, however, you will be fine with the default value.
The Step Metrics tab
The Step Metrics tab shows, for each step of the transformation, the executed operations and several status and information columns. For us, the most relevant columns in this tab are:
data:image/s3,"s3://crabby-images/685fc/685fc61a0ea6778396331b382f558b5c2f22f20c" alt=""
Recall what you did when you were designing your transformation. When you first did a preview on the Calculator step, you got an error. Go back to the Time for action – creating a simple transformation and getting familiar with the design process section in this chapter and look at the screenshot that depicts that error. The line for the UDJE step in the Execution Results window shows one error. The Active
column shows that the transformation has been stopped. You can also see that the icon for that step changed: a red square indicates the situation.
When you ran the second preview on the Calculator step, you got the following results: in the Data Grid step the number of written rows is six which is the same as the number of read rows for the Calculator step. This step in turn writes 6 rows that travel toward the Number range step. The Number range step reads and also writes 6 rows, but the rows go nowhere because the hop that leaves the step was disabled. As a consequence, the next step, the UDJE, doesn't even appear in the window.
Finally, as we didn't work with files or databases, the Input
and Output
rows are zero in all cases.
Have a go hero – calculating the achieved percentage of work
Take as starting point the transformation you created in the earlier section, and implement the following:
- Add a new field named
estimated
. The field must be a number, and will represent the number of days that you estimated for finishing the project.Tip
Add the field in the Meta tab of the Data Grid step, and then the values in the Data tab. Do a preview to see that the field has been added as expected.
- Create a new field named
achieved
as the division between the amount of days taken to implement the project and the estimated time.Tip
You can use the same Calculator step that you used for calculating the
diff_dates
field. - Calculate the performance with a different criterion. Instead of deciding the performance based on the duration, calculate it based on the achieved percentage.
Tip
In order to craft a true percentage, provide proper values for the
Length
andPrecision
columns of the new field, as length gives the total number of significant figures, and precision provides the number of floating point digits. For detailed examples on the use of these properties, you can take a look at the section Numeric Fields in Chapter 4, Filtering, Searching, and Other Useful Operations. - Modify the messages that were created in the last step, so they show the new information.
Have a go hero - calculating the achieved percentage of work (second version)
Modify the transformation of the previous section. This time, instead of using the Calculator step, do the math with UDJE.
The following are a few hints that will help you with the UDJE step:
- Here you have a Java expression for calculating the difference (in days) between
dateA
anddateB
:(dateB.getTime() - dateA.getTime())/ (1000 * 60 * 60 * 24)
- In UDJE, the expressions cannot reference a field defined in the same step. You should use two different UDJE steps for doing that.