Pentaho Data Integration Beginner's Guide(Second Edition)
上QQ阅读APP看书,第一时间看更新

Time for action – browsing new features of PDI by copying a dataset

Before starting, let's introduce the Pentaho BI Platform Tracking site. At the Tracking site, you can see the current Pentaho roadmap and browse their issue-tracking system. The PDI page for that site is http://jira.pentaho.com/browse/PDI.

In this exercise, you will export the list of proposed new features for PDI, from the site and generate detailed and summarized files from that information.

Access to the main Pentaho Tracking site page is at: http://jira.pentaho.com.

Note

At this point, you may want to create a user ID. Logging is not mandatory, but beneficial if you want to create new issues or comment on existing ones.

  1. In the menu at the top of the screen, select Issues. A list of issues will be displayed.
  2. At the top, you will have several drop-down listboxes for filtering. Use them to select the following filters:
    • Project: Pentaho Data Integration - Kettle
    • Issue Type: New Feature
    • Status: Open
  3. As you select the filters, they are automatically applied and you can see the list of issues that match the filters:
  4. Above the list of search criteria, click on the Views icon and a list of options will be displayed. Among the options, select Excel (Current fields) to export the list to an Excel file.
  5. Save the file to the folder of your choice.
    Tip

    The Excel file exported from the JIRA website is a Microsoft Excel 97-2003 Worksheet. PDI does not recognize this version of worksheets. So, before proceeding, open the file with Excel or Calc and convert it to Excel 97/2000/XP.

  6. Create a transformation.
  7. Read the file by using a Microsoft Excel Input step. After providing the filename, click on the Sheets tab and fill it as shown in the following screenshot, so it skips the header rows and the first column:
  8. Click on the Fields tab and fill in the grid by clicking on the Get fields from header row... button.
  9. Click on Preview rows just to be sure that you are reading the file properly. You should see all the contents of the Excel file except the first column and the three heading lines.
  10. Click on OK.
  11. Add a Filter rows step to drop the rows where the Summary field is null. That is, the filter will be Summary IS NOT NULL.
  12. After the Filter rows step, add a Value Mapper step. When asked for the kind of hop, select Main output of step. Then fill the Value Mapper configuration window as shown in the following screenshot:
  13. After the Value Mapper step, add a Sort rows step and order the rows by priority_order (ascending), Summary (ascending).
  14. Select this last step and do a preview. You will see this:
    Note

    Take into account that the issues you see may not match the ones shown here as you derived your own source data from the JIRA system, and it changes all the time.

So far, you read a file with JIRA issues and after applying minor transformations, you got the dataset shown previously. Now it's time to effectively generate the detailed and summarized files as promised at the beginning of the section.

  1. After the Sort rows step, add a Microsoft Excel Output step, and configure it to send the priority_order and Summary fields to an Excel file named new_features.xls.
  2. Drag to the canvas a Group by step.
  3. Create a new hop from the Sort rows step to the Group by step.
  4. A warning window appears asking you to decide whether to copy or distribute rows. Click on Copy.
  5. The hops leaving the Sort rows step change to show you the decision you made. So far, you have this:
  6. Configure the Group by steps as shown in the following screenshot:
  7. Add a new Microsoft Excel Output step to the canvas, and create a hop from the Group by step to this new step.
  8. Configure the Microsoft Excel Output step to send the fields Priority and Quantity to an Excel file named new_features_summarized.xls.
  9. Save the transformation, and then run it.
  10. Verify that both files new_features.xls and new_features_summarized.xls have been created. The first file should look like this:
  11. And the second file should look like this:

What just happened?

After exporting an Excel file with the PDI new features from the JIRA site, you read the file and created two Excel files: one with a list of the issues and another with a summary of the list.

The first steps of the transformation are well known: read a file, filter null rows, map a field, and sort.

Note

Note that the mapping creates a new field to give an order to the Priority field, so that the more severe issues are first in the list while the minor priorities remain at the end of the list.

You linked the Sort rows step to two different steps. This caused PDI to ask you what to do with the rows leaving the step. By clicking on Copy, you told PDI to create a copy of the dataset. After that, two identical copies left the Sort rows step, each to a different destination step.

From the moment you copied the dataset, those copies became independent, each following its own way. The first copy was sent to a detailed Excel file. The other copy was used to create a summary of the fields, which then was sent to another Excel file.

Copying rows

At any place in a transformation, you may decide to split the main stream into two or more streams. When you do so, you have to decide what to do with the data that leaves the last step: copy or distribute.

To copy means that the whole dataset is copied to each of the destination steps. Once the rows are sent to those steps, each follows its own way.

When you copy, the hops that leave the step from which you are copying change visually to indicate the copy action.

In the section, you created two copies of the main dataset. You could have created more than two, like in this example:

When you split the stream into two or more streams, you can do whatever you want with each one as if they had never been the same stream. The transformations you apply to any of those output streams will not modify the data in the others.

Note

You should not assume a particular order in the execution of steps, due to its asynchronous nature. As the steps are executed in parallel and all the output streams receive the rows in sync, you don't have control over the order in which they are executed.

Have a go hero – recalculating statistics

Recall the Have a go heroSelecting the most popular of the official languages exercise Calculate some statistics about the spoken from Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data. You were told to create two transformations that calculated different statistics taking as starting point the same source data. Now, create a single transformation that does both tasks.

Distributing rows

As previously said, when you split a stream, you can copy or distribute the rows. You already saw that copy is about creating copies of the whole dataset and sending each of them to each output stream. To distribute instead means that the rows of the dataset are distributed among the destination steps. Let's see how it works through a modified exercise.