Pentaho Data Integration Beginner's Guide(Second Edition)
上QQ阅读APP看书,第一时间看更新

Time for action – refining the counting task by filtering even more

This is the second tutorial on filtering. As discussed in the previous tutorial, we have a plain file and want to know what kind of information is present in it. In the previous section, we listed and counted the words in the file. Now, we will apply some extra filters in order to refine our work.

  1. Open the transformation from the previous section.
  2. Add a Calculator step, link it to the last step, and calculate the new field len_word representing the length of the words. To do this use the calculator function Return the length of a string A. As Field A type or select word, and as Type select Integer.
  3. Expand the Flow category and drag another Filter rows step to the canvas.
  4. Link it to the Calculator step and edit it.
  5. Click on the <field> textbox and select counter.
  6. Click on the = sign, and select >.
  7. Click on the <value> textbox. A small window appears.
  8. In the Value textbox of the little window type 2.
  9. Click on OK.
  10. Position the mouse cursor over the icon at the upper right-hand corner of the window. When the text Add condition shows up, click on the icon, as shown in the following screenshot:
  11. A new blank condition is shown below the one you created.
  12. Click on null = [] and create the following condition: len_word>3, in the same way that you created the condition counter>2.
  13. Click on OK.
  14. The final condition looks like this:
  15. Close the window.
  16. Add one more Filter rows step to the transformation and create a hop from the last step toward this one.
  17. Configure the step in this way: at the left side of the condition select word, as comparator select IN LIST, and at the right side of the condition, inside the value textbox, type a;an;and;the;that;this;there;these.
  18. Click on the upper-left square above the condition, and the word NOT will appear. The condition will look as shown in the following screenshot:
  19. Add a Sort rows step, and sort the rows by counter descending.
  20. Add a Dummy step at the end of the transformation.
  21. With the Dummy step selected, preview the transformation. This is what you should see now:

What just happened?

This section was the second part of the two devoted to learning how to filter data. In the first part, you read a file and counted the words in it. In this section, you discarded the rows where the word was too short (length less than 4), or appeared just once (counter less than 3), or was too common (compared to a list you typed).

Once you applied all of those filters, you sorted the rows descending by the number of times a word appeared in the file, so you could see which the most frequent words were.

Scrolling down the preview window to skip some prepositions, pronouns, and other common words that have nothing to do with a specific subject, you found words such as shells, strata, formation, South, elevation, porphyritic, Valley, tertiary, calcareous, plain, North and rocks. If you had to guess, you would say that this was a book or article about Geology, and you would be right. The text taken for this exercise was from the book Geological Observations on South America by Charles Darwin.

Filtering rows using the Filter rows step

The Filter rows step allows you to filter rows based on conditions and comparisons.

The step checks the condition for every row, then applies a filter letting only the rows for which the condition is true pass. The other rows are lost.

Note

If you want to keep those rows, there is a way. You just will learn how to do it later in the book.

In the last two tutorials, you used the Filter rows step several times, so you already have an idea of how it works. Let's review it:

When you edit a Filter rows step, you have to enter a condition. This condition may involve one field, such as word IS NOT NULL. In this case, only the rows where the words are neither null nor with empty values will pass. The condition may involve one field and a constant value such as counter > 2. This filter allows only the rows with a word that is more than twice in the file to pass. Finally, the condition may have two fields, such as line CONTAINS word.

You can also combine conditions, as follows:

counter > 2
AND 
len_word>3

or even create sub-conditions, such as:

   (
   counter > 2
AND 
   len_word>3
   )
OR
   (word in list geology; sun)

In this example, the condition let the word geology pass even if it appears only once. It also let the word sun pass, despite its length.

When editing conditions, you always have a contextual menu which allows you to add and delete sub-conditions, change the order of existent conditions, and more.

Maybe you wonder what the Send 'true' data to step: and Send 'false' data to step: textboxes are for. Be patient, you will learn how to use them in Chapter 5, Controlling the Flow of Data.

As an alternative to this step, there is another step for the same purpose: it's the Java Filter step. You also find it in the Flow category. This step can be useful when your conditions are too complicated, and it becomes difficult or impossible to create them in a regular Filter rows step. This is how you use it: in the Java Filter configuration window, instead of creating the condition interactively, you write a Java expression that evaluates to true or false. As an example, you can replace the second Filter row step in the section with a Java Filter step, and in the Condition (Java expression) textbox type counter>2 && len_word > 3. The result would have been the same as with the Filter row step.

Have a go hero – playing with filters

Now it is your turn to try filtering rows. Modify the transformation you just created in the following way: add a sub-condition to avoid excluding some words, just like the one in the preceding example, word in list geology; sun. Change the list of words and test the filter to see that the results are as expected.