上QQ阅读APP看书，第一时间看更新

Chapter 3. Implementing SCD – Using Designer Screen Wizards

Slowly Changing Dimensions (SCD), as the name suggests, allows you to maintain changes in the Dimension table in Datawarehouse. Before you read this chapter, make sure that you have a complete understanding of data warehousing concepts, especially SCD. Also, make sure you know the SCD1, SCD2, and SCD3 types. For your reference, we have described each SCD in detail in this chapter. For more details, refer to The Data Warehouse Toolkit, Ralph Kimball, Wiley India Private Limited. Before we move ahead with the implementation of the SCD in Informatica Power Center, let's discuss the different types of SCDs.

Note that we are talking about the general SCDs in our discussion, that is, SCD1, SCD2, and SCD3. Apart from these, there will always be Hybrid SCDs, which you will come across as well. Hybrid SCDs are nothing but a combination of multiple SCDs that serve your complex business requirements.

The various types of SCDs are described as follows:

Type 1 Dimension mapping (SCD1): This keeps only the current data and does not maintain historical data.

Note

Use SCD1 mapping when you do not want to keep the history of the previous data.
Type 2 Dimension/Version Number mapping (SCD2): This keeps current as well as historical data in the table. SCD2 allows you to insert new records and changed records using a new column (PM_VERSION_NUMBER) by maintaining the version number in the table to track the changes. We use a new PM_PRIMARYKEY column to maintain the history.

Note

Use SCD2 mapping when you want to keep the full history of the dimension data and track the progression of changes using a version number.
Type 2 Dimension/Flag mapping: This keeps the current as well as historical data in the table. SCD2 allows you to insert new records and changed records using a new column (PM_CURRENT_FLAG) by maintaining the flag in the table to track the changes. We use a new PRIMARY_KEY column to maintain the history.

Note

Use SCD2 mapping when you want to keep the full history of dimension data and track the progression of changes using a flag.
Type 2 Dimension/Effective date range mapping: This keeps current as well as historical data in the table. SCD2 allows you to insert new records and changed records using two new columns (PM_BEGIN_DATE and PM_END_DATE) by maintaining the date range in the table to track the changes. We use a new PRIMARY_KEY column to maintain the history.

Note

Use SCD2 mapping when you want to keep the full history of dimension data and track the progression of changes using Start Date and End Date.
Type 3 Dimension mapping: This keeps the current as well as historical data in the table. We maintain only partial history by adding a new PM_PREV_COLUMN_NAME column; that is, we do not maintain full history.

Note

Use SCD3 mapping when you wish to maintain only partial history.

Let's take an example to understand the different SCDs.

Consider that there is a LOCATION column in the EMPLOYEE table and you wish to track the changes in the location of the employees. Consider a record for the 1001 employee ID that is present in your EMPLOYEE dimension table. STEVE was initially working in India and then was shifted to USA. We want to maintain the history in the LOCATION field.

Your datawarehouse table should reflect the current status of STEVE. To implement this, we have different types of SCDs.

Take a look at the following table of type SCD1:

As you can see, INDIA will be replaced with USA, so we end up having only current data, and we lose historical data. Now, if STEVE is again shifted to JAPAN, the LOCATION data will be replaced from USA to JAPAN, as follows:

The advantage of SCD1 is that we do not consume a lot of space to maintain the data; the disadvantage is we don't have the historical data.

Take a look at the following table of type SCD2, where we have added the version number:

As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add two new columns in the table, that is, PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID (supposed to be the primary key) column, and PM_VERSION_NUMBER to understand the current and historical records.

The following SCD2 table has the flag column added to it:

As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add two new columns in the table, that is, PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column, and PM_CURRENT_FLAG to understand the current and history record.

Again, if STEVE is shifted, the data would look like this:

The following table of type SCD2 shows you the data range added to it:

As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add three new columns in the table, that is, PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column and PM_BEGIN_DATE and PM_END_DATE to understand the versions in the data.

The advantage of SCD2 is that you have the complete history of the data, which is a must for data warehouses, whereas the disadvantage of SCD2 is that it consumes a lot of space.

Take a look at the following SCD3 table:

As you can see, we are maintaining the history by adding a new column to maintain the history. An optional PM_PRIMARYKEY column can be added to maintain the primary key constraints. We add a new PM_PREV_LOCATION column in the table to store the changes in the data. As you can see, we added a new column to store data as against SCD2, where we added rows to maintain the history.

If STEVE is now shifted to JAPAN, the data changes to:

As you can see, we lost INDIA from the data warehouse, and that is why we say we are maintaining partial history.

Note

To implement SCD3, decide how many versions of a particular column you wish to maintain. Based on this, the columns will be added in the table.

SCD3 is best when you are not interested in maintaining the complete history but are interested in maintaining only partial history. The drawback of SCD3 is that it doesn't store the full history.

At this point, you should be very clear with the different types of SCD. We need to practically implement these concepts in Informatica PowerCenter. Informatica PowerCenter provides a utility called wizard to implement the SCD. Using this wizard, you can easily implement any SCD. In the upcoming chapters, we will learn how to use the wizard to implement SCD1, SCD2, and SCD3.

Before you proceed to the next section, make sure you have a proper understanding of the transformations in Informatica PowerCenter. You should be clear about the source qualifier, expression, filter, router, lookup, update strategy, and sequence generator transformations. The wizard creates a mapping using all these transformations to implement the SCD functionality.

When we implement SCD, there will be some new records that will need to be loaded into the target table, and there will be some existing records for which we need to maintain the history.

Note

The record that appears for the first time in the table will be referred to as the NEW record, and the record for which we need to maintain the history will be referred to as the CHANGED record. Based on the comparison of the source data with the target data, we will decide which one is the NEW record and which one is the CHANGED record.

To start, we will use a sample file as our source and an Oracle table as our target in order to implement SCDs. Before we implement an SCD, let's talk about the logic that will serve our purpose, and then we will fine-tune the logic for each type of SCD:

Extract all records from the source.
Look up at the target table and cache all the data.
Compare the source data with the target data to flag the NEW and CHANGED records.
Filter the data based on the NEW and CHANGED flags.
Generate the primary key for every new row inserted into the table.
Load the NEW record into the table and update the existing record, if needed.

Based on the specific SCD, the preceding logic will be modified to a certain extent.