Chapter 3. Implementing SCD – Using Designer Screen Wizards
Slowly Changing Dimensions (SCD), as the name suggests, allows you to maintain changes in the Dimension
table in Datawarehouse
. Before you read this chapter, make sure that you have a complete understanding of data warehousing concepts, especially SCD. Also, make sure you know the SCD1, SCD2, and SCD3 types. For your reference, we have described each SCD in detail in this chapter. For more details, refer to The Data Warehouse Toolkit, Ralph Kimball, Wiley India Private Limited. Before we move ahead with the implementation of the SCD in Informatica Power Center, let's discuss the different types of SCDs.
Note that we are talking about the general SCDs in our discussion, that is, SCD1, SCD2, and SCD3. Apart from these, there will always be Hybrid SCDs, which you will come across as well. Hybrid SCDs are nothing but a combination of multiple SCDs that serve your complex business requirements.
The various types of SCDs are described as follows:
- Type 1 Dimension mapping (SCD1): This keeps only the current data and does not maintain historical data.
- Type 2 Dimension/Version Number mapping (SCD2): This keeps current as well as historical data in the table. SCD2 allows you to insert new records and changed records using a new column (
PM_VERSION_NUMBER
) by maintaining the version number in the table to track the changes. We use a newPM_PRIMARYKEY
column to maintain the history. - Type 2 Dimension/Flag mapping: This keeps the current as well as historical data in the table. SCD2 allows you to insert new records and changed records using a new column (
PM_CURRENT_FLAG
) by maintaining the flag in the table to track the changes. We use a newPRIMARY_KEY
column to maintain the history. - Type 2 Dimension/Effective date range mapping: This keeps current as well as historical data in the table. SCD2 allows you to insert new records and changed records using two new columns (
PM_BEGIN_DATE
andPM_END_DATE
) by maintaining the date range in the table to track the changes. We use a newPRIMARY_KEY
column to maintain the history. - Type 3 Dimension mapping: This keeps the current as well as historical data in the table. We maintain only partial history by adding a new
PM_PREV_COLUMN_NAME
column; that is, we do not maintain full history.
Let's take an example to understand the different SCDs.
Consider that there is a LOCATION
column in the EMPLOYEE
table and you wish to track the changes in the location of the employees. Consider a record for the 1001
employee ID that is present in your EMPLOYEE
dimension table. STEVE
was initially working in India and then was shifted to USA. We want to maintain the history in the LOCATION
field.
Your datawarehouse
table should reflect the current status of STEVE
. To implement this, we have different types of SCDs.
Take a look at the following table of type SCD1:
As you can see, INDIA
will be replaced with USA
, so we end up having only current data, and we lose historical data. Now, if STEVE
is again shifted to JAPAN
, the LOCATION
data will be replaced from USA
to JAPAN
, as follows:
The advantage of SCD1 is that we do not consume a lot of space to maintain the data; the disadvantage is we don't have the historical data.
Take a look at the following table of type SCD2, where we have added the version number:
As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add two new columns in the table, that is, PM_PRIMARYKEY
to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID
(supposed to be the primary key) column, and PM_VERSION_NUMBER
to understand the current and historical records.
The following SCD2 table has the flag column added to it:
As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add two new columns in the table, that is, PM_PRIMARYKEY
to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID
column, and PM_CURRENT_FLAG
to understand the current and history record.
Again, if STEVE
is shifted, the data would look like this:
The following table of type SCD2 shows you the data range added to it:
As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add three new columns in the table, that is, PM_PRIMARYKEY
to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID
column and PM_BEGIN_DATE
and PM_END_DATE
to understand the versions in the data.
The advantage of SCD2 is that you have the complete history of the data, which is a must for data warehouses, whereas the disadvantage of SCD2 is that it consumes a lot of space.
Take a look at the following SCD3 table:
As you can see, we are maintaining the history by adding a new column to maintain the history. An optional PM_PRIMARYKEY
column can be added to maintain the primary key constraints. We add a new PM_PREV_LOCATION
column in the table to store the changes in the data. As you can see, we added a new column to store data as against SCD2, where we added rows to maintain the history.
If STEVE
is now shifted to JAPAN
, the data changes to:
As you can see, we lost INDIA
from the data warehouse, and that is why we say we are maintaining partial history.
SCD3 is best when you are not interested in maintaining the complete history but are interested in maintaining only partial history. The drawback of SCD3 is that it doesn't store the full history.
At this point, you should be very clear with the different types of SCD. We need to practically implement these concepts in Informatica PowerCenter. Informatica PowerCenter provides a utility called wizard to implement the SCD. Using this wizard, you can easily implement any SCD. In the upcoming chapters, we will learn how to use the wizard to implement SCD1, SCD2, and SCD3.
Before you proceed to the next section, make sure you have a proper understanding of the transformations in Informatica PowerCenter. You should be clear about the source qualifier, expression, filter, router, lookup, update strategy, and sequence generator transformations. The wizard creates a mapping using all these transformations to implement the SCD functionality.
When we implement SCD, there will be some new records that will need to be loaded into the target table, and there will be some existing records for which we need to maintain the history.
Note
The record that appears for the first time in the table will be referred to as the NEW
record, and the record for which we need to maintain the history will be referred to as the CHANGED
record. Based on the comparison of the source data with the target data, we will decide which one is the NEW
record and which one is the CHANGED
record.
To start, we will use a sample file as our source and an Oracle table as our target in order to implement SCDs. Before we implement an SCD, let's talk about the logic that will serve our purpose, and then we will fine-tune the logic for each type of SCD:
- Extract all records from the source.
- Look up at the target table and cache all the data.
- Compare the source data with the target data to flag the
NEW
andCHANGED
records. - Filter the data based on the
NEW
andCHANGED
flags. - Generate the primary key for every new row inserted into the table.
- Load the
NEW
record into the table and update the existing record, if needed.
Based on the specific SCD, the preceding logic will be modified to a certain extent.