Thursday, April 30, 2015

Window function: Rows between Preceding and Preceding in Teradata

As continuity of earlier post Maintain Historical and Current data in data Warehouse, I am going to write how to implement it in Data Warehouse.

First we create a table that holds Customer information, and assume the table ‘CUSTOMER’ is part of OLTP database. And table doesn’t have any column with END Date. It has Effective date(EFF_DT).

create volatile table Customer
Customer_id int,
CUSTOMER_NAME Varchar(120),
Address VArchar(200),
EFF_DT date


And Insert data into the table.

INSERT INTO Customer VALUES ( 101, 'PAUL','Bondhev 50', date '2015-01-03' );
INSERT INTO Customer VALUES ( 101,'PAUL','Bondhev 51', date '2015-03-01'  );
INSERT INTO Customer VALUES ( 101,'PAULA','Bondhev 51', date '2015-04-01');
INSERT INTO Customer VALUES ( 101, 'PAULA','Bondhev 52', date '2015-04-02');
INSERT INTO Customer VALUES ( 101, 'PAULAS','Bondhev 52', date '2015-04-10' );
INSERT INTO Customer VALUES ( 102, 'MADS',Borupvang 52', date '2015-01-03' );
INSERT INTO Customer VALUES ( 103, 'LARS',Hellerup 52', date '2015-04-01' );

INSERT INTO Customer VALUES ( 202, 'DAN','Oxford street 30', date '2014-04-10' );
INSERT INTO Customer VALUES ( 202, 'DANS','Montix road 52', date '2014-04-10');
INSERT INTO Customer VALUES ( 202,'DANS','boulevard 52', date '2014-04-10' );

INSERT INTO Customer VALUES ( 202, 'DANSA','boulevard 52', date '2014-04-20' );

Data look like below table:

Fig 1: Customer table in OLTP database

As we discussed in earlier post that when we load data from OLTP database to data warehouse then we add END_DT at the time of loading data.

From the above data, customer id 101 has more than one entry and the customer changes his name and address a few times.  To add END DATE you need to do following:

  Find out max EFF_DT and set that as previous row's  END_DT

e.g. Customer id 101 data is arranged in descending order like below, lowest EFF_DT is 03-01-2015.

Fig 2: End date manages in Data Warehouse

We need to look at EFF_DT of previous row which is 01-03-2015 and make that as END_DT for the row.

And we do same way for other rows to find out END_DT.

SQL Code is like below to implement above:

                                Fig 3: Preceding and Preceding to generate End Date

“ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING” is used in an ordered analytical function to tell it to include only preceding row in the partition in the calculation being performed.

If you look at the above data (Fig: 2), most updated row EFF_DT =10-04-2015 doesn’t have any end date. Since the window function (Fig: 3) we have used to get previous row which did not find any previous row. So it return NULL.

We can use COALESEC to replace NULL and put unlimited date ‘9999-12-31’ to present as active row.

Select CUSTOMER.*,
COALESCE(MAX(EFF_DT) OVER ( PARTITION BY Customer_id  ORDER BY EFF_DT desc ROWS BETWEEN 1 preceding AND 1 preceding ),date '9999-12-31') as END_DT


Final output should look like below:

Fig 4: Latest row updated with ever ending date

Sunday, April 19, 2015

Maintain Historical and Current data in data Warehouse

When you want to keep history in Data Warehouse then need to manipulate END_DT along with EFF_DT in the table to find out active rows and inactive rows.
For example source table Customer look like below:

                                       Fig: 1.1 Source table ‘Customer’

The above source table doesn’t have any end date but when you will get the date from source to target you need to add end date. As we can see from the above data, we have two customers and each customer has many rows, sometimes they changes address and sometimes name.

When you get data into target table only one row should  have active for one customer by saying end_date as ‘9999-12-31’ and all other rows for the customer will have end_date as previous row’s EFF_DT. The data should look like below:

Fig: 1.2 How End_date works in  ‘Customer’ table

Why we care about End date (END_DT) in data warehouse?

In data warehouse initially we take full load and then we take delta (each day’s data) every day.  You may ask, can’t we do exact copy from source table?  Do we need to add End Date (END_DT)??!!
And simple answer of your question will be YES and one of the reasons of having data warehouse is isolate active data from the historical data by putting END_DT.  

Take an example, if you want to make query and get the latest attributes of customer_id=101 then how will you find the latest information of the customer. If you make query like below :

Select * from Customer
where customer_id=101;

--Result: you will find all five rows like above

And to get latest information of the customer ‘PAUL’, please run below query:

Select * from Customer
Where customer_id=101 and date between EFF_DT and END_DT


Bondhev 52

             Fig 1.3: Find out the latest information of the customer

Without having END_DT your data warehouse will behave same like as OLTP database where you have to make complex query to find out the latest information.

Now we will find out how we can populate END_DT from EFF_DT by using Window function: Rows between 1 preceding and 1 preceding (Coming in next post.)