It's all about Data: 2015

Saturday, December 19, 2015

How to change CSV format from comma(,) delimited to pipe(|) delimited?

By default, when you save an excel file to .csv, it will be save as comma seperated file.
For example, you have data in excel that looks like:

Fig 1: Original format of data

Lets save this as .csv format:

Fig 2: Save as Comma separated file

After saving the file data will look like below:

Fig 3: Data after converting into CSV

However, you need to have pipe delimited instead of comma delimited, how do you do that? Please follow the below steps to save file as pipe delimited.

If you are using Windows 10, then you need to go to Settings and then you will find below:

Fig 4: Settings in Windows 10

Now please click "Time and Language" option then you will find below screen, where you need to click "Additional time & regional settings" button.

Fig 5: Related Setting area

Now you will find below option:

Fig 6: More click

If you click on "Change date, time and number formats" under Region section; will find below window (left) and then if you click "Additional settings" then right side window will pop up.

Fig 7: Additional settings

Finally you need to change, List separator from comma(,) delimited to pipe (|) delimited.

Fig 8: List seperator

And now, if you save the same excel as CSV then it will look like below:

So achieved your goal. However, please remember to change your settings back if you want to have your data as comma delimited.

Thursday, November 12, 2015

MS SQL: How to insert data into a table from Excel

One of my previous post I have written how to insert data from excel/csv file to Teradata table (URL: http://pauldhip.blogspot.dk/2015/02/insert-data-from-excel-to-terdata-table.html).This time, I am going to explain how to do that in MS SQL.

Lets say, we have a table in the database named Customer_dim which have following columns:

CREATE TABLE [dbo].[Customer_dim](
[Customer_id] [nchar](10) NULL,
[Customer_First_Name] [nchar](10) NULL,
[Customer_Last_Name] [nchar](10) NULL,
[Martial_Status] [nchar](10) NULL
) ON [PRIMARY]

The table need to filled by data from excel. To do so, we need to follow the following steps:

Step 1: Save MS excel file into CSV
The excel file should have same number of columns as in the table.

Fig 1: Data in the excel

Now save the excel data as .csv file.

Fig 2: Save as .csv

Step 2: Write code to insert data from CSV

BULK INSERT [dbo].[Customer_dim] FROM 'C:\BI Projects\Blog\Dim_Customer_CSV.csv'
WITH (
CHECK_CONSTRAINTS,
DATAFILETYPE='widechar',
FIELDTERMINATOR='|',
ROWTERMINATOR='\n',
KEEPIDENTITY,
TABLOCK
);

Since, my csv file is saved as pipe delimiter so that above SQL code has FIELDTERMINATOR='|'. You need to change that as per your csv format. As well as,vyou need to change your csv file loacation accordingly.

After running the above SQL code, you should able to see below records are inserted to the table.

Fig 3: Result set after inserting the data

Sunday, October 18, 2015

Teradata:How to check DDL(table structures) are in sync among different environments(development/Test/Production) ?

Title of this post will be explained with an example. For example, If your projects involve populating data from source to target for ten (10) new tables. Then you have to create the DDL for ten tables at least in three different environments(development/Test/Production). Well, you start with development environment, develop the jobs and populate data into target area and when those are working fine then you take DDL from development environment to Test. And if tester or business find that you need to add a new column or data type then you need to change. Moreover, maybe your go-live date is closer so same DDL that put into test also may go for deploy into Production. It means , any change caught by tester or you; now you need to change the SQL in three different environment to keep consistent. Offourse, you follow change order to keep track of that, but as a developer you will prefer to see if all three different environments's table structure looks same.

If you use database as Teradata you have options to test through different environment, you can compare tables, table's column and column's data type. So you can quickly find out if you have any inconsistency through different environments.

Below code will help you to find out the difference between Test and Production environments.

select
upper(trim(tablename)) Table_name,
upper(trim(columnname)) Column_name,
upper(trim(ColumnFormat)) Column_format
from dbc.columns

where upper(trim(tablename)) in ('Custmer', 'Cust_Addreess', 'Geography', ,'Address_X_Geo','Customer_X_Product', 'Customer_X_Sales')
and upper(trim(databasename))='DWH_T'
except
select
upper(trim(tablename)) Table_name,
upper(trim(columnname)) Column_name,
upper(trim(ColumnFormat)) Column_format
from dbc.columns
where upper(trim(tablename)) in ('Custmer', 'Cust_Addreess', 'Geography', ,'Address_X_Geo','Customer_X_Product', 'Customer_X_Sales')
and upper(trim(databasename))='DWH_P';

Sometimes it's challenging to find out if table structures (DDL) for a particular projects in different environments in data warehouse has same DDL. And due to the inconsistent table structure especially between test and production can lead into ETL job failure in production.

As precaution, by running above mentioned code you can verify if your production DDL is different than Test DDL where your ETL job ran fine. So you can fix the DDL before the jobs failed in production.

Saturday, September 26, 2015

Develop job via SAS DI studio: What is SQL Pass Through and why you use it?

If you are an ETL Developer and using SAS DI Studio as your ETL tool then this post can be interesting for you. SQL Pass-Through Facility is SQL procedure that enables developer to send DBMS-specific statements to a DBMS and to retrieve data from database.
You can specify database SQL syntax instead of SAS SQL syntax when you use the Pass-Through Facility.

What is SQL Pass Through?
When we make job via SAS DI studio and use PROC SQL to make operation in database then your SQL statement will directly communicate with database and make all the operation on your SQL database and final result will bring back to SAS. To activate the feature what you need to do is to set your SAS DI studio property SQL Pass Through 'Yes'.

Fig 1: How to activate SQL Pass through

Why will you use SQL Pass Through?

1) It's handy when you use database specific function like QUALIFY, RANK etc.
2) If your SQL query has calculation then it will be on SQL database so it's faster
3) Very little processing time on SAS side
4) If you are using store procedure or SQL macro in your ETL job then SQL Pass Through will surely pay off.

Thursday, August 20, 2015

PERIOD: An useful function in Teradata

Let’s start with an example where a table is defined with a column with the type PERIOD. PERIOD data type in Teradata can hold two dates or two time stamps. To ease your date and time calculation PERIOD can play a big role. The below example will show university students’ enrollment history with sample data.

We created a table where column 'Enrolled_period' is defined as PERIOD type.

CREATE MULTISET TABLE person_University_period (

person_id INTEGER NOT NULL,

University_Name varchar(40) NOT NULL,

Enrolled_period PERIOD(DATE) FORMAT 'YYYY-MM-DD' NULL

)

PRIMARY INDEX (person_id);

And insert some dummy data in the table:

Insert into TD8245.person_University_period

Values(24568,'Gothenburg University', PERIOD(DATE '2007-08-09',DATE '2009-08-25'));

Insert into TD8245.person_University_period

Values(24569,'Gothenburg University', PERIOD(DATE '2009-07-09',DATE '2011-06-25'));

Data looks like below:

Fig 1 : Student's enrollment data

From the above table we can see Enrolled_period holds start date and end date. Lets see different type of operation you can do on Enrolled_period field.

e.g. you to know how many months or year or hour student is enrolled to the university. To find out that you don't need to do use any formula. PERIOD can find out months or year or day from the column 'enrolled_period'. Below Syntax will find out the number of months the students have studied.

Select INTERVAL(Enrolled_period) MONTH as enrolled_month from person_University_period

Fig 2: Number of months studied by students

At the same way you can calculate interval in day, hour, minutes and seconds.

From the above 'enrolled_period' column you may need to find out start date, so syntax goes like:

select begin(Enrolled_period) as Enrolled_dt from person_University_period

And to find end date SQL will be:

select END(Enrolled_period) as Enrolled_dt from person_University_period

This period function is helpful when you consider banking operation, e.g. you open a bank account and you should have account opening date and closing date. Generally in the database table we make two fields to serve the purpose which are Effective date and End date.

Instead of using two fields in the table, we can use one column that can hold both start and end date and calculation on two dates or time stamp become easier.

As an example:

INSERT INTO Account

VALUES (

21222250,

'Paul Sandal',

PERIOD(

CURRENT_DATE,

UNTIL_CHANGED

)

);

Result will look like below:

Account_number Account_Name Account_Period

21222250 Paul Sandal ('2015-08-20','9999-12-31')

Fig 3: Date with current and until changed

Here, UNTIL_CHANGED set date as '9999-12-31'. If the account is updated for any reason; then the date '9999-12-31' will be updated with the date which account get updated.

Saturday, August 1, 2015

Teradata DBC : Easy way to get DDL of views, tables and macros!!

As a database developer you may need to take back up of DDL of views, macros and tables for a particular database that you are working on. Or when your development work is done then you need to move table/macro structure (DDL) from development environment to test and test to production.

And those cases, how do you move DDL from Development environment to Test environment and test to Production?

You can copy DDL from database one by one and then paste it to a document to save it and then forward the document to the DBA to execute your DDL in Test or Production environment.

If you would like to copy DDL of the table then you do like below:

Fig 1: Manually copy DDL one by one

If you have twenty(20) tables, 10 macros or 5 views then you have to do the above action 35 times which is time consumable, error prone and boring tasks.

Then, what could be the better way to do it?

You can use DBC to get DDL of views, macros and tables for one particular database by using below SQL:

select * from dbc.tables where databasename='Customer_Service'

order by TableKind desc

Result look like below:

Fig 2: DBC result set

From the above result set (Fig 2), you will find a column named RequestText that holds DDL for all elements exist in the database.

To find DDL for table; SQL will look like below:

Select RequestText from dbc.tables where tablekind='T' and databasename='Customer_Service'

tablekind=T means table

tablekind=V means views

tablekind=M means macro

By changing tablekind column's value you can get DDL for different object in the database.

Saturday, July 11, 2015

SET and MULTISET Table in Teradata

After moving into Teradata world, discovered a few challenges; one of them was not knowing about SET and MULTISET concept in Teradata. My story goes like this:

I was making ETL job where I had three different tables and inserting into a target table. While I have selected columns after joining three tables I got 8,000 rows. However, when I inserted them into target table then I got around 5000 rows which puzzled me.

After doing some investigation found the Teradata table looks like below:

CREATE TABLE INTEREST_MAP

(

INT_ID,

BANK_NAME,

INT_RATE,

EFF_DT,

RATING_SCALE

)

PRIMARY INDEX(INT_ID);

Fig 1: DDL without MULTISET

If the table is made above way, then the table can't accept duplicate values. So, my select statement returned 8000 rows which has 3000 duplicate rows that can not be inserted into the table.

What will be solution??

The table DDL should look like below so that it can accept duplicate values. Added MULTISET in the DDL.

CREATE MULTISET TABLE INTEREST_MAP

(

INT_ID,

BANK_NAME,

INT_RATE,

EFF_DT,

RATING_SCALE

)

PRIMARY INDEX(INT_ID);

Fig 1: DDL with MULTISET

Now, I am able to insert 8000 rows into the table. So Lesson learned; without having MULTISET in your DDL you will not be able to insert duplicate rows.

NOTE: If you have unique index in your DDL; then MULTISET will be over ruled by the Unique Index.

We know MULTISET now; What is SET?

When you work with Teradata mode then by default it is SET. Since I was working with Teradata mode so It means above DDL in Fig-1 is work as SET though DDL did not mentioned explicitly SET.

And SET doesn't allow duplicate rows to be inserted into the table.

Sunday, June 14, 2015

Teradata:Can you use both DDL and DML in a macro?

As a database developer you work with Store Procedure and Macros. It's depends on your problem which one to choose from macro and store procedure. However, this post is not about comparing two of them. Rather, we will describe a scenario for macro that cause error and you may stumble solving error until you know macro has it's limitation.

Example Scenario: You need to develop macro that can create a table and insert data into the table and at last return rows from the table.

So your SQL query will look like below:

Create table Customer( cust_id int, Cust_First_Name varchar(40), Cust_Last_Name varchar(40));	DDL
insert into Customer values(1,'dipu','Paul');	DML
Select * from Customer	DML

If you run the above SQL (without creating macro), then it will create the table named 'Customer', insert data into the table and finally retrieve row from the table.

Lets wrap this code in a macro, which will look like below:

Create MACRO Customer_Operation AS(
create table Customer(
cust_id int,
Cust_First_Name varchar(40),
Cust_Last_Name varchar(40));

insert into Customer
values(1,'dipu','Paul');

select * from Customer;
);

You wrapped the same code which was running successfully earlier but now it will return error code like below:

Fig: Error 3576: Data definition not valid unless solitary.

The error raise due to the fact that macro has built with mixing of DDL and DML that doesn't support Teradata macro.

So you need to keep in a mind while working with macro "You cannot mix DDL and DML requests within the same macro."

Friday, May 29, 2015

What you need to know about DBC in Teradata?

As a database developer you may think DBC* is for database administrator to care about, not for you as a database developer.

However, knowing about DBC can save your time when it come to query like. e.g. you need to know all the tables in your database? or Column names in a table or existing views in a database. Sometimes, you may need to anonymous some of the columns in a few tables that you need to do using DBC.

You could say, I can see all table names by browsing the database table and can copy and paste names of all the tables. Yes, you can do it but that is time consumable and manual process. Would not it be nice if you can get all the table names by writing two lines of code.

Lets say, you want to find out all the tables name in your database then SQL will look like below:

Select * from dbc.tables where tablekind='T' and databasename='STUDENT103'

which will return below result (part of the full result):

Fig 1: Tables list of the database

To find out the view names from the database then SQL will look like below:

Select * from dbc.tables where tablekind='V' and databasename='STUDENT103'

Part of the result set look like below:

Fig 2: View list of the database(database contains only one view)

To get all the column names of a table of database, you need to write below SQL:

SELECT DatabaseName,TableName,ColumnName
FROM DBC.Columns
WHERE DatabaseName='STUDENT103'
AND TableName='DEPARTMENT';

Result of the above code will look like below:

Fig 3: List of columns in a table

By using DBC you can get not only column, table, view names but also more attributes like who created the table or view, when it's created and so on.

Note: DBC is the superuser on Teradata database

Wednesday, May 13, 2015

A newly developed job in SAS DI Studio: Why target table not found after Checked In?

When you start working with new tool you will find the tool is buggy (you assume so.. but actually most of the cases it’s you who don’t know all features of the tool). However, a few cases tool have it’s known issue or can’t report the error message properly.

If you are new to SAS DI, I am sure you will find this post interesting.

Let’s start with a ETL job that which get data from an external file and loading into Teradata table.

Fig 1: An example job that get data from external file and load into Teradata table

When job is completed we do checked in and so that other developer can review it and can deployed to Production.

Check In Window Look like below:

Fig 2: Check In Window

After clicking Check In you will find below three steps to do check In the job.

Fig 3: Title and description at while check in.

Fig 4: Verify the objects that you are checking in

Fig 5: Summary of Check In Wizard

After completing the check In when you tried to open the job you will find below error:

Fig 6: Missing Objects

By looking at the information message you can't identify what has changes until you look and observe your job is missing two objects one is external source file and other is target Terdata table.

But you have checked in the job, why then those objects are missing??

Well, only by checking the job you can't expect all objects will be check in automatically. You must need to do check In source file and target table then you will find the job as it should be.

Tips: It's always good idea to do check In even if the job is not completed and check In all the objects related to job.

Thursday, April 30, 2015

Window function: Rows between Preceding and Preceding in Teradata

As continuity of earlier post Maintain Historical and Current data in data Warehouse, I am going to write how to implement it in Data Warehouse.

First we create a table that holds Customer information, and assume the table ‘CUSTOMER’ is part of OLTP database. And table doesn’t have any column with END Date. It has Effective date(EFF_DT).

create volatile table Customer

(

Customer_id int,

CUSTOMER_NAME Varchar(120),

Address VArchar(200),

EFF_DT date

) ON COMMIT PRESERVE ROWS;

And Insert data into the table.

INSERT INTO Customer VALUES ( 101, 'PAUL','Bondhev 50', date '2015-01-03' );

INSERT INTO Customer VALUES ( 101,'PAUL','Bondhev 51', date '2015-03-01' );

INSERT INTO Customer VALUES ( 101,'PAULA','Bondhev 51', date '2015-04-01');

INSERT INTO Customer VALUES ( 101, 'PAULA','Bondhev 52', date '2015-04-02');

INSERT INTO Customer VALUES ( 101, 'PAULAS','Bondhev 52', date '2015-04-10' );

INSERT INTO Customer VALUES ( 102, 'MADS',Borupvang 52', date '2015-01-03' );

INSERT INTO Customer VALUES ( 103, 'LARS',Hellerup 52', date '2015-04-01' );

INSERT INTO Customer VALUES ( 202, 'DAN','Oxford street 30', date '2014-04-10' );

INSERT INTO Customer VALUES ( 202, 'DANS','Montix road 52', date '2014-04-10');

INSERT INTO Customer VALUES ( 202,'DANS','boulevard 52', date '2014-04-10' );

INSERT INTO Customer VALUES ( 202, 'DANSA','boulevard 52', date '2014-04-20' );

Data look like below table:

Fig 1: Customer table in OLTP database

As we discussed in earlier post that when we load data from OLTP database to data warehouse then we add END_DT at the time of loading data.

From the above data, customer id 101 has more than one entry and the customer changes his name and address a few times. To add END DATE you need to do following:

Find out max EFF_DT and set that as previous row's END_DT

e.g. Customer id 101 data is arranged in descending order like below, lowest EFF_DT is 03-01-2015.

Fig 2: End date manages in Data Warehouse

We need to look at EFF_DT of previous row which is 01-03-2015 and make that as END_DT for the row.

And we do same way for other rows to find out END_DT.

SQL Code is like below to implement above:

Fig 3: Preceding and Preceding to generate End Date

“ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING” is used in an ordered analytical function to tell it to include only preceding row in the partition in the calculation being performed.

If you look at the above data (Fig: 2), most updated row EFF_DT =10-04-2015 doesn’t have any end date. Since the window function (Fig: 3) we have used to get previous row which did not find any previous row. So it return NULL.

We can use COALESEC to replace NULL and put unlimited date ‘9999-12-31’ to present as active row.

Select CUSTOMER.*,

COALESCE(MAX(EFF_DT) OVER ( PARTITION BY Customer_id ORDER BY EFF_DT desc ROWS BETWEEN 1 preceding AND 1 preceding ),date '9999-12-31') as END_DT

FROM CUSTOMER

Final output should look like below:

Fig 4: Latest row updated with ever ending date

Sunday, April 19, 2015

Maintain Historical and Current data in data Warehouse

When you want to keep history in Data Warehouse then need to manipulate END_DT along with EFF_DT in the table to find out active rows and inactive rows.

For example source table Customer look like below:

Fig: 1.1 Source table ‘Customer’

The above source table doesn’t have any end date but when you will get the date from source to target you need to add end date. As we can see from the above data, we have two customers and each customer has many rows, sometimes they changes address and sometimes name.

When you get data into target table only one row should have active for one customer by saying end_date as ‘9999-12-31’ and all other rows for the customer will have end_date as previous row’s EFF_DT. The data should look like below:

Fig: 1.2 How End_date works in ‘Customer’ table

Why we care about End date (END_DT) in data warehouse?

In data warehouse initially we take full load and then we take delta (each day’s data) every day. You may ask, can’t we do exact copy from source table? Do we need to add End Date (END_DT)??!!

And simple answer of your question will be YES and one of the reasons of having data warehouse is isolate active data from the historical data by putting END_DT.

Take an example, if you want to make query and get the latest attributes of customer_id=101 then how will you find the latest information of the customer. If you make query like below :

Select * from Customer

where customer_id=101;

--Result: you will find all five rows like above

And to get latest information of the customer ‘PAUL’, please run below query:

Select * from Customer

Where customer_id=101 and date between EFF_DT and END_DT

Result:

Customer_id	CUSTOMER_NAME	Address	EFF_DT	END_DT
101	PAULAS	Bondhev 52	10-04-2015	31-12-9999

Fig 1.3: Find out the latest information of the customer

Without having END_DT your data warehouse will behave same like as OLTP database where you have to make complex query to find out the latest information.

Now we will find out how we can populate END_DT from EFF_DT by using Window function: Rows between 1 preceding and 1 preceding (Coming in next post.)

Pages