Tuesday, November 18, 2014

Data Import failure in the midway: What to DO?

If you work with data warehouse, sometime you may get task to insert data into a table from a .csv file. It's simple task, isn't it?? Just use import functionality from Terdadata SQL Assistant and insert the records. (How to import data from .csv to teradata table is not scope of this article).

But it's not always true that it can straight forward. Considering a task you got from business where you received a .csv file with 50,000 rows and 15 columns in it. And  this data you have to insert into a table (not more than a table). You didn't think much and press F5 to run the query and it's started getting data into the table. You feel happy :) But your happiness will be blurred if you see an error message saying "Duplicate row error in...."after inserting 10,000 records.

Now you will start thinking "I should have think through the situation before pressing the F5 key, I am stupid bla. bla." No worries, human being make error, so you do!!

We will find out the way out :) Assume your table already had few hundred thousands rows and you inserted just now 10,000 and then it failed. Smart way to get rid off the problem is to find out the deviation of the records and insert the missing rows.

What will be steps to fix the problem?

We start with an example to explain the steps involved to solve the problem, say table name is Customer and the received  .csv file from business called 'Customer_from_Mars.csv"

Step 1: Insert the data into a Temporary Table
Take the .csv file and insert data from .csv to a temporary table called "TMP_TBL_Customer_Mars".

Step 2: Find out the deviation
To find out the deviation you can use MINUS operator. The Code will look like below:

--Select the all customers from .csv file that you have inserted into temp table

Select TMP_CUST_TBL.Cust_ID
,TMP_CUST_TBL.First_Name
,TMP_CUST_TBL.PLANET_ID
,TMP_CUST_TBL.XtraField1
,TMP_CUST_TBL.XtraField2
,TMP_CUST_TBL.XtraField3
,TMP_CUST_TBL.XtraField4
,TMP_CUST_TBL.XtraField5
 TMP_CUST_TBL.XtraField6 
FROM TMP_TBL_Customer_Mars  as TMP_CUST_TBL

MINUS
--Find out the customer which already inserted before the failure occurred.
Select TMP_CUST_MARS.Cust_ID
,TMP_CUST_MARS.First_Name
,TMP_CUST_MARS.PLANET_ID
,TMP_CUST_MARS.XtraField1
,TMP_CUST_MARS.XtraField2
,TMP_CUST_MARS.XtraField3
,TMP_CUST_MARS.XtraField4
,TMP_CUST_MARS.XtraField5
TMP_CUST_MARS.XtraField6
FROM CUSTOMER AS CUST
JOIN
TMP_TBL_Customer_Mars AS TMP_CUST_MARS
ON
CUST.cust_id=TMP_CUST_MAR.Cust_ID
AND CUST.PLANET_ID=TMP_CUST_MAR.PLANET_ID

At first you are selecting whole data from .csv file that you have stored in temp table and find out the already inserted record by joining with main table and finally, use MINUS operator between them to get the records those did not insert.

Step 3: Insert the missing data to the table
Now you are ready to insert the rest of the data into the table "Customer"

Insert into CUSTOMER

Select TMP_CUST_TBL.Cust_ID
,TMP_CUST_TBL.First_Name
,TMP_CUST_TBL.PLANET_ID
,TMP_CUST_TBL.XtraField1
,TMP_CUST_TBL.XtraField2
,TMP_CUST_TBL.XtraField3
,TMP_CUST_TBL.XtraField4
,TMP_CUST_TBL.XtraField5
 TMP_CUST_TBL.XtraField6 
FROM TMP_TBL_Customer_Mars  as TMP_CUST_TBL

MINUS
--Find out the customer which already inserted before the failure occurred.
Select TMP_CUST_MARS.Cust_ID
,TMP_CUST_MARS.First_Name
,TMP_CUST_MARS.PLANET_ID
,TMP_CUST_MARS.XtraField1
,TMP_CUST_MARS.XtraField2
,TMP_CUST_MARS.XtraField3
,TMP_CUST_MARS.XtraField4
,TMP_CUST_MARS.XtraField5
TMP_CUST_MARS.XtraField6
FROM CUSTOMER AS CUST
JOIN
TMP_TBL_Customer_Mars AS TMP_CUST_MARS
ON
CUST.cust_id=TMP_CUST_MAR.Cust_ID
AND CUST.PLANET_ID=TMP_CUST_MAR.PLANET_ID


No comments: