Wednesday, December 25, 2019

Environment variable setup in SSIS Catalogue

If your SSIS solution is built with project deployment mode(not package deployment mode) and would like introduce environment variable then this post is for you.

Let's start with a few steps before going into environment variable. In general, after building  SSIS solution, you get .ispac file which you take from DEV to UAT/TEST/PREPROD/PROD environment. And then by following the wizard you deploy the .ispac file under SSIS catalogue.  To know step by step process to deploy SSIS package to catalogue please follow : Deploying SSIS package to Catalogue

After deployment , you will find the packages under SSIS catalogue, as like below figure 01:


Fig 01: SSIS Catalogue


At next step you will create  job/Jobs  to run the packages. Now say, if you don't create any environment variable then the job will like below:

Fig 2: Job Properties without Environment variable

Will this job execute without environment variable? Surely, it does. But the connection string still retain development server information. However, the solution is already moved to UAT or Test Environment, so you need to change the connection string manually.   If  Environment variable have been defined then it can be assigned to the job so the same job (process) can run for different environment (UAT/Pre-PROD/PROD) by pointing the right environment.

Job with environment variable look like below figure 3:


Fig 3: Job with Environment variable

At this setup, you can choose from different environments before running the job. And each environment can hold different source and target database as well as source and destination file location. Now let's see how to create the environment  variable.


Step 1: Create environment variable
Under SSISDB, you will find Environment folder as like below picture, right mouse click and follow the step.

Fig 4: Create environment variable


Step 2: Property set up for environment variable

At this step, you will define connection strings, source/destination file path etc.

Fig 5: setup properties for environment



Step 3: Connect environment variable from project, Under SSISDB catalog choose the project and then right mouse click to find 'Configure'.

Fig 6: Configure environment variable



And then add reference to the project, References->Add and then environment can be chosen.

Fig 7: adding environment variable reference to the project



Step 4: From step 1 to 3, we worked with SSISDB, Now let's look at the job, SQL Server Agent->Jobs, where will have option to choose from environment

Fig 8: Job Properties
And from the properties, click on Steps and then you will find variable to choose from.



In this example, only UAT environment have been created but as per your need Pre-PROD, PROD can be created and source and target database can be assigned accordingly. And those environment can be assigned to job.

Saturday, November 16, 2019

SQL Server: Easy way to add Columns to Multiple tables

When you work in Data Warehouse/BI Project or any project where database is required and you are in a situation where need to add one or more same columns to the tables you already built then this post will help you to achieve the goal quicker and error free.

For example a BI project you may working with require to add load timestamp to all the target tables which at the design phase somehow missed or you find out at the later part of the project that you require one or two particular columns to all the tables.

Boring task!!! right?? Going through all the tables and changing it one by one is time consumable and error prone.  For example if you have 50 tables and you need to add one column say column called 'UpdatedAt'. How do we do it by using SQL script?

You need to list all the table name and run the below query by replacing the table name e.g. 'testTable1', 'testTable2' etc.
 
             Declare @tbl_name as varchar(100)
              Declare @tsql_code Varchar(500)

              Declare cur_table cursor local for
                           select name
                           from sys.tables
                           Where name in ('testTable1','testTable2')
                           order by 1
                           for read only

              open cur_table
              fetch cur_table into @tbl_name
     
                           while @@fetch_status = 0
                           begin
                               
                           Select @tsql_code='Alter Table ' + @tbl_name  +' Add Col1 Varchar(10), Col2 Varchar(10)'
                           Exec (@tsql_code)
                           fetch cur_table into @tbl_name
                           end
                     close cur_table
                     deallocate cur_table


However,  the above code may not work since schema name is missing there, so you need to add the schema name along with the table name like below example:

   Declare @tbl_name as varchar(100)
              Declare @tsql_code Varchar(500)

     Declare cur_table cursor local for
                           select '['+SCHEMA_NAME(schema_id)+'].['+name+']'
                           from sys.tables
                           Where SCHEMA_NAME(schema_id)+'.'+name in ('dbo.Currency','dbo.Sales')
                           order by 1
                           for read only

              open cur_table
              fetch cur_table into @tbl_name
     
                           while @@fetch_status = 0
                           begin
                               
                           Select @tsql_code='Alter Table ' + @tbl_name  +' Add UpdatedAt [DATETIME]'
                           Exec (@tsql_code)
                           fetch cur_table into @tbl_name
                           end
                     close cur_table
                     deallocate cur_table


Hope this will make your life easier.

Sunday, October 20, 2019

SSIS: Execute SQL Task Limiation

I would like to share an experience that found recently. In general, when you work with SSIS and using SQL Execute task, you tend to make query by using  SQL Server Management Studio (SSMS) and when query works fine then copy the code into the Execute SQL Task. However,  when I copied the code and pasted to Execute SQL task and run the package it failed at Execute SQL task, then figure it out only part of the SQL code is pasted.

Fig 1: SQL Execute Task


The SQL code I copied was around 1600 lines, however when pasted it's only took 891 lines of code to Execute SQL Task and rest have been discarded. It means Execute SQL task got limitation, out of curiosity took the pasted part of the code from Execute SQL task then count it and found it can hold up to 30,987 characters with space. (see below fig 2)

Fig 1: SQL Execution task limiation


How do you solve it?
Well, It's easy, you wrap the SQL code either by using view or store procedure and then use that in your Execute SQL task.


Sunday, September 15, 2019

Referential Integrity in Data warehouse Design?

I was involved in a project and the project was asking about implementing referential integrity for the data warehouse design. I was surprised, since I never did such design for Data Warehouse(DW).

If someone ask you to put the referential integrity in your data warehouse project, how do you explain them that it's not required for DW project?

If you design the control /config tables to maintain logs or control the flow of your ETL run then you may follow the referential integrity for only those tables, but when you start working with data load for the tables, don't go with referential integrity which will kill the performance.

So No 1 : Poor performance while loading

If your DW tables (dimensions and facts) have referential integrity which will consume extra time to insert or update the records. So standard for data warehouse design is, you don't maintain referential integrity.

Let's explain why you don't need it?

When you are doing dimensional modeling it's not require that you need to have referential integrity because ETL does it for you. The ETL process is the source of all key values that will be inserted into the Star or Snowflake model. Any ETL process should have surrogate keys in the dimension tables and while loading facts natural keys will be transformed with those surrogate keys from dimension. Any data that would violate referential integrity will be caught by these process.

In summary, If your ETL is built correct then you don't need to have referential integrity in the Data Warehouse design.


Saturday, August 24, 2019

Table variable using in SSIS packages

When you need it and how to use it?

As you may know table variable is similar to temp table in Microsoft SQL, however; table variable generally uses fewer resources. I am going to explain when table variable could be useful at the time of SSIS package development.

Scenario 1: When you need to store data like array and need to use it later in your SSIS package

For example, below scenario I only had to insert/update data where the periods exist in the source data and then making next operation depend on the value exist or not.

DECLARE @periodACD TABLE (periodACD char(2))

INSERT INTO @periodACD

SELECT Distinct [PERIOD]

FROM [ETL].[SourceTable];


SELECT CASE WHEN periodACD='4' THEN 'Insert or update for April'  END /*You can use any transformation or calculation at THEN clause.*/
FROM @periodACD WHERE periodACD='4'


Scenario 2: Keep insert or updated row count for logging purposes while you use SQL Merge

For example, you are using merge to update the dimension and want to catch the number of rows you insert, you can use table variable to store the data for merge.

DECLARE @RowCount TABLE (InsertedID INT) --Declare Table variable

MERGE  [Dim].[Table] AS Target
 USING    [Staging].[Data] as Source
    ON Target.SourceID = Source.SourceID --BusinessKey
WHEN MATCHED

AND (ISNULL(Source.SourceName,'')<>ISNULL(target.[BrandDesc],'')
OR ISNULL(Source.SourceDate,0)<>ISNULL(Target.SourceDate,0)
        OR ISNULL(Source.SourceBit,'')<>ISNULL(Target.SourceBit,'')
   
THEN
   UPDATE SET Target.SourceName = Source.SourceName,
  Target.SourceDate = Source.SourceDate,
  Target.SourceBit = Source.SourceBit

OUTPUT inserted.SourceID 'InsertedID' INTO @RowCount; --output IDs for rowcount

SELECT COUNT(*)
FROM @RowCount

Saturday, July 20, 2019

New to Git and Github? How Azure DevOps connect with Git? Must know.


Those who have been using TFS for a long and now you know 'Git' is a 'thing' that you can't avoid.  When you start with Git, please don’t expect it will be exact same like TFS. When I have started with TFS 10 years back, it was super easy to start with, however for Git, start with a mindset that, command line is required.

As you know Git, other name come along which is 'GitHub' and then confusion grows. 
Many people think Git and GitHub are the same, But it’s not. Git is the version-Control software and GitHub is your storage with sharing capability. If you don’t want to share your code with anyone but still having versions maintenance you don’t need GitHub (which is unlikely though). Moreover, alternative to GitHub you can use Azure DevOps.

Those of you used to work with TFS, maybe already got to know about Azure Devops which give you option to either use Git or TFS as your version-control tool. It means in this scenario, you don’t need GitHub. Below picture depicts Git relate with Azure DevOps and GitHub.

Fig 1: Git, GitHub and Azure Devops

Let's start with beginner steps to work with Git.

Step 1: Install Git at your PC.
Follow the link and choose your right platform to download the file and then install it.
https://git-scm.com/downloads

Step 2: Open the Git command window, which look like below

Fig 2: Installed Git software

Now, you would like to create a solution or file which you would like to add to the Git and would like to work with other colleagues.

Adding file to Git means, the folder where you keep the files/code should be recognized by Git. To achieve that, you need to do following:

For example, your code is located under C\Code and you want to make this as Git repository.

After opening the Git Bash, write below command:

1 cd C:\                                                                                                            

2 cd Code
3 git init

Now the folder is ready to use as Git repository

'Initialized empty local Git repository in C:/Code/.git/'

And you will a find a git folder created in the folder C:/Code/

Now you can copy all your files under the folder C:/Code and let's add a file to the local Git repository.

4. git add Testcode.txt

and Commit the code in local repository

5.  git commit -m 'added test file'


Now your solution or all code files is under git repository, If you work on your own and wanted to have version control so you can always go back to your earlier code, which is ready at this stage.

Step 3:
Let's get to the next step where you would like to push this code to either Github or Azure Devops. In my case I have uploaded the code to Azure DevOps.

How do you do this?

Login to the Azure devops : https://dev.azure.com/

Create repository , choose Type as 'Git'

Fig 3: Create repo under Azure Devops
After finishing the repository creation, you will find like below window, where you need to copy the command line code.



Now go back to your  'Git Bash' command window, and write two lines of code:

6. git remote add origin https://d*****@dev.azure.com/dpaul0628/DpaulGithub/_git/LoveGit
7. git push -u origin –all

Now if you login to the Azure DevOps, you will see the local file you have committed earlier is uploaded to the Azure DevOps.

Saturday, June 29, 2019

Creating script along with Data for database objects- Why require? and Easy way to accomplish.

Sometimes you may have a table which have populated by business or tech to support the solution you develop. This type of table can be called as Master data table which is not derived from any source system. So whenever you need to move table DDL/script to higher environment (from DEV to UAT or PROD) you need to move not only DDL but also data for that table.

Now let's find out How do you do it?

There are a few ways to do it but my preferable way to do it by using SQL Task, we mostly use this for generating scripts; however, it can also take the data.

Step 1: Select the database and find generate script option
Fig 1: Finding out generate scripts for the database
Step 2: Follow the wizard
Fig 2: Go thought the wizard
Step 3: Choose the database object, the table you would like to move to the higher environment. And click Next.
Fig 3: Choose your database object
Step 4: Before you generate the script, you also need to have data,so click at the 'Advanced' button.
Fig 4: Advanced scripting options
Step 5: Change 'Type of data to script' to Schema and data
Fig 5: get schema and data together




You are almost done, after completing the addition few steps in the wizard script will be ready with DDL and data. Now you can move the saved file to the upper environment and run the script, you will find table and data together.



Sunday, May 19, 2019

When and why unpivot require for your Data Warehouse(DW) project?

You may involve with a Data warehouse (DW) project where you find source system got many repeated columns in a table which can be reduced to a few columns table. It's not only reducing the number of columns rather you would like to find out attributes those can be aggregated for the reporting purpose, in DW world, which will be your fact attributes. Let's start with an example:

In Medical data, Hospital can have different type of emergency bed type e.g. below table holds bed availability data for different type of bed.

Fig 1: Hospital Bed data

Let's make a table with above data, use below SQL query to populate the data.

create table #HospitalBed (TranID int,SubmissionDate datetime, GenAcuteCom int, 
ShortstayCrisisComp int, MoodAnxietyComp int, EatDisorderComp int)

 insert into #HospitalBed
 select 101,getutcdate(),10,3,9,4
 union 
 select 102,getutcdate(),5,6,2,8
 union 
select 104,getutcdate(),6,3,9,1
 union 
 select 105,getutcdate(),5,6,2,8



If you would like to get the data under, SELECT * FROM #HospitalBed

Fig 2: Data from Table

Now to fit this data for data warehouse model, we can clearly see columns need to turned into rows. These different type of bed should go under one column called 'BedType' and values for those columns  under another column can be called 'complementbed'. Now how do we do it? We have function in SQL called 'unpivot', we are going to use the function to populate expected data set.

select u.tranID, u.Bedtype, u.complementbed
from #HospitalBed s
unpivot
(
  complementbed
  for Bedtype in (GenAcuteCom, ShortstayCrisisComp, MoodAnxietyComp,EatDisorderComp)
) u;

Fig 3: After unpivot

As you can see , after applying UNPIVOT you get more rows but column got reduced.

But what is the benefit? This will help to make dimensional model easy, where all these BedType can be kept in a dimension table and fact table can have the key of BedType along the with ComplementBed value. Below Fig (4) shows how dimensional model can be established for the earlier mentioned source table (Fig 1) eventually.

Fig 4: Sample Dimension and Fact table 


And consider the source table we started have 4 different type of bed, however, it could be around 20 columns and more. Dimensional model can be super complicated and even may not able to build proper dimensional model without transposing the columns into rows for the above scenario,

Sunday, April 21, 2019

Misleading popup message in SSIS solution can give you trouble!!

SQL Server Data Tools (2017) is smart and trying to help you many ways. But sometime performance improvement popup message can lead you to some other issue.

I would like to share my experience recently, I was working with SSIS packages and at the end of the day commited and pushed the code to github successfully. But very next day when I opened the solution, surprisingly got incompatibility error (Fig 1), I was not able to open the project.

Fig 1: Solution incompatibility error.
I was really surprised, and trying to remember what I did wrong which may causes the issue. The project incompatible (Fig 1) generally happens when Visual studio version miss match. I did not do any update !! My mind was stuck for a while then recall that I clicked a pop up message in SSDT last day even can't remember what I responded. 

The pop up message was the below one (Fig 2). And I surely clicked 'Disable this extension'.

Fig 2: Misleading pop up
Pop up message in SSDT also says 'Disabling it may improve your experience' (Fig 2). I did not pay close attention and trusted the part where saying it may improve the performance which is misleading. And this causes the solution turned into incompatible.

How to solve this issue?

There are two ways you can solve:

1) Repair the software or 2) Enable the extension

1) Repair the software

You can execute the SQL server data tools (SSDT-Setup-ENU.exe) and click repair as like below figure 3, complete a few steps in wizard. When repair is done then open the solution, now compatibility error will go away.
Fig 3: Repair SSDT

2) Enable the extension
Please open SQL Server Data Tool and go to Tools->Extension and updates. You will find 'Micoroft Integration Services Project' extension is Disabled (Fig 4), so you need to enable the extension,


Fig 4: Extension and update





Saturday, March 16, 2019

Unions in OLEDB Source can kill performance in SSIS, How to handle this?

If you write SQL with many UNION in OLEDB source transformation then you will find the SSIS package is running slow.

For example, you may require many SQL unions to populate the data set as like below:
SELECT x,y,z,z1,z2
FROM tblA

UNION
SELECT x,y,z,z1,z2
FROM tblB

UNION
SELECT x,y,z,z1,z2
FROM tblC

UNION
SELECT x,y,z,z1,z2
FROM tblD

UNION
SELECT x,y,z,z1,z2
FROM tblN


Fig 1: OLEDB source 

Instead of using the above SQL in OLEDB source, you can use Union All transformation which is much faster in compare with using the SQL code in OLEDB source.

Fig 2: Union All Transformation



Saturday, February 16, 2019

SQL Analytical Function: Teradata Vs. MS SQL syntax

A few years back, I posted about Teradata analytical function where I described how this works and why we need to use analytical function :  https://pauldhip.blogspot.com/2014/12/analytical-function-rownumber-with.html

However, this post includes SQL syntax for MS SQL and how it's different from Teradata.

The SQL Code I have written in Teradata in my earlier post to find latest version of the data by using below:
select INDUST_CODE,INDUSTRY_NAME
from INDUSTRY
QUALIFY
row_number() Over (partition by INDUST_CODE order by ACTIVATE_DATE desc)=1

Do you think, the above syntax will work in MS SQL? Not really!! MS SQL doesn't have QUALIFY function. You need to write the syntax a little different way and  a few extra lines required to achieve same results.

Let's translate the above syntax for MS SQL:

SELECT INDUST_CODE, INDUSTRY_NAME, RANKRESULT
FROM(
SELECT  INDUST_CODE, INDUSTRY_NAME ,
RANK() OVER (PARTITION BY INDUST_CODE ORDER BY ACTIVATE_DATE DESC) AS RANKRESULT  
 FROM INDUSTRY
) TT
WHERE TT.RANKRESULT=1 

As you see, for MS SQL we had to use sub query and RANK function to achieve the same result we got in Teradata.

Sunday, January 13, 2019

Why Big Data and What is Big Data?

Before we know What is Big data, let’s start with Why Big data came into the picture?

Big data gets generated in multi petabyte quantities every day. Data changes fast and comes in different format e.g. audio, video, picture, text, structure, unstructured etc. those are difficult to manage and process using RDBMS or other traditional technologies. Since tech company like Google, yahoo in early 2000s found challenges to solve these various types of data with huge volume with existing technologies, so they started looking into alternative solution and that's how Big data is here today. You will find more about the big data history at the end of this post.


Let’s start with, what is Big data?

Is big data a Tool? Language? Solution? Or what? ...

Well, it’s a platform that comprises many tools, fortunately most of them are open source. However, since there are many tools available in the market to solve big data challenges, so next confusion arises; what tools to use when, I will write about this in my next post.

Let’s focus on concept of big data, People think big data is always about huge data, but it’s not the case. We can say, to be candidate for big data solution it should meet at least one of the three elements from 3 Vs:
 1) Volume 
 2) Velocity and 
 3) Variety

Fig 1: Elements to meet big data challenge

High volume: Social media like Facebook has billions of users, huge content created on YouTube every hour, organization like NASA generated 1.73 gigabytes of data at the end of year 2014 in every few seconds, Maersk vessels send huge volume of data every minutes over network.

High Velocity: Speed of the data matter, you need to capture real time data from IoT devices. Your mobile devices produce tons of data every day. Some business can’t wait longer, so you may have to capture near real time of data and need to process immediately. Some business like retail industry require real time data.

High Variety: Different type of data mixed in the same platform e.g. Wikipedia or Twitter or Facebook they have mix of text, audio, videos, images etc. Regular business also receive different format of data which need to transform into useful output.

So when your organization deal with the above 3 Vs then it's time to consider moving into big data platform. As Forbes research shown [1], the companies who said don't have any plan to use big data in 2015,  out of those; 11% percent already started using big data from 2017. And in 2015, 17% mentioned they are using big data but those number is increased to 53%. in 2017. The research also added that, among all industries; Finance and Telecom are ahead to adapt the big data.

History of Big data (literally how Hadoop invented):

Since data started growing exponentially and you get various type of data with great velocity which existing transactional database could not handle. Hence, many says; at first Google faced challenge how to handle the scenarios where they tried to gain an advantage in their searches, Google wanted to create a catalog of the entire Internet. To be successful, they had to meet the challenges presented by the 3 V's (as mentioned above) in an innovative way.

Google tackled the big data problem working together in a group of interconnected, inexpensive computers. This was revolutionary, over a span of a couple of years, Google Labs released papers describing the parts of their big data solution. From these, Doug Cutting and Mike Cafarella began developing a project at Yahoo!, which was later open sourced into the Apache Foundation project called Hadoop, named after the toy elephant of Mr. Cutting’s son.

When people talk about big data, the first name come is ‘Hadoop’. Hadoop is High-availability distributed object-oriented platform is used in maintaining, scaling, error handling, self-healing and securing large scale of data. These data can be structured or unstructured. As mentioned earlier if data is huge with variety and need to process instantly then traditional systems are unable to handle it. Thus, Hadoop comes in the picture.

But please remember, big data is not only Hadoop, there are so many other tools work with Hadoop eco system which you must need to use to solve the big data challenges which I am going to write in my next post.


[1] https://www.forbes.com/sites/louiscolumbus/2017/12/24/53-of-companies-are-adopting-big-data-analytics/#19852cfd39a1 (accessed on 11th Jan 2019)