It's all about Data

Saturday, March 16, 2019

Unions in OLEDB Source can kill performance in SSIS, How to handle this?

If you write SQL with many UNION in OLEDB source transformation then you will find the SSIS package is running slow.

For example, you may require many SQL unions to populate the data set as like below:
SELECT x,y,z,z1,z2
FROM tblA

UNION
SELECT x,y,z,z1,z2
FROM tblB

UNION
SELECT x,y,z,z1,z2
FROM tblC

UNION
SELECT x,y,z,z1,z2
FROM tblD

UNION
SELECT x,y,z,z1,z2
FROM tblN

Fig 1: OLEDB source

Instead of using the above SQL in OLEDB source, you can use Union All transformation which is much faster in compare with using the SQL code in OLEDB source.

Fig 2: Union All Transformation

Saturday, February 16, 2019

SQL Analytical Function: Teradata Vs. MS SQL syntax

A few years back, I posted about Teradata analytical function where I described how this works and why we need to use analytical function : https://pauldhip.blogspot.com/2014/12/analytical-function-rownumber-with.html

However, this post includes SQL syntax for MS SQL and how it's different from Teradata.

The SQL Code I have written in Teradata in my earlier post to find latest version of the data by using below:

select INDUST_CODE,INDUSTRY_NAME

from INDUSTRY

QUALIFY

row_number() Over (partition by INDUST_CODE order by ACTIVATE_DATE desc)=1

Do you think, the above syntax will work in MS SQL? Not really!! MS SQL doesn't have QUALIFY function. You need to write the syntax a little different way and a few extra lines required to achieve same results.

Let's translate the above syntax for MS SQL:

SELECT INDUST_CODE, INDUSTRY_NAME, RANKRESULT

FROM(

SELECT INDUST_CODE, INDUSTRY_NAME ,

RANK() OVER (PARTITION BY INDUST_CODE ORDER BY ACTIVATE_DATE DESC) AS RANKRESULT

FROM INDUSTRY

) TT

WHERE TT.RANKRESULT=1

As you see, for MS SQL we had to use sub query and RANK function to achieve the same result we got in Teradata.

Sunday, January 13, 2019

Why Big Data and What is Big Data?

Before we know What is Big data, let’s start with Why Big data came into the picture?

Big data gets generated in multi petabyte quantities every day. Data changes fast and comes in different format e.g. audio, video, picture, text, structure, unstructured etc. those are difficult to manage and process using RDBMS or other traditional technologies. Since tech company like Google, yahoo in early 2000s found challenges to solve these various types of data with huge volume with existing technologies, so they started looking into alternative solution and that's how Big data is here today. You will find more about the big data history at the end of this post.

Let’s start with, what is Big data?

Is big data a Tool? Language? Solution? Or what? ...

Well, it’s a platform that comprises many tools, fortunately most of them are open source. However, since there are many tools available in the market to solve big data challenges, so next confusion arises; what tools to use when, I will write about this in my next post.

Let’s focus on concept of big data, People think big data is always about huge data, but it’s not the case. We can say, to be candidate for big data solution it should meet at least one of the three elements from 3 Vs:

1) Volume

2) Velocity and

3) Variety

Fig 1: Elements to meet big data challenge

High volume: Social media like Facebook has billions of users, huge content created on YouTube every hour, organization like NASA generated 1.73 gigabytes of data at the end of year 2014 in every few seconds, Maersk vessels send huge volume of data every minutes over network.

High Velocity: Speed of the data matter, you need to capture real time data from IoT devices. Your mobile devices produce tons of data every day. Some business can’t wait longer, so you may have to capture near real time of data and need to process immediately. Some business like retail industry require real time data.

High Variety: Different type of data mixed in the same platform e.g. Wikipedia or Twitter or Facebook they have mix of text, audio, videos, images etc. Regular business also receive different format of data which need to transform into useful output.

So when your organization deal with the above 3 Vs then it's time to consider moving into big data platform. As Forbes research shown [1], the companies who said don't have any plan to use big data in 2015, out of those; 11% percent already started using big data from 2017. And in 2015, 17% mentioned they are using big data but those number is increased to 53%. in 2017. The research also added that, among all industries; Finance and Telecom are ahead to adapt the big data.

History of Big data (literally how Hadoop invented):

Since data started growing exponentially and you get various type of data with great velocity which existing transactional database could not handle. Hence, many says; at first Google faced challenge how to handle the scenarios where they tried to gain an advantage in their searches, Google wanted to create a catalog of the entire Internet. To be successful, they had to meet the challenges presented by the 3 V's (as mentioned above) in an innovative way.

Google tackled the big data problem working together in a group of interconnected, inexpensive computers. This was revolutionary, over a span of a couple of years, Google Labs released papers describing the parts of their big data solution. From these, Doug Cutting and Mike Cafarella began developing a project at Yahoo!, which was later open sourced into the Apache Foundation project called Hadoop, named after the toy elephant of Mr. Cutting’s son.

When people talk about big data, the first name come is ‘Hadoop’. Hadoop is High-availability distributed object-oriented platform is used in maintaining, scaling, error handling, self-healing and securing large scale of data. These data can be structured or unstructured. As mentioned earlier if data is huge with variety and need to process instantly then traditional systems are unable to handle it. Thus, Hadoop comes in the picture.

But please remember, big data is not only Hadoop, there are so many other tools work with Hadoop eco system which you must need to use to solve the big data challenges which I am going to write in my next post.

[1] https://www.forbes.com/sites/louiscolumbus/2017/12/24/53-of-companies-are-adopting-big-data-analytics/#19852cfd39a1 (accessed on 11th Jan 2019)

Wednesday, November 7, 2018

SQL: Pay attention when using Group by and CASE WHEN statement together

When you are working with SQL and having case when statement together with

group by then please have a close look at the what are you putting in the case

when statement.

For example, Cases like below is pretty much straight forward:

 
Select cust_key,
       Cust_site_key,
       Max(rating_key) as rating_key
From Customer_rating
Where year=2018
Group by 1,2
 
However, Now let's look at the below code and see if you find any issue with

this code.

Select
event_key,
Event_code,
CASE Event_code
   When ‘start’ then min(f.event_date_key)
   When ‘Closed’ then max(f.event_date_key)
ELSE
   f.event_date_key
END as c 
FROM
Event_fact f
Join
event_dim ed
On
f.event_key=ed.event_key
Group
by event_key,Event_code



After running above code you will get error with "Failed
3504: Selected non-aggregate values must be part of the associated group."

What can go wrong? Well, you may think; event_date_key is using aggregate

function (min and max) so you don't need to use group by with

event_date_key. However, if you look closely then will find ELSE part of the CASE

statement also using the event_date_key where no aggregation applied. It means

we need to use group by with event_date_key.

Hence the correct code will be:

Select

event_key,
Event_code,
CASE Event_code
   When ‘start’ then min(f.event_date_key)
   When ‘Closed’ then max(f.event_date_key)
ELSE
   f.event_date_key
END as c 
FROM
Event_fact f
Join
event_dim ed
On
f.event_key=ed.event_key
Group
by event_key,Event_code,event_date_key

Sunday, October 14, 2018

SSAS cube: Partition process successfully but measure group is not showing any data

Some days will bring you big surprise which you are not prepare for at all!! I am going to share one of them!!

One of the cube I was working with processed successfully but there is no data in the measures which was just there before last processed.

What can go wrong??

I have only added one attribute to the existing dimension. And then processed default, processed full, processed partition by by partition full; all different cases processed successfully but no data under the measure group. When I drag dimensions under SSAS cube browser I am able to see the value but no rows are show while adding measures.

Was google helpful?

Through google search, some suggested to check if the source database connection is set correctly, Some talked about full process, some mentioned about adding slice to process the partition correct.

Nothing really helped me.

What steps resolved the issue?

step 1: Make sure you have right database connection and passed your credential as well as used correct impersonate correct(sometime cube don't show you the correct error message)

step 2: Process all the dimensions one after another

step 3: Since it's huge data set so process one partition and checked the estimated size property of the specific partition that you just processed.

If estimated size of the partition property is not 0 MB, it means you should have the data.

Lesson learned: if you find the cube measure is not shown any data suddenly after adding an attribute to the existing dimension then do process all the dimensions (not only the changed dimension) and then process the cube. This step is always helpful if your cube behave strange sometime without noticing you :)

Saturday, September 22, 2018

How to resolve if connection string changes throw error in your excel report?

Scenario: For example, an excel report is connected to SSAS cube in Production and it's serving business needs. However; for some reason SSAS server has been changed, hence the connection need to point to the new server. How will you change the connection with new SSAS server?

Well, you may thinking it should be pretty straight forward, just change the connection string where SSAS server name should be replaced(as like fig 1) then refresh all and it should be ready to use again.

Fig 1: Changing data source from Connection string

Unfortunately it will not work for some cases or many cases. I actually did the same but ended up with error like below:

Fig 2: Error after changing connection.

Okay!! got it, then how to solve this?

Solution: Please find the solution below:

Go to 'Analyse' tab in your excel as like Fig 3.

Fig 3: Find Analyse tab

And then change the connection by using change data source (Fig 4):

Fig 4: Change data source

After your change the data source, hit the 'Refresh All' button , it should work now without any error.

Sunday, July 15, 2018

SSAS Calculations: How to deploy calculation changes quickly in a cube?

If your SSAS solution is multidimensional then you may have at least a few calculated measures in your cube. And every time you change MDX script under calculation you need to process the cube; though only default process will effect the changes in the cube but still you may spend sometime for processing the cube.

If the cube is bigger and have many partitions then you may easily spend good amount of time for processing the cube every time; even though you modify a very little part of MDX script.

To avoid processing the cube every time you change MDX script, you can use BIDS helper which will save development time. Though BIDS helper can help many ways in MS BI development, however; this writing only include MDX deployment part of SSAS multidimensional cube.

BIDS helper is an extension to Visual studio, So first step will be to download the extension, you will find Extension and Update under Tools in Visual studio as like below figure 1

Fig 1: Find Extension and update

Then you can search BIDS helper and install it as like figure 2:

Fig 2: Search and install BIDS extension

After you complete the installation, please restart visual studio. And as long as you open your multidimensional cube you will find new icons under the calculation tab. The new icon look like below figure 3:

Fig 3: New BIDS helper icon

If you mouse over the icon it will display ''Deply MDX Script", as long as you click the icon it will deploy the MDX script to the cube immediately. So you don't need to process the cube at all.

Pages