It's all about Data

Saturday, January 30, 2021

How to work with SQL Store procedure output parameters in Azure data factory

To facilitate native SQL code we have stored procedure support in Azure Data Factory (ADF). When we work with stored procedures, mostly we use input parameters, however, a stored procedure can also have output parameters. In this case, we need to deal with return values, similar to how a value is returned by a function. This article will describe how you can work with stored procedure output parameters in ADF.

ADF has the SQL Server Stored Procedure Activity, which is used for any stored procedure you have in a SQL Server database. ADF also has a Lookup activity in which you can use a stored procedure. The example will use the Lookup activity to execute a stored procedure in a database.

Let's start by creating a stored procedure (SP) with an output parameter:

CREATE PROCEDURE [ETL].[sp_testprocOutParm]
(
 @input  VARCHAR(10),@Iambit BIT OUTPUT)
AS
BEGIN
IF @input >= '1'
BEGIN
    SET @iambit = 1;
	RETURN;
	END
END;

Let's see if the SP returns the expected value. Well, how do we execute the SP in SQL Server Management Studio (SSMS)? Please write below syntax to find the outcome.

DECLARE @Iambit bit 

EXEC ETL.sp_testprocOutParm '1', @Iambit OUTPUT
SELECT @Iambit Iambit

You should have outcome like below Fig 1.

Fig 1: Execution of output parameter in SSMS

The SP is created in the database and it has returned the expected outcome. Now we need to move to ADF.

This article assume you know how to add Lookup activity to ADF pipeline. In your pipeline get the Lookup activity as like figure 1.1.Now, Go Settings of the Lookup activity and choose stored procedure from theUse query selections (as shown in Figure 2). Then you should able to see the stored procedure under the dropdown list which you just created in the database. If you are not able to see the stored procedure then it most likely a problem with your access in ADF. Please remember, though you can execute the SP in SSMS, this doesn't mean you will have access to the stored procedure from ADF, You will need to talk with your portal admin and find out if your user has been given enough permissions to execute the stored procedure.

Fig 2: Connect stored procedure via Lookup in ADF

If you find out the stored procedure in the list, you can continue to the next step. The next step is to import parameters by clicking the button, import parameter, as shown in Fig 3.

Fig 3: import parameter

The import parameter will load all the SP parameters, both input and output parameters. In this case the single output parameter will be shown. Having an output parameter means you want to return some value from the stored procedure and use it in ADF. In this example, the return value will control the next activities in the pipeline. Let's store the return value into a variable.

Drag and drop the Set Variable activity and connect it with the with the Lookup, as shown below in Fig 4

Fig 4: Add Set variable

We need a variable as shown in Fig 5, put variable name and select type of the variable. Since variable is at pipeline scope, so make sure you have selected pipeline itself not any activities in the pipeline.

Fig 5: Creating variable

While creating variables, you will find there are three types of variables: string, Boolean and Array. We have chosen Boolean type for the variable since our stored procedure returns Boolean.

Fig 6: Creating variable

Now, let's get to the pipeline and select 'set variable' activity (as shown in Fig 7). In the Set variable activity, please click the variable tab. Here is where you will find the recently created variable under the name drop down. Select it, as shown in Fig 7.

Fig 7: Choose the variable

To bind the value returned from the stored procedure to the variable, you need write the expression under the 'value' of the variable, as shown in Fig 8

Fig 8: expression to hold return value

The expression to bind return value from stored procedure is:

@activity('Lookup1').output.firstRow.Iambit

Let's explain what expression means. The above expression will return first row from the lookup activity, named Lookup1, and return the column name from the stored procedure, which is 'Iambit'. Here, 'Iambit' will return the result as 1 (a Boolean TRUE).

Be Cautious: Though the variable is set to be Boolean and the stored procedure returned the value as a bit, I found an error in the ADF with the above expression. I had to modify the expression to explicitly convert the return value to Boolean, like the code below:

@bool(activity('Lookup1').output.firstRow.Iambit)

In summary, output parameter in stored procedure a great facility when in need and happy to see that works in ADF. Please note that, the example we depicted here in this article by using very simple stored procedure, however; in business scenario your stored procedure could be a bit complex. For example, you may have logging tables in the data warehouse and you want any activity in ADF will only execute when you pass particular pipeline name as input parameter as a return if the stored procedure returns TRUE.

Wednesday, December 30, 2020

ADF data flow: can particular date format generate NULL value?

I am going to share recent finding in ADF data flow where my source data in .csv got correct date. However, as long as when I did some transformation and saved in .parquet found those date all got empty values. This blog post will describe the issue and the resolution.

Source data in .csv have startDate and CloseDate like below figure 1.0 where format of the date is MM/dd/yyyy

Fig 1.0: Date in the source .csv

I have used ADF data flow to cleanup/transform the files and saved into .parquet format. However, in the .parquet file these two date columns 'StartDate' and 'CloseDate' become empty.

Fig 1.1: Dates become empty in .parquet

After looking into the dataflow and specific looking at the projection of the source found auto detect date format ‘MM/dd/YYYY’ which is original source date format.

Fig 1.3: Date format auto detected in the data flow

And when previewed the data those date shown as NULL which was kind of weird.

Fig 1.4: Date shown as NULL in the data preview

How to solve it?

To fix this issue, what you need to do is, go under projection and change the date format to ‘yyyy-MM-dd’ as like below figure 2.0

Fig 2.0: Change date format

And you can go and see the preview, it looks good now.

Fig 2.1: After changing the date format preview looks perfect

Note that, I have tried with other format from projection such as yyyy/MM/dd and so on but those did not resolve the issue.

Fig 2.2:

Other Solution?

You can also take other approach, change the format from 'date' to 'string' under the projection

Fig 3.0: change data type from date to string

And then use derive column activity

Fig 3.1: get 'derived column' in the data flow

Now, use expression to convert into the correct date format : toDate(OpenDate,'yyyy-MM-dd')

Fig 3.2: expression to convert from string to date

In summary, if you find any discrepancy at output file's date columns then look closely the date format, preview the data in the ADF data flow then either change the format from source projections or use derived column activities to fix it.

Saturday, November 14, 2020

How to deal with NULL in ADF dataflow compared with SSIS?

When you are working with ETL/ELT, sometimes you may need to transform NULL into something meaningful value. If you worked with SSIS, you know how you handle that. This blog post will describe how do we do in SSIS and how the very same task can be done in ADF Dataflow.

Consider, we have a .csv file where Name columns have NULL value for 2nd record (figure: 1.0)

Fig 1.0: Sample .csv file with NULL record

After connecting the .csv file through flat file source in SSIS data flow, we can debug and view the record through data viewer which will look like below figure 1.1

Fig 1.1: Result in SSIS data flow - data viewer

If you would like to replace the NULL value with meaning value, in that case you need to use derive column activity and use expression.

SSIS data flow expression got REPLACENULL function, which will replace NULL to the expected value that you want.

The expression: REPLACENULL(Name,"Unknown")

The above expression will return 'Unknown' when Name is NULL otherwise it will return the original value.

Fig 1.2: Expression in SSIS Data flow to replace NULL with 'Unknown'

When it comes to ADF data flow regular expression similar like SSIS expression; isNull only give you true or false. And isNull function take only one argument, e.g. below fig 2.1 took the argument Name and return True (✓) if the value is NULL.

Fig 2.0: ADF dataflow isNull function

Now, let's find out how to transform NULL value into something meaningful in ADF data flow. ADF doesn’t have the same function REPLACENULL which used in SSIS, rather there are two ways you can replace the NULL values in ADF dataflow.

Approach 1: Combination of iif and isNULL function

Expression: iif(isNull(Name), 'Unknown', Name)

The function iif will check the condition isNull(Name), if Name have Null value it will return 'Unknown' otherwise original value will be returned.

Fig 2.1: using iif and isNull in ADF dataflow

Approach 2: By using iifNull function

The smartest solution is to use iifNull which will return exactly the same result we found via approach 1.

expression: iifNull(Name, 'Unknown') will return 'Unknown' if Name have NULL values otherwise it will return original value.

Fig 2.2:iifNULL function to replace NULL value

In summary, expression is similar to replace NULL values for both SSIS and ADF data flow, however, the function you need to use is different for two different tools.

Sunday, October 18, 2020

How to handle Case statement in Azure Data Factory (ADF) compare to SSIS?

This post will describe how do you use CASE WHEN statement in Azure data factory(ADF). If you are coming from SSIS background, you know a piece of SQL statement will do the task. However let's see how you do it in SSIS and the very same thing can be achieved in ADF.

Problem statement:

For my simple scenario, In case PortfolioTypeCode is either 'Mutual Fund' or 'Pooled Fund' it should return 1 Else it should return 0.

How do you do in SSIS?

In SSIS, under data flow you will have OLEDB source like below fig 1:

Fig 1: SSIS OLEDB source

And open the OLEDB source and then Write SQL command like below and you are done:

SELECT Col1,
CASE WHEN PortfolioCode IN('Mutual fund','Pooled fund')
THEN 1
ELSE 0
END IsFund,
Col2
From Table1

Fig 2: CASE WHEN under SQL command in SSIS

How do you implement in ADF?

However in ADF, to achieve the same you need to use Expressions. ADF have very same concept of data flow like SSIS. In the data flow, after the source dataset is established you can add 'Derived Column' activity like below Fig 3:

Fig 3: Adding derive column under data flow

Now you can give a new column name and then add the expression (Fig 4):

Fig 4: Derived column expression

Let's see how Case expression works: it takes 3 arguments, those are condition, true and false. However, it can have alternating condition which describe in the figure 5:

Fig 5: Case in Expression

For my simple scenario, If PortfolioTypeCode is either 'Mutual Fund' or 'Pooled Fund' it should return 1 Else it should return 0.

Since you can't have CASE WHEN rather using case as Expression, the code will look like below:

case( PortfolioTypeCode=='Mutual Fund',1,
PortfolioTypeCode=='Pooled Fund',1,0)

Saturday, September 19, 2020

Azure data Factory: What is not allowed in the file name while using Data Flow (spark engine)?

I was working with a pipeline where data needs to move from source and load into RAW zone in ADLS. As soon as loaded the file it got date time stamp appended end of the file. e.g. source file name is: customer.csv and when it's landed to RAW zone, then file name will be : customer_2020-09-28T12:15:32.csv

How do you add date time stamp end of the file?

Adding dynamic content at the sink pipeline like below (Fig 1) will do the task.

Fig 1: dynamic content to

When I run the pipeline it was appending timestamp end of the file and saving in the blob storage And I was able to view the data from the file: customer_2020-09-28T12:15:32.csv

Fig 2: Preview of the data

In the azure data lake storage file name with ':' did not give any issue while creating the file name as well viewing it.

However, as soon as I use that file in the Data Flow activity and debug the code (when apache spark engine fire) then below error : java.lang.illegalArgumentException:java.net.URISyntaxException:....

Fig 3: java.lang.illegalArgumentException

How to resolve the error?

You can't have file with ':', which also true if you try to create a file in Windows OS with ':' in it. It will throw error, however, interestingly that's not the case when you create a file with same name in Azure data lake storage (HDFS is working behind).

Fig 4: Error in Windows machine

To resolve this, I updated the format of timestamp part while adding end of the each file, instead of using yyyy-MM-ddTHH:mm:ss , I have used yyyy-MM-ddTHHmmss. so I get the file name as: customer_2020-09-28T121532.csv

Saturday, August 22, 2020

How to solve Azure data factory Error: Sink dataset filepaths cannot contain a file name ?

I was building a pipeline which was taking data from SFTP folder and sink to Cleansed layer where file format was .parquet. As soon as validated the pipeline got the below error, where the error message is: 'Sink dataset filepath can not contain a file name, please remove the file name from {DatasetName}.

You will find error like this:

Fig 1: Sink dataset error

If you find this error, may get confuse at first since you have file name at two different places. One is at Sink activity settings and other one is at the data set.

Let's guide you to how to find those settings and fix this error. If you look at the Settings for Sink activity, you will find like below fig: 2.

Fig 2: Settings of Sink activity

The error message said filepath can't contain file name but remember this is not data set, it's setting of the Sink activity so you don't need to remove the file name. And if you want to keep it in a single file then choose 'Output to single file'.

So fix the issue you need click Sink and open data set as like below fig 3:

Fig 3: open data set

Now you can find the file name under data set as like below fig 4, Please remove file name, save and validate.

Fig 4: remove file name from Data set

Now the error will disappear. Lesson learned, look at the error message more than once and locate the correct place to fix the issue.

Sunday, July 26, 2020

How to deploy SQL Server big data cluster by using Azure data studio?

There are different ways to deploy SQL server 2019 big data cluster. However, this blog post will use Azure data studio to deploy big data cluster.

The very first thing you need to have is Azure data studio, if you don't have installed Azure data studio then please find it: https://docs.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver15

Step 1: After you open Azure data studio you will find below UI:

Fig 1: Azure data studio

Step 2: Choose SQL server big data cluster

Fig 2: Choose SQL Server Big Data Cluster

Step 3: Missing prerequisite

kubectl
Azure CLI
azdata

Please install those.

Fig 3: Missing prerequisite

After prerequisite have been installed you will find below UI

Fig 4: After prequisite is installed

Step 4: Deployment Configuration Profile

Fig 5: Configuration profile

Step 5.A: Azure Settings:

Azure settings where resource group will be created, make sure you have enough permission to create resource group, if you don't have permission then it will fail.

Fig 5: Azure settings

If failed then the error message will show like below:

Fig 5.1: Error while creating resource group

Step 5.B: Cluster Settings: Your setup cluster name and credential which you will require later to connect the cluster, so please keep it safe.

Fig 7: SQL Server cluster settings

Step 5.C: Service and storage settings: It will fill automatic but please adjust as per your need.

Fig 8: Service settings.

Step 6: Script to Notebook

Last step of the wizard, where as soon as click the 'Script to Notebook' it will open Notebook in Azure studio.

Fig 9: Script to Notebook

However, if this is the first time you are deploying big data cluster then python need to be installed, so you will find below UI:

Fig 10: install python

When you are done then go and hit 'Run All'

Fig 11: Execute the script

You may find error where pandas is still missing

Fig 12: Pandas is missing

So you need to install Pandas package from Azure data studio, Go to Package manager

Fig 13: Finding package manager in Azure data studio

And then find pandas under package manager, as soon as you find them , hit the install button.

Fig 14: Install pandas

When installation is done, then hit the 'Run All' from Notebook again, this time it will successfully start running and you will find , one of the step will login to Azure portal where automatically redirect to the portal and you will find like below UI where you don't need to do anything.

Fig 15: Azure login

When deployment is over , your big data cluster should be ready to use, so you must need to connect that. You will find all the end points after deployment is completed, please take the end point for 'SQL Server Master Instance Front-End'

Fig 16: Click to connect the Cluster

After connecting the big data cluster, it will look like below:

Fig 17: Connect with big data cluster

If you would like to remove the cluster then either you delete it by using azdata command in the command shell.

Delete cluster:

azdata bdc delete -n mssql-cluster

Fig 18: delete cluster

Or you can completely remove the the resource group from Azure portal.

Pages