It's all about Data: 2020

Wednesday, December 30, 2020

ADF data flow: can particular date format generate NULL value?

I am going to share recent finding in ADF data flow where my source data in .csv got correct date. However, as long as when I did some transformation and saved in .parquet found those date all got empty values. This blog post will describe the issue and the resolution.

Source data in .csv have startDate and CloseDate like below figure 1.0 where format of the date is MM/dd/yyyy

Fig 1.0: Date in the source .csv

I have used ADF data flow to cleanup/transform the files and saved into .parquet format. However, in the .parquet file these two date columns 'StartDate' and 'CloseDate' become empty.

Fig 1.1: Dates become empty in .parquet

After looking into the dataflow and specific looking at the projection of the source found auto detect date format ‘MM/dd/YYYY’ which is original source date format.

Fig 1.3: Date format auto detected in the data flow

And when previewed the data those date shown as NULL which was kind of weird.

Fig 1.4: Date shown as NULL in the data preview

How to solve it?

To fix this issue, what you need to do is, go under projection and change the date format to ‘yyyy-MM-dd’ as like below figure 2.0

Fig 2.0: Change date format

And you can go and see the preview, it looks good now.

Fig 2.1: After changing the date format preview looks perfect

Note that, I have tried with other format from projection such as yyyy/MM/dd and so on but those did not resolve the issue.

Fig 2.2:

Other Solution?

You can also take other approach, change the format from 'date' to 'string' under the projection

Fig 3.0: change data type from date to string

And then use derive column activity

Fig 3.1: get 'derived column' in the data flow

Now, use expression to convert into the correct date format : toDate(OpenDate,'yyyy-MM-dd')

Fig 3.2: expression to convert from string to date

In summary, if you find any discrepancy at output file's date columns then look closely the date format, preview the data in the ADF data flow then either change the format from source projections or use derived column activities to fix it.

Saturday, November 14, 2020

How to deal with NULL in ADF dataflow compared with SSIS?

When you are working with ETL/ELT, sometimes you may need to transform NULL into something meaningful value. If you worked with SSIS, you know how you handle that. This blog post will describe how do we do in SSIS and how the very same task can be done in ADF Dataflow.

Consider, we have a .csv file where Name columns have NULL value for 2nd record (figure: 1.0)

Fig 1.0: Sample .csv file with NULL record

After connecting the .csv file through flat file source in SSIS data flow, we can debug and view the record through data viewer which will look like below figure 1.1

Fig 1.1: Result in SSIS data flow - data viewer

If you would like to replace the NULL value with meaning value, in that case you need to use derive column activity and use expression.

SSIS data flow expression got REPLACENULL function, which will replace NULL to the expected value that you want.

The expression: REPLACENULL(Name,"Unknown")

The above expression will return 'Unknown' when Name is NULL otherwise it will return the original value.

Fig 1.2: Expression in SSIS Data flow to replace NULL with 'Unknown'

When it comes to ADF data flow regular expression similar like SSIS expression; isNull only give you true or false. And isNull function take only one argument, e.g. below fig 2.1 took the argument Name and return True (✓) if the value is NULL.

Fig 2.0: ADF dataflow isNull function

Now, let's find out how to transform NULL value into something meaningful in ADF data flow. ADF doesn’t have the same function REPLACENULL which used in SSIS, rather there are two ways you can replace the NULL values in ADF dataflow.

Approach 1: Combination of iif and isNULL function

Expression: iif(isNull(Name), 'Unknown', Name)

The function iif will check the condition isNull(Name), if Name have Null value it will return 'Unknown' otherwise original value will be returned.

Fig 2.1: using iif and isNull in ADF dataflow

Approach 2: By using iifNull function

The smartest solution is to use iifNull which will return exactly the same result we found via approach 1.

expression: iifNull(Name, 'Unknown') will return 'Unknown' if Name have NULL values otherwise it will return original value.

Fig 2.2:iifNULL function to replace NULL value

In summary, expression is similar to replace NULL values for both SSIS and ADF data flow, however, the function you need to use is different for two different tools.

Sunday, October 18, 2020

How to handle Case statement in Azure Data Factory (ADF) compare to SSIS?

This post will describe how do you use CASE WHEN statement in Azure data factory(ADF). If you are coming from SSIS background, you know a piece of SQL statement will do the task. However let's see how you do it in SSIS and the very same thing can be achieved in ADF.

Problem statement:

For my simple scenario, In case PortfolioTypeCode is either 'Mutual Fund' or 'Pooled Fund' it should return 1 Else it should return 0.

How do you do in SSIS?

In SSIS, under data flow you will have OLEDB source like below fig 1:

Fig 1: SSIS OLEDB source

And open the OLEDB source and then Write SQL command like below and you are done:

SELECT Col1,
CASE WHEN PortfolioCode IN('Mutual fund','Pooled fund')
THEN 1
ELSE 0
END IsFund,
Col2
From Table1

Fig 2: CASE WHEN under SQL command in SSIS

How do you implement in ADF?

However in ADF, to achieve the same you need to use Expressions. ADF have very same concept of data flow like SSIS. In the data flow, after the source dataset is established you can add 'Derived Column' activity like below Fig 3:

Fig 3: Adding derive column under data flow

Now you can give a new column name and then add the expression (Fig 4):

Fig 4: Derived column expression

Let's see how Case expression works: it takes 3 arguments, those are condition, true and false. However, it can have alternating condition which describe in the figure 5:

Fig 5: Case in Expression

For my simple scenario, If PortfolioTypeCode is either 'Mutual Fund' or 'Pooled Fund' it should return 1 Else it should return 0.

Since you can't have CASE WHEN rather using case as Expression, the code will look like below:

case( PortfolioTypeCode=='Mutual Fund',1,
PortfolioTypeCode=='Pooled Fund',1,0)

Saturday, September 19, 2020

Azure data Factory: What is not allowed in the file name while using Data Flow (spark engine)?

I was working with a pipeline where data needs to move from source and load into RAW zone in ADLS. As soon as loaded the file it got date time stamp appended end of the file. e.g. source file name is: customer.csv and when it's landed to RAW zone, then file name will be : customer_2020-09-28T12:15:32.csv

How do you add date time stamp end of the file?

Adding dynamic content at the sink pipeline like below (Fig 1) will do the task.

Fig 1: dynamic content to

When I run the pipeline it was appending timestamp end of the file and saving in the blob storage And I was able to view the data from the file: customer_2020-09-28T12:15:32.csv

Fig 2: Preview of the data

In the azure data lake storage file name with ':' did not give any issue while creating the file name as well viewing it.

However, as soon as I use that file in the Data Flow activity and debug the code (when apache spark engine fire) then below error : java.lang.illegalArgumentException:java.net.URISyntaxException:....

Fig 3: java.lang.illegalArgumentException

How to resolve the error?

You can't have file with ':', which also true if you try to create a file in Windows OS with ':' in it. It will throw error, however, interestingly that's not the case when you create a file with same name in Azure data lake storage (HDFS is working behind).

Fig 4: Error in Windows machine

To resolve this, I updated the format of timestamp part while adding end of the each file, instead of using yyyy-MM-ddTHH:mm:ss , I have used yyyy-MM-ddTHHmmss. so I get the file name as: customer_2020-09-28T121532.csv

Saturday, August 22, 2020

How to solve Azure data factory Error: Sink dataset filepaths cannot contain a file name ?

I was building a pipeline which was taking data from SFTP folder and sink to Cleansed layer where file format was .parquet. As soon as validated the pipeline got the below error, where the error message is: 'Sink dataset filepath can not contain a file name, please remove the file name from {DatasetName}.

You will find error like this:

Fig 1: Sink dataset error

If you find this error, may get confuse at first since you have file name at two different places. One is at Sink activity settings and other one is at the data set.

Let's guide you to how to find those settings and fix this error. If you look at the Settings for Sink activity, you will find like below fig: 2.

Fig 2: Settings of Sink activity

The error message said filepath can't contain file name but remember this is not data set, it's setting of the Sink activity so you don't need to remove the file name. And if you want to keep it in a single file then choose 'Output to single file'.

So fix the issue you need click Sink and open data set as like below fig 3:

Fig 3: open data set

Now you can find the file name under data set as like below fig 4, Please remove file name, save and validate.

Fig 4: remove file name from Data set

Now the error will disappear. Lesson learned, look at the error message more than once and locate the correct place to fix the issue.

Sunday, July 26, 2020

How to deploy SQL Server big data cluster by using Azure data studio?

There are different ways to deploy SQL server 2019 big data cluster. However, this blog post will use Azure data studio to deploy big data cluster.

The very first thing you need to have is Azure data studio, if you don't have installed Azure data studio then please find it: https://docs.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver15

Step 1: After you open Azure data studio you will find below UI:

Fig 1: Azure data studio

Step 2: Choose SQL server big data cluster

Fig 2: Choose SQL Server Big Data Cluster

Step 3: Missing prerequisite

kubectl
Azure CLI
azdata

Please install those.

Fig 3: Missing prerequisite

After prerequisite have been installed you will find below UI

Fig 4: After prequisite is installed

Step 4: Deployment Configuration Profile

Fig 5: Configuration profile

Step 5.A: Azure Settings:

Azure settings where resource group will be created, make sure you have enough permission to create resource group, if you don't have permission then it will fail.

Fig 5: Azure settings

If failed then the error message will show like below:

Fig 5.1: Error while creating resource group

Step 5.B: Cluster Settings: Your setup cluster name and credential which you will require later to connect the cluster, so please keep it safe.

Fig 7: SQL Server cluster settings

Step 5.C: Service and storage settings: It will fill automatic but please adjust as per your need.

Fig 8: Service settings.

Step 6: Script to Notebook

Last step of the wizard, where as soon as click the 'Script to Notebook' it will open Notebook in Azure studio.

Fig 9: Script to Notebook

However, if this is the first time you are deploying big data cluster then python need to be installed, so you will find below UI:

Fig 10: install python

When you are done then go and hit 'Run All'

Fig 11: Execute the script

You may find error where pandas is still missing

Fig 12: Pandas is missing

So you need to install Pandas package from Azure data studio, Go to Package manager

Fig 13: Finding package manager in Azure data studio

And then find pandas under package manager, as soon as you find them , hit the install button.

Fig 14: Install pandas

When installation is done, then hit the 'Run All' from Notebook again, this time it will successfully start running and you will find , one of the step will login to Azure portal where automatically redirect to the portal and you will find like below UI where you don't need to do anything.

Fig 15: Azure login

When deployment is over , your big data cluster should be ready to use, so you must need to connect that. You will find all the end points after deployment is completed, please take the end point for 'SQL Server Master Instance Front-End'

Fig 16: Click to connect the Cluster

After connecting the big data cluster, it will look like below:

Fig 17: Connect with big data cluster

If you would like to remove the cluster then either you delete it by using azdata command in the command shell.

Delete cluster:

azdata bdc delete -n mssql-cluster

Fig 18: delete cluster

Or you can completely remove the the resource group from Azure portal.

Saturday, June 20, 2020

Adding Active Directory users to Azure SQL databases

If you are working with Azure SQL Database you will find adding user a bit different than general SQL DB, especially; adding Active Directory users. Though it's easy and a few steps, off course if you know the steps. While I was working on it and went through different routes so thought about sharing the easy one which worked like Gem.

There are two authentication approach for Azure SQL DB and SQL Managed Instance.

1) SQL Authentication

2) Azure Active directory Authentication

Fig 1.0: Authentication type in Azure

SQL authentication is straight forward, while you are creating database in Azure, you must have to put SQL server credential which will become your Azure database credential if you don't create database credential separately. As long as you have administrative credential to login to the database then you create all necessary users. However, you can't create Active directory users in Azure by logging as SQL authenticate user.

For example, I have logged in with the SQL admin user and then tried to execute below query to add Active directory user:

CREATE USER [user1@domain.com]

FROM EXTERNAL PROVIDER

WITH DEFAULT_SCHEMA = dbo;

--add user to role(s) for the particular database

ALTER ROLE dbmanager ADD MEMBER [user1@domain.com];

ALTER ROLE loginmanager ADD MEMBER [user1@domain.com];

However, found below error:

Fig 2.0: Error creating AD user

The error message is self explanatory: "Principal 'user1@domain.com' could not be created. Only connections established with Active Directory accounts can create other Active Directory users."

So, how to solve it? Or How to add the AD user to Azure SQL DB ?

Step 1: Please login to you azure portal and find out your SQL server resource and you will find 'Active directory Admin' left side under settings. Please click 'Active Directory Admin' to find out UI like below where click 'Set admin' , now your Active Directory user account become Admin to the SQL server.

Fig 3.0: Active Directory Admin

Step 2: Now either use SSMS or Azure data studio and login with Active directory authentication. Since my organization have Multi factor Authentication (MFA) enables so I have chosen Azure Active Directoty- Universal with MFA, but you can choose either 'Active directory - Integrated' or 'Active Directory - Password.'

Fig 4: Login with AD user

Step 3: And now you can run the query to add Active directory user and give permission as you want, though traditional database level roles like db_datareader, db_datawriter, db_ddladmin, etc. are the same in Azure database, but roles like sysadmin, serveradmin, etc. don’t exist in Azure SQL database. There are two admin roles, dbmanager (similar to dbcreator) that can create and drop databases, and loginmanager (similar to securityadmin) that can create new logins which you can see in the below code: