Pages

Sunday, October 18, 2020

How to handle Case When statement in Azure Data Factory (ADF) compare to SSIS?

This post will describe how do you use CASE WHEN statement in Azure data factory(ADF). If you are coming from SSIS background, you know a piece of SQL statement will do the task. However let's see how you do it in SSIS and the very same thing can be achieved in ADF.

Problem statement:

For my simple scenario, In case PortfolioTypeCode is either 'Mutual Fund' or 'Pooled Fund' it should return 1 Else it should return 0.

 

How do you do in SSIS?

In SSIS, under data flow you will have OLEDB source like below fig 1:

Fig 1: SSIS OLEDB source

 

And open the OLEDB source and then Write SQL command like below and you are done:

SELECT Col1,

          CASE WHEN PortfolioCode IN('Mutual fund','Pooled fund')

            THEN 1

            ELSE 0

END  IsFund,

Col2

From Table1

Fig 2: CASE WHEN under SQL command in SSIS

 

How do you implement in ADF?

However in ADF, to achieve the same you need to use Expressions. ADF have very same concept of data flow like SSIS. In the data flow, after the source dataset is established you can add 'Derived Column' activity like below Fig 3:

Fig 3: Adding derive column under data flow

 

Now you can give a new column name and then add the expression (Fig 4):

Fig 4: Derived column expression

 

Let's see how Case expression works: it takes 3 arguments, those are condition, true and false. However, it can have alternating condition which describe in the figure 5:

Fig 5: Case in Expression

 

For my simple scenario, If PortfolioTypeCode is either 'Mutual Fund' or 'Pooled Fund' it should return 1 Else it should return 0.

Since you can't have CASE WHEN rather using case as Expression, the code will look like below:

  case( PortfolioTypeCode=='Mutual Fund',1,

       PortfolioTypeCode=='Pooled Fund',1,0)

 

Saturday, September 19, 2020

Azure data Factory: What is not allowed in the file name while using Data Flow (spark engine)?

I was working with a pipeline where data needs to move from source and load into RAW zone in ADLS. As soon as loaded the file it got date time stamp appended end of the file. e.g. source file name is: customer.csv and when it's landed to RAW zone, then file name will be : customer_2020-09-28T12:15:32.csv

How do you add date time stamp end of the file?

Adding dynamic content at the sink pipeline like below (Fig 1) will do the task.


Fig 1: dynamic content to



When I run the pipeline it was appending timestamp end of the file and saving in the blob storage And I was able to view the data from the file: customer_2020-09-28T12:15:32.csv


Fig 2: Preview of the data



In the azure data lake storage file name with ':' did not give any issue while creating the file name as well viewing it. 

However, as soon as I use that file in the Data Flow activity and debug the code (when apache spark engine fire) then below error : java.lang.illegalArgumentException:java.net.URISyntaxException:.... 


Fig 3: java.lang.illegalArgumentException

How to resolve the error?

You can't have file with ':', which also true if you try to create a file in Windows OS with ':' in it. It will throw error, however, interestingly that's not the case when you create a file with same name in Azure data lake storage (HDFS is working behind).

Fig 4: Error in Windows machine

To resolve this, I updated the format of timestamp part while adding end of the each file, instead of using 
yyyy-MM-ddTHH:mm:ss , I have used yyyy-MM-ddTHHmmss. so I get the file name as: customer_2020-09-28T121532.csv

Saturday, August 22, 2020

How to solve Azure data factory Error: Sink dataset filepaths cannot contain a file name ?

I was building a pipeline which was taking data from SFTP folder and sink to Cleansed layer where file format was .parquet. As soon as validated the pipeline got the below error, where the error message is: 'Sink dataset filepath can not contain a file name, please remove the file name from {DatasetName}.

You will find error like this:

Fig 1: Sink dataset error


If you find this error, may get confuse at first since you have file name at two different places. One is at Sink activity settings and other one is at the data set.

Let's guide you to how to find those settings and fix this error. If you look at the Settings for Sink activity, you will find like below fig: 2.



Fig 2: Settings of Sink activity



The error message said filepath can't contain file name but remember this is not data set, it's setting of the Sink activity so you don't need to remove the file name. And if you want to keep it in a single file then choose 'Output to single file'.

So fix the issue you need click Sink and open data set as like below fig 3:

Fig 3: open data set


Now you can find the file name under data set as like below fig 4, Please remove file name, save and validate.

Fig 4: remove file name from Data set

Now the error will disappear. Lesson learned, look at the error message more than once and locate the correct place to fix the issue.







Sunday, July 26, 2020

How to deploy SQL Server big data cluster by using Azure data studio?

There are different ways to deploy SQL server 2019 big data cluster. However, this blog post will use Azure data studio to deploy big data cluster.

The very first thing you need to have is Azure data studio, if you don't have installed Azure data studio then please find it: https://docs.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver15


Step 1: After you open Azure data studio you will find below UI:



Fig 1: Azure data studio

Step 2: Choose SQL server big data cluster


Fig 2: Choose SQL Server Big Data Cluster    

Step 3: Missing prerequisite

  • kubectl
  • Azure CLI
  • azdata
Please install those.

Fig 3: Missing prerequisite



After prerequisite have been installed you will find below UI

Fig 4: After prequisite is installed



Step 4: Deployment Configuration Profile



Fig 5: Configuration profile


Step 5.A: Azure Settings:
 Azure settings where resource group will be created, make sure you have enough permission to create resource group, if you don't have permission then it will fail.

Fig 5: Azure settings

If failed then the error message will show like below:

Fig 5.1: Error while creating resource group

Step 5.B: Cluster Settings: Your setup cluster name and credential which you will require later to connect the cluster, so please keep it safe.

Fig 7: SQL Server cluster settings



Step 5.C: Service and storage settings: It will fill automatic but please adjust as per your need.

Fig 8: Service settings.



Step 6: Script to Notebook

Last step of the wizard, where as soon as click the 'Script to Notebook' it will open Notebook in Azure studio.
Fig 9: Script to Notebook


However, if this is the first time you are deploying big data cluster then python need to be installed, so you will find below UI:
Fig 10: install python


When you are done then go and hit 'Run All'


Fig 11: Execute the script


You may find error where pandas is still missing


Fig 12: Pandas is missing

So you need to install Pandas package from Azure data studio, Go to Package manager
Fig 13: Finding package manager in Azure data studio

And then find pandas under package manager, as soon as you find them , hit the install button.

Fig 14: Install pandas
When installation is done, then hit the 'Run All' from Notebook again, this time it will successfully start running and you will find , one of the step will login to Azure portal where automatically redirect to the portal and you will find like below UI where you don't need to do anything.

Fig 15: Azure login
When deployment is over , your big data cluster should be ready to use, so you must need to connect that. You will find all the end points after deployment is completed, please take the end point for 'SQL Server Master Instance Front-End'

Fig 16: Click to connect the Cluster

After connecting the big data cluster, it will look like below:

Fig 17: Connect with big data cluster

If you would like to remove the cluster then either you delete it by using azdata command in the command shell.


Delete cluster:

azdata bdc delete -n mssql-cluster

Fig 18: delete cluster



Or you can completely remove the the resource group from Azure portal.

Saturday, June 20, 2020

Adding Active Directory users to Azure SQL databases

If you are working with Azure SQL Database you will find adding user a bit different than general SQL DB, especially; adding Active Directory users. Though it's easy and a few steps, off course if you know the steps. While I was working on it and went through different routes so thought about sharing the easy one which worked like Gem.

There are two authentication approach for Azure SQL DB and SQL Managed Instance.

1) SQL Authentication
2) Azure Active directory Authentication


Fig 1.0: Authentication type in Azure


SQL authentication is straight forward, while you are creating database in Azure, you must have to put SQL server credential which will become your Azure database credential if you don't create database credential separately. As long as you have administrative credential to login to the database then you create all necessary users. However, you can't create Active directory users in Azure by logging as SQL authenticate user.

For example, I have logged in with the SQL admin user and then tried to execute below query to add Active directory user:

CREATE USER [user1@domain.com] 
FROM EXTERNAL PROVIDER 
WITH DEFAULT_SCHEMA = dbo;  
  
--add user to role(s) for the particular database
ALTER ROLE dbmanager ADD MEMBER [user1@domain.com]; 
ALTER ROLE loginmanager ADD MEMBER [user1@domain.com]; 

However, found below error:

Fig 2.0: Error creating AD user

The error message is self explanatory: "Principal 'user1@domain.com' could not be created. Only connections established with Active Directory accounts can create other Active Directory users."

So, how to solve it? Or How to add the AD user to Azure SQL DB ?

Step 1: Please login to you azure portal and find out your SQL server resource and you will find 'Active directory Admin' left side under settings. Please click 'Active Directory Admin' to find out UI like below where click 'Set admin' , now your Active Directory user account become Admin to the SQL server. 

Fig 3.0:  Active Directory Admin

Step 2: Now either use SSMS or Azure data studio and login with Active directory authentication. Since my organization have Multi factor Authentication (MFA) enables so I have chosen Azure Active Directoty- Universal with MFA, but you can choose either 'Active directory - Integrated' or 'Active Directory - Password.'

Fig 4: Login with AD user


Step 3: And now you can run the query to add Active directory user and give permission as you want, though traditional database level roles like db_datareader, db_datawriter, db_ddladmin, etc. are the same in Azure database, but roles like sysadmin, serveradmin, etc. don’t exist in Azure SQL database. There are two admin roles, dbmanager (similar to dbcreator) that can create and drop databases, and loginmanager (similar to securityadmin) that can create new logins which you can see in the below code:


CREATE USER [user1@domain.com] 
FROM EXTERNAL PROVIDER 
WITH DEFAULT_SCHEMA = dbo;  
  
-- add user to role(s) for the particular database
ALTER ROLE dbmanager ADD MEMBER [user1@domain.com]; 
ALTER ROLE loginmanager ADD MEMBER [user1@domain.com]; 


As soon as you execute the above code the AD user got all the necessary access to perform activities. 



  

Saturday, May 23, 2020

Parameterize data source in Power BI and Tricks for source as Salesforce

Parameters in Power BI can be used for different purposes, one of the useful case to hold environment variable. For example, while you are building Power BI report source can be connected to DEV environment and as soon as report is ready to release you may want to connect the source from PRE-PROD/PROD environment. If you use parameter then it's become easy to handle.

How to create Parameter?

Open power query editor and then find Manage parameter and create New parameter as like below fig 1.0

Fig 1.0: Find New Parameter

My source is Salesforce and have three different environments, QA, UAT and PROD.  At first provide the name of the parameter, e.g. my parameter name is: Environment, then choose Type as 'Text' and for Suggested values please choose 'List of values' as like below figure 2.0.

Fig 2.0: Creating parameter for different environment


Under List of values you will list down all your source environment and Default and Current value you need to set from those list of values. e.g. if would like to use QA environment then you need to choose the URL for QA for both default and current value.

Fig 3.0: Default and current value


As soon as you click 'OK' button then you are done. Yes, you are done with current value you set up; if you are connected for QA that's work fine. However, as soon as you change your environment from current value QA to UAT that works also fine if your source is database, but not for Salesforce source as expected.

What special about source as Salesforce?

After changing the source from QA to UAT and refresh the model, I got the below error:

Fig 3.0: Error after changing environment


How to fix it or What do you do after environment update for source as Salesforce ?

If you run into the same situation then please go to the power query editor then go to the source query as like below figure 4.0 for each table and then validate it by enter or the validation button (tick mark as like in the below figure). You will find error disappear. It's now can identity correct environment to connect.

Fig 4.0: Refresh data source 

Update Environment from Power BI service:
Since you have created the parameter, so you can update the parameter value from Power  BI service as well. Firstly login to the power BI portal and then find your published datasets as like below figure 5.0.

Fig 5.0: Settings under Datasets

And clicking ellipsis (...) you will find settings 

                              
                                          Fig 6.0: Settings menu


Under settings you will see the parameter, where you can update the environment, you can replace UAT by PROD URL and click Apply which will connect to PROD environment and refresh your power BI model.


Sunday, April 26, 2020

Basic differences between SQL and NoSQL.

Around three decades Oracle, IBM, Microsoft relational databases were consistent leaders for Operational Database Management Systems. However, currently all of these vendors offer NoSQL platform parallel to their traditional database platform e.g. Microsoft got Azure Cosmos DB as NoSQL database and so on.

Journey of NoSQL:
 
The world of NoSQL emerged in 1998 and it's gained popularity in 2010s. In 2015, the popular NoSQL database provider MongoDB appeared in Gartners magic Quardent as a leader for operational database management system.

Fig 1:Gartners Magic Quadrant in 2015


Though recent trends shows that NoSQL is getting more popular in Big data landscape than relational database management system.



What is NoSql and Why?

NoSQL stands for 'Not only SQL'. NoSQL database is for storing both structure and semi structure data.

NoSQL stores data in one of four categories:
  1. Key-Value storage
  2. Document storage
  3. Wide Column storage
  4. Graph database
The most popular NoSQL  database MongoDB is document storage, however, it can be used as a key/value store, which is just a subset of features.

Today data become more broader in terms of shape and size. Data is spread around documents, social media, complex web application, IoT sources. The traditional SQL database can't handle these data due to the continuous changing behavior. Traditional database need to know shape of the data (schema) beforehand to store those, hence it failed to capture continuous evolving data. And thus, NoSQL emerge and conquer the world!!!

Fixed schema vs. No Schema

For structure database (traditional SQL database) you need to have schema ready before you insert data to a table but for NoSQL you don't need to have schema created at first rather you can directly insert the data. Though, it's better to say 'Schema agnostice' than 'Schemaless' e.g. Microsoft's NoSQL database Azure Cosmos DB known as schema agnostic database, which is not bound by schemas but are aware of the schemas. 


Relationship vs. No Relationship

Traditional SQL require to maintain relationship to perform, that's why normalization is there. Whereas, NoSQL doesn't require to maintain the relationship. You can embed a few tables into one table in NoSQL database.


Below table (fig:2) shows basic differences between SQL and NoSQL database:
Fig 2: Comparison between SQL and NoSQL



Conclusion: NoSQL denotes 'Not only SQL', it means NoSQL database can perform what traditional relational database can do as well as do more. But these two types of Database have different expertise,  as per business requirements you can choose from them. A way to find out which fit best for you could be asking the question. Do you know the shape of data well advance? If answer is 'Yes' which means you can define schema earlier and can build the relationship before data arrive then it's good to choose traditional SQL. On the other hand, when you don't know the shape of data and behavior changes continuously then go with NoSQL.  Nevertheless, some complex enterprise solution can be built by using both SQL and NoSQL database to leverage best of each.