It's all about Data

Sunday, July 26, 2020

How to deploy SQL Server big data cluster by using Azure data studio?

There are different ways to deploy SQL server 2019 big data cluster. However, this blog post will use Azure data studio to deploy big data cluster.

The very first thing you need to have is Azure data studio, if you don't have installed Azure data studio then please find it: https://docs.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver15

Step 1: After you open Azure data studio you will find below UI:

Fig 1: Azure data studio

Step 2: Choose SQL server big data cluster

Fig 2: Choose SQL Server Big Data Cluster

Step 3: Missing prerequisite

kubectl
Azure CLI
azdata

Please install those.

Fig 3: Missing prerequisite

After prerequisite have been installed you will find below UI

Fig 4: After prequisite is installed

Step 4: Deployment Configuration Profile

Fig 5: Configuration profile

Step 5.A: Azure Settings:

Azure settings where resource group will be created, make sure you have enough permission to create resource group, if you don't have permission then it will fail.

Fig 5: Azure settings

If failed then the error message will show like below:

Fig 5.1: Error while creating resource group

Step 5.B: Cluster Settings: Your setup cluster name and credential which you will require later to connect the cluster, so please keep it safe.

Fig 7: SQL Server cluster settings

Step 5.C: Service and storage settings: It will fill automatic but please adjust as per your need.

Fig 8: Service settings.

Step 6: Script to Notebook

Last step of the wizard, where as soon as click the 'Script to Notebook' it will open Notebook in Azure studio.

Fig 9: Script to Notebook

However, if this is the first time you are deploying big data cluster then python need to be installed, so you will find below UI:

Fig 10: install python

When you are done then go and hit 'Run All'

Fig 11: Execute the script

You may find error where pandas is still missing

Fig 12: Pandas is missing

So you need to install Pandas package from Azure data studio, Go to Package manager

Fig 13: Finding package manager in Azure data studio

And then find pandas under package manager, as soon as you find them , hit the install button.

Fig 14: Install pandas

When installation is done, then hit the 'Run All' from Notebook again, this time it will successfully start running and you will find , one of the step will login to Azure portal where automatically redirect to the portal and you will find like below UI where you don't need to do anything.

Fig 15: Azure login

When deployment is over , your big data cluster should be ready to use, so you must need to connect that. You will find all the end points after deployment is completed, please take the end point for 'SQL Server Master Instance Front-End'

Fig 16: Click to connect the Cluster

After connecting the big data cluster, it will look like below:

Fig 17: Connect with big data cluster

If you would like to remove the cluster then either you delete it by using azdata command in the command shell.

Delete cluster:

azdata bdc delete -n mssql-cluster

Fig 18: delete cluster

Or you can completely remove the the resource group from Azure portal.

Saturday, June 20, 2020

Adding Active Directory users to Azure SQL databases

If you are working with Azure SQL Database you will find adding user a bit different than general SQL DB, especially; adding Active Directory users. Though it's easy and a few steps, off course if you know the steps. While I was working on it and went through different routes so thought about sharing the easy one which worked like Gem.

There are two authentication approach for Azure SQL DB and SQL Managed Instance.

1) SQL Authentication

2) Azure Active directory Authentication

Fig 1.0: Authentication type in Azure

SQL authentication is straight forward, while you are creating database in Azure, you must have to put SQL server credential which will become your Azure database credential if you don't create database credential separately. As long as you have administrative credential to login to the database then you create all necessary users. However, you can't create Active directory users in Azure by logging as SQL authenticate user.

For example, I have logged in with the SQL admin user and then tried to execute below query to add Active directory user:

CREATE USER [user1@domain.com]

FROM EXTERNAL PROVIDER

WITH DEFAULT_SCHEMA = dbo;

--add user to role(s) for the particular database

ALTER ROLE dbmanager ADD MEMBER [user1@domain.com];

ALTER ROLE loginmanager ADD MEMBER [user1@domain.com];

However, found below error:

Fig 2.0: Error creating AD user

The error message is self explanatory: "Principal 'user1@domain.com' could not be created. Only connections established with Active Directory accounts can create other Active Directory users."

So, how to solve it? Or How to add the AD user to Azure SQL DB ?

Step 1: Please login to you azure portal and find out your SQL server resource and you will find 'Active directory Admin' left side under settings. Please click 'Active Directory Admin' to find out UI like below where click 'Set admin' , now your Active Directory user account become Admin to the SQL server.

Fig 3.0: Active Directory Admin

Step 2: Now either use SSMS or Azure data studio and login with Active directory authentication. Since my organization have Multi factor Authentication (MFA) enables so I have chosen Azure Active Directoty- Universal with MFA, but you can choose either 'Active directory - Integrated' or 'Active Directory - Password.'

Fig 4: Login with AD user

Step 3: And now you can run the query to add Active directory user and give permission as you want, though traditional database level roles like db_datareader, db_datawriter, db_ddladmin, etc. are the same in Azure database, but roles like sysadmin, serveradmin, etc. don’t exist in Azure SQL database. There are two admin roles, dbmanager (similar to dbcreator) that can create and drop databases, and loginmanager (similar to securityadmin) that can create new logins which you can see in the below code:

CREATE USER [user1@domain.com]

FROM EXTERNAL PROVIDER

WITH DEFAULT_SCHEMA = dbo;

-- add user to role(s) for the particular database

ALTER ROLE dbmanager ADD MEMBER [user1@domain.com];

ALTER ROLE loginmanager ADD MEMBER [user1@domain.com];

As soon as you execute the above code the AD user got all the necessary access to perform activities.

Saturday, May 23, 2020

Parameterize data source in Power BI and Tricks for source as Salesforce

Parameters in Power BI can be used for different purposes, one of the useful case to hold environment variable. For example, while you are building Power BI report source can be connected to DEV environment and as soon as report is ready to release you may want to connect the source from PRE-PROD/PROD environment. If you use parameter then it's become easy to handle.

How to create Parameter?

Open power query editor and then find Manage parameter and create New parameter as like below fig 1.0

Fig 1.0: Find New Parameter

My source is Salesforce and have three different environments, QA, UAT and PROD. At first provide the name of the parameter, e.g. my parameter name is: Environment, then choose Type as 'Text' and for Suggested values please choose 'List of values' as like below figure 2.0.

Fig 2.0: Creating parameter for different environment

Under List of values you will list down all your source environment and Default and Current value you need to set from those list of values. e.g. if would like to use QA environment then you need to choose the URL for QA for both default and current value.

Fig 3.0: Default and current value

As soon as you click 'OK' button then you are done. Yes, you are done with current value you set up; if you are connected for QA that's work fine. However, as soon as you change your environment from current value QA to UAT that works also fine if your source is database, but not for Salesforce source as expected.

What special about source as Salesforce?

After changing the source from QA to UAT and refresh the model, I got the below error:

Fig 3.0: Error after changing environment

How to fix it or What do you do after environment update for source as Salesforce ?

If you run into the same situation then please go to the power query editor then go to the source query as like below figure 4.0 for each table and then validate it by enter or the validation button (tick mark as like in the below figure). You will find error disappear. It's now can identity correct environment to connect.

Fig 4.0: Refresh data source

Update Environment from Power BI service:

Since you have created the parameter, so you can update the parameter value from Power BI service as well. Firstly login to the power BI portal and then find your published datasets as like below figure 5.0.

Fig 5.0: Settings under Datasets

And clicking ellipsis (...) you will find settings

Fig 6.0: Settings menu

Under settings you will see the parameter, where you can update the environment, you can replace UAT by PROD URL and click Apply which will connect to PROD environment and refresh your power BI model.

Sunday, April 26, 2020

Basic differences between SQL and NoSQL.

Around three decades Oracle, IBM, Microsoft relational databases were consistent leaders for Operational Database Management Systems. However, currently all of these vendors offer NoSQL platform parallel to their traditional database platform e.g. Microsoft got Azure Cosmos DB as NoSQL database and so on.

Journey of NoSQL:

The world of NoSQL emerged in 1998 and it's gained popularity in 2010s. In 2015, the popular NoSQL database provider MongoDB appeared in Gartners magic Quardent as a leader for operational database management system.

Fig 1:Gartners Magic Quadrant in 2015

Though recent trends shows that NoSQL is getting more popular in Big data landscape than relational database management system.

What is NoSql and Why?

NoSQL stands for 'Not only SQL'. NoSQL database is for storing both structure and semi structure data.

NoSQL stores data in one of four categories:

Key-Value storage
Document storage
Wide Column storage
Graph database

The most popular NoSQL database MongoDB is document storage, however, it can be used as a key/value store, which is just a subset of features.

Today data become more broader in terms of shape and size. Data is spread around documents, social media, complex web application, IoT sources. The traditional SQL database can't handle these data due to the continuous changing behavior. Traditional database need to know shape of the data (schema) beforehand to store those, hence it failed to capture continuous evolving data. And thus, NoSQL emerge and conquer the world!!!

Fixed schema vs. No Schema

For structure database (traditional SQL database) you need to have schema ready before you insert data to a table but for NoSQL you don't need to have schema created at first rather you can directly insert the data. Though, it's better to say 'Schema agnostice' than 'Schemaless' e.g. Microsoft's NoSQL database Azure Cosmos DB known as schema agnostic database, which is not bound by schemas but are aware of the schemas.

Relationship vs. No Relationship

Traditional SQL require to maintain relationship to perform, that's why normalization is there. Whereas, NoSQL doesn't require to maintain the relationship. You can embed a few tables into one table in NoSQL database.

Below table (fig:2) shows basic differences between SQL and NoSQL database:

Fig 2: Comparison between SQL and NoSQL

Conclusion: NoSQL denotes 'Not only SQL', it means NoSQL database can perform what traditional relational database can do as well as do more. But these two types of Database have different expertise, as per business requirements you can choose from them. A way to find out which fit best for you could be asking the question. Do you know the shape of data well advance? If answer is 'Yes' which means you can define schema earlier and can build the relationship before data arrive then it's good to choose traditional SQL. On the other hand, when you don't know the shape of data and behavior changes continuously then go with NoSQL. Nevertheless, some complex enterprise solution can be built by using both SQL and NoSQL database to leverage best of each.

Saturday, March 21, 2020

Finding unused Schemas in SQL database

Fun part of a developer life is: 'Today what is unknown, tomorrow you know it by learning and doing". And by learning the unknown and implementing you become expert :) I will share something which I learned and develop recently. Long back I got a task to find out used schemas in database where the solution was pretty straight forward. However, this time the task was to get unused schemas. Firstly, I thought this must be fun, never did it but someone must did it so started with google and got some clue but not with proper explanation so thought about writing it.

To find out the solution, you need to go through three tables:
1) Sys.Schemas
2) Sys.database_principals
3) Sys.objects

Easy way to find unused schemas is : If the schema_id from sys.schemas table is not found in Sys.objects table means those are never used. And both tables schemas and objects got the column ‘Schema_ID’. Woww!! We know the solution, yes all most 😊 Why we need database_principals table? When schema name is not specified for those cases we need to use this table. Relations among three tables can be depicted with below diagram:

Fig 1: Relationship among tables

We covered theory part, let's see in action, below is the SQL code to find out unused schemas in database:

SELECT name

FROM sys.schemas SCH

where SCH.schema_id not between 1 and 4 --This is for dbo, guest, information_schema,sys

and SCH.principal_id not between 16384 and 16399 --- it's related to db_* schema, so don't touch this.

and not exists

( select 1

from sys.database_principals P

where P.default_schema_name=SCH.name ---Name to be used when SQL name does not specify a schema, so we are taking this off

)

and not exists

( select 1

from sys.objects O

where O.[schema_id]=SCH.[schema_id] ---Object uses the schema, so if schema id is not exist in the object; it means the schema is not used.

)

order by SCH.name

You know how to find unused schema, now you can delete them if you want.

Wednesday, February 19, 2020

What is Linked services in Azure Data factory(ADF)?

If you are new to Azure Data Factory and started to build the very first pipeline. Assuming, you already found the source and target. Your source data either from on premise or cloud database or blob storage or from any Web API. Whether you connect with storage account or on-premise /cloud database you require Linked Services. Linked Services is like Connection string.

For example, if you like to get data from Blob storage and insert into the cloud database then you will require two linked services. One linked service is your source connection from blog storage to the data factory and another is the target connection which connect from data factory to database.

Creating Linked service is easy, following a few steps via wizard, however; when you create Linked service one of the step will be creating or using called Integration Runtime. Since there are three different type of Integration Runtime (IR), you may get confuse with which one to choose:

Let's explain three different type of Integration Runtime,

1. Azure
2. Self-Hosted
3. Linked Self-Hosted

Please note that, there is one more Integration Runtime which called Azure-SSIS integration Runtime, this IR is required when you shift your SSIS package to Azure. We will not cover SSIS IR in this post.

1. Azure
It's the default IR, means if you don't create anything then it will create automatically and give the name as 'AutoResolveIntegrationRuntime' like as below figure 1.

Fig 1: AutoResolveIntegrationRuntime

If the data integration capabilities like Data Flow, Data movement and Activity dispatch is under public network then 'Azure IR' is used. If Azure IR is not automatically created then you can create manually by clicking the '+New' like below fig 2

Fig 2: Choose to create IR

and you will find below window to create Azure IR (Fig 3)

Fig 3: Choose from different IR

When IR is created, you can find them running under data factory-> Connections

Fig 4: IR running under Data Factory

2. Self-Hosted:

If the source or target is under private network e.g. database maybe resides on premise in that case you must need to create Self-Hosted Runtime. There are a few steps to create Self-Hosted Runtime. While you are working with Azure Data factory, you can create it via data factory UI.
please follow the link to create Self-Hosted IR : https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime#create-a-self-hosted-ir-via-azure-data-factory-ui

3. Linked Self-Hosted:

If you already have Self-Hosted IR and need to use the same for other Data Factory then you don't need to create new Self-Hosted IR rather you can link the existing Self-Hosted to the other data factory.

Reference Resources:

https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime

https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime#integration-runtime-location

Saturday, January 18, 2020

Azure Data Factory: How to add and remove code repository to ADF

If you are working with Azure data Factory (ADF), the very first step of the development is to connect with code repository, well!! you can develop without connecting code repository but collaboration work and version control can't be there.

In general when you work with ADF and publish code is saved under ADF service but no repository is there means you can't see who saved the code or which version of the code you are working with.
For example, below Fig 01 give me option to choose from ADF mode or use Github. If I choose ADF mode then there will be no repository for the project, however; with Github I will have that option.

Fig 01: ADF repo ot ADF mode

Step 1: Create source source repository

If you have visual studio subscription then you can get free azure DevOps account and then create repository there, if don't then please go to Github (https://github.com/) and create a repository which is completely free. It's very simple to create a repository in Github, just a few click.

Step 2: Login to Azure Data Factory and choose relevant repository type

Let's log into the Azure Data Factory and after clicking 'Author and Monitor' you will have the page for the development work. At the top left of the page like below Fig 02, you will find option called 'Set up Code repository'.

Fig 02: Set up Code repository

And you get options to choose from either Azure DevOps Git or Github (see above fig: 02). If you already have visual studio account then you can choose 'Azure DevOps', otherwise you need to use Github.

For example, in my case; I logged in Azure portal with my personal account and the account doesn't have any association with Visual studio account. So when choose 'Azure DevOps' nothing is showing(Fig 03)

Fig 03: Azure Devops require VS account

Step 2: Settings for the chosen repository type

Let's go to the next step where you have already chosen repository type 'Github' and now please fill up the necessary information as like shown in figure 04.

Fig 04: Detail for GitHub

Github repository is added to the Azure data factory, so it's become easy to collaborate and deploy the pipeline.

Fig 05: Repository is connected

Removing Code Repository:

The code repository have been added for ADF, however, for any reason if you want to remove the repository, then?

You need to find out 'repo settings', Please come out from Author tab and switch to Home tab to find 'repo settings' as like figure 06.

Fig 06: Find Git repo

As long as you click the 'repo settings', you will find below screen (Fig 07) there will be option to remove the repository, just click the 'remove git' you are done!!

Fig 07: Removing Git repo

Pages