Pages

Sunday, April 26, 2020

Basic differences between SQL and NoSQL.

Around three decades Oracle, IBM, Microsoft relational databases were consistent leaders for Operational Database Management Systems. However, currently all of these vendors offer NoSQL platform parallel to their traditional database platform e.g. Microsoft got Azure Cosmos DB as NoSQL database and so on.

Journey of NoSQL:
 
The world of NoSQL emerged in 1998 and it's gained popularity in 2010s. In 2015, the popular NoSQL database provider MongoDB appeared in Gartners magic Quardent as a leader for operational database management system.

Fig 1:Gartners Magic Quadrant in 2015


Though recent trends shows that NoSQL is getting more popular in Big data landscape than relational database management system.



What is NoSql and Why?

NoSQL stands for 'Not only SQL'. NoSQL database is for storing both structure and semi structure data.

NoSQL stores data in one of four categories:
  1. Key-Value storage
  2. Document storage
  3. Wide Column storage
  4. Graph database
The most popular NoSQL  database MongoDB is document storage, however, it can be used as a key/value store, which is just a subset of features.

Today data become more broader in terms of shape and size. Data is spread around documents, social media, complex web application, IoT sources. The traditional SQL database can't handle these data due to the continuous changing behavior. Traditional database need to know shape of the data (schema) beforehand to store those, hence it failed to capture continuous evolving data. And thus, NoSQL emerge and conquer the world!!!

Fixed schema vs. No Schema

For structure database (traditional SQL database) you need to have schema ready before you insert data to a table but for NoSQL you don't need to have schema created at first rather you can directly insert the data. Though, it's better to say 'Schema agnostice' than 'Schemaless' e.g. Microsoft's NoSQL database Azure Cosmos DB known as schema agnostic database, which is not bound by schemas but are aware of the schemas. 


Relationship vs. No Relationship

Traditional SQL require to maintain relationship to perform, that's why normalization is there. Whereas, NoSQL doesn't require to maintain the relationship. You can embed a few tables into one table in NoSQL database.


Below table (fig:2) shows basic differences between SQL and NoSQL database:
Fig 2: Comparison between SQL and NoSQL



Conclusion: NoSQL denotes 'Not only SQL', it means NoSQL database can perform what traditional relational database can do as well as do more. But these two types of Database have different expertise,  as per business requirements you can choose from them. A way to find out which fit best for you could be asking the question. Do you know the shape of data well advance? If answer is 'Yes' which means you can define schema earlier and can build the relationship before data arrive then it's good to choose traditional SQL. On the other hand, when you don't know the shape of data and behavior changes continuously then go with NoSQL.  Nevertheless, some complex enterprise solution can be built by using both SQL and NoSQL database to leverage best of each.




Saturday, March 21, 2020

Finding unused Schemas in SQL database

Fun part of a developer life is: 'Today what is unknown, tomorrow you know it by learning and doing". And by learning the unknown and implementing you become expert :) I will share something which I learned and develop recently. Long back I got a task to find out used schemas in database where the solution was pretty straight forward. However, this time the task was to get unused schemas. Firstly, I thought this must be fun, never did it but someone must did it so started with google and got some clue but not with proper explanation so thought about writing it.

To find out the solution, you need to go through three tables:
1) Sys.Schemas
2) Sys.database_principals
3) Sys.objects


Easy way to find unused schemas is : If the schema_id from sys.schemas table is not found in Sys.objects table means those are never used. And both tables schemas and objects got the column ‘Schema_ID’. Woww!! We know the solution, yes all most 😊 Why we need database_principals table? When schema name is not specified for those cases we need to use this table. Relations among three tables can be depicted with below diagram:

Fig 1: Relationship among tables

We covered theory part, let's see in action, below is the SQL code to find out unused schemas in database:

SELECT name

FROM sys.schemas SCH

where SCH.schema_id not between 1 and 4    --This is for dbo, guest, information_schema,sys

and SCH.principal_id not between 16384 and 16399 ---  it's related to db_* schema, so don't touch this.

and not exists

( select 1

from sys.database_principals P

where P.default_schema_name=SCH.name     ---Name to be used when SQL name does not specify a schema, so we are taking this off

)

and not exists

( select 1

from sys.objects O

where O.[schema_id]=SCH.[schema_id]   ---Object uses the schema, so if schema id is not exist in the object; it means the schema is not used.

)

order by SCH.name


You know how to find unused schema, now you can delete them if you want.

Wednesday, February 19, 2020

Azure Data factory(ADF): Important components you need to know.

If you are new to Azure Data Factory and started to build the very first pipeline. Assuming, you already found the source and target. Your source data either from on premise or cloud database or blob storage or from any Web API. Whether you connect with storage account or on-premise /cloud database you require Linked Services. Linked Services is like Connection string.

For example, if you like to get data from Blob storage and insert into the cloud database then you will require two linked services. One linked service is your source connection from blog storage to the data factory and another is the target connection which connect from data factory to database.


Creating Linked service is easy, following a few steps via wizard, however; when you create Linked service  one of the step will be creating or using called Integration Runtime.  Since there are three different type of Integration Runtime (IR), you may get confuse with which one to choose:

Let's explain three different type of Integration Runtime,

1. Azure
2. Self-Hosted
3. Linked Self-Hosted

Please note that, there is one more Integration Runtime which called Azure-SSIS integration Runtime, this IR is required when you shift your SSIS package to Azure. We will not cover SSIS IR in this post.

1. Azure
It's the default IR, means if you don't create anything then it will create automatically and give the name as 'AutoResolveIntegrationRuntime' like as below figure 1.

Fig 1: AutoResolveIntegrationRuntime

If the data integration capabilities like Data Flow, Data movement and Activity dispatch is under public network then 'Azure IR' is used. If Azure IR is not automatically created then you can create manually by clicking the '+New' like below fig 2


Fig 2: Choose to create IR

 and you will find below window to create Azure IR (Fig 3)

Fig 3: Choose from different IR

When IR is created, you can find them running under data factory-> Connections

Fig 4: IR running under Data Factory



2. Self-Hosted:

If the source or target is under private network e.g. database maybe resides on premise in that case you must need to create Self-Hosted Runtime. There are a few steps to create Self-Hosted Runtime. While you are working with Azure Data factory, you can create it via data factory UI.
please follow the link to create Self-Hosted IR : https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime#create-a-self-hosted-ir-via-azure-data-factory-ui



3. Linked Self-Hosted: 

If you already have Self-Hosted IR and need to use the same for other Data Factory then you don't need to create  new Self-Hosted IR rather you can link the existing Self-Hosted to the other data factory.


Saturday, January 18, 2020

Azure Data Factory: How to add and remove code repository to ADF

If you are working with Azure data Factory (ADF), the very first step of the development is to connect with code repository, well!!  you can develop without connecting code repository but  collaboration work and version control can't be there.

In general when you work with ADF and publish code is saved under ADF service but no repository is there means you can't see who saved the code or which version of the code you are working with.
For example, below Fig 01 give me option to choose from ADF mode or use Github. If I choose ADF mode then there will be no repository for the project, however; with Github I will have that option.


Fig 01: ADF repo ot ADF mode

Step 1: Create source source repository

If you have visual studio subscription then you can get free azure DevOps account and then create repository there, if don't then please go to  Github (https://github.com/) and create a repository which is completely free. It's very simple to create a repository in Github, just a few click.


Step 2: Login to Azure Data Factory and choose relevant repository type

Let's log into the Azure Data Factory and after clicking 'Author and Monitor' you will have the page for the development work. At the top left of the page like below Fig 02, you will find option called 'Set up Code repository'.

Fig 02: Set up Code repository

And you get options to choose from either Azure DevOps Git or Github (see above fig: 02). If you already have visual studio account then you can choose 'Azure DevOps', otherwise you need to use Github.

For example, in my case; I logged in Azure portal with my personal account and the account doesn't have any association with Visual studio account. So when choose 'Azure DevOps' nothing is showing(Fig 03)
Fig 03: Azure Devops require VS account


Step 2: Settings for the chosen repository type

Let's go to the next step where you have already chosen repository type 'Github' and now please fill up the necessary information as like shown in figure 04.


Fig 04: Detail for GitHub

Github repository is added to the Azure data factory, so it's become easy to collaborate and deploy the pipeline.

Fig 05: Repository is connected


Removing Code Repository:

The code repository have been added for ADF, however, for any reason if you want to remove the repository, then?


You need to find out 'repo settings', Please come out from Author tab and switch to Home tab to find 'repo settings' as like figure 06.

Fig 06: Find Git repo


As long as you click the 'repo settings', you will find below screen (Fig 07)  there will be option to remove the repository, just click the 'remove git' you are done!!

Fig 07: Removing Git repo






Wednesday, December 25, 2019

Environment variable setup in SSIS Catalogue

If your SSIS solution is built with project deployment mode(not package deployment mode) and would like introduce environment variable then this post is for you.

Let's start with a few steps before going into environment variable. In general, after building  SSIS solution, you get .ispac file which you take from DEV to UAT/TEST/PREPROD/PROD environment. And then by following the wizard you deploy the .ispac file under SSIS catalogue.  To know step by step process to deploy SSIS package to catalogue please follow : Deploying SSIS package to Catalogue

After deployment , you will find the packages under SSIS catalogue, as like below figure 01:


Fig 01: SSIS Catalogue


At next step you will create  job/Jobs  to run the packages. Now say, if you don't create any environment variable then the job will like below:

Fig 2: Job Properties without Environment variable

Will this job execute without environment variable? Surely, it does. But the connection string still retain development server information. However, the solution is already moved to UAT or Test Environment, so you need to change the connection string manually.   If  Environment variable have been defined then it can be assigned to the job so the same job (process) can run for different environment (UAT/Pre-PROD/PROD) by pointing the right environment.

Job with environment variable look like below figure 3:


Fig 3: Job with Environment variable

At this setup, you can choose from different environments before running the job. And each environment can hold different source and target database as well as source and destination file location. Now let's see how to create the environment  variable.


Step 1: Create environment variable
Under SSISDB, you will find Environment folder as like below picture, right mouse click and follow the step.

Fig 4: Create environment variable


Step 2: Property set up for environment variable

At this step, you will define connection strings, source/destination file path etc.

Fig 5: setup properties for environment



Step 3: Connect environment variable from project, Under SSISDB catalog choose the project and then right mouse click to find 'Configure'.

Fig 6: Configure environment variable



And then add reference to the project, References->Add and then environment can be chosen.

Fig 7: adding environment variable reference to the project



Step 4: From step 1 to 3, we worked with SSISDB, Now let's look at the job, SQL Server Agent->Jobs, where will have option to choose from environment

Fig 8: Job Properties
And from the properties, click on Steps and then you will find variable to choose from.



In this example, only UAT environment have been created but as per your need Pre-PROD, PROD can be created and source and target database can be assigned accordingly. And those environment can be assigned to job.

Saturday, November 16, 2019

SQL Server: Easy way to add Columns to Multiple tables

When you work in Data Warehouse/BI Project or any project where database is required and you are in a situation where need to add one or more same columns to the tables you already built then this post will help you to achieve the goal quicker and error free.

For example a BI project you may working with require to add load timestamp to all the target tables which at the design phase somehow missed or you find out at the later part of the project that you require one or two particular columns to all the tables.

Boring task!!! right?? Going through all the tables and changing it one by one is time consumable and error prone.  For example if you have 50 tables and you need to add one column say column called 'UpdatedAt'. How do we do it by using SQL script?

You need to list all the table name and run the below query by replacing the table name e.g. 'testTable1', 'testTable2' etc.
 
             Declare @tbl_name as varchar(100)
              Declare @tsql_code Varchar(500)

              Declare cur_table cursor local for
                           select name
                           from sys.tables
                           Where name in ('testTable1','testTable2')
                           order by 1
                           for read only

              open cur_table
              fetch cur_table into @tbl_name
     
                           while @@fetch_status = 0
                           begin
                               
                           Select @tsql_code='Alter Table ' + @tbl_name  +' Add Col1 Varchar(10), Col2 Varchar(10)'
                           Exec (@tsql_code)
                           fetch cur_table into @tbl_name
                           end
                     close cur_table
                     deallocate cur_table


However,  the above code may not work since schema name is missing there, so you need to add the schema name along with the table name like below example:

   Declare @tbl_name as varchar(100)
              Declare @tsql_code Varchar(500)

     Declare cur_table cursor local for
                           select '['+SCHEMA_NAME(schema_id)+'].['+name+']'
                           from sys.tables
                           Where SCHEMA_NAME(schema_id)+'.'+name in ('dbo.Currency','dbo.Sales')
                           order by 1
                           for read only

              open cur_table
              fetch cur_table into @tbl_name
     
                           while @@fetch_status = 0
                           begin
                               
                           Select @tsql_code='Alter Table ' + @tbl_name  +' Add UpdatedAt [DATETIME]'
                           Exec (@tsql_code)
                           fetch cur_table into @tbl_name
                           end
                     close cur_table
                     deallocate cur_table


Hope this will make your life easier.

Sunday, October 20, 2019

SSIS: Execute SQL Task Limiation

I would like to share an experience that found recently. In general, when you work with SSIS and using SQL Execute task, you tend to make query by using  SQL Server Management Studio (SSMS) and when query works fine then copy the code into the Execute SQL Task. However,  when I copied the code and pasted to Execute SQL task and run the package it failed at Execute SQL task, then figure it out only part of the SQL code is pasted.

Fig 1: SQL Execute Task


The SQL code I copied was around 1600 lines, however when pasted it's only took 891 lines of code to Execute SQL Task and rest have been discarded. It means Execute SQL task got limitation, out of curiosity took the pasted part of the code from Execute SQL task then count it and found it can hold up to 30,987 characters with space. (see below fig 2)

Fig 1: SQL Execution task limiation


How do you solve it?
Well, It's easy, you wrap the SQL code either by using view or store procedure and then use that in your Execute SQL task.