Sunday, July 25, 2021

Step by step guideline to install PostgreSQL in Azure cloud and connect through PgAdmin

What is PostgreSQL?

PostgreSQL, also known as Postgres, is a free and open-source relational database management system.  The official PostgreSQL site mentioned, "The World's Most Advanced Open Source Relational Database".  PostgreSQL as Open Source database gained huge popularity in past few years, this article post will focus how to install PostgreSQL in Azure cloud and tools to interact with the database.


Installation of PostgreSQL in the Azure Cloud environment

At first, login to your Azure portal and search for PostgreSQL, You will find different services to choose from, I have chosen “Azure Database for PostgreSQL flexible servers” from the below list as shown in Fig 1. This particular service will allow to add any extension you want to add to your database in future.


Fig 1: PostgresSQL services in Azure Cloud


As soon as you choose the option you will find below figure 2, which will allow to create the postgreSQL flexible server.


    Fig 2: PostgreSQL flexible server


After clicking  "Create Azure Database for PostgreSQL flexible server" as shown in above figure 2, you will have options to choose from four different plans as shown in figure 3. As per your need you can choose from anyone of them. "Single server" was best fit for my requirements since it's enterprise ready, fully managed and I can add geospatial extension to it.


Fig 3: Choose right plan for your database

 
As soon as you hit the 'Single server' as shown above figure 3, you will find details information to fill up as shown in figure 4.

Please follow the below steps, figure (4) indicates each step listed.

1) Choose the right subscription for your resource group
2) Please select resource group where you want to install the database server, if no resource group created then you need to create a resource group. Please find details how to create azure resource group
3) Put the server name for PostgreSQL
4) Choose the location where you would like to install the PostgreSQL, I have chosen Canada Central, however; you can choose which best fit for you.
5) Choose the version of PostgreSQL that you would like deploy in Azure
6) At this step fill up the administrator account information and save this credential; you will need this when you log into the database server.

Fig 4: PostgreSQL deployment config input





After filling up the above information, please click 'Review + Create'. It will take a few minutes to complete the installation and you will find below message when deployment is completed as shown in figure 5.


Fig 5: Deployment is completed


After the deployment if you click Go to Resource (as shown bottom link at Fig 6), you will find out more details about the resource that you just created. We will need these information when connection need to establish from the IDE.

Fig 6: resource details



How to connect PostgreSQL from On-Premise GUI?


PostgreSQL deployment is completed, however; Now we need to find out how to connect this PostgreSQL database server with a Graphical User Interface (GUI) and create any new databases. One of the popular GUI for PostgreSQL is pgAdmin.

Let's start installing pgAdmin to connect the database server and do rest of the operation. Please follow the link to install pgAdmin for Windows. You can choose latest version to of pgAmin, download it and then use wizard to install it.

When you pgAdmin installation is completed, you will find below (Fig 7) if you search for the app.


                             Fig 7: pgAdmin installed in my PC



Now, we are going to use pgAdmin 4 to connect the deployed PostgreSQL database server. Open the app pgAdmin 4 and right click under server as below figure 8 is shown.



Fig 8: Create connection

And then you need to fill up the details to connect PostgreSQL database server which we deployed previously (fig 4). Details are shown in below figure 9, and fill up the information as suggested below:

1. Host name/Address: This is server name which can be found under the resource details (as shown in figure 5.)
2. Port by default should be set 5432, in case it's not then please put 5432.
3. Maintenance database: It's like master database if you are coming from SQL DB experiences, it should fill up automatically, if not then put: postgres
4. User Name: It's admin user name (see figure 4 or 6)
5. Password: The password you entered (fig 4)

As well as, under General tab, please give any name you like for the connection then hit Save button.



Fig 9: connection details need to fill up



Now you are connected your PostgreSQL database server in the Azure Cloud environment from PgAdmin GUI as shown in below figure 10. Everything is set, you can create new database, add new extension to it and whatever operations you want to make. 


Fig 10: PgAdmin GUI connected with PostgreSQL in the Azure Cloud


We learned how to deploy PostgreSQL in the Azure Cloud environment as well as how we can connect the database server from on-premise GUI called PgAdmin.

Sunday, June 6, 2021

Why Power query as a transformation activity in Azure Data factory and SSIS?

This blog post will describe how power Query activity in ADF and SSIS can be useful. As well as, I will share the differences of Power Query activity between SSIS and ADF.

Why Power Query and When to use it?

When data engineer works for transformation pipeline they get different activities like lookup, merge, data conversion etc. in their preferred ETL tool. ETL tools like Azure data factory (ADF) got Dataflow and Databricks to solve complex transformation. In addition, ADF introduced 'Power Query' (previous name data wrangling) as an activity. Please note that, Power query is still in preview for both Azure Data Factory (ADF) and SSIS. 


Fig 1: Power Query in ADF

Despite having many activities in Azure data factory why we need Power Query? Let's share my experience when Power Query have chosen as an activity in the pipeline.  The task was to get data from complex excel files with many calculation and more than 1000 columns which is used by business as an application. Yes! you got it right, it's an excel application, organization still uses excel as an application!!  A few transformed and calculated columns need to go to the modern data warehouse from the excel files. 

In this scenario, thought about what would be the best activity to choose from: DataFlow, Databricks or Power Query? well, I would say all of them may work but Power Query was the best choice.

Let me explain, why? Since the source file is excel and it's got  more than 1000 columns with many calculation inside, It's almost impossible for a Data Engineer to find out how to derive the expected outcome where no mapping or transformation logic is provided. By using Power Query visual transformation, business expert and I were able to work closely and produce the output in a very short period of time. 


Fig 2: Power Query transformation in ADF

 As a Data Engineer, when you work with dataflow or Databricks or any other transformation activity in ETL tool, you follow the documented mapping logic and build the pipeline. It means transformation rules and mappings are predefined. However, when transformation rules are yet to discover then best to start with Power Query. You can simply start with Power BI desktop to work together with business to produce the expected outcome. And when output is verified and accepted then then copy the M Query to ADF Power Query activity or SSIS Power Query source. In fact, now you have the transformation rules in the M Query so if you like to use other transformation activity like dataflow or Databricks you can use that too.


What works in SSIS but not in ADF?

Power Query activity is in Preview for both SSIS and ADF, however; if you choose ADF then you need to convert the source file from .excel to .csv since Power Query for ADF doesn't support .excel as source dataset.


Fig 3: Source dataset for Power Query

However, if you work with Power Query in SSIS then it support excel as source. On Contrary, in SSIS; when you are working with Power Query Source, it doesn't have user interface to make the transformation like ADF. The obvious reason is, you can use Power BI desktop to do the transformation and then copy the M query (Power Query generate M syntax which called M Query) from Power BI and paste it to Power Query Source in SSIS.

Fig 4: Power Query in SSIS



In summary, Power Query in both SSIS and ADF is useful Activity and new feature which still in preview, hence there might be many different scenarios where you want to use Power Query activity, however; this article is based on my experiences with Power Query activity in ADF and SSIS. It's also interesting to know that, The user interface you get under ADF Power Query is identical to Power BI, however, not all M query is supported by ADF Power Query yet.




Sunday, April 25, 2021

Handling SQL DB row-level errors in ADF (Azure data factory) Data Flows

If you are working with ADF (Azure data factory) data flows then you may have noticed there is a new feature released in Nov 2020 which is useful to capture any error while inserting/updating the records to the SQL database.

Fig 1: Error row handling at sink database

For error handling there are two options to choose from:

1) Stop on first error (default)

2) Continue on error


                                                    Fig 2: Error row handling options

By default, ADF pipeline will stop at the error. However, the main purpose of this feature to use option "Continue on error" to catch and log the error so that we can look at later and take action accordingly. 

Let's fill up the settings to catch errors rows, below figures show the settings and will also describe each setup (Please follow the numbering in the figure 3).

1) Error row handling: Since we wanted to catch the error so we have chosen "Continue of error" at Error row handling.

Fig 3: Settings Continue on error

2) Transaction Commit: Choose whether the data flow will be written in a single transaction or in batches, I have chosen single, it means whenever there is failure it will store the record on the other hand batch will store error records when full batch is completed.

3)Output rejected data: You need to make this check mark TRUE to store the error rows. The whole point of error row handling is you want to know the error records; if so, please tick check mark. Though you can avoid this, in that case pipeline will run but if there is any error you will not know which records causes the error.

4) Linked Service: Put the linked service and test the connection

5)Storage folder path: Storage path need to mention here, it's the path where you would like to store the error records in a file.

6)Report success on error: I don't put report on success checkbox to TRUE since I wanted to know if there is a failure.


After the settings, when you run the pipeline and if there is any error in the dataset, it will be stored in your storage folder as you have provided at point no 5 in the settings.


In general, when there is a failure at the time of inserting records to the database it takes sometime to find out the reason of failure. You may have to go through large chunk of dataset and look for miss match of data types or NULL value etc. to find out the root cause. Through this feature the error records will be captured and stored in the storage so you will be able to identify the reason for any error very quickly. And if you would like to ingest those error rows then you can fix those records and re-run the pipeline.

Sunday, March 28, 2021

Step by step guideline to install Jupyter Notebook

Whether you work as a Data Engineer or a Data Scientist, a Jupyter Notebook is a helpful tool. One of the projects I was working required a comparison of two parquet files. This is mainly a schema comparison, not a data comparison. Though the two .parquet were created from two different sources, the outcome should be completely alike, schema wise. At the beginning I was manually comparing them then I thought there must be a tool to do that. Well, that's how I found a Jupyter notebook can be useful to compare two .parquet files' schema.

The Jupyter Notebook can be used for data cleaning and transformation, data visualization, machine learning, statistical modeling and much more. This post will describe the step by step installation process of Jupyter notebook.

Step 1: Install python version 3.7.9

Python is a prerequisite for running a Jupyter notebook, so we need to install python first. Please follow this URL and choose right version to install: https://www.python.org/downloads/.

I have chosen 'Windows x86-64 executable installer' for my Windows 64 bit OS. Please choose the version as per your computer Operating system.

Fig 1: Windows Executable

You can download the executable file and save in any location at your computer.

Now next step is to create a 'Python' folder under the C: drive, we will use this folder as installation location at later step.

Fig 2: Python folder under C

 

Find out the downloaded executable file, I have saved the executable file under Downloads folder (shown in below figure 3). Now double click the executable file to initiate the installation process.

Fig 3: Python Execution file

Make sure to choose 'Customize Installation' and check mark 'Add Python 3.9 to PATH' as shown in figure 4. I followed the customization method to avoid setting up environment variable.

Fig 4: Python Installation wizard

As below figure 5 shown, the Customize installation location, where make sure you put the installation location folder C:\Python\Python39. We have created 'Python' folder in C drive in earlier step (Fig 2)

Fig 5: choose the location

Now hit the Install button. Installation will complete in a minute or two.

Let's test if python installed successfully, open command prompt and type "python". If python is installed correctly then you should able to see the python version number and some key help, as shown below in Fig 6.

Fig 6: Python installed successfully.

Step 2: Install the Jupyter Notebook

Let's move to the next step, which is to install the Jupyter notebook software. Open command prompt and type the below code:

>pip install jupyter
 
Fig 7: Jupyter Notebook installation started

When installation is complete, let's run the Jupyter Notebook web application. To do this, you need to go to a cmd prompt and execute this command, as shown in below figure 8:

Jupyter notebook
 
Fig 8: Opening Jupyter Notebook
 
As soon as you hit the above command button, It will open a browser with jupyter notebook as shown in figure 9.
 
Fig 9: Jupyter Notebook on browser

Now you can create a Notebook by choosing 'New' and choose Python 3 as show in fig 10. This will open a new browser tab where you will write the code.

Fig 10: Open Notebook

Let's write hello world program in the Jupyter notebook. The browser will look like figure 11 if you enter this code:

print('Hello world')

The output is shown after clicking the 'Run' button.

 
Fig 11: Hello world in Jupyter Notebook

Now you can write and run other notebooks.

In this article, we learned how to install python and Jupyter Notebooks and have also written a simple hello world program. There are different ways you can install Jupyter Notebook, but I followed this approach and found simple.

Tuesday, February 23, 2021

How to add local timestamp end of the files in ADF?

This article will describe how to add your local timestamp at the end of the each file in Azure Data Factory (ADF). In general, ADF gets a UTC timestamp, so we need to convert the timestamp from UTC to EST, since our local time zone is EST.

For example, if the input Source file name in SFTP is "_source_Customer.csv", then the expected outcome will be, "_source_Customer_2021-02-12T133751.csv". This means that the pipeline should add '_2021-02-12T133751' to the end of each file. This will work dynamically, which means that any file you pass from the source will have the timestamp added to it by using an ADF regular expression.

Let's set a simple pipeline and explain the scenario in a few steps. In this example, we receive files from an event based trigger and hold the file name in a parameter. The main part of this article is how to append the current date and time to the end of the file name we received. Please note that event based triggers will not be discussed here. If you like to know more about how to create trigger, please follow this link.

Step 1: Add Copy Activity

Create a simple pipeline with at least one Copy activity that connects a source and a sink, similar to what is shown in Fig 1.

 

                                           

                                          Fig 1: Pipeline with Copy Activity
 

Step 2: Adding a parameter to receive the file name 

Before adding the expression, we need to have a parameter in the pipeline that will catch the filename from a trigger. For the time being assume that the trigger will give us the file name and the parameter is to hold the filename.
 
                                                 
                                                              Fig 2: Adding parameter

Step 3: Prepare the sink dataset

In the Copy data activity there is a Sink dataset that needs a parameter. Click on the Sink dataset and when it opens, you will find the view similar to Fig 3.

                                                                         
                                                                                Fig 3: Adding parameter to the dataset
 
To add parameter to the dataset, click New and add the parameter name. Select the type, which should be a string. Now you you will see the sink dataset looks like Fig 4. The value edit box is where you need to add the dynamic content.
                                             
                                                  Fig 4: Dynamic content to add the timestamp

 

Step 4: Setting up a dynamic expression

Now, let's create the dynamic expression. As soon you hit the 'Add dynamic content', shown in Figure 5, you will able to write the expression that will convert the UTC timestamp to EST and then pad end of the file.

                                         
                                                                  Fig 5: expression language
 
We apply a number of functions to the pTriggerFile parameter from Step 1. Let's have closer look at the expression:
@concat(replace(pipeline().parameters.pTriggerFile,'.csv',''), '_', 
formatDateTime(convertTimeZone(utcnow(),'UTC','Eastern Standard Time'),'yyyy-MM-ddTHHmmss'), '.csv')

Find out the explanation of the above expression.

  1. First we need to get the filename from the parameter, pTriggerFile. The value here will be: _source_Customer.csv
  2. Next we use REPLACE() to replace the .csv with empty string : replace(pipeline().parameters.pTriggerFile,'.csv',''). This case, we get:  _source_Customer
  3. We need to get the timestamp. To do that, we convert utcnow() to EST with this function: convertTimeZone(utcnow(),'UTC','Eastern Standard Time')
  4. We want to format the date, and use this: formatDateTime(convertTimeZone(utcnow(),'UTC','Eastern Standard Time'),'yyyy-MM-ddTHHmmss'), which will return a value like: 2021-02-12T133751
  5. We put this all together with @concat(Step1,'_', 'Step4','.csv'), which will return _source_Customer_2021-02-12T133751.csv

We learned how to add local timestamp end of any file, though in this case, the source file was a .csv. However, you can follow the same process for .txt file where you only need to change '.csv' to '.txt' in the expression.

Saturday, January 30, 2021

How to work with SQL Store procedure output parameters in Azure data factory

To facilitate native SQL code we have stored procedure support in Azure Data Factory (ADF). When we work with stored procedures, mostly we use input parameters, however, a stored procedure can also have output parameters. In this case, we need to deal with return values, similar to how a value is returned by a function. This article will describe how you can work with stored procedure output parameters in ADF.

ADF has the SQL Server Stored Procedure Activity, which is used for any stored procedure you have in a SQL Server database. ADF also has a Lookup activity in which you can use a stored procedure. The example will use the Lookup activity to execute a stored procedure in a database.

Let's start by creating a stored procedure (SP) with an output parameter:

CREATE PROCEDURE [ETL].[sp_testprocOutParm]
(
 @input  VARCHAR(10),@Iambit BIT OUTPUT)
AS
BEGIN
IF @input >= '1'
BEGIN
    SET @iambit = 1;
	RETURN;
	END
END;

Let's see if the SP returns the expected value. Well, how do we execute the SP in SQL Server Management Studio (SSMS)? Please write below syntax to find the outcome.

DECLARE @Iambit bit 

EXEC ETL.sp_testprocOutParm '1', @Iambit OUTPUT
SELECT @Iambit Iambit

You should have outcome like below Fig 1.

Fig 1: Execution of output parameter in SSMS

The SP is created in the database and it has returned the expected outcome. Now we need to move to ADF.

This article assume you know how to add Lookup activity to ADF pipeline. In your pipeline get the Lookup activity as like figure 1.1.Now, Go Settings of the Lookup activity and choose stored procedure from theUse query selections (as shown in Figure 2). Then you should able to see the stored procedure under the dropdown list which you just created in the database. If you are not able to see the stored procedure then it most likely a problem with your access in ADF. Please remember, though you can execute the SP in SSMS, this doesn't mean you will have access to the stored procedure from ADF, You will need to talk with your portal admin and find out if your user has been given enough permissions to execute the stored procedure.

Fig 2: Connect stored procedure via Lookup in ADF

If you find out the stored procedure in the list,  you can continue to the next step. The next step is to import parameters by clicking the button, import parameter, as shown in Fig 3.

Fig 3: import parameter

The import parameter will load all the SP parameters, both input and output parameters. In this case the single output parameter will be shown. Having an output parameter means you want to return some value from the stored procedure and use it in ADF. In this example, the return value will control the next activities in the pipeline. Let's store the return value into a variable.

Drag and drop the Set Variable activity and connect it with the with the Lookup, as shown below in Fig 4

Fig 4: Add Set variable

We need a variable as shown in Fig 5, put variable name and select type of the variable. Since variable is at pipeline scope, so make sure you have selected pipeline itself not any activities in the pipeline.

Fig 5: Creating variable

While creating variables, you will find there are three types of variables: string, Boolean and Array. We have chosen Boolean type for the variable since our stored procedure returns Boolean.

Fig 6: Creating variable

Now, let's get to the pipeline and select 'set variable' activity (as shown in Fig 7). In the Set variable activity, please click the variable tab. Here is where you will find the recently created variable under the name drop down. Select it, as shown in Fig 7.

 

Fig 7: Choose the variable

To bind the value returned from the stored procedure to the variable, you need write the expression under the 'value' of the variable, as shown in Fig 8

Fig 8: expression to hold return value

The expression to bind return value from stored procedure is:

@activity('Lookup1').output.firstRow.Iambit

Let's explain what expression means. The above expression will return first row from the lookup activity, named Lookup1, and return the column name from the stored procedure, which is 'Iambit'. Here, 'Iambit' will return the result as 1 (a Boolean TRUE).

Be Cautious: Though the variable is set to be Boolean and the stored procedure returned the value as a bit, I found an error in the ADF with the above expression. I had to modify the expression to explicitly convert the return value to Boolean, like the code below:

@bool(activity('Lookup1').output.firstRow.Iambit)

In summary, output parameter in stored procedure a great facility when in need and happy to see that works in ADF. Please note that, the example we depicted here in this article by using very simple stored procedure, however; in business scenario your stored procedure could be a bit complex. For example, you may have logging tables in the data warehouse and you want any activity in ADF will only execute when you pass particular pipeline name as input parameter as a return if the stored procedure returns TRUE.

Wednesday, December 30, 2020

ADF data flow: can particular date format generate NULL value?

I am going to share recent finding in ADF data flow where my source data in .csv got correct date. However, as long as when I did some transformation and saved in .parquet found those date all got empty values. This blog post will describe the issue and the resolution.


Source data in .csv have startDate and CloseDate like below figure 1.0 where format of the date is MM/dd/yyyy

Fig 1.0: Date in the source .csv



I have used ADF data flow to cleanup/transform the files and saved into .parquet format. However, in the .parquet file these two date columns 'StartDate' and 'CloseDate' become empty.

 

Fig 1.1: Dates become empty in .parquet


After looking into the dataflow and specific looking at the projection of the source found auto detect date format ‘MM/dd/YYYY’ which is original source date format.


Fig 1.3: Date format auto detected in the data flow

 And when previewed the data those date shown as NULL which was kind of weird.

   

Fig 1.4: Date shown as NULL in the data preview

How to solve it? 

To fix this issue, what you need to do is, go under projection and change the date format to ‘yyyy-MM-dd’ as like below figure 2.0


Fig 2.0: Change date format

 

 And you can go and see the preview, it looks good now.

 

Fig 2.1: After changing the date format preview looks perfect


 

 Note that, I have tried with other format from projection such as yyyy/MM/dd and so on but those did not resolve the issue.

Fig 2.2: 


Other Solution?

 You can also take other approach,  change the format from 'date' to 'string'  under the projection

Fig 3.0: change data type from date to string


And then use derive column activity 

Fig 3.1: get 'derived column' in the data flow


Now, use expression to convert into the correct date format : toDate(OpenDate,'yyyy-MM-dd') 

Fig 3.2: expression to convert from string to date


In summary, if you find any discrepancy at output file's date columns then look closely the date format, preview the data in the ADF data flow then either change the format from source projections or use derived column activities to fix it.