Sunday, March 28, 2021

Step by step guideline to install Jupyter Notebook

Either you work as a Data Engineer or Data Scientist; Jupyter Notebook is a helpful tool. One of the project I was working required two .parquet file comparison, it's mainly schema comparison; not the data. Though the two .parquet was creating from two different sources but outcome (which is .parquet) should be completely alike, schema wise. At the beginning I was manually comparing them then I thought there must be a tool to do that. Well, that's how I found Jupyter notebook which can be useful to compare two .parquet schema.

To compare two .parquet files your development environment need to have Jupyter notebook and python as a prerequisite. This post will describe step by step installation process of Jupyter notebook. 

Step 1: Install python version 3.7.9

Python is a prerequisite for Jupyter notebook so at first python need to install.

Please follow the below URL and choose right version to install.

https://www.python.org/downloads/


Fig 1: Windows Executable


Create python folder under C Drive where downloaded file should be resided.

Fig 2: Python folder creation

Open the folder 'Python' and find the execution file as shown in figure fig 3: and double click the execution file to initiate the installation process.

Fig 3: Python Execution file


Make sure to choose 'Customize Installation' and check mark 'Add Python 3.9 to PATH' as shown in figure 4. I followed the customization method to avoid setting up environment variable.



Fig 4: Python Installation wizard


Make sure to choose the folder that you have created earlier (Fig 2)

Fig 5: choose the location

And now hit the Install button. Installation will complete in a minute or two.

Let's test if python installed successfully, open command prompt and write python. If python is installed correctly then you should able to see the python version number and some key help as like below figure 6.


Fig 6: Python installed successfully.



Step 2: Install Jupyter notebook

Let's move to the next step to install Jupyter notebook

Open command prompt and write the below code:

>pip install jupyter 

                            

Fig 7: Jupyter notebook installation started



When installation is done, let's run the Jupyter notebook. To run Jupyter notebook, you need to go to cmd prompt and execute the command: Jupyter notebook  as shown in below figure 8.

Fig 8: Opening Jupyter notebook

It will open a browser with jupyter notebook as shown in figure 9.

Fig 9: Jupyter notebook on browser


Now you can create a Notebook by choosing 'New' and choose Python 3 as show in fig 10. It will open a new browser tab where you will write the code.

Fig 10: Open Notebook

Let's write hello world program in the Jupyter notebook, the browser will look like figure 11 where Code we wrote : print('Hello world') and the output is shown after clicking the 'Run' button.

Fig 11: Hello world in Jupyter Notebook


We learned how to install Jupyter Notebook and also written simple hello world program. And there are different ways you can install Jupyter Notebook, but I followed this approach and found simple.

Tuesday, February 23, 2021

How to add local timestamp end of the files in ADF?

This article will describe how to add your local timestamp at the end of the each file in Azure Data Factory (ADF). In general, ADF gets a UTC timestamp, so we need to convert the timestamp from UTC to EST, since our local time zone is EST.

For example, if the input Source file name in SFTP is "_source_Customer.csv", then the expected outcome will be, "_source_Customer_2021-02-12T133751.csv". This means that the pipeline should add '_2021-02-12T133751' to the end of each file. This will work dynamically, which means that any file you pass from the source will have the timestamp added to it by using an ADF regular expression.

Let's set a simple pipeline and explain the scenario in a few steps. In this example, we receive files from an event based trigger and hold the file name in a parameter. The main part of this article is how to append the current date and time to the end of the file name we received. Please note that event based triggers will not be discussed here. If you like to know more about how to create trigger, please follow this link.

Step 1: Add Copy Activity

Create a simple pipeline with at least one Copy activity that connects a source and a sink, similar to what is shown in Fig 1.

 

                                           

                                          Fig 1: Pipeline with Copy Activity
 

Step 2: Adding a parameter to receive the file name 

Before adding the expression, we need to have a parameter in the pipeline that will catch the filename from a trigger. For the time being assume that the trigger will give us the file name and the parameter is to hold the filename.
 
                                                 
                                                              Fig 2: Adding parameter

Step 3: Prepare the sink dataset

In the Copy data activity there is a Sink dataset that needs a parameter. Click on the Sink dataset and when it opens, you will find the view similar to Fig 3.

                                                                         
                                                                                Fig 3: Adding parameter to the dataset
 
To add parameter to the dataset, click New and add the parameter name. Select the type, which should be a string. Now you you will see the sink dataset looks like Fig 4. The value edit box is where you need to add the dynamic content.
                                             
                                                  Fig 4: Dynamic content to add the timestamp

 

Step 4: Setting up a dynamic expression

Now, let's create the dynamic expression. As soon you hit the 'Add dynamic content', shown in Figure 5, you will able to write the expression that will convert the UTC timestamp to EST and then pad end of the file.

                                         
                                                                  Fig 5: expression language
 
We apply a number of functions to the pTriggerFile parameter from Step 1. Let's have closer look at the expression:
@concat(replace(pipeline().parameters.pTriggerFile,'.csv',''), '_', formatDateTime(convertTimeZone(utcnow(),'UTC','Eastern Standard Time'),'yyyy-MM-ddTHHmmss'), '.csv')

Find out the explanation of the above expression.

  1. First we need to get the filename from the parameter, pTriggerFile. The value here will be: _source_Customer.csv
  2. Next we use REPLACE() to replace the .csv with empty string : replace(pipeline().parameters.pTriggerFile,'.csv',''). This case, we get:  _source_Customer
  3. We need to get the timestamp. To do that, we convert utcnow() to EST with this function: convertTimeZone(utcnow(),'UTC','Eastern Standard Time')
  4. We want to format the date, and use this: formatDateTime(convertTimeZone(utcnow(),'UTC','Eastern Standard Time'),'yyyy-MM-ddTHHmmss'), which will return a value like: 2021-02-12T133751
  5. We put this all together with @concat(Step1,'_', 'Step4','.csv'), which will return _source_Customer_2021-02-12T133751.csv

We learned how to add local timestamp end of any file, though in this case, the source file was a .csv. However, you can follow the same process for .txt file where you only need to change '.csv' to '.txt' in the expression.

Saturday, January 30, 2021

How to work with SQL Store procedure output parameters in Azure data factory

To facilitate native SQL code we have stored procedure support in Azure Data Factory (ADF). When we work with stored procedures, mostly we use input parameters, however, a stored procedure can also have output parameters. In this case, we need to deal with return values, similar to how a value is returned by a function. This article will describe how you can work with stored procedure output parameters in ADF.

ADF has the SQL Server Stored Procedure Activity, which is used for any stored procedure you have in a SQL Server database. ADF also has a Lookup activity in which you can use a stored procedure. The example will use the Lookup activity to execute a stored procedure in a database.

Let's start by creating a stored procedure (SP) with an output parameter:

CREATE PROCEDURE [ETL].[sp_testprocOutParm]
(
 @input  VARCHAR(10),@Iambit BIT OUTPUT)
AS
BEGIN
IF @input >= '1'
BEGIN
    SET @iambit = 1;
	RETURN;
	END
END;

Let's see if the SP returns the expected value. Well, how do we execute the SP in SQL Server Management Studio (SSMS)? Please write below syntax to find the outcome.

DECLARE @Iambit bit 

EXEC ETL.sp_testprocOutParm '1', @Iambit OUTPUT
SELECT @Iambit Iambit

You should have outcome like below Fig 1.

Fig 1: Execution of output parameter in SSMS

The SP is created in the database and it has returned the expected outcome. Now we need to move to ADF.

This article assume you know how to add Lookup activity to ADF pipeline. In your pipeline get the Lookup activity as like figure 1.1.Now, Go Settings of the Lookup activity and choose stored procedure from theUse query selections (as shown in Figure 2). Then you should able to see the stored procedure under the dropdown list which you just created in the database. If you are not able to see the stored procedure then it most likely a problem with your access in ADF. Please remember, though you can execute the SP in SSMS, this doesn't mean you will have access to the stored procedure from ADF, You will need to talk with your portal admin and find out if your user has been given enough permissions to execute the stored procedure.

Fig 2: Connect stored procedure via Lookup in ADF

If you find out the stored procedure in the list,  you can continue to the next step. The next step is to import parameters by clicking the button, import parameter, as shown in Fig 3.

Fig 3: import parameter

The import parameter will load all the SP parameters, both input and output parameters. In this case the single output parameter will be shown. Having an output parameter means you want to return some value from the stored procedure and use it in ADF. In this example, the return value will control the next activities in the pipeline. Let's store the return value into a variable.

Drag and drop the Set Variable activity and connect it with the with the Lookup, as shown below in Fig 4

Fig 4: Add Set variable

We need a variable as shown in Fig 5, put variable name and select type of the variable. Since variable is at pipeline scope, so make sure you have selected pipeline itself not any activities in the pipeline.

Fig 5: Creating variable

While creating variables, you will find there are three types of variables: string, Boolean and Array. We have chosen Boolean type for the variable since our stored procedure returns Boolean.

Fig 6: Creating variable

Now, let's get to the pipeline and select 'set variable' activity (as shown in Fig 7). In the Set variable activity, please click the variable tab. Here is where you will find the recently created variable under the name drop down. Select it, as shown in Fig 7.

 

Fig 7: Choose the variable

To bind the value returned from the stored procedure to the variable, you need write the expression under the 'value' of the variable, as shown in Fig 8

Fig 8: expression to hold return value

The expression to bind return value from stored procedure is:

@activity('Lookup1').output.firstRow.Iambit

Let's explain what expression means. The above expression will return first row from the lookup activity, named Lookup1, and return the column name from the stored procedure, which is 'Iambit'. Here, 'Iambit' will return the result as 1 (a Boolean TRUE).

Be Cautious: Though the variable is set to be Boolean and the stored procedure returned the value as a bit, I found an error in the ADF with the above expression. I had to modify the expression to explicitly convert the return value to Boolean, like the code below:

@bool(activity('Lookup1').output.firstRow.Iambit)

In summary, output parameter in stored procedure a great facility when in need and happy to see that works in ADF. Please note that, the example we depicted here in this article by using very simple stored procedure, however; in business scenario your stored procedure could be a bit complex. For example, you may have logging tables in the data warehouse and you want any activity in ADF will only execute when you pass particular pipeline name as input parameter as a return if the stored procedure returns TRUE.