Sunday, June 6, 2021

Why Power query as a transformation activity in Azure Data factory and SSIS?

This blog post will describe how power Query activity in ADF and SSIS can be useful. As well as, I will share the differences of Power Query activity between SSIS and ADF.

Why Power Query and When to use it?

When data engineer works for transformation pipeline they get different activities like lookup, merge, data conversion etc. in their preferred ETL tool. ETL tools like Azure data factory (ADF) got Dataflow and Databricks to solve complex transformation. In addition, ADF introduced 'Power Query' (previous name data wrangling) as an activity. Please note that, Power query is still in preview for both Azure Data Factory (ADF) and SSIS. 


Fig 1: Power Query in ADF

Despite having many activities in Azure data factory why we need Power Query? Let's share my experience when Power Query have chosen as an activity in the pipeline.  The task was to get data from complex excel files with many calculation and more than 1000 columns which is used by business as an application. Yes! you got it right, it's an excel application, organization still uses excel as an application!!  A few transformed and calculated columns need to go to the modern data warehouse from the excel files. 

In this scenario, thought about what would be the best activity to choose from: DataFlow, Databricks or Power Query? well, I would say all of them may work but Power Query was the best choice.

Let me explain, why? Since the source file is excel and it's got  more than 1000 columns with many calculation inside, It's almost impossible for a Data Engineer to find out how to derive the expected outcome where no mapping or transformation logic is provided. By using Power Query visual transformation, business expert and I were able to work closely and produce the output in a very short period of time. 


Fig 2: Power Query transformation in ADF

 As a Data Engineer, when you work with dataflow or Databricks or any other transformation activity in ETL tool, you follow the documented mapping logic and build the pipeline. It means transformation rules and mappings are predefined. However, when transformation rules are yet to discover then best to start with Power Query. You can simply start with Power BI desktop to work together with business to produce the expected outcome. And when output is verified and accepted then then copy the M Query to ADF Power Query activity or SSIS Power Query source. In fact, now you have the transformation rules in the M Query so if you like to use other transformation activity like dataflow or Databricks you can use that too.


What works in SSIS but not in ADF?

Power Query activity is in Preview for both SSIS and ADF, however; if you choose ADF then you need to convert the source file from .excel to .csv since Power Query for ADF doesn't support .excel as source dataset.


Fig 3: Source dataset for Power Query

However, if you work with Power Query in SSIS then it support excel as source. On Contrary, in SSIS; when you are working with Power Query Source, it doesn't have user interface to make the transformation like ADF. The obvious reason is, you can use Power BI desktop to do the transformation and then copy the M query (Power Query generate M syntax which called M Query) from Power BI and paste it to Power Query Source in SSIS.

Fig 4: Power Query in SSIS



In summary, Power Query in both SSIS and ADF is useful Activity and new feature which still in preview, hence there might be many different scenarios where you want to use Power Query activity, however; this article is based on my experiences with Power Query activity in ADF and SSIS. It's also interesting to know that, The user interface you get under ADF Power Query is identical to Power BI, however, not all M query is supported by ADF Power Query yet.




No comments: