Saturday, August 28, 2021

How to Flatten JSON in Azure Data Factory?

When you work with ETL and the source file is JSON, many documents may get nested attributes in the JSON file. Your requirements will often dictate that you flatten those nested attributes. There are many ways you can flatten the JSON hierarchy, however; I am going to share my experiences with Azure Data Factory (ADF) to flatten JSON.

The ETL process involved taking a JSON source file, flattening it, and storing in an Azure SQL database. The attributes in the JSON files were nested, which required flattening them. The source JSON look like this:

{

"id": "01",

"name": "Tom Hanks",

"age": 20.0,

"email": "th@hollywood.com",

"Cars":

  {

  "make": "Bentley",

  "year": 1973.0,

  "color": "White"

  }

}

The above JSON document has a nested attribute, Cars. We would like to flatten these values that produce a final outcome look like below:

{

"id": "01",

"name": "Tom Hanks",

"age": 20.0,

"email": "th@hollywood.com",

"Cars_make":  "Bentley",

"Cars_year":  "1973.0",

"Cars_color":   "White"

}

How do we do it by using ADF?

Let's create a pipeline that includes the Copy activity, which has the capabilities to flatten the JSON attributes. Let's do that step by step.

First, create a new ADF Pipeline and add a copy activity.

Fig 1: Copy Activity in ADF


Next, we need datasets. You need to have both source and target datasets to move data from one place to another. In this case source is Azure Data Lake Storage (Gen 2). The target is Azure SQL database. The below figure shows the source dataset. We are using a JSON file in Azure Data Lake.

Fig 2: Source dataset

We will insert data into the target after flattening the JSON. the below figure shows the sink dataset, which is an Azure SQL Database.

Fig 3: Sink dataset

Please note that, you will need Linked services to create both the datasets, this article will not go into details about Linked Services, to know details you can look into the Microsoft document.


3. Flattening JSON

After you create source and target dataset, you need to click on mapping as shown below figure 4 and follow the steps:

Fig 4: Flattening JSON

a) At first import schemas
b) Make sure to choose value from Collection Reference
c) Toggle the Advanced Editor
d) Update the columns those you want to flatten

After you have done above, then save it and execute the pipeline. You will find flatten records are inserted to the database as shown in fig 5.


Fig 5: Saved data into the table after flattening


Be cautious

Make sure to choose "Collection Reference" as mentioned 3.b, if you forget to choose that then the mapping will look like below Fig 6:

Fig 6: Without putting collection reference


If you look at the mapping closely from the above figure 6, the nested item in the JSON from source side is: 'result'][0]['Cars']['make'] which means it will only take very first record from the JSON. If you execute the pipeline you will find only one record from JSON file is inserted to the database. So it's important to choose Collection Reference.

In summary, I found Copy Activity in Azure Data Factory make easier to flatten the JSON, you don't need to write any custom code which is super cool.



No comments: