Saturday, September 19, 2020

Azure data Factory: What is not allowed in the file name while using Data Flow (spark engine)?

I was working with a pipeline where data needs to move from source and load into RAW zone in ADLS. As soon as loaded the file it got date time stamp appended end of the file. e.g. source file name is: customer.csv and when it's landed to RAW zone, then file name will be : customer_2020-09-28T12:15:32.csv

How do you add date time stamp end of the file?

Adding dynamic content at the sink pipeline like below (Fig 1) will do the task.

Fig 1: dynamic content to

When I run the pipeline it was appending timestamp end of the file and saving in the blob storage And I was able to view the data from the file: customer_2020-09-28T12:15:32.csv

Fig 2: Preview of the data

In the azure data lake storage file name with ':' did not give any issue while creating the file name as well viewing it. 

However, as soon as I use that file in the Data Flow activity and debug the code (when apache spark engine fire) then below error : 

Fig 3: java.lang.illegalArgumentException

How to resolve the error?

You can't have file with ':', which also true if you try to create a file in Windows OS with ':' in it. It will throw error, however, interestingly that's not the case when you create a file with same name in Azure data lake storage (HDFS is working behind).

Fig 4: Error in Windows machine

To resolve this, I updated the format of timestamp part while adding end of the each file, instead of using 
yyyy-MM-ddTHH:mm:ss , I have used yyyy-MM-ddTHHmmss. so I get the file name as: customer_2020-09-28T121532.csv

1 comment:

Sonny Quinn said...

Nicely done, Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. continue to write about

Data Engineering Solutions