admin管理员组

文章数量:1391960

I have a data lake where each container consists of folders named A, B, C, etc. Inside each folder, there are JSON files that I want to merge into one Parquet file, with some simple data transformations. Hence, there is one Parquet file per folder in the container. The data transformation for folder A is different from that for folders B and C, but all transformations result in a Parquet file.

As time goes on, new containers will be added with the same A, B, C folder structure and names, and JSON files will be updated.

I am considering using Synapse, Databricks, and Azure Data Factory (ADF) for this task. However, the challenging part is creating a trigger that updates the Parquet file when needed. The entire process should be automated, with the data transformation tool monitoring the data lake and deciding when to start the data transformation.

How can I automate this process to ensure that the Parquet files are updated whenever new data is added or existing data is modified? Any guidance on setting up triggers and automating the data transformation would be greatly appreciated.

Could ADF and azure function be a option as well ? cost is something i would consider as important :D

Im new to all these tools :D

I have a data lake where each container consists of folders named A, B, C, etc. Inside each folder, there are JSON files that I want to merge into one Parquet file, with some simple data transformations. Hence, there is one Parquet file per folder in the container. The data transformation for folder A is different from that for folders B and C, but all transformations result in a Parquet file.

As time goes on, new containers will be added with the same A, B, C folder structure and names, and JSON files will be updated.

I am considering using Synapse, Databricks, and Azure Data Factory (ADF) for this task. However, the challenging part is creating a trigger that updates the Parquet file when needed. The entire process should be automated, with the data transformation tool monitoring the data lake and deciding when to start the data transformation.

How can I automate this process to ensure that the Parquet files are updated whenever new data is added or existing data is modified? Any guidance on setting up triggers and automating the data transformation would be greatly appreciated.

Could ADF and azure function be a option as well ? cost is something i would consider as important :D

Im new to all these tools :D

Share Improve this question asked Mar 12 at 13:07 Mats-Johan FagerheimMats-Johan Fagerheim 12 bronze badges 1
  • can you provide the code you have tried? – Dileep Raj Narayan Thumula Commented Mar 13 at 3:30
Add a comment  | 

1 Answer 1

Reset to default 0

If you have restrictions to only ADF, you can try the below approach using two pipelines, 6 dataflows and one trigger.

This design involves two scenarios:

Full load of the files:

For the first time, the files need to be merged parquet files should be created.

For this, you need to 3 dataflows which will be used for 3 folders.

Follow the below pipeline design:

Get Meta Data Activity -> Take a Binary dataset without any container or file path and use Child items in the activity.
For-Each activity -> Give the child items list to for-each activity and enable the Sequential checkbox.
    - Dataflow 1 -> Create a JSON dataset with dataset parameter in the container name and give it as source of the dataflow1. Pass the container name from the for loop to dataset parameter in the activity. Inside dataflow source settings give a wild card file path A/*.json and add your transformations after the source. In the sink give your parquet dataset. Use dataset parameters for the file name of the parquet if needed.
    - Dataflow 2 -> Similarly, do the same for folder B and add folder B transformations in the wild card file path, give B/*.json.
    - Dataflow 3 -> Similarly, do the same for folder C and add folder C transformations in the wild card file path, give C/*.json.

For Incremental load:

After the first load, use the below pipeline design.

Create parameters for the folder_path and file path in the pipeline and pass the trigger parameter for it. Now, create Storage event trigger with all containers and give .json as file end path.

Refer this SO answer to know about passing trigger parameters to pipeline parameters.

Create another set of 3 dataflows for the incremental load.

Take a Parquet dataset with dataset parameter for the container and give your folder A parquet file name. Use this as both Source1 and sink1. Create another JSON dataset with dataset parameters for both container and file name. Use this dataset as Source2 in the dataflow. Use union transformation byName to both Source1 and Source2. Add the same Source1 dataset as sink in the dataflow.

Then follow the below pipeline design.

Set variable -> Extract the folder name and store it in a string variable `folder` using this expression from the pipeline parameter split(pipeline().parameters.folder_path,'/')[1].
Switch activity -> Give the `folder` variable to this
 - Case `A` -> Give the case name as `A`.
    - Dataflow A -> pass the folder name and file name parameters to the dataflow dataset parameters.
 - Case `B` -> Give the case name as `B`.
    - Dataflow B -> pass the folder name and file name parameters to the dataflow dataset parameters.
 - Case `C` -> Give the case name as `C`.
    - Dataflow C -> pass the folder name and file name parameters to the dataflow dataset parameters.

For every file upload in any of your containers, the above pipeline will extract the folder name of it and will append the file data to the parquet file of the same folder.

As you have mentioned that the transformations are simple, so, you can try the above approach.

If the transformations are complex, then having a Databricks or synapse notebook for the transformations is better approach. You can use the ADF trigger for the file upload/modification and call the databricks or synapse notebook to pass the file path.

本文标签: