admin管理员组

文章数量:1404773

We have an excel file which we need to convert to CSV using ADF.

The excel file has data table starting from row 5.

However, there are instances where "Time" (on row 3) is missing and the data starts from row 4.

Is there anyway we can check row index of "ProductID" to determine starting position of the data table?

We have an excel file which we need to convert to CSV using ADF.

The excel file has data table starting from row 5.

However, there are instances where "Time" (on row 3) is missing and the data starts from row 4.

Is there anyway we can check row index of "ProductID" to determine starting position of the data table?

Share Improve this question asked Mar 10 at 12:16 developerdeveloper 1,6274 gold badges38 silver badges86 bronze badges 2
  • Have you tried any approach so far? – Rakesh Govindula Commented Mar 10 at 20:33
  • I tried to get cell A4 content and compare with word "ProductID" to check if it starts from row 4 otherwise row 5. Is there any better way to do this? – developer Commented Mar 11 at 9:36
Add a comment  | 

2 Answers 2

Reset to default 2

I tried to get cell A4 content and compare with word "ProductID" to check if it starts from row 4 otherwise row 5. Is there any better way to do this?

If you know that the correct header is either A4 or A5, you can directly opt for this approach. As it will successfully deal with two cases of either A4 or A5.

But if you don't have the control over the first row, you can try the below approach involving loop to find out the required row position.

First create the excel dataset with one dataset parameter range of type string for the Range and another parameter header of Boolean type for First Row as header.

Now, in the pipeline create the below variables. In this approach, a counter variable will be incremented by 1 from value 1 in an Until loop. The counter variable will be checked with the Range of the Excel dataset in each dataset. After each iteration, it checks whether the first-row value is ProductID or not. If not, the counter will be checked in next iteration otherwise the current counter value is the correct header row, and it will be the end of the loop.

Take an Until activity with below expression as condition.

@equals(variables('col'),'ProductID')

Inside Until activity, take a set variable activity and assign the counter variable to the temp value.

Now, take lookup activity with the above Excel dataset. Provide the counter variable for the range and false to the header.

range - @concat('A',string(variables('counter')))
header - false

Take a set variable and store the first-row value in first column in the col variable.

@activity('Lookup1').output.value[0].Prop_0

In another set variable, use the expression @add(variables('temp'),1) to the counter variable which will increment the counter variable.

After the end of the Until loop, the correct header row will be stored in the counter variable. You can use the same Excel dataset in your sub-sequent activities with the counter variable in the range parameter and now change the header parameter value to true.

In case if some one want to do using databricks notebook activity, pass the filepath from ADF to notebook and execute the below code.

Below is the function for loading the excel data, it scan for the ProductID string and get the row index and after that it find the other columns\headers. Finally it load the data from ProductID onwards and return dataframe.

%python
#from pyspark.sql import SparkSession
import openpyxl
from pyspark.sql.types import StructType, StructField, StringType

def extract_data_from_excel(file_path, sheet_name="Sheet1", start_string="ProductID"):
    try:
        workbook = openpyxl.load_workbook(file_path) #load excel
        sheet = workbook[sheet_name]

        start_row_index = None #row index of the first row containing the start_string
        for row_index, row in enumerate(sheet.iter_rows(values_only=True), start=1):
            if start_string in row:
                start_row_index = row_index
                break

        if start_row_index is None:
            print(f"Error: String '{start_string}' not found in sheet '{sheet_name}'")
            return None
          
        #extract the headers/column names
        headers = list(sheet[start_row_index])
        headers = [cell.value for cell in headers if cell.value is not None]
        
        #extract row values
        data = []
        for row_index in range(start_row_index + 1, sheet.max_row + 1):
            row_values = [cell.value for cell in sheet[row_index] if cell.value is not None]
            if not row_values:  # skip empty rows
                continue

            data.append(row_values[:len(headers)])


        schema = StructType([StructField(header, StringType(), True) for header in headers])
        rdd = spark.sparkContext.parallelize(data)

        df = spark.createDataFrame(rdd, schema)

        return df

    except Exception as e:
        print(f"Error: {e}")
        return None

then call this function using the below code and write to a csv file

file_path = "/dbfs/mnt/Test/Prod1.xlsx" #change to your filepath
file_output_path = "/dbfs/mnt/Test/Prod1.csv" #change to your filepath

result_df = extract_data_from_excel(file_path)
#result_df.show()

result_df.write.csv(file_output_path, header=True,mode="overwrite")

Once you get the data in a correct dataframe you have multiple ways to load the data, I choose the csv so the ADF copy activity can load the data into destination.

The input file looks like this

and the output dataframe look like this

This process require databricks cluster start and will not be cheap compared to other solution which uses only ADF. But for large load and having thousands of files containing millions of data, in that case this solution might perform well.

本文标签: azure data factoryADF dynamic row starting positionStack Overflow