azure data factory - ADF dynamic row starting position - Stack Overflow

IT技术

更新时间：2025-04-175

admin管理员组
文章数量:1404773

We have an excel file which we need to convert to CSV using ADF.

The excel file has data table starting from row 5.

However, there are instances where "Time" (on row 3) is missing and the data starts from row 4.

Is there anyway we can check row index of "ProductID" to determine starting position of the data table?

We have an excel file which we need to convert to CSV using ADF.

The excel file has data table starting from row 5.

However, there are instances where "Time" (on row 3) is missing and the data starts from row 4.

Is there anyway we can check row index of "ProductID" to determine starting position of the data table?

Share Improve this question asked Mar 10 at 12:16 developer 1,6274 gold badges38 silver badges86 bronze badges

Have you tried any approach so far? – Rakesh Govindula Commented Mar 10 at 20:33
I tried to get cell A4 content and compare with word "ProductID" to check if it starts from row 4 otherwise row 5. Is there any better way to do this? – developer Commented Mar 11 at 9:36

Add a comment |

2 Answers 2

Sorted by: Reset to default 2

I tried to get cell A4 content and compare with word "ProductID" to check if it starts from row 4 otherwise row 5. Is there any better way to do this?

If you know that the correct header is either A4 or A5, you can directly opt for this approach. As it will successfully deal with two cases of either A4 or A5.

But if you don't have the control over the first row, you can try the below approach involving loop to find out the required row position.

First create the excel dataset with one dataset parameter range of type string for the Range and another parameter header of Boolean type for First Row as header.

Now, in the pipeline create the below variables. In this approach, a counter variable will be incremented by 1 from value 1 in an Until loop. The counter variable will be checked with the Range of the Excel dataset in each dataset. After each iteration, it checks whether the first-row value is ProductID or not. If not, the counter will be checked in next iteration otherwise the current counter value is the correct header row, and it will be the end of the loop.

Take an Until activity with below expression as condition.

@equals(variables('col'),'ProductID')

Inside Until activity, take a set variable activity and assign the counter variable to the temp value.

Now, take lookup activity with the above Excel dataset. Provide the counter variable for the range and false to the header.

range - @concat('A',string(variables('counter')))
header - false

Take a set variable and store the first-row value in first column in the col variable.

@activity('Lookup1').output.value[0].Prop_0

In another set variable, use the expression @add(variables('temp'),1) to the counter variable which will increment the counter variable.

After the end of the Until loop, the correct header row will be stored in the counter variable. You can use the same Excel dataset in your sub-sequent activities with the counter variable in the range parameter and now change the header parameter value to true.

In case if some one want to do using databricks notebook activity, pass the filepath from ADF to notebook and execute the below code.

Below is the function for loading the excel data, it scan for the ProductID string and get the row index and after that it find the other columns\headers. Finally it load the data from ProductID onwards and return dataframe.

%python
#from pyspark.sql import SparkSession
import openpyxl
from pyspark.sql.types import StructType, StructField, StringType

def extract_data_from_excel(file_path, sheet_name="Sheet1", start_string="ProductID"):
    try:
        workbook = openpyxl.load_workbook(file_path) #load excel
        sheet = workbook[sheet_name]

        start_row_index = None #row index of the first row containing the start_string
        for row_index, row in enumerate(sheet.iter_rows(values_only=True), start=1):
            if start_string in row:
                start_row_index = row_index
                break

        if start_row_index is None:
            print(f"Error: String '{start_string}' not found in sheet '{sheet_name}'")
            return None
          
        #extract the headers/column names
        headers = list(sheet[start_row_index])
        headers = [cell.value for cell in headers if cell.value is not None]
        
        #extract row values
        data = []
        for row_index in range(start_row_index + 1, sheet.max_row + 1):
            row_values = [cell.value for cell in sheet[row_index] if cell.value is not None]
            if not row_values:  # skip empty rows
                continue

            data.append(row_values[:len(headers)])


        schema = StructType([StructField(header, StringType(), True) for header in headers])
        rdd = spark.sparkContext.parallelize(data)

        df = spark.createDataFrame(rdd, schema)

        return df

    except Exception as e:
        print(f"Error: {e}")
        return None

then call this function using the below code and write to a csv file

file_path = "/dbfs/mnt/Test/Prod1.xlsx" #change to your filepath
file_output_path = "/dbfs/mnt/Test/Prod1.csv" #change to your filepath

result_df = extract_data_from_excel(file_path)
#result_df.show()

result_df.write.csv(file_output_path, header=True,mode="overwrite")

Once you get the data in a correct dataframe you have multiple ways to load the data, I choose the csv so the ADF copy activity can load the data into destination.

The input file looks like this

and the output dataframe look like this

This process require databricks cluster start and will not be cheap compared to other solution which uses only ADF. But for large load and having thousands of files containing millions of data, in that case this solution might perform well.

本文标签： azure data factoryADF dynamic row starting positionStack Overflow

版权声明：本文标题：azure data factory - ADF dynamic row starting position - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744847227a2628282.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

azure data factory - ADF dynamic row starting position - Stack Overflow

2 Answers 2

更多相关文章

azure data factory - ADF dynamic row starting position - Stack Overflow

发表评论

推荐文章

angularjs - Debugging JavaScript Angular application in WebStorm using Google Chrome - Stack Overflow

visual studio - Unicode and ASCII may not be encoding correctly - Stack Overflow

ZMQ module not found on jupyter emacs - Stack Overflow

javascript - Uncaught ReferenceError: Link is not defined React - Stack Overflow

javascript - Popup Div anchor at mouse hoverclick - Stack Overflow

热门文章

plugins - How do I run SQL queries on a WordPress web page?

post meta - Saving repeated option values when querying in Published, draft and all

jquery - JavaScript Variable undefined outside of function - Stack Overflow

javascript - Angular translate multiple json files - Stack Overflow

visual studio code - How to tell to VSCode the type of a function parameter (Javascript) - Stack Overflow

javascript - error: ‘NewSymbol’ is not a member of ‘v8::String’? - Stack Overflow

How do I do a Javascript Redirect in asp.net - Stack Overflow

HTTP proxy failure with camel-box component - Stack Overflow

metabox - How to make a meta box field a requirement

javascript - Return true or false using map arrays - Stack Overflow

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

arrays - Break Camel Case function in JavaScript - Stack Overflow

react native - Push Notification FCM Server Key Error Between Two Apps - Stack Overflow

themes - Does anyone know where to find the area to edit this text

javascript - How to reload page after using fetch - Stack Overflow

javascript - How to access variable from parent module in TypeScript? - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价