admin管理员组文章数量:1404773
We have an excel file which we need to convert to CSV using ADF.
The excel file has data table starting from row 5.
However, there are instances where "Time" (on row 3) is missing and the data starts from row 4.
Is there anyway we can check row index of "ProductID" to determine starting position of the data table?
We have an excel file which we need to convert to CSV using ADF.
The excel file has data table starting from row 5.
However, there are instances where "Time" (on row 3) is missing and the data starts from row 4.
Is there anyway we can check row index of "ProductID" to determine starting position of the data table?
Share Improve this question asked Mar 10 at 12:16 developerdeveloper 1,6274 gold badges38 silver badges86 bronze badges 2- Have you tried any approach so far? – Rakesh Govindula Commented Mar 10 at 20:33
- I tried to get cell A4 content and compare with word "ProductID" to check if it starts from row 4 otherwise row 5. Is there any better way to do this? – developer Commented Mar 11 at 9:36
2 Answers
Reset to default 2I tried to get cell A4 content and compare with word "ProductID" to check if it starts from row 4 otherwise row 5. Is there any better way to do this?
If you know that the correct header is either A4 or A5, you can directly opt for this approach. As it will successfully deal with two cases of either A4 or A5.
But if you don't have the control over the first row, you can try the below approach involving loop to find out the required row position.
First create the excel dataset with one dataset parameter range
of type string for the Range and another parameter header
of Boolean type for First Row as header.
Now, in the pipeline create the below variables. In this approach, a counter variable will be incremented by 1
from value 1
in an Until loop. The counter variable will be checked with the Range of the Excel dataset in each dataset. After each iteration, it checks whether the first-row value is ProductID
or not. If not, the counter will be checked in next iteration otherwise the current counter value is the correct header row, and it will be the end of the loop.
Take an Until activity with below expression as condition.
@equals(variables('col'),'ProductID')
Inside Until activity, take a set variable activity and assign the counter
variable to the temp
value.
Now, take lookup activity with the above Excel dataset. Provide the counter
variable for the range and false
to the header.
range - @concat('A',string(variables('counter')))
header - false
Take a set variable and store the first-row value in first column in the col
variable.
@activity('Lookup1').output.value[0].Prop_0
In another set variable, use the expression @add(variables('temp'),1)
to the counter
variable which will increment the counter variable.
After the end of the Until loop, the correct header row will be stored in the counter
variable. You can use the same Excel dataset in your sub-sequent activities with the counter
variable in the range parameter and now change the header
parameter value to true
.
In case if some one want to do using databricks notebook activity, pass the filepath from ADF to notebook and execute the below code.
Below is the function for loading the excel data, it scan for the ProductID string and get the row index and after that it find the other columns\headers. Finally it load the data from ProductID onwards and return dataframe.
%python
#from pyspark.sql import SparkSession
import openpyxl
from pyspark.sql.types import StructType, StructField, StringType
def extract_data_from_excel(file_path, sheet_name="Sheet1", start_string="ProductID"):
try:
workbook = openpyxl.load_workbook(file_path) #load excel
sheet = workbook[sheet_name]
start_row_index = None #row index of the first row containing the start_string
for row_index, row in enumerate(sheet.iter_rows(values_only=True), start=1):
if start_string in row:
start_row_index = row_index
break
if start_row_index is None:
print(f"Error: String '{start_string}' not found in sheet '{sheet_name}'")
return None
#extract the headers/column names
headers = list(sheet[start_row_index])
headers = [cell.value for cell in headers if cell.value is not None]
#extract row values
data = []
for row_index in range(start_row_index + 1, sheet.max_row + 1):
row_values = [cell.value for cell in sheet[row_index] if cell.value is not None]
if not row_values: # skip empty rows
continue
data.append(row_values[:len(headers)])
schema = StructType([StructField(header, StringType(), True) for header in headers])
rdd = spark.sparkContext.parallelize(data)
df = spark.createDataFrame(rdd, schema)
return df
except Exception as e:
print(f"Error: {e}")
return None
then call this function using the below code and write to a csv file
file_path = "/dbfs/mnt/Test/Prod1.xlsx" #change to your filepath
file_output_path = "/dbfs/mnt/Test/Prod1.csv" #change to your filepath
result_df = extract_data_from_excel(file_path)
#result_df.show()
result_df.write.csv(file_output_path, header=True,mode="overwrite")
Once you get the data in a correct dataframe you have multiple ways to load the data, I choose the csv so the ADF copy activity can load the data into destination.
The input file looks like this
and the output dataframe look like this
This process require databricks cluster start and will not be cheap compared to other solution which uses only ADF. But for large load and having thousands of files containing millions of data, in that case this solution might perform well.
本文标签: azure data factoryADF dynamic row starting positionStack Overflow
版权声明:本文标题:azure data factory - ADF dynamic row starting position - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744847227a2628282.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论