admin管理员组文章数量:1422330
I'm trying to load an XML file to using databricks. my environment is on Azure databricks:
14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
Here is my code where it fails:
# Load the specified XML file
single_file_df = (
spark.read.format("com.databricks.spark.xml")
.option("rowTag", "Tag").load(specific_file_path)
)
# # Show a sample of the data
single_file_df.show(truncate=False)
the error is :
Py4JJavaError: An error occurred while calling o457.load.
: Failure to initialize configuration for storage account [REDACTED].dfs.core.windows: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key
File <command-1483821906694786>,
line 4
1 # Load the specified XML file
2 single_file_df = (
3 spark.read.format("com.databricks.spark.xml")
----> 4 .option("rowTag", "Tag").load(specific_file_path)
5 )
7 # # Show a sample of the data
8 single_file_df.show(truncate=False)
What I've checked so far :
The connection: I can do a list on the container and I also managed to read the file raw with this code:
simple_df = spark.read.text(specific_file_path) simple_df.show(truncate=False)
The library installed on the cluster:
com.databricks:spark-xml_2.12:0.15.0
. I've already installed the latest.Permission to read the blob data - already have all the permissions:
Storage Blob Data Owner Storage Blob Data Contributor Storage Blob Data Reader
Reviewed also this git:
didn't find the answer there.
What are my options then?
It seems to be something to do with the library com.databricks:spark-xml_2.12:0.15.0
but I don't understand the error.
I don't want to mount or use blob libraries (SDK connections); I want to use as little as I could python, and stick to pyspark.
Your help would be greatly appreciated
Here is my complete code:
# Define the Azure Key Vault scope
scope = 'your_scope_here'
# Retrieve storage account names and access keys securely from Key Vault
dl_storage_account = dbutils.secrets.get(scope=scope, key="dl_storage_account_name_key")
dl_storage_account_access_key = dbutils.secrets.get(scope=scope, key="dl_storage_account_access_key_key")
blob_storage_account = dbutils.secrets.get(scope=scope, key="blob_storage_account_name_key")
blob_storage_account_access_key = dbutils.secrets.get(scope=scope, key="blob_storage_account_access_key_key")
# Set Spark configurations for accessing Azure storage
spark.conf.set(f"fs.azure.account.key.{blob_storage_account}.blob.core.windows", blob_storage_account_access_key)
spark.conf.set(f"fs.azure.account.key.{dl_storage_account}.dfs.core.windows", dl_storage_account_access_key)
# Define the input path for Azure Data Lake (redacted)
input_path = f"abfss://container-name@{dl_storage_account}.dfs.core.windows/path/to/directory/"
from pyspark.sql import SparkSession
from pyspark.sql.functions import schema_of_xml, expr, col, current_timestamp,lit,explode_outer, input_file_name, regexp_extract,concat_ws
from pyspark.sql.types import NullType
# Define the specific file path for Azure Data Lake (redacted)
specific_file_path = f"abfss://container-name@{dl_storage_account}.dfs.core.windows/path/to/directory/file-name.xml"
# Load the specified XML file
single_file_df = (
spark.read.format("com.databricks.spark.xml")
.option("rowTag", "Tag").load(specific_file_path)
)
# # Show a sample of the data
single_file_df.show(truncate=False)
I'm trying to load an XML file to using databricks. my environment is on Azure databricks:
14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
Here is my code where it fails:
# Load the specified XML file
single_file_df = (
spark.read.format("com.databricks.spark.xml")
.option("rowTag", "Tag").load(specific_file_path)
)
# # Show a sample of the data
single_file_df.show(truncate=False)
the error is :
Py4JJavaError: An error occurred while calling o457.load.
: Failure to initialize configuration for storage account [REDACTED].dfs.core.windows: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key
File <command-1483821906694786>,
line 4
1 # Load the specified XML file
2 single_file_df = (
3 spark.read.format("com.databricks.spark.xml")
----> 4 .option("rowTag", "Tag").load(specific_file_path)
5 )
7 # # Show a sample of the data
8 single_file_df.show(truncate=False)
What I've checked so far :
The connection: I can do a list on the container and I also managed to read the file raw with this code:
simple_df = spark.read.text(specific_file_path) simple_df.show(truncate=False)
The library installed on the cluster:
com.databricks:spark-xml_2.12:0.15.0
. I've already installed the latest.Permission to read the blob data - already have all the permissions:
Storage Blob Data Owner Storage Blob Data Contributor Storage Blob Data Reader
Reviewed also this git:
https://github/devlace/azure-databricks-storage?tab=readme-ov-file
didn't find the answer there.
What are my options then?
It seems to be something to do with the library com.databricks:spark-xml_2.12:0.15.0
but I don't understand the error.
I don't want to mount or use blob libraries (SDK connections); I want to use as little as I could python, and stick to pyspark.
Your help would be greatly appreciated
Here is my complete code:
# Define the Azure Key Vault scope
scope = 'your_scope_here'
# Retrieve storage account names and access keys securely from Key Vault
dl_storage_account = dbutils.secrets.get(scope=scope, key="dl_storage_account_name_key")
dl_storage_account_access_key = dbutils.secrets.get(scope=scope, key="dl_storage_account_access_key_key")
blob_storage_account = dbutils.secrets.get(scope=scope, key="blob_storage_account_name_key")
blob_storage_account_access_key = dbutils.secrets.get(scope=scope, key="blob_storage_account_access_key_key")
# Set Spark configurations for accessing Azure storage
spark.conf.set(f"fs.azure.account.key.{blob_storage_account}.blob.core.windows", blob_storage_account_access_key)
spark.conf.set(f"fs.azure.account.key.{dl_storage_account}.dfs.core.windows", dl_storage_account_access_key)
# Define the input path for Azure Data Lake (redacted)
input_path = f"abfss://container-name@{dl_storage_account}.dfs.core.windows/path/to/directory/"
from pyspark.sql import SparkSession
from pyspark.sql.functions import schema_of_xml, expr, col, current_timestamp,lit,explode_outer, input_file_name, regexp_extract,concat_ws
from pyspark.sql.types import NullType
# Define the specific file path for Azure Data Lake (redacted)
specific_file_path = f"abfss://container-name@{dl_storage_account}.dfs.core.windows/path/to/directory/file-name.xml"
# Load the specified XML file
single_file_df = (
spark.read.format("com.databricks.spark.xml")
.option("rowTag", "Tag").load(specific_file_path)
)
# # Show a sample of the data
single_file_df.show(truncate=False)
Share
Improve this question
edited Apr 6 at 14:24
marc_s
757k184 gold badges1.4k silver badges1.5k bronze badges
asked Jan 16 at 9:22
dexondexon
297 bronze badges
1
- 1 Have your tried provided the azure data bricks the storage blob contributor role? – Dileep Raj Narayan Thumula Commented Jan 16 at 9:29
1 Answer
Reset to default 1I have tried the below approach:
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": ".apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<YOUR CLIENT ID>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="Key-vault-secret-dbx02", key="secretKV2"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline/<YOUR TENANT ID>/oauth2/token"
}
try:
dbutils.fs.mount( source="abfss://[email protected]",
mount_point="/mnt/raw_agent02",
extra_configs=configs
)
print("Mount successful!")
except Exception as e:
print(f"Error mounting storage: {e}")
In the above code I have mounted the ADLS with Azure databricks. Know more about how to mount ADLS on Azure datbricks
Also provide AZUREDATBRICKS app Keyvault adminstartor
& Storage Blob Data Contributor
roles
Once the mount is successful. Go to your Cluster>Libraries> Maven Central
Use the below library and Install on the cluster.
com.databricks:spark-xml_2.12:0.18.0
Then try reading the Xml file from ADLS:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
schema = StructType([
StructField("ID", IntegerType(), True),
StructField("Name", StringType(), True),
StructField("Department", StringType(), True),
StructField("Designation", StringType(), True),
StructField("Salary", IntegerType(), True),
StructField("JoiningDate", StringType(), True)
])
file_path = "/mnt/raw_agent02/new/sampledilip.xml"
try:
df = spark.read.format("com.databricks.spark.xml") \
.option("rowTag", "Employee") \
.schema(schema) \
.load(file_path)
print("File loaded successfully!")
df = df.withColumn("JoiningDate", df["JoiningDate"].cast(DateType()))
df.printSchema()
df.show()
except Exception as e:
print(f"Error reading XML file: {e}")
In the above code defining the schema for the XML file Providing the path to the XML file in the mounted container. Reading the XML file into a DataFrame.
Results:
File loaded successfully!
root
|-- ID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Department: string (nullable = true)
|-- Designation: string (nullable = true)
|-- Salary: integer (nullable = true)
|-- JoiningDate: date (nullable = true)
+---+-------------+----------+-----------+------+-----------+
| ID| Name|Department|Designation|Salary|JoiningDate|
+---+-------------+----------+-----------+------+-----------+
|101| John Doe| HR| Manager| 75000| 2018-05-15|
|102| Jane Smith| Finance| Analyst| 68000| 2019-07-20|
|103|Emily Johnson| IT| Developer| 80000| 2020-03-10|
|104|Michael Brown| Marketing| Executive| 55000| 2021-08-01|
+---+-------------+----------+-----------+------+-----------+
本文标签: Failed to load XML file using librarycomdatabrickssparkxmlStack Overflow
版权声明:本文标题:Failed to load XML file using library : com.databricks.spark.xml - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745362757a2655370.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论