admin管理员组

文章数量:1422330

I'm trying to load an XML file to using databricks. my environment is on Azure databricks:

14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)

Here is my code where it fails:

# Load the specified XML file
single_file_df = (
    spark.read.format("com.databricks.spark.xml")
    .option("rowTag", "Tag").load(specific_file_path)
)

# # Show a sample of the data
single_file_df.show(truncate=False)

the error is :

Py4JJavaError: An error occurred while calling o457.load.
: Failure to initialize configuration for storage account [REDACTED].dfs.core.windows: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key

File <command-1483821906694786>, 
line 4
      
1 # Load the specified XML file
      2 single_file_df = (
      3     spark.read.format("com.databricks.spark.xml")
----> 4     .option("rowTag", "Tag").load(specific_file_path)
      5 )
      7 # # Show a sample of the data
      8 single_file_df.show(truncate=False)

What I've checked so far :

  1. The connection: I can do a list on the container and I also managed to read the file raw with this code:

    simple_df = spark.read.text(specific_file_path)
    simple_df.show(truncate=False)
    
  2. The library installed on the cluster: com.databricks:spark-xml_2.12:0.15.0. I've already installed the latest.

  3. Permission to read the blob data - already have all the permissions:

    Storage Blob Data Owner
    Storage Blob Data Contributor
    Storage Blob Data Reader
    
  4. Reviewed also this git: didn't find the answer there.

What are my options then?

It seems to be something to do with the library com.databricks:spark-xml_2.12:0.15.0 but I don't understand the error.

I don't want to mount or use blob libraries (SDK connections); I want to use as little as I could python, and stick to pyspark.

Your help would be greatly appreciated

Here is my complete code:

# Define the Azure Key Vault scope
scope = 'your_scope_here'

# Retrieve storage account names and access keys securely from Key Vault
dl_storage_account = dbutils.secrets.get(scope=scope, key="dl_storage_account_name_key")
dl_storage_account_access_key = dbutils.secrets.get(scope=scope, key="dl_storage_account_access_key_key")
blob_storage_account = dbutils.secrets.get(scope=scope, key="blob_storage_account_name_key")
blob_storage_account_access_key = dbutils.secrets.get(scope=scope, key="blob_storage_account_access_key_key")

# Set Spark configurations for accessing Azure storage
spark.conf.set(f"fs.azure.account.key.{blob_storage_account}.blob.core.windows", blob_storage_account_access_key)
spark.conf.set(f"fs.azure.account.key.{dl_storage_account}.dfs.core.windows", dl_storage_account_access_key)

# Define the input path for Azure Data Lake (redacted)
input_path = f"abfss://container-name@{dl_storage_account}.dfs.core.windows/path/to/directory/"

from pyspark.sql import SparkSession
from pyspark.sql.functions import schema_of_xml, expr, col, current_timestamp,lit,explode_outer, input_file_name, regexp_extract,concat_ws
from pyspark.sql.types import NullType

# Define the specific file path for Azure Data Lake (redacted)
specific_file_path = f"abfss://container-name@{dl_storage_account}.dfs.core.windows/path/to/directory/file-name.xml"

# Load the specified XML file
single_file_df = (
    spark.read.format("com.databricks.spark.xml")
    .option("rowTag", "Tag").load(specific_file_path)
)

# # Show a sample of the data
single_file_df.show(truncate=False)

I'm trying to load an XML file to using databricks. my environment is on Azure databricks:

14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)

Here is my code where it fails:

# Load the specified XML file
single_file_df = (
    spark.read.format("com.databricks.spark.xml")
    .option("rowTag", "Tag").load(specific_file_path)
)

# # Show a sample of the data
single_file_df.show(truncate=False)

the error is :

Py4JJavaError: An error occurred while calling o457.load.
: Failure to initialize configuration for storage account [REDACTED].dfs.core.windows: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key

File <command-1483821906694786>, 
line 4
      
1 # Load the specified XML file
      2 single_file_df = (
      3     spark.read.format("com.databricks.spark.xml")
----> 4     .option("rowTag", "Tag").load(specific_file_path)
      5 )
      7 # # Show a sample of the data
      8 single_file_df.show(truncate=False)

What I've checked so far :

  1. The connection: I can do a list on the container and I also managed to read the file raw with this code:

    simple_df = spark.read.text(specific_file_path)
    simple_df.show(truncate=False)
    
  2. The library installed on the cluster: com.databricks:spark-xml_2.12:0.15.0. I've already installed the latest.

  3. Permission to read the blob data - already have all the permissions:

    Storage Blob Data Owner
    Storage Blob Data Contributor
    Storage Blob Data Reader
    
  4. Reviewed also this git: https://github/devlace/azure-databricks-storage?tab=readme-ov-file didn't find the answer there.

What are my options then?

It seems to be something to do with the library com.databricks:spark-xml_2.12:0.15.0 but I don't understand the error.

I don't want to mount or use blob libraries (SDK connections); I want to use as little as I could python, and stick to pyspark.

Your help would be greatly appreciated

Here is my complete code:

# Define the Azure Key Vault scope
scope = 'your_scope_here'

# Retrieve storage account names and access keys securely from Key Vault
dl_storage_account = dbutils.secrets.get(scope=scope, key="dl_storage_account_name_key")
dl_storage_account_access_key = dbutils.secrets.get(scope=scope, key="dl_storage_account_access_key_key")
blob_storage_account = dbutils.secrets.get(scope=scope, key="blob_storage_account_name_key")
blob_storage_account_access_key = dbutils.secrets.get(scope=scope, key="blob_storage_account_access_key_key")

# Set Spark configurations for accessing Azure storage
spark.conf.set(f"fs.azure.account.key.{blob_storage_account}.blob.core.windows", blob_storage_account_access_key)
spark.conf.set(f"fs.azure.account.key.{dl_storage_account}.dfs.core.windows", dl_storage_account_access_key)

# Define the input path for Azure Data Lake (redacted)
input_path = f"abfss://container-name@{dl_storage_account}.dfs.core.windows/path/to/directory/"

from pyspark.sql import SparkSession
from pyspark.sql.functions import schema_of_xml, expr, col, current_timestamp,lit,explode_outer, input_file_name, regexp_extract,concat_ws
from pyspark.sql.types import NullType

# Define the specific file path for Azure Data Lake (redacted)
specific_file_path = f"abfss://container-name@{dl_storage_account}.dfs.core.windows/path/to/directory/file-name.xml"

# Load the specified XML file
single_file_df = (
    spark.read.format("com.databricks.spark.xml")
    .option("rowTag", "Tag").load(specific_file_path)
)

# # Show a sample of the data
single_file_df.show(truncate=False)
Share Improve this question edited Apr 6 at 14:24 marc_s 757k184 gold badges1.4k silver badges1.5k bronze badges asked Jan 16 at 9:22 dexondexon 297 bronze badges 1
  • 1 Have your tried provided the azure data bricks the storage blob contributor role? – Dileep Raj Narayan Thumula Commented Jan 16 at 9:29
Add a comment  | 

1 Answer 1

Reset to default 1

I have tried the below approach:

configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": ".apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": "<YOUR CLIENT ID>",  
    "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="Key-vault-secret-dbx02", key="secretKV2"),
    "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline/<YOUR TENANT ID>/oauth2/token" 
}
try:
    dbutils.fs.mount(   source="abfss://[email protected]",
        mount_point="/mnt/raw_agent02",
        extra_configs=configs
    )
    print("Mount successful!")
except Exception as e:
    print(f"Error mounting storage: {e}")

In the above code I have mounted the ADLS with Azure databricks. Know more about how to mount ADLS on Azure datbricks

Also provide AZUREDATBRICKS app Keyvault adminstartor & Storage Blob Data Contributor roles

Once the mount is successful. Go to your Cluster>Libraries> Maven Central

Use the below library and Install on the cluster.

com.databricks:spark-xml_2.12:0.18.0

Then try reading the Xml file from ADLS:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
schema = StructType([
    StructField("ID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Department", StringType(), True),
    StructField("Designation", StringType(), True),
    StructField("Salary", IntegerType(), True),
    StructField("JoiningDate", StringType(), True) 
])
file_path = "/mnt/raw_agent02/new/sampledilip.xml"  
try:
    df = spark.read.format("com.databricks.spark.xml") \
        .option("rowTag", "Employee") \
        .schema(schema) \
        .load(file_path)

    print("File loaded successfully!")

    
    df = df.withColumn("JoiningDate", df["JoiningDate"].cast(DateType()))


    df.printSchema()
    df.show()

except Exception as e:
    print(f"Error reading XML file: {e}")

In the above code defining the schema for the XML file Providing the path to the XML file in the mounted container. Reading the XML file into a DataFrame.

Results:

File loaded successfully!
root
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- Designation: string (nullable = true)
 |-- Salary: integer (nullable = true)
 |-- JoiningDate: date (nullable = true)

+---+-------------+----------+-----------+------+-----------+
| ID|         Name|Department|Designation|Salary|JoiningDate|
+---+-------------+----------+-----------+------+-----------+
|101|     John Doe|        HR|    Manager| 75000| 2018-05-15|
|102|   Jane Smith|   Finance|    Analyst| 68000| 2019-07-20|
|103|Emily Johnson|        IT|  Developer| 80000| 2020-03-10|
|104|Michael Brown| Marketing|  Executive| 55000| 2021-08-01|
+---+-------------+----------+-----------+------+-----------+

本文标签: Failed to load XML file using librarycomdatabrickssparkxmlStack Overflow