admin管理员组文章数量:1122832
I am working with a Parquet file that contains a complicated nested column structure. Specifically, I have a column where the structure is as follows:
{
"dynamodb": {
"NewImage": {
"DiscountData": {
"M": {
"TEST1": {"N": "0"},
"TEST2": {"N": "0"},
"TEST3": {"N": "0"},
"TEST4": {"N": "0"},
"TEST5": {"N": "1"}
}
}
}
}
}
The key challenge is the M field. It is dynamic: it can contain varying keys (TEST1, TEST2, etc.) depending on the dataset. I am processing multiple Parquet files, and the structure of M may change across files.
I attempted to use the option to infer the schema , but I am having this issue:
AnalysisException: Found duplicate column(s) in the data schema
To make it consistent, I want to:
Treat the entire content of M as a string (e.g., JSON string), or Convert M into a MapType where keys and values are strings.
Here’s the schema I tried to define:
from pyspark.sql.types import StructType, StructField, StringType, MapType
schema = StructType([
StructField("dynamodb", StructType([
StructField("NewImage", StructType([
StructField("DiscountData", StructType([
StructField("M", MapType(StringType(), StringType()), True)
]), True)
]), True)
]), True)
])
This is what i am using to create the dataframe:
df_disscountcoderule = spark.read \
.schema(schema) \
.option("mergeSchema", "true") \
.option("recursiveFileLookup", "true") \
.parquet(f"s3://{source_bucket}/{source_discountcodesrules_prefix}")\
.filter(input_file_name().endswith(".parquet"))\
.filter(~input_file_name().contains("processing-failed"))
However, I get the following error when applying this schema:
ava.lang.ClassCastException: class org.apache.spark.sql.types.MapType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.MapType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
My Questions:
Is there a way to create a static schema that can handle this dynamic structure? How can I convert the M field into a JSON string or a MapType for consistent processing? I appreciate any guidance or alternative approaches!
本文标签: apache sparkIs there a way to create a schema for this columnStack Overflow
版权声明:本文标题:apache spark - Is there a way to create a schema for this column? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736301988a1931373.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论