admin管理员组文章数量:1122846
I'm experimenting with Apache Iceberg, and trying to understand how column renaming works. In my scenario I'm working with an existing datalake of Parquet files stored in AWS S3. My goal is to create Iceberg tables using the existing files, without having to move or re-write any data.
All the information I can find regarding column renaming seems to suggest that it should just work. Using the Iceberg Java SDK:
icebergTable.updateSchema()
.renameColumn("old_name", "new_name")
mit();
When I do this though, and query existing data (where the column is stored as 'old_name' in the Parquet files), I get all null values returned for column 'new_name'. I was expecting the original 'old_name' values to be mapped into the 'new_name' column.
Is this expectation valid? Am I missing something about how Iceberg column renaming works?
Edit: additional detail (for posterity - I'm not sure anyone will have an answer for this)
I've narrowed down the issue a bit further, and it appears to apply only to data in Parquet files that were originally created by something other than Iceberg (e.g. Parquet files written using the standard Apache Parquet libs). These files can be added to an Iceberg table using the appendFile()
function (.6.1/org/apache/iceberg/AppendFiles.html#appendFile(org.apache.iceberg.DataFile) ). Data created this way, then appended to an Iceberg table, does not appear to properly track column renames.
Interestingly, a Parquet file that was originally created by Iceberg, can also be appended to another Iceberg table, in the same way, and that data does properly track column renames, even if the file was copied and/or moved from it's original Iceberg location. So it seems there's something unique about the Parquet files that are created by Iceberg that allows them to track column renames.
I'm experimenting with Apache Iceberg, and trying to understand how column renaming works. In my scenario I'm working with an existing datalake of Parquet files stored in AWS S3. My goal is to create Iceberg tables using the existing files, without having to move or re-write any data.
All the information I can find regarding column renaming seems to suggest that it should just work. Using the Iceberg Java SDK:
icebergTable.updateSchema()
.renameColumn("old_name", "new_name")
.commit();
When I do this though, and query existing data (where the column is stored as 'old_name' in the Parquet files), I get all null values returned for column 'new_name'. I was expecting the original 'old_name' values to be mapped into the 'new_name' column.
Is this expectation valid? Am I missing something about how Iceberg column renaming works?
Edit: additional detail (for posterity - I'm not sure anyone will have an answer for this)
I've narrowed down the issue a bit further, and it appears to apply only to data in Parquet files that were originally created by something other than Iceberg (e.g. Parquet files written using the standard Apache Parquet libs). These files can be added to an Iceberg table using the appendFile()
function (https://iceberg.apache.org/javadoc/1.6.1/org/apache/iceberg/AppendFiles.html#appendFile(org.apache.iceberg.DataFile) ). Data created this way, then appended to an Iceberg table, does not appear to properly track column renames.
Interestingly, a Parquet file that was originally created by Iceberg, can also be appended to another Iceberg table, in the same way, and that data does properly track column renames, even if the file was copied and/or moved from it's original Iceberg location. So it seems there's something unique about the Parquet files that are created by Iceberg that allows them to track column renames.
Share Improve this question edited Nov 25, 2024 at 15:29 pedorro asked Nov 22, 2024 at 0:30 pedorropedorro 3,3271 gold badge26 silver badges27 bronze badges 2- please provide a stackoverflow.com/help/minimal-reproducible-example . why is the post tagged with 'parquet' , if not relevant please remove – ticktalk Commented Nov 22, 2024 at 1:03
- I think there is substantial setup involved with providing a reproducible example, including an existing datalake in S3. I guess my primary question is whether the expectation, about how column renaming works, is valid. I'm not necessarily looking for help debugging my code, but more just general advice on how Iceberg works. – pedorro Commented Nov 22, 2024 at 15:12
1 Answer
Reset to default 0So it turns out I was missing an important piece during Iceberg table creation. Since the parquet files that I'm adding are created outside of the Iceberg ecosystem, they lack some key metadata that Iceberg adds when it writes Parquet: namely the field-id
.
To support adding externally created Parquet files, Iceberg provides the ability to define this metadata on the table itself. This is done with the schema.name-mapping.default
property (described here: https://iceberg.apache.org/spec/#name-mapping-serialization). It sounds like this property should be used any time non-Iceberg Parquet files are included in an Iceberg table.
Specifically, in this case, the table is created using the following code:
val nameMapping = """[
{"field-id": 1, "names": ["old_name"]}
]"""
catalog.buildTable(tableId, schema)
.withPartitionSpec(partitionSpec)
.withProperties(mapOf(TableProperties.DEFAULT_NAME_MAPPING to nameMapping))
.create()
With this additional information defined on the table, the column rename as done in the original question now works.
本文标签: parquetRenamed column is returning null from existing dataStack Overflow
版权声明:本文标题:parquet - Renamed column is returning null from existing data - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736306376a1932935.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论