admin管理员组

文章数量:1122846

I'm experimenting with Apache Iceberg, and trying to understand how column renaming works. In my scenario I'm working with an existing datalake of Parquet files stored in AWS S3. My goal is to create Iceberg tables using the existing files, without having to move or re-write any data.

All the information I can find regarding column renaming seems to suggest that it should just work. Using the Iceberg Java SDK:

icebergTable.updateSchema()
    .renameColumn("old_name", "new_name")
    mit();

When I do this though, and query existing data (where the column is stored as 'old_name' in the Parquet files), I get all null values returned for column 'new_name'. I was expecting the original 'old_name' values to be mapped into the 'new_name' column.

Is this expectation valid? Am I missing something about how Iceberg column renaming works?


Edit: additional detail (for posterity - I'm not sure anyone will have an answer for this)

I've narrowed down the issue a bit further, and it appears to apply only to data in Parquet files that were originally created by something other than Iceberg (e.g. Parquet files written using the standard Apache Parquet libs). These files can be added to an Iceberg table using the appendFile() function (.6.1/org/apache/iceberg/AppendFiles.html#appendFile(org.apache.iceberg.DataFile) ). Data created this way, then appended to an Iceberg table, does not appear to properly track column renames.

Interestingly, a Parquet file that was originally created by Iceberg, can also be appended to another Iceberg table, in the same way, and that data does properly track column renames, even if the file was copied and/or moved from it's original Iceberg location. So it seems there's something unique about the Parquet files that are created by Iceberg that allows them to track column renames.

I'm experimenting with Apache Iceberg, and trying to understand how column renaming works. In my scenario I'm working with an existing datalake of Parquet files stored in AWS S3. My goal is to create Iceberg tables using the existing files, without having to move or re-write any data.

All the information I can find regarding column renaming seems to suggest that it should just work. Using the Iceberg Java SDK:

icebergTable.updateSchema()
    .renameColumn("old_name", "new_name")
    .commit();

When I do this though, and query existing data (where the column is stored as 'old_name' in the Parquet files), I get all null values returned for column 'new_name'. I was expecting the original 'old_name' values to be mapped into the 'new_name' column.

Is this expectation valid? Am I missing something about how Iceberg column renaming works?


Edit: additional detail (for posterity - I'm not sure anyone will have an answer for this)

I've narrowed down the issue a bit further, and it appears to apply only to data in Parquet files that were originally created by something other than Iceberg (e.g. Parquet files written using the standard Apache Parquet libs). These files can be added to an Iceberg table using the appendFile() function (https://iceberg.apache.org/javadoc/1.6.1/org/apache/iceberg/AppendFiles.html#appendFile(org.apache.iceberg.DataFile) ). Data created this way, then appended to an Iceberg table, does not appear to properly track column renames.

Interestingly, a Parquet file that was originally created by Iceberg, can also be appended to another Iceberg table, in the same way, and that data does properly track column renames, even if the file was copied and/or moved from it's original Iceberg location. So it seems there's something unique about the Parquet files that are created by Iceberg that allows them to track column renames.

Share Improve this question edited Nov 25, 2024 at 15:29 pedorro asked Nov 22, 2024 at 0:30 pedorropedorro 3,3271 gold badge26 silver badges27 bronze badges 2
  • please provide a stackoverflow.com/help/minimal-reproducible-example . why is the post tagged with 'parquet' , if not relevant please remove – ticktalk Commented Nov 22, 2024 at 1:03
  • I think there is substantial setup involved with providing a reproducible example, including an existing datalake in S3. I guess my primary question is whether the expectation, about how column renaming works, is valid. I'm not necessarily looking for help debugging my code, but more just general advice on how Iceberg works. – pedorro Commented Nov 22, 2024 at 15:12
Add a comment  | 

1 Answer 1

Reset to default 0

So it turns out I was missing an important piece during Iceberg table creation. Since the parquet files that I'm adding are created outside of the Iceberg ecosystem, they lack some key metadata that Iceberg adds when it writes Parquet: namely the field-id.

To support adding externally created Parquet files, Iceberg provides the ability to define this metadata on the table itself. This is done with the schema.name-mapping.default property (described here: https://iceberg.apache.org/spec/#name-mapping-serialization). It sounds like this property should be used any time non-Iceberg Parquet files are included in an Iceberg table.

Specifically, in this case, the table is created using the following code:

val nameMapping = """[
    {"field-id": 1, "names": ["old_name"]}
]"""
catalog.buildTable(tableId, schema)
    .withPartitionSpec(partitionSpec)
    .withProperties(mapOf(TableProperties.DEFAULT_NAME_MAPPING to nameMapping))
    .create()

With this additional information defined on the table, the column rename as done in the original question now works.

本文标签: parquetRenamed column is returning null from existing dataStack Overflow