Context
Recently, we’re planing to use Pulumi
to manage all current existing AWS Glue Datacatalog
tables which are Iceberg
format. For the Iceberg
tables, I have post a blog before to talk about what is iceberg and what’s feature of it. Here is post link: https://stonefishy.github.io/2020/05/23/what-is-apache-iceberg/
To manage the AWS Glue Iceberg tables with Pulumi
, due to our catalog table schemas are continue changes base on requirements. We need to do some technical POC whethere the pulumi can also support to update the iceberg metadata schema as well.
Create Glue Iceberg Table
We’re using Pulumi
to manage the AWS Cloud Infrastructure. Before create glue table, a glue database is indeed.
1 | import pulumi |
Above code is to create a glue database named pulumi_database_test. Next Step is to create a glue table with Iceberg
format.
1 | import pulumi |
There is important thing to notice here is that we need to set open_table_format_input
with iceberg_input
and set metadata_operation
as CREATE
. This is because we want to create a new Iceberg table with new schema.
Below is glue iceberg table created screenshot. You can see the 4 fields is added in schema and table format is Apache Iceberg
.
Next, let’s check the important file that is Apache Iceberg
metadata file which is located in s3://xxx/pulumi_external_table_test/metadata/
. Download this json file 00006-fd122b03-a7aa-42cf-8fec-001535a9fcf5.metadata.json
from S3
. The 4 fields are defined in metadata json file. That is good. The metadata json is created as well when creating glue table.
Insert new data in Glue iceberg table
Let’s using AWS Athena
to insert a test data in the table.
1 | INSERT INTO pulumi_database_test.pulumi_external_table_test(test1,test2,test3,test4) VALUES('1a', '2a', true, '4a') |
The data is insert success and we can use SELECT
sql to query the data.
In Iceberg table, we can insert
, update
, delete
data as well.
Update Glue Iceberg table schema
Let’s add a new field test5
in the glue iceberg table base on previous code.
1 | import pulumi |
Execute pulumi up
command to update the glue table schema.
After that, we can check the glue table schema is updated to add a new field test5
.
Let’s insert new data in the table with new field test5
and run it in AWS Athena
.
1 | INSERT INTO pulumi_database_test.pulumi_external_table_test(test1,test2,test3,test4,test5) VALUES('1b', '2b', true, '4b', '5b') |
The Athena execute show below errors:
1 | COLUMN_NOT_FOUND: Insert column name does not exist in target table: test5. If a data manifest file was generated at 's3://xxxxxx/4c346103-60d2-45ea-9813-d7060bd5efe9/Unsaved/2024/10/09/37f67a67-4604-43cb-b113-af351c363a51-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account. |
But when we check the Apache Iceberg
metadata file again. The new field test5
is not added in the new metadata file. That’s why the insert new data with new field failed.
Conclusion
In Pulumi documentation. The metadata_operation
of iceberg_input
in open_table_format_input
is only support CREATE
value. It seems it only can create the iceberg metadata file when glue table created.
It seems this is pulumi
issue. It is not updating the iceberg metadata file when the glue table schema is updated. I’ve raised a issue to pulumi, here is issue link: https://github.com/pulumi/pulumi/issues/17516. Hope this issue can be fixed soon.
Mean while, I found there is same issue in Terraform
which also can not update the iceberg metadata file when the glue table schema is updated. Terraform issue link here https://github.com/hashicorp/terraform-provider-aws/issues/36641.