

tl;dr: How do I use SparkSession.newSession with changes to the SQL config?

I'm using PySpark within AWS Glue, creating a Glue 5 notebook.

I'd like to have two different SparkSessions, with different SQL configs (two different warehouses). Everything is iceberg.

I can easily set up a session that works fine doing something like this:

warehouse_path = "s3://some_s3_bucket/path"
spark = SparkSession.builder \
    .config("spark.sql.warehouse.dir", warehouse_path) \
    .config("spark.sql.catalog.glue_catalog.warehouse", warehouse_path) \
    .config("spark.sql.catalog.glue_catalog", ".apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "") \
    .config("", "") \
    .config("spark.sql.extensions", ".apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.parquetpression.codec", "gzip") \

So, to have two different sessions, with different warehouse paths, I attempt to do something like this:

spark = SparkSession.builder \
    .config("spark.sql.catalog.glue_catalog", ".apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "") \
    .config("", "") \
    .config("spark.sql.extensions", ".apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.parquetpression.codec", "gzip") \

warehouse_path_1 = "s3://s3_bucket_1/path"
spark_session_1 = spark.newSession()
spark_session_1.conf.set("spark.sql.warehouse.dir", warehouse_path_1)
spark_session_1.conf.set("spark.sql.catalog.glue_catalog.warehouse", warehouse_path_1)

warehouse_path_2 = "s3://s3_bucket_2/path"
spark_session_2 = spark.newSession()
spark_session_2.conf.set("spark.sql.warehouse.dir", warehouse_path_2)
spark_session_2.conf.set("spark.sql.catalog.glue_catalog.warehouse", warehouse_path_2)

(I also tried the same thing with all of the sql confs being set on the child sessions, not just the changed ones, with the same results)

I end up with this error (or a similar one for whichever sql conf I try to change first:

AnalysisException: Cannot modify the value of a static config: spark.sql.warehouse.dir

On the one hand, I understand that the the SQLConf is static, but if you look at the docs for newSession it says (emphasis mine):

Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.

So, if it has a "separate SQLConf", how can I actually set it up with different SQL options?

本文标签: apache sparkSparkSessionnewSession with distinct SQLConfStack Overflow