admin管理员组文章数量:1335840
I set up a Docker environment containing Spark, PostgreSQL, and Jupyter Notebook, all connected on the same network. The idea is to process large datasets using Spark, then save the results to PostgreSQL.
My docker-compose is as follows:
version: '3.8'
services:
spark-master:
image: bitnami/spark:3.4.1
container_name: spark-master
command: bin/spark-class .apache.spark.deploy.master.Master
environment:
- SPARK_MODE=master
- SPARK_MASTER_HOST=spark-master
ports:
- "7077:7077" # RPC port for workers
- "8080:8080" # Web UI for Spark Master
networks:
- spark-network
spark-worker:
image: bitnami/spark:3.4.1
container_name: spark-worker
command: bin/spark-class .apache.spark.deploy.worker.Worker spark://spark-master:7077
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=4g
- SPARK_EXECUTOR_MEMORY=3g
- SPARK_DRIVER_MEMORY=1g
- SPARK_RPC_MESSAGE_MAX_SIZE=512
- SPARK_NETWORK_TIMEOUT=300s
- SPARK_EXECUTOR_HEARTBEAT_INTERVAL=30s
depends_on:
- spark-master
ports:
- "8081:8081" # Web UI for Spark Worker
networks:
- spark-network
postgres:
image: postgres:13
container_name: postgres
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
ports:
- "5432:5432" # Expose PostgreSQL for Airflow and pgAdmin
volumes:
- postgres_data:/var/lib/postgresql/data
networks:
- spark-network
pgadmin:
image: dpage/pgadmin4
container_name: pgadmin
depends_on:
- postgres
environment:
PGADMIN_DEFAULT_EMAIL: [email protected]
PGADMIN_DEFAULT_PASSWORD: admin
ports:
- "5050:80" # pgAdmin Web UI
volumes:
- pgadmin_data:/var/lib/pgadmin
networks:
- spark-network
jupyter:
build:
context: .
dockerfile: Dockerfile
container_name: jupyter
ports:
- "8888:8888"
volumes:
- ./notebooks:/home/jovyan/work
networks:
- spark-network
environment:
- SPARK_MASTER_HOST=spark-master
- SPARK_MASTER=spark://spark-master:7077
- SPARK_DRIVER_HOST=jupyter # Use the Jupyter container hostname
- SPARK_DRIVER_PORT=4040 # Set a static driver port
- PYSPARK_PYTHON=python3
- PYSPARK_DRIVER_PYTHON=jupyter
- PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=0.0.0.0 --allow-root"
command: >
start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''
volumes:
postgres_data:
pgadmin_data:
spark-logs:
networks:
spark-network:
driver: bridge
Everything starts fine, and I can access the Spark Master and Worker UIs. When I test Spark in Jupyter by printing its version, it works immediately. However, when I run the following code:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("spark://spark-master:7077") \
.appName("JupyterETL") \
.getOrCreate()
data = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
data.show()
It runs indefinitely in jupyter. When I checked the spark UI, it shows that worker was working on the task that I sent from jupyter. However, the container logs show no meaningful progress e.g stage 1 completed or not even after 10 mins, even though this should be a straightforward task.
Could you kindly help identify any issues with my docker-compose.yml or suggest what might be going wrong? Thank you!
本文标签: Spark in Docker Jupyter Task Running IndefinitelyStack Overflow
版权声明:本文标题:Spark in Docker: Jupyter Task Running Indefinitely - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742399923a2467682.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论