dockerfile - Dataflow Flex Template Docker issue: Cannot start an expansion service since neither Java nor Docker executables ar-软件玩家

admin管理员组
文章数量:1398757

I'm trying to run a dataflow job using flex template in docker. Here what I have:


FROM python:3.11-slim

COPY --from=apache/beam_python3.11_sdk:2.54.0 /opt/apache/beam /opt/apache/beam

COPY --from=gcr.io/dataflow-templates-base/python311-template-launcher-base:20230622_RC00 /opt/google/dataflow/python_template_launcher /opt/google/dataflow/python_template_launcher

# Install necessary dependencies for Java
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    wget \
    ca-certificates \
    openjdk-17-jdk && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Set JAVA_HOME environment variable (optional, but recommended)
ENV JAVA_HOME /usr/lib/jvm/java-17-openjdk-amd64
ENV PATH $JAVA_HOME/bin:$PATH

# Verify Java installation (optional, but good practice)
RUN java -version

# Location to store the pipeline artifacts.
ARG WORKDIR=/template
WORKDIR ${WORKDIR}

COPY main.py .
COPY requirements.txt .

# Installing exhaustive list of dependencies from a requirements.txt
# helps to ensure that every time Docker container image is built,
# the Python dependencies stay the same. Using `--no-cache-dir` reduces image size.
RUN pip install --no-cache-dir -r requirements.txt

# Installing the pipeline package makes all modules encompassing the pipeline
# available via import statements and installs necessary dependencies.
# Editable installation allows picking up later changes to the pipeline code
# for example during local experimentation within the container.
RUN pip install -e .

# For more informaiton, see: 
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"

# Because this image will be used as custom sdk container image, and it already
# installs the dependencies from the requirements.txt, we can omit
# the FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE directive here
# to reduce pipeline submission time.
# Similarly, since we already installed the pipeline package,
# we don't have to specify the FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py" configuration option.

# Optionally, verify that dependencies are not conflicting.
# A conflict may or may not be significant for your pipeline.
RUN pip check

# Optionally, list all installed dependencies.
# The output can be used to seed requirements.txt for reproducible builds.
RUN pip freeze

# Set the entrypoint to Apache Beam SDK launcher, which allows this image
# to be used as an SDK container image.
ENTRYPOINT ["/opt/apache/beam/boot"]

Below my apache beam code:


import apache_beam as beam
import logging
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import StandardOptions
import argparse
from typing import List, Dict, Tuple, Union
from apache_beam.io.jdbc import ReadFromJdbc
import logging
import os



def run(argv=None):
    
    """Runs the Apache Beam pipeline to read a CSV file from GCS and write it to BigQuery with specified schema, adding a timestamp column.

    Args:
        input_file_path: The GCS path to the CSV file (e.g., gs://your-bucket/path/to/data.csv).
        project_id: The ID of the Google Cloud project.
        dataset: The ID of the BigQuery dataset.
        table: The ID of the BigQuery table.
        network: The VPC network to use.
        subnetwork: The subnetwork to use.
        table_schema: The JSON string containing the BigQuery schema definition.
    """
    try:
        
        parser = argparse.ArgumentParser(description="Dataflow Flex Template")
        
        parser.add_argument("--project", required=True, help="GCP project.")
        parser.add_argument("--region", required=True, help="GCP region.")
        parser.add_argument("--staging_location", required=True, help="GCS staging location.")
        parser.add_argument("--temp_location", required=True, help="GCS temp location.")
        parser.add_argument("--network", help="VPC Network Name")
        parser.add_argument("--subnetwork", help="Subnetwork URL")
        parser.add_argument("--service_account_email", help="Service account email")
        parser.add_argument("--runner", default="DataflowRunner", help="Dataflow Runner")

        # Changed this to parse_known_args
        args, pipeline_args = parser.parse_known_args(argv)

        # Print argument values for debugging
        print(f"Project: {args.project}")
        print(f"Region: {args.region}")
        print(f"Staging Location: {args.staging_location}")
        print(f"Temp Location: {args.temp_location}")
        print(f"Network: {argswork}")
        print(f"Subnetwork: {args.subnetwork}")
        print(f"Service Account Email: {args.service_account_email}")
        
        
        pipeline_options = PipelineOptions(
            network=argswork,
            subnetwork=args.subnetwork,
            save_main_session=True,
            use_public_ips=False,
        )
        google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
        google_cloud_options.project = args.project
        google_cloud_options.region = args.region
        google_cloud_options.staging_location = args.staging_location
        google_cloud_options.temp_location = args.temp_location
        google_cloud_options.service_account_email = args.service_account_email

        standard_options = pipeline_options.view_as(StandardOptions)
        standard_options.runner = 'DataflowRunner'
        
        db_user = "sqlserver" #Retrieve from environment variable
        db_password = "root"#Retrieve from environment variable
        
        jdbc_url = "jdbc:sqlserver://XXX:XXX;databaseName=atra;trustServerCertificate=true;encrypt=false;"
        query = "SELECT TOP 10 * FROM production.brands"
       

        with beam.Pipeline(options=pipeline_options) as pipeline:
            data = pipeline | "Read from SQL Server" >> ReadFromJdbc(
                table_name='brands',
                driver_class_name='com.microsoft.sqlserver.jdbc.SQLServerDriver',
                jdbc_url=jdbc_url,
                username=db_user,
                password=db_password,
                query=query,
                fetch_size=1000,  # Adjust fetch size as needed
                classpath=["gs://XXX/jars/mssql-jdbc-12.8.1.jre11.jar"]
            )
            
            

            data | "Print Data" >> beam.Map(print)


    except Exception as e:
        logging.error(f"Error: {e}")


if __name__ == '__main__':
    import logging
    logging.getLogger().setLevel(logging.INFO)
    run()

below is how I build the image to use in Artifact Registry:

 
gcloud auth configure-docker us-east1-docker.pkg.dev

export PROJECT_ID=$(gcloud config get-value project)
export TEMPLATE_PATH="gs://XXXXX/templates/sqlserver-to-bq.json"




 gcloud dataflow flex-template build $TEMPLATE_PATH \
 --image-gcr-path "us-east1-docker.pkg.dev/$PROJECT_ID/dataflow-abc-repo/sql-server-to-bq-2:latest" \
 --sdk-language "PYTHON" \
 --flex-template-base-image "PYTHON3" \
 --py-path "." \
 --worker-region us-east1 \
 --service-account-email [email protected] \
 --env "FLEX_TEMPLATE_PYTHON_PY_FILE=main.py" \
 --env "FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE=requirements.txt"

But somehow I keep getting the error message below from dataflow :

ERROR:root:Error: Cannot start an expansion service since neither Java nor Docker executables are available in the system.

I Have tried other images and installed Java, but the error is always the same!

The apache beam code works just fine WHEN RUNNING IT IN CLOUD SHELL where Java is installed. Google Cloud Network/subnetworking also working just fine from Dataflow workers, I ran a different code on dataflow to check if could reach the sqlserver instance and all good!

Could someone pls point me to the right direction or just share a Dockerfile and the gcloud command to build it correctly? I spent a few days already on this an no way of progress!

I Have tried other images and installed Java, but the error is always the same!

Could someone pls point me to the right direction or just share a Dockerfile and the gcloud command to build it correctly? I spent a few days already on this an no way of progress!

本文标签：

版权声明：本文标题：dockerfile - Dataflow Flex Template Docker issue: Cannot start an expansion service since neither Java nor Docker executables ar 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744678354a2619249.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

dockerfile - Dataflow Flex Template Docker issue: Cannot start an expansion service since neither Java nor Docker executables ar

更多相关文章

Group CPT posts by custom taxonomy

metabox - Getting Custom Post Type Meta on Publish

Issue on Creating Custom Tax Term Dynamically From Another CPT Meta-box On Publish

javascript - How do I make something (a string) appear after some seconds? - Stack Overflow

plugins - WooCommerce: Building a page with user input which is used to query order status and external API and return a result

query optimization - Aurora MySQL not using the most efficient index - Stack Overflow

javascript - How to make a variable amount of inputs be controlled - React - Stack Overflow

html - solidjs: SVG inside SolidJS not referenced use ids - Stack Overflow

javascript - How to add a vertical scroll bar in multiple select box with checkbox, size attribute not working - Stack Overflow

Woocommerce SKU on ALL products page

javascript - Vue js calling action inside action - Stack Overflow

vb.net - glDrawElements doesnt work but glDrawArrays does - Stack Overflow

javascript - Masonry grid with bootstrap 4

javascript - Mix Node.js and PHP? - Stack Overflow

c# - ASP.Net: How to maintain TextBox State after postback - Stack Overflow

javascript - 404 Not Found in AngularJS - Stack Overflow

javascript - How to show only specific days react-dates - Stack Overflow

WooCommerce Conditional Tag inside plugin

javascript - Angular2 importing syntax: &quot;import * as &lt;foo&gt;&quot; vs &quot;import {&lt;foo&

vue.js - dynamic esmodule import in vuejs - Stack Overflow

发表评论

推荐文章

How to execute an arithmetic operation inside a string in Javascript? - Stack Overflow

django - @action got an unexpected keyword argument &#39;id&#39; - Stack Overflow

amazon web services - Can AWS SAM be used to develop and deploy Lambda authorizers? - Stack Overflow

java - Springboot Kafka producer and InvalidPidMappingException - Stack Overflow

html - solidjs: SVG inside SolidJS not referenced use ids - Stack Overflow

热门文章

javascript - How to prevent screenshot of webpage in mobile view? - Stack Overflow

android - Facing Issue when using TabLayout, BottomSheetFragment and NavigationComponent - Stack Overflow

How to get the total internal storage size including system usage in Android (Kotlin)? - Stack Overflow

twitter bootstrap - How to make wordpress theme option page in columns?

javascript - Error with FileReader in Chrome - JS - Stack Overflow

woocommerce offtopic - is_user_logged_in() always returns false for woocommerce_login_redirect

html - Javascript Personality Quiz - Stack Overflow

navigation - Custom ID for certain menu item?

Disable single view for specific post category

c# - How to convert Newton Json ReferenceLoopHandling to System.Text.Json - Stack Overflow

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

javascript - GWT: working with JsDate and Java Date - Stack Overflow

testing - How to do dbt_tests on timestamp data type? - Stack Overflow

Javascript function called by call and apply can&#39;t handle a parameter - Stack Overflow

vue.js - dynamic esmodule import in vuejs - Stack Overflow

javascript - Angular2 importing syntax: &quot;import * as &lt;foo&gt;&quot; vs &quot;import {&lt;foo&

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - Angular2 importing syntax: "import * as <foo>" vs "import {<foo&

django - @action got an unexpected keyword argument 'id' - Stack Overflow

Javascript function called by call and apply can't handle a parameter - Stack Overflow

javascript - Angular2 importing syntax: "import * as <foo>" vs "import {<foo&