apache kafka connect - Complex Flink Job with Data Enrichment Using Table API and DataStream API - Stack Overflow-软件玩家

admin管理员组
文章数量:1336183

I have built a complex Flink job for data enrichment. The job consumes messages from external systems via Kafka, and the goal is to enrich these messages with reference data stored in a PostgreSQL database. After enrichment, additional validation and transformation are performed before the enriched data is sent to Kafka sinks.

The setup includes:

Three Input Kafka Streams:
- Each with different types of raw data requiring enrichment and transformation.
Six Reference Tables in PostgreSQL:
- The tables are updated daily, and the data is read using the PostgreSQL CDC connector.
Enrichment Process:
- The reference tables are joined and converted to DataStreams.
- Kafka streams are enriched using broadcast streams and processed using BroadcastProcessFunction.

Here’s how I implemented this job:

Step 1: PostgreSQL CDC Table Creation

I use the PostgreSQL CDC connector to create tables for the reference data:

tableEnv.executeSql(
    "CREATE TABLE meters (" +
    "  id BIGINT," +
    "  res_spec_id BIGINT," +
    "  serial STRING" +
    ") WITH (" +
    "  'connector' = 'postgres-cdc'," +
    "  'hostname' = 'XXXX'," +
    "  'port' = '5432'," +
    "  'username' = 'XXXX'," +
    "  'password' = 'XXXX'," +
    "  'database-name' = 'name'," +
    "  'schema-name' = 'schema'," +
    "  'decoding.plugin.name' = 'pgoutput'," +
    "  'table-name' = 'meters'," +
    "  'slot.name' = 'meters_slot'," +
    "  'debezium.publication.name' = 'flink_publication'," +
    "  'scan.incremental.snapshot.enabled' = 'true'" +
    ")"
);

This is repeated for all six tables.

Step 2: Joining Reference Tables

The reference tables are joined to enrich the data. For example:

public static Table createMeterTable(StreamTableEnvironment tableEnv) {
    return tableEnv.sqlQuery(
        "SELECT * FROM meters mm " +
        "JOIN meter_chars mc ON mm.id = mc.meter_id " +
        "JOIN resource_specs rs ON rs.id = mm.res_spec_id"
    );
}

public static Table createUpTable(StreamTableEnvironment tableEnv) {
    return tableEnv.sqlQuery(
        "SELECT * FROM usage_points up " +
        "JOIN up_chars uc ON up.id = uc.usage_point_id"
    );
}

Step 3: Converting to DataStreams

After creating joined tables, I convert them to DataStreams:

public DataStream<Row> getMeterStream(StreamTableEnvironment tableEnv) {
    Table meterTable = createMeterTable(tableEnv);
    return tableEnv.toChangelogStream(meterTable);
}

Step 4: Broadcasting Reference Data

Broadcast streams are created for the enriched data:

public BroadcastStream<Row> getMeterBroadcastStream(DataStream<Row> meterStream) {
    MapStateDescriptor<MeterKey, List<MeterDeployment>> meterStateDescriptor = 
        EnrichmentBroadcastStateConfig.createMeterStateDescriptor();
    return meterStream.broadcast(meterStateDescriptor);
}

public static MapStateDescriptor<MeterKey, List<MeterDeployment>> createMeterStateDescriptor() {
    return new MapStateDescriptor<>(
        "meterStateDescriptor",
        TypeInformation.of(new TypeHint<>() {}),
        TypeInformation.of(new TypeHint<>() {})
    );
}

Step 5: Enriching Kafka Streams

The input Kafka streams are connected to broadcast streams using BroadcastProcessFunction:

public SingleOutputStreamOperator<ReadingBlockDto> enrichBlockWithMeters(
        DataStream<RawBlock> rawBlocks, 
        BroadcastStream<Row> meterBroadcast) {
    MapStateDescriptor<MeterKey, List<MeterDeployment>> meterStateDescriptor = 
        EnrichmentBroadcastStateConfig.createMeterStateDescriptor();
    return rawBlocks.connect(meterBroadcast)
                    .process(new BlockMeterEnrichmentFunction(meterStateDescriptor));
}

Step 6: Enrichment Function

Here’s an example of the BroadcastProcessFunction used for enrichment:

public class BlockMeterEnrichmentFunction extends BroadcastProcessFunction<RawBlock, Row, ReadingBlockDto> {
    private final MapStateDescriptor<MeterKey, List<MeterDeployment>> meterStateDescriptor;
    private final OutputTag<RawBlock> rejectedTag;

    public BlockMeterEnrichmentFunction(MapStateDescriptor<MeterKey, List<MeterDeployment>> meterStateDescriptor, OutputTag<RawBlock> rejectedTag) {
        this.meterStateDescriptor = meterStateDescriptor;
        this.rejectedTag = rejectedTag;
    }

    @Override
    public void processElement(RawBlock rawBlock, ReadOnlyContext ctx, Collector<ReadingBlockDto> collector) throws Exception {
        ReadOnlyBroadcastState<MeterKey, List<MeterDeployment>> meterState = ctx.getBroadcastState(meterStateDescriptor);

        List<MeterDeployment> meterDeployments = meterState.get(new MeterKey(rawBlock.getSerial(), rawBlock.getVendorId()));
        // Perform enrichment logic...
    }

    @Override
    public void processBroadcastElement(Row row, Context ctx, Collector<ReadingBlockDto> out) throws Exception {
        MeterDeployment meterDeployment = MeterDeployment.builder() // parse columns
                .build();

        BroadcastState<MeterKey, List<MeterDeployment>> broadcastState = ctx.getBroadcastState(meterStateDescriptor);
        MeterKey key = new MeterKey(meterDeployment.getSerial(), meterDeployment.getVendorId());

        List<MeterDeployment> meterDeployments = broadcastState.get(key);
        if (meterDeployments == null) {
            meterDeployments = new ArrayList<>();
        }

        switch (row.getKind()) {
            case DELETE:
                meterDeployments.remove(meterDeployment);
                if (meterDeployments.isEmpty()) {
                    broadcastState.remove(key);
                } else {
                    broadcastState.put(key, meterDeployments);
                }
                break;
            case UPDATE_BEFORE:
                meterDeployments.remove(meterDeployment);
                broadcastState.put(key, meterDeployments);
                break;
            case UPDATE_AFTER:
            case INSERT:
                meterDeployments.add(meterDeployment);
                broadcastState.put(key, meterDeployments);
                break;
        }
    }
}

Additional Context

Flink Version: 1.20
CDC Version: 3.2
Flink Operator: 1.10
State Backend: RocksDB
Running on Kubernetes.

Questions

1. Is There a Better Way to Handle This?

I use the Table API because I need the PostgreSQL CDC connector. However:

The BroadcastProcessFunction logic is duplicated for every Kafka stream, which seems inefficient.
Is there an alternative approach to the broadcast pattern for repeatedly joining the same reference data with different streams?

2. Memory Management

I am using RocksDB as the state backend, but I’m unclear about where certain components are stored:

a) Where are the CDC tables created with tableEnv.executeSql stored? Are they in memory, and if so, which memory?
b) Do joins between the CDC tables require separate memory, or are they performed lazily?
c) Where is the broadcast state stored? Is it in RocksDB or somewhere else?

3. RocksDB Resource Management

Since RocksDB is my state backend, is its usage limited by the memory and disk allocated to my TaskManager pods in Kubernetes?
Can I inspect or monitor RocksDB state usage (e.g., disk, memory, or compaction metrics), or is it essentially a black box?

I would appreciate any guidance on improving this architecture or insights into Flink’s state and memory management. Thank you!

The setup includes:

Three Input Kafka Streams:
- Each with different types of raw data requiring enrichment and transformation.
Six Reference Tables in PostgreSQL:
- The tables are updated daily, and the data is read using the PostgreSQL CDC connector.
Enrichment Process:
- The reference tables are joined and converted to DataStreams.
- Kafka streams are enriched using broadcast streams and processed using BroadcastProcessFunction.

Here’s how I implemented this job:

Step 1: PostgreSQL CDC Table Creation

I use the PostgreSQL CDC connector to create tables for the reference data:

tableEnv.executeSql(
    "CREATE TABLE meters (" +
    "  id BIGINT," +
    "  res_spec_id BIGINT," +
    "  serial STRING" +
    ") WITH (" +
    "  'connector' = 'postgres-cdc'," +
    "  'hostname' = 'XXXX'," +
    "  'port' = '5432'," +
    "  'username' = 'XXXX'," +
    "  'password' = 'XXXX'," +
    "  'database-name' = 'name'," +
    "  'schema-name' = 'schema'," +
    "  'decoding.plugin.name' = 'pgoutput'," +
    "  'table-name' = 'meters'," +
    "  'slot.name' = 'meters_slot'," +
    "  'debezium.publication.name' = 'flink_publication'," +
    "  'scan.incremental.snapshot.enabled' = 'true'" +
    ")"
);

This is repeated for all six tables.

Step 2: Joining Reference Tables

The reference tables are joined to enrich the data. For example:

public static Table createMeterTable(StreamTableEnvironment tableEnv) {
    return tableEnv.sqlQuery(
        "SELECT * FROM meters mm " +
        "JOIN meter_chars mc ON mm.id = mc.meter_id " +
        "JOIN resource_specs rs ON rs.id = mm.res_spec_id"
    );
}

public static Table createUpTable(StreamTableEnvironment tableEnv) {
    return tableEnv.sqlQuery(
        "SELECT * FROM usage_points up " +
        "JOIN up_chars uc ON up.id = uc.usage_point_id"
    );
}

Step 3: Converting to DataStreams

After creating joined tables, I convert them to DataStreams:

public DataStream<Row> getMeterStream(StreamTableEnvironment tableEnv) {
    Table meterTable = createMeterTable(tableEnv);
    return tableEnv.toChangelogStream(meterTable);
}

Step 4: Broadcasting Reference Data

Broadcast streams are created for the enriched data:

public BroadcastStream<Row> getMeterBroadcastStream(DataStream<Row> meterStream) {
    MapStateDescriptor<MeterKey, List<MeterDeployment>> meterStateDescriptor = 
        EnrichmentBroadcastStateConfig.createMeterStateDescriptor();
    return meterStream.broadcast(meterStateDescriptor);
}

public static MapStateDescriptor<MeterKey, List<MeterDeployment>> createMeterStateDescriptor() {
    return new MapStateDescriptor<>(
        "meterStateDescriptor",
        TypeInformation.of(new TypeHint<>() {}),
        TypeInformation.of(new TypeHint<>() {})
    );
}

Step 5: Enriching Kafka Streams

The input Kafka streams are connected to broadcast streams using BroadcastProcessFunction:

public SingleOutputStreamOperator<ReadingBlockDto> enrichBlockWithMeters(
        DataStream<RawBlock> rawBlocks, 
        BroadcastStream<Row> meterBroadcast) {
    MapStateDescriptor<MeterKey, List<MeterDeployment>> meterStateDescriptor = 
        EnrichmentBroadcastStateConfig.createMeterStateDescriptor();
    return rawBlocks.connect(meterBroadcast)
                    .process(new BlockMeterEnrichmentFunction(meterStateDescriptor));
}

Step 6: Enrichment Function

Here’s an example of the BroadcastProcessFunction used for enrichment:

public class BlockMeterEnrichmentFunction extends BroadcastProcessFunction<RawBlock, Row, ReadingBlockDto> {
    private final MapStateDescriptor<MeterKey, List<MeterDeployment>> meterStateDescriptor;
    private final OutputTag<RawBlock> rejectedTag;

    public BlockMeterEnrichmentFunction(MapStateDescriptor<MeterKey, List<MeterDeployment>> meterStateDescriptor, OutputTag<RawBlock> rejectedTag) {
        this.meterStateDescriptor = meterStateDescriptor;
        this.rejectedTag = rejectedTag;
    }

    @Override
    public void processElement(RawBlock rawBlock, ReadOnlyContext ctx, Collector<ReadingBlockDto> collector) throws Exception {
        ReadOnlyBroadcastState<MeterKey, List<MeterDeployment>> meterState = ctx.getBroadcastState(meterStateDescriptor);

        List<MeterDeployment> meterDeployments = meterState.get(new MeterKey(rawBlock.getSerial(), rawBlock.getVendorId()));
        // Perform enrichment logic...
    }

    @Override
    public void processBroadcastElement(Row row, Context ctx, Collector<ReadingBlockDto> out) throws Exception {
        MeterDeployment meterDeployment = MeterDeployment.builder() // parse columns
                .build();

        BroadcastState<MeterKey, List<MeterDeployment>> broadcastState = ctx.getBroadcastState(meterStateDescriptor);
        MeterKey key = new MeterKey(meterDeployment.getSerial(), meterDeployment.getVendorId());

        List<MeterDeployment> meterDeployments = broadcastState.get(key);
        if (meterDeployments == null) {
            meterDeployments = new ArrayList<>();
        }

        switch (row.getKind()) {
            case DELETE:
                meterDeployments.remove(meterDeployment);
                if (meterDeployments.isEmpty()) {
                    broadcastState.remove(key);
                } else {
                    broadcastState.put(key, meterDeployments);
                }
                break;
            case UPDATE_BEFORE:
                meterDeployments.remove(meterDeployment);
                broadcastState.put(key, meterDeployments);
                break;
            case UPDATE_AFTER:
            case INSERT:
                meterDeployments.add(meterDeployment);
                broadcastState.put(key, meterDeployments);
                break;
        }
    }
}

Additional Context

Flink Version: 1.20
CDC Version: 3.2
Flink Operator: 1.10
State Backend: RocksDB
Running on Kubernetes.

Questions

1. Is There a Better Way to Handle This?

I use the Table API because I need the PostgreSQL CDC connector. However:

The BroadcastProcessFunction logic is duplicated for every Kafka stream, which seems inefficient.
Is there an alternative approach to the broadcast pattern for repeatedly joining the same reference data with different streams?

2. Memory Management

I am using RocksDB as the state backend, but I’m unclear about where certain components are stored:

a) Where are the CDC tables created with tableEnv.executeSql stored? Are they in memory, and if so, which memory?
b) Do joins between the CDC tables require separate memory, or are they performed lazily?
c) Where is the broadcast state stored? Is it in RocksDB or somewhere else?

3. RocksDB Resource Management

Since RocksDB is my state backend, is its usage limited by the memory and disk allocated to my TaskManager pods in Kubernetes?
Can I inspect or monitor RocksDB state usage (e.g., disk, memory, or compaction metrics), or is it essentially a black box?

I would appreciate any guidance on improving this architecture or insights into Flink’s state and memory management. Thank you!

Share edited Nov 20, 2024 at 20:23 MLlamas xWF 453 bronze badges asked Nov 19, 2024 at 20:09 MiniH 352 silver badges7 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

There's a lot to potentially react to here; I'm going to limit myself to sharing some relevant information.

2a) Where are the CDC tables created with tableEnv.executeSql stored? Are they in memory, and if so, which memory?

Those tables are stored in Postgres. Flink tables are nothing more than metadata describing data stored externally to Flink.

2b) Do joins between the CDC tables require separate memory, or are they performed lazily?

Flink will materialize in its state backend whatever it needs to retain from the records streaming in from those tables to produce the desired results. In this case, these so-called regular joins will need to permanently store in RocksDB every record from both sides of the join -- this is the most expensive type of join. I've made a video about streaming joins to explain this in more detail.

2c) Where is the broadcast state stored? Is it in RocksDB or somewhere else?

Flink always stores broadcast state on the heap -- so it's in memory. And each parallel instance of the job will checkpoint its own copy of the broadcast state. Broadcast state should only be used for relatively small state that cannot be key-partitioned.

RocksDB

RocksDB uses the local disks of the task managers, with an in-memory cache in off-heap memory.

There is an extensive set of metrics available.

Is there a better way to handle this?

Without taking the time to carefully consider your requirements, I'll just share a couple of pointers, but in general I would look for a pure Table API solution, avoiding regular joins and broadcast state (if feasible). Maybe the approach to enrichment described here would work, and be less expensive. Or maybe you can use temporal joins (described in the video linked above).

本文标签：

版权声明：本文标题：apache kafka connect - Complex Flink Job with Data Enrichment Using Table API and DataStream API - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1742401439a2467962.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

apache kafka connect - Complex Flink Job with Data Enrichment Using Table API and DataStream API - Stack Overflow

Step 1: PostgreSQL CDC Table Creation

Step 2: Joining Reference Tables

Step 3: Converting to DataStreams

Step 4: Broadcasting Reference Data

Step 5: Enriching Kafka Streams

Step 6: Enrichment Function

Additional Context

Questions

1. Is There a Better Way to Handle This?

2. Memory Management

3. RocksDB Resource Management

Step 1: PostgreSQL CDC Table Creation

Step 2: Joining Reference Tables

Step 3: Converting to DataStreams

Step 4: Broadcasting Reference Data

Step 5: Enriching Kafka Streams

Step 6: Enrichment Function

Additional Context

Questions

1. Is There a Better Way to Handle This?

2. Memory Management

3. RocksDB Resource Management

1 Answer 1

更多相关文章

flutter - Error in Adapty In-App Purchase Implementation: StoreKitManagerError.productPurchaseFailed&quot; - Stack Overflow

javascript - Keep a list item highlighted, react.js - Stack Overflow

javascript - Is safeApply is best practice? - Stack Overflow

php - Set a session variable when a link is clicked - Stack Overflow

javascript - Cannot read property &#39;replace&#39; of undefined - Stack Overflow

When renaming multiple properties in an entity, the EF Core migration mixes up the field names - Stack Overflow

How to use terms from the same custom taxonomy in different roles in a custom post type?

javascript - Hard-coded URL&#39;s - Stack Overflow

java - Error injecting: MainAnnotationProcessorMojo ProvisionException: Unable to provision, see the following errors - Stack Ov

Windows本地部署DeepSeek-R1大模型并使用web界面远程交互

javascript - jqGrid display own error dialog with celledit - Stack Overflow

javascript - Show progress on node.js child_process.exec? - Stack Overflow

javascript - How to use Babel 6 external helpers in the browser? - Stack Overflow

javascript - How to trace: &quot;The sanitizer was unable to parse the following block of html: &lt;32&quot;? - Stac

javascript - Apply a CSS to all the elements of a particular div - Stack Overflow

ios - How to move between view controllers? - Stack Overflow

javascript - How to play audio in an extension? - Stack Overflow

popup - JavaScript form error state does not stay on same page - Stack Overflow

javascript - Why &quot;mailto:&quot; is considered to be a not spam friendly way of sending email? - Stack Overflow

flutter - failure: Build failed with an exception - Stack Overflow

发表评论

推荐文章

amazon - WordPress hosted on AWS EC2

Add additional functions file instead of functions.php

javascript - How to add items entered in a dynamic HTML table to a model-bound list - Stack Overflow

jquery - JavaScript on different screen sizes - Stack Overflow

javascript - Get row data in extjs on cell edit - Stack Overflow

热门文章

javascript - How to send JSON in Redis stream NodeJs - Stack Overflow

Woocommerce doesn&#39;t work with theme, header missing and css messed up

jquery - Using the JavaScript revealing prototype pattern, how can I namespace functions contained within prototypes? - Stack Ov

javascript - Dynamic change font-size in Leaflet marker tooltip text on zoom change - Stack Overflow

javascript - Regex that allows all international characters but no symbols - Stack Overflow

images - How to get full absolute url for post attachment?

How to disconnect MongoDB in Julia - Stack Overflow

javascript - How to use a CSS framework with LitElement - Stack Overflow

javascript - Load different appsettings.json into Window object in Blazor WebAssembly - Stack Overflow

Javascript - Compare Integer against an Array of Integers - Stack Overflow

最新文章

剪映windows版1.0的使用

Windows本地部署DeepSeek-R1大模型并使用web界面远程交互

在windows操作系统上安装MariaDB

drozer-Android安全测试基本使用教程（Windows10）_drozer官网(1)

2025.2.13 Android Studio下载安装配置教程（详细版）

javascript - How to filter a value exactly equal to a given input in react-table? - Stack Overflow

javascript - jquery getting index of selected checkbox - Stack Overflow

javascript - How to implement Headless JS in react native android application? - Stack Overflow

flutter - failure: Build failed with an exception - Stack Overflow

javascript - why don&#39;t the ng-leaveng-enter classes get added - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

flutter - Error in Adapty In-App Purchase Implementation: StoreKitManagerError.productPurchaseFailed" - Stack Overflow

javascript - Cannot read property 'replace' of undefined - Stack Overflow

javascript - Hard-coded URL's - Stack Overflow

javascript - How to trace: "The sanitizer was unable to parse the following block of html: <32"? - Stac

javascript - Why "mailto:" is considered to be a not spam friendly way of sending email? - Stack Overflow

Woocommerce doesn't work with theme, header missing and css messed up

javascript - why don't the ng-leaveng-enter classes get added - Stack Overflow