admin管理员组

文章数量:1278947

I am trying to understand how Google Cloud Dataflow costs when reading a file with beam.io.ReadFromText. From my understanding, every time something is read from a Google Cloud bucket, it incurs charges per 1000 operations. I am trying to understand however, if I had hypothetically 10 billion rows in a file (or any sorts of small rows but a lot of them), would this incur charges in the millions of dollars for a simple filtering or does DataFlow only charge a single request to stage the target file into a "free" environment (or does it batch in some way?)?

Edit: I am referring to the class B operations on the cloud bucket performed. Talking to a cloud representative, it looks like it does read it in chunks but not even they know how much those could remotely be

I am trying to understand how Google Cloud Dataflow costs when reading a file with beam.io.ReadFromText. From my understanding, every time something is read from a Google Cloud bucket, it incurs charges per 1000 operations. I am trying to understand however, if I had hypothetically 10 billion rows in a file (or any sorts of small rows but a lot of them), would this incur charges in the millions of dollars for a simple filtering or does DataFlow only charge a single request to stage the target file into a "free" environment (or does it batch in some way?)?

Edit: I am referring to the class B operations on the cloud bucket performed. Talking to a cloud representative, it looks like it does read it in chunks but not even they know how much those could remotely be

Share Improve this question edited Feb 24 at 22:35 Giancarlo Metitieri asked Feb 24 at 17:58 Giancarlo MetitieriGiancarlo Metitieri 1356 bronze badges 0
Add a comment  | 

1 Answer 1

Reset to default 0

The cost of a Dataflow job using beam.io.ReadFromText to process a 10-billion-row file isn't simply a matter of "charges per 1000 operations." The pricing depends on several interacting factors.

Dataflow doesn't stage the entire file into a "free" environment. The processing happens in parallel across multiple workers and the data is read and processed in chunks, not as a single monolithic operation. But processing 10 billion rows will definitely consume significant compute resources and it charges based on the amount of time your worker nodes run and the resources (CPU, memory, and persistent disk space) they consume.

It's unlikely to be in the millions of dollars for a simple filter if the job is appropriately configured for parallel processing and uses a cost-effective worker type.

本文标签: apache beamHow does DataFlow charge for read operations from Cloud StorageStack Overflow