admin管理员组

文章数量:1304081

Using Python streamz and dask, I want to distribute the data of textfiles that are generated to threads. Which then will process every newline generated inside those files.

from streamz import Stream
source = Stream.filenames(r'C:\Documents\DataFiles\*.txt')
# recv = source.scatter().map(print)
recv = source.map(Stream.from_textfile)
recv.start()

In the above code how will I get the data coming from the streamz of the text files? I access and make dataframes for each text files like this. But cannot join this stream of executions nor use a dask distributed to scatter the work.

from streamz import Stream
recv = Stream.from_textfile(r'C:\Users\DataFiles\stream_output.txt')
import pandas as pd
dataframe = recv.partition(10).map(pd.DataFrame)

本文标签: daskUse Python streamz to parallel process many realtime updating filesStack Overflow