admin管理员组文章数量:1122846
I have a zarr dataset on disk, which I open with xarray using:
import xarray as xr
import numpy as np
import dask.distributed as dd
# setup dask
cluster = dd.LocalCluster()
client = dd.Client(cluster)
# open datset
ds = xr.open_zarr('...')
Once open, the dataset looks like this:
Note that the window
dimension represents time and the clm
dimension represents spectral coefficients.
Now, I want to manipulate this dataset, for example interpolate it into grid space, which I do as follows:
def spec_to_grid_numpy(f_clm):
# first fold the clm coefficients
f_clm_folded = ...
# then apply Legendre transform
leg_f = np.einsum(...)
# finally apply HFFT
f = np.fft.hfft(...)
return f
ds_grid = xr.apply_ufunc(
spec_to_grid_numpy,
ds,
input_core_dims=[['clm']],
output_core_dims=[['latitude', 'longitude']],
dask='parallelized',
output_dtypes=['float64'],
dask_gufunc_kwargs=dict(
output_sizes=dict(latitude=num_lat, longitude=num_lon),
),
)
ds_grid.to_zarr('...')
However, when I run this code, I get the following warning:
/libre/farchia/programs/miniforge3/envs/ptifs/lib/python3.10/site-packages/distributed/client.py:3361: UserWarning: Sending large graph of size 22.31 MiB.
This may cause some slowdown.
Consider loading the data with Dask directly or using futures or delayed objects to embed the data into the graph without repetition.
See also .html#load-data-with-dask for more information.
warnings.warn(
This raises several questions.
- Should I worry about this warning? Indeed, except for the fact that the process takes a few dozen seconds to start, everything runs as expected and the result seems overall correct.
- Is 22.31 MiB really large for a task graph? Considering that the transformation is applied independently to each chunk of data, everything is embarrassingly parallel, so why is the task graph "so" large? Is there a way to reduce it? For example by "loading the data with Dask directly or using futures or delayed objects" as suggested by the warning?
- What will happen if I increase the size of the dataset even more (for example, if I multiply by 10 the size of the
window
dimension)? Will dask be still able to handle the dataset? Or will I have to manually split the dataset into several subsets along thewindow
dimension, and process each subset independently?
Edit I have plotted the task graph for a subset of the data:
本文标签: python xarrayquotSending large graphquot what is the significance of this warningStack Overflow
版权声明:本文标题:python xarray - "Sending large graph": what is the significance of this warning? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736310878a1934535.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论