admin管理员组

文章数量:1122846

I have a zarr dataset on disk, which I open with xarray using:

import xarray as xr
import numpy as np
import dask.distributed as dd

# setup dask
cluster = dd.LocalCluster()
client = dd.Client(cluster)

# open datset
ds = xr.open_zarr('...')

Once open, the dataset looks like this:

Note that the window dimension represents time and the clm dimension represents spectral coefficients. Now, I want to manipulate this dataset, for example interpolate it into grid space, which I do as follows:

def spec_to_grid_numpy(f_clm):
    # first fold the clm coefficients
    f_clm_folded = ...
    # then apply Legendre transform
    leg_f = np.einsum(...)
    # finally apply HFFT
    f = np.fft.hfft(...)
    return f

ds_grid = xr.apply_ufunc(
    spec_to_grid_numpy,
    ds,
    input_core_dims=[['clm']],
    output_core_dims=[['latitude', 'longitude']],
    dask='parallelized',
    output_dtypes=['float64'],
    dask_gufunc_kwargs=dict(
        output_sizes=dict(latitude=num_lat, longitude=num_lon),
    ),
)

ds_grid.to_zarr('...')

However, when I run this code, I get the following warning:

/libre/farchia/programs/miniforge3/envs/ptifs/lib/python3.10/site-packages/distributed/client.py:3361: UserWarning: Sending large graph of size 22.31 MiB.
This may cause some slowdown.
Consider loading the data with Dask directly or using futures or delayed objects to embed the data into the graph without repetition.
See also .html#load-data-with-dask for more information.
  warnings.warn(

This raises several questions.

  1. Should I worry about this warning? Indeed, except for the fact that the process takes a few dozen seconds to start, everything runs as expected and the result seems overall correct.
  2. Is 22.31 MiB really large for a task graph? Considering that the transformation is applied independently to each chunk of data, everything is embarrassingly parallel, so why is the task graph "so" large? Is there a way to reduce it? For example by "loading the data with Dask directly or using futures or delayed objects" as suggested by the warning?
  3. What will happen if I increase the size of the dataset even more (for example, if I multiply by 10 the size of the window dimension)? Will dask be still able to handle the dataset? Or will I have to manually split the dataset into several subsets along the window dimension, and process each subset independently?

Edit I have plotted the task graph for a subset of the data:

本文标签: python xarrayquotSending large graphquot what is the significance of this warningStack Overflow