Bazel digest calculation of big sparse files is slow - Stack Overflow

IT技术

更新时间：2025-03-100

admin管理员组
文章数量:1295668

We use Bazel to build partition and disk images. The individual partitions are built across multiple build steps and assembled into a disk image in the end.

The images are stored as sparse files in the filesystem and their disk size is 1-10% of their logical size.
For example: 100GB size but only 2GB on disk.

When Bazel computes the digest of the images, it cannot take advantage of the sparse nature of the file and reads the giant holes which is very slow. The creation of the images is pretty fast, so artifact caching is not even needed.

What I tried:

Implementing a custom digest function and using --unix_digest_hash_attribute_name - I was told that the calculated digest value must match what what --digest_function returns, so we can't implement a custom digest function.
While DigestFunction in remote_execution.proto does have a few digest functions that could be easily implemented to work optimally with sparse files, these are not available in the hashFunctionRegistry. Only BLAKE3, SHA1 and SHA256 are available.
Creating the output in a directory and marking it as a TreeArtifact. Bazel still traverses the entire output directory and calculates the digest for each file.

What I'd like to avoid:

Creating and maintaining a custom BLAKE3 implementation.
Merging our build steps. We'd still like to take advantage of Bazel's action parallelism and build images in parallel.

Current workaround:
We create a tar file after each build step and untar them when we want to work with the contents in the next build step. This makes the digest calculation faster but adds code complexity and unnecessary work/slowdown to our builds. In addition, artifacts are somewhat more cumbersome to inspect because they need to be untarred first.

We use Bazel to build partition and disk images. The individual partitions are built across multiple build steps and assembled into a disk image in the end.

The images are stored as sparse files in the filesystem and their disk size is 1-10% of their logical size.
For example: 100GB size but only 2GB on disk.

When Bazel computes the digest of the images, it cannot take advantage of the sparse nature of the file and reads the giant holes which is very slow. The creation of the images is pretty fast, so artifact caching is not even needed.

What I tried:

Implementing a custom digest function and using --unix_digest_hash_attribute_name - I was told that the calculated digest value must match what what --digest_function returns, so we can't implement a custom digest function.
While DigestFunction in remote_execution.proto does have a few digest functions that could be easily implemented to work optimally with sparse files, these are not available in the hashFunctionRegistry. Only BLAKE3, SHA1 and SHA256 are available.
Creating the output in a directory and marking it as a TreeArtifact. Bazel still traverses the entire output directory and calculates the digest for each file.

What I'd like to avoid:

Creating and maintaining a custom BLAKE3 implementation.
Merging our build steps. We'd still like to take advantage of Bazel's action parallelism and build images in parallel.

Current workaround:
We create a tar file after each build step and untar them when we want to work with the contents in the next build step. This makes the digest calculation faster but adds code complexity and unnecessary work/slowdown to our builds. In addition, artifacts are somewhat more cumbersome to inspect because they need to be untarred first.

Share Improve this question edited Feb 12 at 10:34 Ulrich Eckhardt 17.4k5 gold badges31 silver badges58 bronze badges asked Feb 12 at 9:20 David Frank 6,0928 gold badges30 silver badges43 bronze badges

Have you checked Bazel's bug tracker and perhaps filed a feature request? – Ulrich Eckhardt Commented Feb 12 at 10:35
I did: github/bazelbuild/bazel/issues/25265 – David Frank Commented Feb 12 at 11:51

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

The current solution for "big blob" digest in Bazel are:

Use a modern CPU with SHA extension that helps you speed up SHA256 with SIMD
Use blake3 digest function with startup --digest_function=blake3. (your remote cache / remote execution server needs to support blake3 as well)
Use a FUSE filesystem that provides the digest via a special xattr. (rare to find one in the wild)

Im not entirely clear what a "sparse file" is in this context. If it's https://wiki.archlinux./title/Sparse_file, then it won't matter much because Bazel would still need to read and hash the whole file. Let's say you resize the "hole" in your file from 98GB -> 1GB, you would expect the digest to get updated as a result.

I think tar-ing the sparse file will effectively replace the 98GB hole with a "manifest" that said: chunk n-th to chunk m-th of the file are all zeroes (empty blocks). And then sha256/blake3 will hash that manifest instead of blocks of zeroes, which is faster. So your current workaround sounds like a good solution to me.

本文标签： Bazel digest calculation of big sparse files is slowStack Overflow

版权声明：本文标题：Bazel digest calculation of big sparse files is slow - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741612466a2388337.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

Bazel digest calculation of big sparse files is slow - Stack Overflow

1 Answer 1

更多相关文章