admin管理员组文章数量:1410689
My goal is to create images using gray codes, an example would be this:
It is all modulo 64 groups in gray codes in polar form.
Now of course I know of the simple mapping n ^ (n >> 1)
from binary, but I had found more efficient ways to generate gray codes directly than using said mapping. But since binary codes are related I will post code that generates binary codes as well.
I want a function that generates all n-bit gray codes in the form np.zeros((1 << n, n), dtype=bool)
. I want it as efficient as possible, and it has to be implemented in numpy
and only implemented in numpy
, no other libraries are allowed.
Why do I disallow other libraries? Because I have installed scipy
, PIL
, cv2
, matplotlib
, numba
... all of them require different versions of numpy
and updating one breaks the dependency of another, and all of them provide a number of methods, it is a huge learning curve to know how to use them well. I am currently trying to familiarize myself with numpy
, so I invented this challenge to make myself learn.
I have implemented a bunch of different methods, they all work correctly, I have rigorously tested them, but none of them strikes me as efficient. So far, I have found that np.unpackbits
is the most efficient method to get binary bits of a number, but it only works with np.uint8
, that is easy to solve, just using .view(np.uint8)
, but the output is in mixed endianness and that is somewhat trickier to solve.
But even if I use np.unpackbits
, converting it from binary to gray code is less efficient than generating gray codes directly.
And according to my tests, np.concatenate(arrs)
is more efficient than np.vstack(arrs)
,
np.concatenate(arrs, axis=-1)
beats np.hstack(arrs)
, and np.concatenate(arrs).reshape((w, h)).T
beats np.dstack(arrs)
. And somehow initializing an array and then broadcasting to individual columns using a loop can be more efficient than using np.concatenate
.
And using numpy
broadcasting to get a & b
column-wise in which a
is a 1d array and b
is a 1d array to get binary decomposition of a
can be much less efficient than just looping through the columns and apply &
column by column. In particular, (a & b[:, None]).T.astype(bool)
is much more efficient than (a[:, None] & b).astype(bool)
.
Code
import numpy as np
lo = 1
hi = 8
UINT_BITS = {}
for dtype in (np.uint8, np.uint16, np.uint32, np.uint64):
for i in range(lo, hi + 1):
UINT_BITS[i] = dtype
lo = hi + 1
hi <<= 1
def get_dtype(n: int) -> np.uint8 | np.uint16 | np.uint32 | np.uint64:
if dtype := UINT_BITS.get(n):
return dtype
raise ValueError(f"Argument {n} is not a valid bit width")
def validate(n: int) -> None:
if not (n and isinstance(n, int)):
raise ValueError(f"Argument {n} is not a valid bit width")
def binary_codes_0(n: int) -> np.ndarray:
validate(n)
count = 1 << n
rect = np.zeros((count, n), dtype=bool)
r = 1
for i in range(n - 1, -1, -1):
count >>= 1
rect[:, i] = np.tile(
np.concatenate([np.zeros(r, dtype=bool), np.ones(r, dtype=bool)]), count
)
r <<= 1
return rect
def binary_codes_1(n: int) -> np.ndarray:
validate(n)
r = total = 1 << n
return (
np.concatenate(
[
np.tile(
np.concatenate(
[np.zeros((r := r >> 1), dtype=bool), np.ones(r, dtype=bool)]
),
1 << i,
)
for i in range(n)
]
)
.reshape((n, total))
.T
)
def binary_codes_2(n: int) -> np.ndarray:
validate(n)
chunks = np.array([(0,), (1,)], dtype=bool)
l = 2
for _ in range(n - 1):
chunks = np.concatenate(
[
np.concatenate([np.zeros((l, 1), dtype=bool), chunks], axis=-1),
np.concatenate([np.ones((l, 1), dtype=bool), chunks], axis=-1),
]
)
l <<= 1
return chunks
def binary_codes_3(n: int) -> np.ndarray:
validate(n)
rect = np.zeros([2] * n + [n], dtype=bool)
for i, a in enumerate(np.ix_(*[(0, 1)] * n)):
rect[..., i] = a
return rect.reshape(-1, n)
def binary_codes_4(n: int) -> np.ndarray:
numbers = np.arange(1 << n, dtype=get_dtype(n))
return (
np.concatenate([(numbers & 1 << i).astype(bool) for i in range(n - 1, -1, -1)])
.reshape(n, 1 << n)
.T
)
def binary_codes_5(n: int) -> np.ndarray:
numbers = np.arange((count := 1 << n), dtype=get_dtype(n))
result = np.zeros((count, n), dtype=bool)
mask = count
for i in range(n):
result[:, i] = numbers & (mask := mask >> 1)
return result
def binary_codes_6(n: int) -> np.ndarray:
return np.unpackbits(
np.arange(1 << n, dtype=get_dtype(n))[:, None].view(np.uint8),
axis=1,
bitorder="little",
count=n,
)[:, ::-1]
def binary_codes_7(n: int) -> np.ndarray:
validate(n)
return np.array(np.meshgrid(*[(0, 1)] * n, indexing="ij")).reshape((n, 1 << n)).T
def gray_codes_0(n: int) -> np.ndarray:
numbers = np.arange((count := 1 << n), dtype=get_dtype(n))
gray = numbers ^ (numbers >> 1)
return (
np.concatenate([(gray & 1 << i).astype(bool) for i in range(n - 1, -1, -1)])
.reshape((n, count))
.T
)
def gray_codes_1(n: int) -> np.ndarray:
numbers = np.arange((count := 1 << n), dtype=get_dtype(n))
gray = numbers ^ (numbers >> 1)
result = np.zeros((count, n), dtype=bool)
for i in range(n):
result[:, i] = gray & (count := count >> 1)
return result
def gray_codes_2(n: int) -> np.ndarray:
validate(n)
binary = binary_codes_6(n)
shifted = np.roll(binary, 1, axis=-1)
shifted[:, 0] = 0
return binary ^ shifted
def gray_codes_3(n: int) -> np.ndarray:
validate(n)
gray = np.array([(0,), (1,)], dtype=bool)
l = 2
for _ in range(n - 1):
gray = np.concatenate(
[
np.concatenate([np.zeros((l, 1), dtype=bool), gray], axis=-1),
np.concatenate([np.ones((l, 1), dtype=bool), gray[::-1]], axis=-1),
]
)
l <<= 1
return gray
Testing
import numpy as np
zeros = np.zeros(524288, dtype=bool)
ones = np.ones(524288, dtype=bool)
zeros1 = np.zeros((524288, 32), dtype=bool)
ones1 = np.ones((524288, 32), dtype=bool)
million = [list(range(i*4096, i*4096+4096)) for i in range(256)]
numbers = np.arange(1 << 16, dtype=np.uint64)
mask = np.array([1 << i for i in range(15, -1, -1)], dtype=np.uint64)
In [3]: %timeit (numbers & mask[:, None]).T.astype(bool)
4.1 ms ± 97.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit (numbers[:, None] & mask).astype(bool)
6.1 ms ± 423 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit binary_codes_5(16)
2.02 ms ± 19.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit binary_codes_4(16)
2.32 ms ± 27.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit np.hstack([zeros, ones])
312 μs ± 12.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit np.concatenate([zeros, ones])
307 μs ± 9.97 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: %timeit np.vstack([zeros, ones])
315 μs ± 11.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [10]: %timeit np.hstack([zeros1, ones1])
19.8 ms ± 800 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit np.concatenate([zeros1, ones1], axis=-1)
18.1 ms ± 265 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit np.concatenate([zeros1, ones1])
9.73 ms ± 413 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [13]: %timeit np.vstack([zeros1, ones1])
10.3 ms ± 229 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [14]: %timeit np.dstack(million)[0]
78.7 ms ± 973 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit np.concatenate(million).reshape((256, 4096)).T
69.9 ms ± 251 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: %timeit binary_codes_0(16)
2.32 ms ± 18 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: %timeit binary_codes_1(16)
6.37 ms ± 182 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [18]: %timeit binary_codes_2(16)
1.46 ms ± 28 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [19]: %timeit binary_codes_3(16)
1.64 ms ± 29.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [20]: %timeit binary_codes_6(16)
1.12 ms ± 9.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [21]: %timeit gray_codes_0(16)
2.12 ms ± 25.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [22]: %timeit gray_codes_1(16)
2.17 ms ± 29 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [23]: %timeit gray_codes_2(16)
4.51 ms ± 151 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [24]: %timeit gray_codes_3(16)
1.46 ms ± 19.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Is there a more efficient way to generate all n-bit gray codes?
I have figured out how to use np.meshgrid
to do Cartesian product, and it is much slower than expected. I have edited the code above to include it.
In [82]: %timeit binary_codes_7(16)
6.96 ms ± 249 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [83]: %timeit binary_codes_5(16)
1.74 ms ± 36.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [84]: %timeit binary_codes_3(16)
1.65 ms ± 15.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [85]: %timeit np.meshgrid(*[(0, 1)] * 16, indexing="ij")
4.33 ms ± 49.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [86]: np.all(np.array(np.meshgrid(*[(0, 1)] * 5, indexing="ij")).reshape((5, 32)).T == binary_codes_3(5))
Out[86]: np.True_
Now I have implemented everything I can think of.
At this point I have realized this problem is extremely simple, we don't even need any bit operations at all.
In both binary and gray codes, each cell can only have two values, zero and one. It can only be one or zero.
Now if we have np.zeros((1 << n, n), dtype=bool)
our job is half way done. Exactly half of the cells have the correct value: zero. We just have to flip the ones.
If we look at the sequences row-wise, there isn't much we can do; but if we look at the columns, it just repeats. There are groups of ones with equal length separated by groups of zeros with the same length.
We can just create a 1d array as a binary mask to flip everything on for each column except those gaps. Job done. The question is, how?
The rightmost column in binary is straightforward, just do arr[:, -1][1::2] = 1
. But what about the second last column? It needs to be (0, 0, 1, 1)
repeat, in other words every other pair of cells are ones, I know the indices of the start and end points, it needs to be on in [range(2, 4), range(6, 8), range(10, 12)...]
but what is the simplest way to tell the computer to flip those cells? And the third last column, the bands of ones are [range(4, 8), range(12, 16), range(20, 24)...]
, how do I flip those cells?
Surprisingly I haven't found a good answer, or perhaps unsurprising, since Google is useless, but I did find this: Indexing in NumPy: Access every other group of values.
And no, this is not a duplicate, because doing reshape
then ravel
for each column would be terribly inefficient, and that doesn't create a boolean mask for indexing the array, it creates a smaller array...
Currently I can do this:
arr = np.zeros((16, 4), dtype=bool)
l = 1
for i in (3, 2, 1, 0):
l2 = l * 2
for a, b in zip(range(l, 16, l2), range(l2,17,l2)):
arr[:, i][a:b] = 1
l = l2
But this supposedly is slow (I haven't benchmarked this), however if this is implemented in numpy
then I think this would be the most efficient algorithm for this type of sequences. The question is, how to implement this?
My goal is to create images using gray codes, an example would be this:
It is all modulo 64 groups in gray codes in polar form.
Now of course I know of the simple mapping n ^ (n >> 1)
from binary, but I had found more efficient ways to generate gray codes directly than using said mapping. But since binary codes are related I will post code that generates binary codes as well.
I want a function that generates all n-bit gray codes in the form np.zeros((1 << n, n), dtype=bool)
. I want it as efficient as possible, and it has to be implemented in numpy
and only implemented in numpy
, no other libraries are allowed.
Why do I disallow other libraries? Because I have installed scipy
, PIL
, cv2
, matplotlib
, numba
... all of them require different versions of numpy
and updating one breaks the dependency of another, and all of them provide a number of methods, it is a huge learning curve to know how to use them well. I am currently trying to familiarize myself with numpy
, so I invented this challenge to make myself learn.
I have implemented a bunch of different methods, they all work correctly, I have rigorously tested them, but none of them strikes me as efficient. So far, I have found that np.unpackbits
is the most efficient method to get binary bits of a number, but it only works with np.uint8
, that is easy to solve, just using .view(np.uint8)
, but the output is in mixed endianness and that is somewhat trickier to solve.
But even if I use np.unpackbits
, converting it from binary to gray code is less efficient than generating gray codes directly.
And according to my tests, np.concatenate(arrs)
is more efficient than np.vstack(arrs)
,
np.concatenate(arrs, axis=-1)
beats np.hstack(arrs)
, and np.concatenate(arrs).reshape((w, h)).T
beats np.dstack(arrs)
. And somehow initializing an array and then broadcasting to individual columns using a loop can be more efficient than using np.concatenate
.
And using numpy
broadcasting to get a & b
column-wise in which a
is a 1d array and b
is a 1d array to get binary decomposition of a
can be much less efficient than just looping through the columns and apply &
column by column. In particular, (a & b[:, None]).T.astype(bool)
is much more efficient than (a[:, None] & b).astype(bool)
.
Code
import numpy as np
lo = 1
hi = 8
UINT_BITS = {}
for dtype in (np.uint8, np.uint16, np.uint32, np.uint64):
for i in range(lo, hi + 1):
UINT_BITS[i] = dtype
lo = hi + 1
hi <<= 1
def get_dtype(n: int) -> np.uint8 | np.uint16 | np.uint32 | np.uint64:
if dtype := UINT_BITS.get(n):
return dtype
raise ValueError(f"Argument {n} is not a valid bit width")
def validate(n: int) -> None:
if not (n and isinstance(n, int)):
raise ValueError(f"Argument {n} is not a valid bit width")
def binary_codes_0(n: int) -> np.ndarray:
validate(n)
count = 1 << n
rect = np.zeros((count, n), dtype=bool)
r = 1
for i in range(n - 1, -1, -1):
count >>= 1
rect[:, i] = np.tile(
np.concatenate([np.zeros(r, dtype=bool), np.ones(r, dtype=bool)]), count
)
r <<= 1
return rect
def binary_codes_1(n: int) -> np.ndarray:
validate(n)
r = total = 1 << n
return (
np.concatenate(
[
np.tile(
np.concatenate(
[np.zeros((r := r >> 1), dtype=bool), np.ones(r, dtype=bool)]
),
1 << i,
)
for i in range(n)
]
)
.reshape((n, total))
.T
)
def binary_codes_2(n: int) -> np.ndarray:
validate(n)
chunks = np.array([(0,), (1,)], dtype=bool)
l = 2
for _ in range(n - 1):
chunks = np.concatenate(
[
np.concatenate([np.zeros((l, 1), dtype=bool), chunks], axis=-1),
np.concatenate([np.ones((l, 1), dtype=bool), chunks], axis=-1),
]
)
l <<= 1
return chunks
def binary_codes_3(n: int) -> np.ndarray:
validate(n)
rect = np.zeros([2] * n + [n], dtype=bool)
for i, a in enumerate(np.ix_(*[(0, 1)] * n)):
rect[..., i] = a
return rect.reshape(-1, n)
def binary_codes_4(n: int) -> np.ndarray:
numbers = np.arange(1 << n, dtype=get_dtype(n))
return (
np.concatenate([(numbers & 1 << i).astype(bool) for i in range(n - 1, -1, -1)])
.reshape(n, 1 << n)
.T
)
def binary_codes_5(n: int) -> np.ndarray:
numbers = np.arange((count := 1 << n), dtype=get_dtype(n))
result = np.zeros((count, n), dtype=bool)
mask = count
for i in range(n):
result[:, i] = numbers & (mask := mask >> 1)
return result
def binary_codes_6(n: int) -> np.ndarray:
return np.unpackbits(
np.arange(1 << n, dtype=get_dtype(n))[:, None].view(np.uint8),
axis=1,
bitorder="little",
count=n,
)[:, ::-1]
def binary_codes_7(n: int) -> np.ndarray:
validate(n)
return np.array(np.meshgrid(*[(0, 1)] * n, indexing="ij")).reshape((n, 1 << n)).T
def gray_codes_0(n: int) -> np.ndarray:
numbers = np.arange((count := 1 << n), dtype=get_dtype(n))
gray = numbers ^ (numbers >> 1)
return (
np.concatenate([(gray & 1 << i).astype(bool) for i in range(n - 1, -1, -1)])
.reshape((n, count))
.T
)
def gray_codes_1(n: int) -> np.ndarray:
numbers = np.arange((count := 1 << n), dtype=get_dtype(n))
gray = numbers ^ (numbers >> 1)
result = np.zeros((count, n), dtype=bool)
for i in range(n):
result[:, i] = gray & (count := count >> 1)
return result
def gray_codes_2(n: int) -> np.ndarray:
validate(n)
binary = binary_codes_6(n)
shifted = np.roll(binary, 1, axis=-1)
shifted[:, 0] = 0
return binary ^ shifted
def gray_codes_3(n: int) -> np.ndarray:
validate(n)
gray = np.array([(0,), (1,)], dtype=bool)
l = 2
for _ in range(n - 1):
gray = np.concatenate(
[
np.concatenate([np.zeros((l, 1), dtype=bool), gray], axis=-1),
np.concatenate([np.ones((l, 1), dtype=bool), gray[::-1]], axis=-1),
]
)
l <<= 1
return gray
Testing
import numpy as np
zeros = np.zeros(524288, dtype=bool)
ones = np.ones(524288, dtype=bool)
zeros1 = np.zeros((524288, 32), dtype=bool)
ones1 = np.ones((524288, 32), dtype=bool)
million = [list(range(i*4096, i*4096+4096)) for i in range(256)]
numbers = np.arange(1 << 16, dtype=np.uint64)
mask = np.array([1 << i for i in range(15, -1, -1)], dtype=np.uint64)
In [3]: %timeit (numbers & mask[:, None]).T.astype(bool)
4.1 ms ± 97.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit (numbers[:, None] & mask).astype(bool)
6.1 ms ± 423 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit binary_codes_5(16)
2.02 ms ± 19.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit binary_codes_4(16)
2.32 ms ± 27.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit np.hstack([zeros, ones])
312 μs ± 12.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit np.concatenate([zeros, ones])
307 μs ± 9.97 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: %timeit np.vstack([zeros, ones])
315 μs ± 11.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [10]: %timeit np.hstack([zeros1, ones1])
19.8 ms ± 800 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit np.concatenate([zeros1, ones1], axis=-1)
18.1 ms ± 265 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit np.concatenate([zeros1, ones1])
9.73 ms ± 413 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [13]: %timeit np.vstack([zeros1, ones1])
10.3 ms ± 229 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [14]: %timeit np.dstack(million)[0]
78.7 ms ± 973 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit np.concatenate(million).reshape((256, 4096)).T
69.9 ms ± 251 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: %timeit binary_codes_0(16)
2.32 ms ± 18 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: %timeit binary_codes_1(16)
6.37 ms ± 182 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [18]: %timeit binary_codes_2(16)
1.46 ms ± 28 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [19]: %timeit binary_codes_3(16)
1.64 ms ± 29.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [20]: %timeit binary_codes_6(16)
1.12 ms ± 9.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [21]: %timeit gray_codes_0(16)
2.12 ms ± 25.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [22]: %timeit gray_codes_1(16)
2.17 ms ± 29 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [23]: %timeit gray_codes_2(16)
4.51 ms ± 151 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [24]: %timeit gray_codes_3(16)
1.46 ms ± 19.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Is there a more efficient way to generate all n-bit gray codes?
I have figured out how to use np.meshgrid
to do Cartesian product, and it is much slower than expected. I have edited the code above to include it.
In [82]: %timeit binary_codes_7(16)
6.96 ms ± 249 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [83]: %timeit binary_codes_5(16)
1.74 ms ± 36.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [84]: %timeit binary_codes_3(16)
1.65 ms ± 15.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [85]: %timeit np.meshgrid(*[(0, 1)] * 16, indexing="ij")
4.33 ms ± 49.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [86]: np.all(np.array(np.meshgrid(*[(0, 1)] * 5, indexing="ij")).reshape((5, 32)).T == binary_codes_3(5))
Out[86]: np.True_
Now I have implemented everything I can think of.
At this point I have realized this problem is extremely simple, we don't even need any bit operations at all.
In both binary and gray codes, each cell can only have two values, zero and one. It can only be one or zero.
Now if we have np.zeros((1 << n, n), dtype=bool)
our job is half way done. Exactly half of the cells have the correct value: zero. We just have to flip the ones.
If we look at the sequences row-wise, there isn't much we can do; but if we look at the columns, it just repeats. There are groups of ones with equal length separated by groups of zeros with the same length.
We can just create a 1d array as a binary mask to flip everything on for each column except those gaps. Job done. The question is, how?
The rightmost column in binary is straightforward, just do arr[:, -1][1::2] = 1
. But what about the second last column? It needs to be (0, 0, 1, 1)
repeat, in other words every other pair of cells are ones, I know the indices of the start and end points, it needs to be on in [range(2, 4), range(6, 8), range(10, 12)...]
but what is the simplest way to tell the computer to flip those cells? And the third last column, the bands of ones are [range(4, 8), range(12, 16), range(20, 24)...]
, how do I flip those cells?
Surprisingly I haven't found a good answer, or perhaps unsurprising, since Google is useless, but I did find this: Indexing in NumPy: Access every other group of values.
And no, this is not a duplicate, because doing reshape
then ravel
for each column would be terribly inefficient, and that doesn't create a boolean mask for indexing the array, it creates a smaller array...
Currently I can do this:
arr = np.zeros((16, 4), dtype=bool)
l = 1
for i in (3, 2, 1, 0):
l2 = l * 2
for a, b in zip(range(l, 16, l2), range(l2,17,l2)):
arr[:, i][a:b] = 1
l = l2
But this supposedly is slow (I haven't benchmarked this), however if this is implemented in numpy
then I think this would be the most efficient algorithm for this type of sequences. The question is, how to implement this?
4 Answers
Reset to default 1This seems 3-4 times faster than your fastest gray_codes
. Fills the array by reverse-copying blocks and setting streaks of 1s.
def g(n):
a = np.zeros((n, 2**n), dtype=bool)
m, M = 1, 2
while n:
a[n:, m:M] = a[n:, m-1::-1]
n -= 1
a[n, m:M] = 1
m, M = M, 2*M
return a.T
Attempt This Online!
This answer analyze the performance of each implementation and provide faster solutions.
Profiling
First of all, I cannot reproduce the trends in the performance results. The performance of a Numpy code is dependent of the version of Numpy, the operating system (actually more specifically the compiler used to build Numpy), the target machine (mainly the CPU and the DRAM) and also the version of CPython in some critical cases. It is important to provide such information if you can (especially the version of Numpy). Here are results on my machine (i5-9600KF CPU with a ~40 GB/s RAM) using Numpy 2.1.3 (from PIP) and CPython 3.11 on Linux:
# mean ± std. dev. of 7 runs, 1,000 loops each
%timeit binary_codes_0(16) # 559 μs ± 228 ns per loop
%timeit binary_codes_1(16) # 159 μs ± 148 ns per loop
%timeit binary_codes_2(16) # 685 μs ± 137 ns per loop
%timeit binary_codes_3(16) # 876 μs ± 592 ns per loop
%timeit binary_codes_6(16) # 312 μs ± 161 ns per loop
%timeit gray_codes_0(16) # 211 μs ± 80 ns per loop
%timeit gray_codes_1(16) # 573 μs ± 523 ns per loop
%timeit gray_codes_2(16) # 1613 ms ± 10 μs per loop
%timeit gray_codes_3(16) # 691 μs ± 897 ns per loop
Analysis & low-level profiling
We can see that gray_codes_0
is very fast compared to other functions.
gray_codes_1
is slower but it can be optimized easily. Indeed, data is not stored contiguously which is not efficient. You can swap the axis so to make significantly faster contiguous stores. The array can then be transposed (lazily). We can optimize this even more. I provide an optimized code below.
gray_codes_2
is slow because it spends most of its time operating on strided arrays due to the -1
axis provided to np.roll
. Strided memory accesses are inherently slow, especially on large arrays. Moreover, it calls binary_codes_6
using unpackbits
which is not yet optimized to benefit from SIMD units of CPUs (which is critical for performance). Since unpackbits
is not particularly fast (and even slower than gray_codes_0
), there is certainly no way to make this function much faster than the fastest one (beside waiting for this Numpy function to be optimized).
gray_codes_3
is rather slow because it spend most of its time into memory operations which is not surprising regarding its implementation.
Faster implementation
Starting from gray_codes_1
code, we can apply the above optimizations. We can also call np.empty
instead of np.zeros
(a bit faster on some machines). Here is the resulting implementation:
def JR_gray_codes_0(n: int) -> np.ndarray:
numbers = np.arange((count := 1 << n), dtype=get_dtype(n))
gray = numbers ^ (numbers >> 1)
result = np.empty((n, count), dtype=bool)
for i in range(n):
result[i] = gray & (count := count >> 1)
return result.T
We can then remove the loop and use an idiomatic Numpy broadcasting-based solution to make this a bit faster:
def JR_gray_codes_1(n: int) -> np.ndarray:
dt = get_dtype(n)
numbers = np.arange((count := 1 << n), dtype=dt)
gray = numbers ^ (numbers >> 1)
counts = np.array([count := count >> 1 for i in range(n)], dtype=dt)
return ((gray[None,:] & counts[:,None]) != 0).T
We can certainly optimize this further by operating on 8-bit data rather than 16-bit ones. This should be about twice faster. However, this require to operate on chunks since numbers
and gray
cannot fit in 8-bit integers for large n
values.
Preallocating the output array can significantly improve performance of the computations (especially on Windows -- see the notes below) due to issues with page faults. This assume you know the size of the output ahead of time and this is often useful when the code is executed in a loop (so certainly not well suited here). Note that the pre-allocated array can be bigger than needed so it can be reused for multiple value of n
. Here is an example just to see how much time is spent in page-faults:
n = 16
# Pre-allocation of the array manually filled with zeros
# (do not use np.zeros instead since it will not actually fill the memory).
tmp1 = np.full((n, 1<<n), 0, dtype=get_dtype(n))
tmp2 = np.full((n, 1<<n), False, dtype=bool)
def JR_gray_codes_2() -> np.ndarray:
global tmp1, tmp2
dt = get_dtype(n)
numbers = np.arange((count := 1 << n), dtype=dt)
gray = numbers ^ (numbers >> 1)
counts = np.array([count := count >> 1 for i in range(n)], dtype=dt)
np.bitwise_and(gray[None,:], counts[:,None], tmp1)
return np.not_equal(tmp1, 0, tmp2).T
Final benchmark
Here is the performance of the provided solution (still on the same platform/environment):
# mean ± std. dev. of 7 runs, 1,000 loops each
%timeit JR_gray_codes_0(16) # 167 μs ± 379 ns per loop
%timeit JR_gray_codes_1(16) # 145 μs ± 347 ns per loop
%timeit no_comment(16) # 317 μs ± 3.05 μs per loop
We can see that JR_gray_codes_1
is the fastest implementation on my machine. It spends ~75% of its time in both the &
and !=
operations (the later is not optimally implemented in Numpy currently though still rather fast). It is 45% faster than the fastest solution so far (which is gray_codes_0
).
I might implement the optimisation about 8-bit integers later which should be 3 times faster than gray_codes_0
.
Here are results on Windows 10 (on the same machine):
%timeit gray_codes_0(16) # 1.1 ms ± 5.31 μs per loop
%timeit gray_codes_1(16) # 1.13 ms ± 5.99 μs per loop
%timeit gray_codes_2(16) # 2.75 ms ± 25.3 μs per loop
%timeit gray_codes_3(16) # 948 μs ± 1.68 μs per loop
%timeit JR_gray_codes_0(16) # 1.04 ms ± 2.01 μs per loop
%timeit JR_gray_codes_1(16) # 1.11 ms ± 2.53 μs per loop
%timeit JR_gray_codes_2() # 544 μs ± 10.5 μs per loop
%timeit no_comment(16) # 753 μs ± 2.53 μs per loop
Please note that results are significantly slower than on Linux. See the note below to understand why.
Important notes
The timings of the functions gray_codes_0
, gray_codes_2
and gray_codes_3
(not other functions) are severely impacted by a weird HUGE effect on my machine. As a result, I thing the timings of such functions are unreliable. Here is a standalone script to reproduce the issue. I reproduced it with 2 Numpy versions (2.1.3 & 2.2.3) and 2 CPython versions (3.11.11 & 3.13.2) on Linux 6.12.12 (libc v2.40-6). When the functions are slow (default use-case), most of the time is spent in the Linux kernel (more specifically in memory paging kernel functions and IRQ) so this is certainly a problem related to page faults. I think it is due to a pathological case of the system allocator (which is dependent of the platform). A similar issue is described in this post.
While the version of Numpy and CPython do impact performance results, it is rather small compared to this huge effect. I reproduced the effect with both Numpy 2.1.3 and Numpy 2.2.3, and both CPython 3.11.11 and CPython 3.13.2.
In the above profiling benchmark, I choose to pick the fastest implementation since page-faults/issues are a side effect of temporary Numpy arrays that may only happen on some platform and that can be avoided using in-place operations or by using another allocator.
On Windows, note that while I cannot reproduce the effect with the same script, a similar effect happens if I run the code several times (for example gray_codes_0
is 50% faster the second time the script is executed). The order of the function also seems to impact the timings on Windows...
Moreover, on Windows, most of the functions are also much slower because of 2 issues:
- more expensive page faults and more frequent page faults (the system allocator is less conservative than on Linux)
- Numpy is typically compiled with the MSVC compiler which fails to optimise several basic Numpy functions. For example booleans operations tends to be much slower.
In this case, on the same machine, I get 1.1 ms on Windows 10 versus 0.145 ms for JR_gray_codes_1
on Linux... This is 7.5 times slower! JR_gray_codes_2
takes 0.54 ms which is significantly better and faster than all other implementations. Note that using WSL mostly fix this performance issue. You can mitigate the issue by preallocating arrays so to avoid page faults. Preallocation does not solve inefficient implementation of some Numpy function on Windows though. Recompiling your own Numpy module on Windows with Clang-CL instead of using pre-compiled PIP packages should solve this.
If anyone is interested in how to create the polar plots from the generated gray codes, here is some code for completeness (thanks OP for an interesting exercise on numpy
and gray codes). Set an oversampling
for small bit counts to make the plot smooth.
def plot_gray_codes(gc, oversampling=1):
import matplotlib.pyplot as plt
fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
# Set an oversampling factor to make the polar plot smoother for small bit counts
gc = np.repeat(gc, oversampling, axis=0) if oversampling > 1 else gc
for i, j in zip(*np.where(gc)):
ax.fill_between([i * 2 * np.pi / gc.shape[0], (i + 1) * 2 * np.pi / gc.shape[0]], [j, j], [j+1, j+1], color='black')
ax.set_aspect('equal')
plt.axis('off')
plt.tight_layout()
plt.show()
I have written some new solutions and according to my tests the fastest of them have better performance than the solutions from the other answers on my computer.
First, some details about my computer: CPU Intel(R) Core(TM) i5-4430 CPU @ 3.00GHz, RAM Kingston DDR3 8GiB * 2, OS Windows 11 Pro 23H2 x64, CPython 3.12.6 x64, NumPy 2.0.2, I am using IPython 8.27.0 on Windows Terminal.
I have tried to implement my idea mentioned in my latest edit to the question, and surprisingly I have actually beaten all submitted solutions and even np.unpackbits
...
First I will describe the ways I have found to generate sequence of integers with periodic gaps.
Say you want to generate a list of numbers that includes every other group of width
numbers, for example, if the width
is 5, if we include the first five natural numbers, we get 0, 1, 2, 3, 4, then we skip the next five numbers and we append 10, 11, 12, 13, 14 to the list, then we skip another five numbers and append 20, 21, 22, 23, 24...
Each group of included numbers has width
of 5, and they all start at a multiple of 5. I have identified that the groups that are included have a floor division of zero when divided by 5.
So if I want every other group of five integers starting at 7 ending at 100 I can do this:
In [35]: nums = np.arange(100)
In [36]: ~((nums - 7) // 5) & 1
Out[36]:
array([1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1])
But this is inefficient (and the first 2 bits should be off). NumPy arrays can be added, so I should instead do this:
In [37]: np.arange(7, 100, 10)[:, None] + np.arange(5)
Out[37]:
array([[ 7, 8, 9, 10, 11],
[ 17, 18, 19, 20, 21],
[ 27, 28, 29, 30, 31],
[ 37, 38, 39, 40, 41],
[ 47, 48, 49, 50, 51],
[ 57, 58, 59, 60, 61],
[ 67, 68, 69, 70, 71],
[ 77, 78, 79, 80, 81],
[ 87, 88, 89, 90, 91],
[ 97, 98, 99, 100, 101]])
But both are inefficient:
In [38]: %timeit column = np.arange(65536, dtype=np.uint16); ~((column - 64) >> 7) & 1
219 μs ± 15.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [39]: %timeit np.arange(64, 65536, 256, dtype=np.uint16)[:, None] + np.arange(128)
61.2 μs ± 2.97 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [40]: %timeit np.tile(np.concatenate([np.zeros(64, dtype=bool), np.ones(64, dtype=bool)]), 512)
17.9 μs ± 662 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Though as you can see we need to use np.concatenate
if the starting position of the tiling isn't a multiple of the tile's length.
I have implemented 5 new functions, binary_codes_8
implements the above idea, though it isn't efficient:
def binary_codes_8(n: int) -> np.ndarray:
dtype = get_dtype(n)
result = np.zeros(((length := 1 << n), n), dtype=bool)
width = 1
for i in range(n - 1, -1, -1):
result[:, i][
np.arange(width, length, (step := width << 1), dtype=dtype)[:, None]
+ np.arange(width)
] = 1
width = step
return result
In [41]: %timeit binary_codes_6(16)
1.14 ms ± 37.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [42]: %timeit binary_codes_8(16)
2.53 ms ± 41.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Instead I have a new idea, instead of generating a two dimensional array, we can just join the columns head to tail to a one dimensional array, this is to make memory access contiguous and to avoid broadcasting assignments per iteration.
def binary_codes_9(n: int) -> np.ndarray:
validate(n)
places = 1 << np.arange(n)
return (
np.concatenate(
[
np.tile(
np.concatenate([np.zeros(a, dtype=bool), np.ones(a, dtype=bool)]), b
)
for a, b in zip(places[::-1], places)
]
)
.reshape((n, 1 << n))
.T
)
In [43]: %timeit binary_codes_9(16)
910 μs ± 26.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
As you can see, it actually beats the np.unpackbits
based solution.
gray_codes_4
implements the same idea as binary_codes_8
, so it is also inefficient:
def gray_codes_4(n: int) -> np.ndarray:
dtype = get_dtype(n)
result = np.zeros(((length := 1 << n), n), dtype=bool)
width = 2
start = 1
for i in range(n - 1, 0, -1):
result[:, i][
np.arange(start, length, (step := width << 1), dtype=dtype)[:, None]
+ np.arange(width)
] = 1
width = step
start <<= 1
result[:, 0][length >> 1 :] = 1
return result
In [44]: %timeit gray_codes_4(16)
2.52 ms ± 161 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So I tried to implement the idea of binary_codes_9
to gray_codes_5
:
def gray_codes_5(n: int) -> np.ndarray:
m = n - 1
half = 1 << m
column = np.arange(1 << n, dtype=get_dtype(n))
offsets = (1 << np.arange(m - 1, -1, -1, dtype=np.uint8)).tolist()
return (
np.concatenate(
[
np.concatenate([np.zeros(half, dtype=bool), np.ones(half, dtype=bool)]),
*(~((column - a) >> b) & 1 for a, b in zip(offsets, range(m, 0, -1))),
]
)
.reshape((n, 1 << n))
.T
)
In [45]: %timeit gray_codes_5(16)
3.67 ms ± 60.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It is somehow slower, but the idea is sound, just that the way each column is generated is inefficient.
So I tried again, and this time it even beats binary_codes_9
:
def gray_codes_6(n: int) -> np.ndarray:
validate(n)
if n == 1:
return np.array([(0,), (1,)], dtype=bool)
half = 1 << (n - 1)
places = (1 << np.arange(n - 1, -1, -1)).tolist()
return (
np.concatenate(
[
np.zeros(half, dtype=bool),
np.ones(half, dtype=bool),
*(
np.tile(
np.concatenate(
[
np.zeros(b, dtype=bool),
np.ones(a, dtype=bool),
np.zeros(b, dtype=bool),
]
),
1 << i,
)
for i, (a, b) in enumerate(zip(places, places[1:]))
),
]
)
.reshape((n, 1 << n))
.T
)
In [46]: %timeit gray_codes_6(16)
759 μs ± 19.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This is the benchmark of all the functions on my IPython interpreter:
In [7]: %timeit binary_codes_0(16)
1.62 ms ± 58.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit binary_codes_1(16)
829 μs ± 9.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: %timeit binary_codes_2(16)
1.9 ms ± 67.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [10]: %timeit binary_codes_3(16)
1.55 ms ± 9.63 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [11]: %timeit binary_codes_4(16)
1.66 ms ± 40.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [12]: %timeit binary_codes_5(16)
1.8 ms ± 54.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [13]: %timeit binary_codes_6(16)
1.11 ms ± 22.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [14]: %timeit binary_codes_7(16)
7.01 ms ± 46.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [15]: %timeit binary_codes_8(16)
2.5 ms ± 57.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [16]: %timeit binary_codes_9(16)
887 μs ± 5.43 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [17]: %timeit gray_codes_0(16)
1.65 ms ± 11.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [18]: %timeit gray_codes_1(16)
1.79 ms ± 9.98 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [19]: %timeit gray_codes_2(16)
3.97 ms ± 28.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [20]: %timeit gray_codes_3(16)
1.9 ms ± 49.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [21]: %timeit gray_codes_4(16)
2.38 ms ± 33.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [22]: %timeit gray_codes_5(16)
3.95 ms ± 19.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [23]: %timeit gray_codes_6(16)
718 μs ± 4.91 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [24]: %timeit JR_gray_codes_1(16)
1.42 ms ± 10.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [25]: %timeit gray_codes_nocomment(16)
1.03 ms ± 4.88 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [26]: %timeit JR_gray_codes_2()
809 μs ± 12.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
As you can see, my solutions beat all functions from the other answers.
But how do they scale with larger inputs? To find out, I reused code from my last answer:
import colorsys
import matplotlib.pyplot as plt
import numpy as np
import timeit
from math import ceil
from scipy.interpolate import make_interp_spline
def measure_command(func, *args, **kwargs):
elapsed = timeit.timeit(lambda: func(*args, **kwargs), number=5)
once = elapsed / 5
if elapsed >= 1:
return int(1e9 * once + 0.5)
times = min(1024, ceil(1 / once))
return int(
1e9 * timeit.timeit(lambda: func(*args, **kwargs), number=times) / times + 0.5
)
def test(func):
return [measure_command(func, i) for i in range(1, 21)]
bin6 = binary_codes_9(6)
gra6 = gray_codes_6(6)
execution_times = {}
to_test = [
*((f"binary_codes_{i}", bin6) for i in range(10)),
*((f"gray_codes_{i}", gra6) for i in range(7)),
("JR_gray_codes_1", gra6),
("gray_codes_nocomment", gra6),
]
for func_name, correct in to_test:
func = globals()[func_name]
assert np.array_equal(func(6), correct)
execution_times[func_name] = test(func)
cost_per_item = {
k: [e / (1 << i) for i, e in enumerate(v, start=1)]
for k, v in execution_times.items()
}
largest_execution = sorted(
[(k, v[-1]) for k, v in execution_times.items()], key=lambda x: x[1]
)
average_execution = sorted(
[(k, sum(v) / 20) for k, v in cost_per_item.items()], key=lambda x: x[1]
)
columns = [
sorted([v[i] for v in execution_times.values()], reverse=True) for i in range(20)
]
overall_performance = [
(k, [columns[i].index(e) for i, e in enumerate(v)])
for k, v in execution_times.items()
]
overall_execution = sorted(
[(a, sum(b) / 36) for a, b in overall_performance], key=lambda x: -x[1]
)
execution_times_1.json
In [14]: average_execution
Out[14]:
[('binary_codes_6', 206.31875624656678),
('gray_codes_nocomment', 292.46834869384764),
('binary_codes_5', 326.2715059280396),
('binary_codes_4', 425.4694920539856),
('gray_codes_1', 432.8871788024902),
('gray_codes_4', 440.75454263687135),
('binary_codes_3', 486.1538872718811),
('JR_gray_codes_1', 505.06243762969973),
('gray_codes_0', 518.2342618465424),
('gray_codes_3', 560.9797175884247),
('binary_codes_2', 585.7214835166931),
('binary_codes_8', 593.8069385528564),
('gray_codes_5', 1012.6498884677887),
('gray_codes_6', 1102.2803171157836),
('binary_codes_0', 1102.6035027503967),
('gray_codes_2', 1152.2696633338928),
('binary_codes_1', 1207.228157234192),
('binary_codes_7', 1289.8271428585053),
('binary_codes_9', 1667.4837736606598)]
In [15]: [(a, b / 1e9) for a, b in largest_execution]
Out[15]:
[('gray_codes_6', 0.01664672),
('binary_codes_9', 0.017008553),
('gray_codes_nocomment', 0.0200915),
('binary_codes_1', 0.025319672),
('binary_codes_6', 0.027002585),
('gray_codes_3', 0.038912479),
('binary_codes_2', 0.045456482),
('JR_gray_codes_1', 0.053382224),
('binary_codes_3', 0.054410716),
('binary_codes_4', 0.0555085),
('gray_codes_0', 0.058621065),
('binary_codes_0', 0.0718396),
('binary_codes_5', 0.084661),
('binary_codes_8', 0.085592108),
('gray_codes_1', 0.088250008),
('gray_codes_4', 0.091165908),
('gray_codes_2', 0.093191058),
('binary_codes_7', 0.104509167),
('gray_codes_5', 0.146622829)]
In [16]: overall_execution
Out[16]:
[('binary_codes_6', 9.055555555555555),
('gray_codes_nocomment', 8.88888888888889),
('JR_gray_codes_1', 7.5),
('binary_codes_5', 6.972222222222222),
('binary_codes_4', 6.527777777777778),
('gray_codes_1', 6.0),
('binary_codes_3', 5.916666666666667),
('gray_codes_0', 5.805555555555555),
('gray_codes_3', 4.888888888888889),
('gray_codes_6', 4.555555555555555),
('gray_codes_4', 4.25),
('binary_codes_1', 4.083333333333333),
('binary_codes_2', 4.0),
('binary_codes_9', 3.7222222222222223),
('binary_codes_8', 3.6944444444444446),
('binary_codes_0', 3.361111111111111),
('gray_codes_2', 2.611111111111111),
('gray_codes_5', 2.2777777777777777),
('binary_codes_7', 0.8888888888888888)]
This is puzzling, clearly my new functions gray_codes_6
, binary_codes_9
perform the best for the largest input (20, it means the output will have 1048576 rows), but according to my metrics they somehow score poorly...
Just to sanity check:
In [17]: %timeit binary_codes_1(16)
908 μs ± 24.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [18]: %timeit binary_codes_6(16)
1.12 ms ± 8.18 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [19]: %timeit binary_codes_9(16)
925 μs ± 12.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [20]: %timeit binary_codes_1(20)
17.3 ms ± 205 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [21]: %timeit binary_codes_6(20)
28.6 ms ± 233 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [22]: %timeit binary_codes_9(20)
23 ms ± 753 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [23]: %timeit gray_codes_6(16)
854 μs ± 23.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [24]: %timeit JR_gray_codes_1(16)
1.69 ms ± 34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [25]: %timeit gray_codes_nocomment(16)
1.11 ms ± 31 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [26]: %timeit gray_codes_6(20)
15.9 ms ± 959 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [27]: %timeit gray_codes_nocomment(20)
20.5 ms ± 640 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [28]: %timeit JR_gray_codes_1(20)
48.5 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Indeed, they performed better for larger inputs.
So I tried to graph the performance, the graphs are kind of busy, a lot goings on, but I saw something:
ymax = -1e309
for v in execution_times.values():
ymax = max(max(v), ymax)
rymax = ceil(ymax)
fig, ax = plt.subplots()
ax.axis([0, 20, 0, ymax])
ax.set_xticks(range(1, 21))
ax.set_xticklabels([f"{1<<i}" for i in range(1, 21)])
for k, v in execution_times.items():
x = np.linspace(1, 20, 256)
plt.plot(x, make_interp_spline(range(1, 21), v)(x), label=k)
plt.legend()
plt.show()
fig, ax = plt.subplots()
ax.axis([0, 20, 0, 19])
ax.set_xticks(range(1, 21))
ax.set_yticks(range(20))
ax.set_xticklabels([f"{1<<i}" for i in range(1, 21)])
for k, v in overall_performance:
plt.plot(range(1, 21), v, label=k)
plt.legend()
plt.show()
My smart functions have flatter curves than the inefficient ones, but they are slower for small inputs than the inefficient ones, but the inefficient ones grow much faster than my smart functions.
In my cost_per_item
, in order to get a sensible average of the different execution times for different inputs, I divided the execution time for each input by the corresponding output size, so I get some average number...
But this is wrong, the functions don't scale linearly, the bigger the input, the harder it is to finish the execution in record time.
And in overall_execution
the scoring is also wrong, we care about how does a function scale for ever larger inputs, so if a function completes faster for smaller inputs, it should have less importance:
In [35]: overall_execution1 = overall_execution = sorted(
...: [(a, sum(c << i for i, c in enumerate(b))) for a, b in overall_performance], key=lambda x: -x[1]
...: )
In [36]: overall_execution1
Out[36]:
[('gray_codes_6', 18462984),
('binary_codes_9', 17880641),
('gray_codes_nocomment', 16642805),
('binary_codes_1', 15627986),
('binary_codes_6', 14995433),
('gray_codes_3', 13092872),
('binary_codes_3', 11226678),
('binary_codes_2', 11189287),
('JR_gray_codes_1', 10861503),
('binary_codes_4', 9194493),
('binary_codes_0', 9184718),
('gray_codes_0', 8130547),
('binary_codes_5', 6373604),
('binary_codes_8', 5019700),
('gray_codes_1', 4305913),
('gray_codes_4', 3413060),
('gray_codes_2', 2567962),
('binary_codes_7', 925733),
('gray_codes_5', 210406)]
In [37]: average_execution1 = sorted(
...: [(k, sum(v) / 20) for k, v in execution_times.items()], key=lambda x: x[1]
...: )
In [38]: [(a, round(b / 1e9, 6)) for a, b in average_execution1]
Out[38]:
[('binary_codes_9', 0.001746),
('gray_codes_6', 0.001789),
('gray_codes_nocomment', 0.001967),
('binary_codes_1', 0.002462),
('binary_codes_6', 0.002567),
('gray_codes_3', 0.003661),
('binary_codes_2', 0.004313),
('binary_codes_3', 0.004558),
('JR_gray_codes_1', 0.004716),
('binary_codes_4', 0.005205),
('gray_codes_0', 0.005425),
('binary_codes_0', 0.005544),
('binary_codes_5', 0.007468),
('binary_codes_8', 0.007723),
('gray_codes_1', 0.007831),
('gray_codes_4', 0.008079),
('gray_codes_2', 0.0086),
('binary_codes_7', 0.009952),
('gray_codes_5', 0.013721)]
According to this new metric, it is undeniable that my binary_codes_9
and gray_codes_6
are much faster than the solutions provided by the other answers.
And I selected the fastest 9 functions to graph as proof:
In the last image, the vertical axis signifies how many functions have been beaten by the function corresponding to the graph.
My functions aren't beaten, but since the code from nocomment beats my original solutions I will accept that answer.
本文标签: pythonWhat is the fastest way to generate all nbit gray codes using NumPyStack Overflow
版权声明:本文标题:python - What is the fastest way to generate all n-bit gray codes using NumPy? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744792468a2625372.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论