python - Filtering dict filled with lists - Stack Overflow

IT技术

更新时间：2025-02-020

admin管理员组
文章数量:1193807

I have a dict filled with lists. This dict contains parsed ".MDB" table. The keys are columns of that table, and elements of each key's list are rows of that column. I need to filter that data. By filtering I mean getting a new dict with all the same keys, but the rows should either satisfy my criteria or be deleted. For example, the new dict should only contain rows for which value of 'index' key equals "13".

I wrote a function that does exactly what I need:

def filter_d(d, key, condition):
    new_d = deepcopy(d)
    for i in range(len(new_d[key]) - 1, -1, -1):
        if new_d[key][i] != condition:
            for k in new_d:
                del new_d[k][i]
    return new_d

And I call it like that (in this example I filter dict "extindex" by 2 parameters):

extindex_nu = filter_d(filter_d(extindex, 'idblank', temperature_index), 'index', 13)

This function works fine, but it seems to be extemely slow. And I have to filter thousands of rows for hundreds of times. Are there more efficient solutions?

I've only been finding something called "dict comprehension" but I'm not shure it solves my problem. If it does, I'd like some suggestions on its usage.

I have a dict filled with lists. This dict contains parsed ".MDB" table. The keys are columns of that table, and elements of each key's list are rows of that column. I need to filter that data. By filtering I mean getting a new dict with all the same keys, but the rows should either satisfy my criteria or be deleted. For example, the new dict should only contain rows for which value of 'index' key equals "13".

I wrote a function that does exactly what I need:

def filter_d(d, key, condition):
    new_d = deepcopy(d)
    for i in range(len(new_d[key]) - 1, -1, -1):
        if new_d[key][i] != condition:
            for k in new_d:
                del new_d[k][i]
    return new_d

And I call it like that (in this example I filter dict "extindex" by 2 parameters):

extindex_nu = filter_d(filter_d(extindex, 'idblank', temperature_index), 'index', 13)

This function works fine, but it seems to be extemely slow. And I have to filter thousands of rows for hundreds of times. Are there more efficient solutions?

I've only been finding something called "dict comprehension" but I'm not shure it solves my problem. If it does, I'd like some suggestions on its usage.

Share Improve this question edited Jan 24 at 6:44 Gustav 55.8k7 gold badges31 silver badges61 bronze badges asked Jan 23 at 13:31 Eugene Petrov 211 silver badge3 bronze badges

2 Can you give a minimal example of your dictionary and temperature_index? you should provide enough data in the question for someone else to run your code. – Guy Commented Jan 23 at 13:47
Are you strictly limited to python and its' standard library or you might use external packages? – Daweo Commented Jan 23 at 13:57

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

There are multiple reasons for your code to be so slow:

You need to call filter_d once for every column you want to filter, and each time you need to iterate through the entiere dataset;
Multiple for loops;
Usage of del multiple times which can be very slow.

Below you will find different suggestsions to improve your code.

Original code

from copy import deepcopy
import time

import numpy as np


# Dummy data
data = {
    f"column{idx}": np.random.randint(2, size=(1000000,)).tolist()
    for idx in range(10)
}

def filter_d(d, key, condition):
    new_d = deepcopy(d)
    for i in range(len(new_d[key]) - 1, -1, -1):
        if new_d[key][i] != condition:
            for k in new_d:
                del new_d[k][i]
    return new_d


start = time.time()
filter_d(filter_d(data, "column0", 1), "column3", 0)
print(f"{time.time() - start:.4f}s")

>>> 3.5714s

Let's analyze the complexity of your code. Let us assume your data has k columns and n rows. Then, the copy is in O(nk) and your nested loop is in O(kn²) because the del is in O(n). Overall the code is in O(kn²). This ignores the fact you have to call this function twice.

First proposition using the same signature

from collections import defaultdict

def filter_d2(d, key, condition):
    # Convert the dictionnary of lists into an iterator of dictionnaries
    data_iter = (
        {k: v[i] for k, v in d.items()}
        for i in range(len(next(iter(d.values()))))
    )

    # Filter the data
    filtered_iter = filter(lambda row: row[key] == condition, data_iter)

    # Convert to the original structure
    new_d = defaultdict(list)

    for row in filtered_iter:
        for k, v in row.items():
            new_d[k].append(v)

    return new_d

start = time.time()
filter_d2(filter_d2(data, "column0", 1), "column3", 0)
print(f"{time.time() - start:.4f}s")

>>> 0.1278s

This method changes the data into another structure that is easier to iterate over (roughly, a list of dictionnaries). Then, it performs the filtering using the filter Python built-in. Finally, it converts the data back into the original structure.

Let's analyze the complexity of the second code. The conversion of data structure is in O(nk), the filtering in O(n) and the conversion to the original structure is in O(nk). Overall the code is in O(nk). This ignores the fact this function must be called twice.

Second proposition using a lambda function to filter the rows

def filter_d3(d, cond):
    # Convert the dictionnary of lists into an iterator of dictionnaries
    data_iter = (
        {k: v[i] for k, v in d.items()}
        for i in range(len(next(iter(d.values()))))
    )

    # Filter the data
    filtered_iter = filter(cond, data_iter)

    # Convert to the original structure
    new_d = defaultdict(list)

    for row in filtered_iter:
        for k, v in row.items():
            new_d[k].append(v)

    return new_d

start = time.time()
filter_d3(data, lambda row: row["column0"] == 1 and row["column3"] == 0)
print(f"{time.time() - start:.4f}s")

>>> 0.0633s

The complexity of this code is also in O(nk) but you have to call it only once, hence reducing the time by a factor of 2.

本文标签： pythonFiltering dict filled with listsStack Overflow

版权声明：本文标题：python - Filtering dict filled with lists - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1738491888a2089753.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Filtering dict filled with lists - Stack Overflow

1 Answer 1

更多相关文章