admin管理员组

文章数量:1241126

Sorry if the title is confusing.

I'm pretty familiar with Pandas and think I have a solid idea of how I would do this there. Pretty much just brute-force iteration and index-based assignment for the new columns. I recently learned about Polars, though, and want to try it for the parallelization/speed and to stay fresh and up to date on my data skills. This is my first foray, and it's not been going great.

I have a dataframe, and one column of this frame is basically a tag list. Each cell in that column is a list of relevant tags. What I want to do is scan through those lists, row by row, and add a column by the name of a more general tag if the existing tag is in the cell.

For example, say I have a dataframe that looks like this:

Index Person Food Provided
1 Billy Apple, Hot dog
2 Suzy Celery, brownies

Sorry if the title is confusing.

I'm pretty familiar with Pandas and think I have a solid idea of how I would do this there. Pretty much just brute-force iteration and index-based assignment for the new columns. I recently learned about Polars, though, and want to try it for the parallelization/speed and to stay fresh and up to date on my data skills. This is my first foray, and it's not been going great.

I have a dataframe, and one column of this frame is basically a tag list. Each cell in that column is a list of relevant tags. What I want to do is scan through those lists, row by row, and add a column by the name of a more general tag if the existing tag is in the cell.

For example, say I have a dataframe that looks like this:

Index Person Food Provided
1 Billy Apple, Hot dog
2 Suzy Celery, brownies

Then I also have a dictionary that looks like this: foodTypes_dict = {'Apple':'Fruit', 'Hot dog':'Meat', 'Celery':'Vegetable', 'brownies':'Dessert'}

I would like to create a new column based on the food type that has a simple X or True or something if the "Food Provided" list contains the dict key.

Something like:

Index Person Food Provided Fruit Vegetable Meat Dessert
1 Billy Apple, Hot dog X None X None
2 Suzy Celery, brownies None X None X

I've tried:

for key in foodTypes_dict.keys():
    my_df.with_columns((pl.col("Food Provided").str.contains(key)).alias(foodTypes_dict[key]))

This has finally gotten me away from syntax errors, which I was encountering with everything else I tried. It doesn't, however, seem to actually be working at all. Essentially, it doesn't seem to create any new columns whatsoever. I tried adding a my_df.glimpse() call during each iteration of the for loop, but the dataframe dimensions don't change. I do not get any syntax errors or otherwise. I am using Jupyter Notebook which can suppress some of them, but the cell runs and finishes nearly instantly, just not close to the expected output.

Any help would be appreciated. Thanks!

Share Improve this question edited yesterday jqurious 21.4k4 gold badges20 silver badges39 bronze badges asked yesterday Sparky ParkySparky Parky 356 bronze badges
Add a comment  | 

3 Answers 3

Reset to default 3

First of all, the large majority of Polars' DataFrame operations are not in place, so you must re-assign to the variable if updating in a loop.

Next, for the "Food Provided" column, you should use Polars' list data type. This works natively with Polars' other operations and prevents substring-like issues (e.g., pineapple vs apple, etc) arising from string containment checks.

The list data type also makes is super easy to check if a particular value is in there.

Here's a solution that produces your expected output

my_df = pl.DataFrame({
    "index": [1, 2],
    "person": ["Billy", "Sally"],
    "Food Provided": [["Apple", "Hot dog"], ["Celery", "brownies"]]
})

food_types = {
    "Apple": "Fruit",
    "Celery": "Vegetable",
    "Hot dog": "Meat",
    "brownies": "Dessert"
}

my_df.with_columns(
    # When food is contained in the list of food provided
    pl.when(pl.col("Food Provided").list.contains(food))
    # Then a literal "X"
    .then(pl.lit("X"))
    # Implicit "None" by leaving out the "otherwise" block
    # Set the column name as the food type
    .alias(food_type)
    for food, food_type in food_types.items()
)
shape: (2, 7)
┌───────┬────────┬────────────────────────┬───────┬───────────┬──────┬─────────┐
│ index ┆ person ┆ Food Provided          ┆ Fruit ┆ Vegetable ┆ Meat ┆ Dessert │
│ ---   ┆ ---    ┆ ---                    ┆ ---   ┆ ---       ┆ ---  ┆ ---     │
│ i64   ┆ str    ┆ list[str]              ┆ str   ┆ str       ┆ str  ┆ str     │
╞═══════╪════════╪════════════════════════╪═══════╪═══════════╪══════╪═════════╡
│ 1     ┆ Billy  ┆ ["Apple", "Hot dog"]   ┆ X     ┆ null      ┆ X    ┆ null    │
│ 2     ┆ Sally  ┆ ["Celery", "brownies"] ┆ null  ┆ X         ┆ null ┆ X       │
└───────┴────────┴────────────────────────┴───────┴───────────┴──────┴─────────┘

Be sure to re-assign the result to a variable once you are done with the transformations.

Do note that this code will break once there is more than one food of a given type in the food_types dict. This is because Polars does not allow duplicate column names, which the code would create. At this point, consider switching the food type to be the key of the dict and have a list of foods as the values

EDIT: Here is a solution with the food_types dict having the food types as keys, and a list of values. any_horizontal returns true when any condition in the for food in foods loop is true.

my_df = pl.DataFrame({
    "index": [1, 2, 3],
    "person": ["Billy", "Sally", "Bob"],
    "Food Provided": [["Apple", "Hot dog"], ["Celery", "brownies"], ["Spinach"]],
})

food_types = {
    "Fruit": ["Apple"],
    "Vegetable": ["Celery", "Spinach"],
    "Meat": ["Hot dog", "Chicken"],
    "Dessert": ["brownies", "Cake"],
}


my_df.with_columns(
      # When any food in the foods list is contained in "Food Provided" column
      pl.when(pl.any_horizontal(
          pl.col("Food Provided").list.contains(food) for food in foods
      ))
      .then(pl.lit("X"))
      .alias(food_type)
      for food_type, foods in food_types.items()
  )

That is a fair bit of Python looping, so here's another option. It uses replace to do a join-like operation and then pivots the food type. If you know all the possible food types ahead of time, you can avoid pivot (available in eager API only) and do a "lazy pivot" as described in the last example of the DataFrame.pivot docs

my_df = pl.DataFrame({
    "index": [1, 2, 3],
    "person": ["Billy", "Sally", "Bob"],
    "Food Provided": [["Apple", "Hot dog"], ["Celery", "brownies"], ["Spinach"]],
})

food_types = {
    "Apple": "Fruit",
    "Celery": "Vegetable",
    "Hot dog": "Meat",
    "brownies": "Dessert",
    "Cake": "Dessert",
    "Chicken": "Meat",
    "Spinach": "Vegetable",
}


(
    my_df
    .with_columns(
        food_types=pl.col("Food Provided").list.eval(pl.element().replace(food_types)),
        # The value to use when the food type is pivoted
        value=pl.lit("X"),
    )
    .explode("food_types")
    .pivot("food_types", index=["index", "person", "Food Provided"])
)

All return the expected output.

You'd still need to iterate over the dictionary entries, but the desired output can be achieved using

  • pl.Expr.str.split to split the string column into a list of items,
  • pl.Expr.list.contains to check whether the list contains a particular item.
(
    df
    .with_columns(
        pl.col("Food Provided").str.split(", ").list.contains(item).alias(category)
        for item, category in foodTypes_dict.items()
    )
)

If you build a dict of lists instead.

foodTypes_dict = {"Apple":"Fruit", "Hot dog":"Meat", "Celery":"Vegetable", "brownies":"Dessert"}

other = {}
for k, v in foodTypes_dict.items():
    other.setdefault(v, []).append(k) # or collections.defaultdict
{'Fruit': ['Apple'],
 'Meat': ['Hot dog'],
 'Vegetable': ['Celery'],
 'Dessert': ['brownies']}

You could then check the .list.set_intersection() has a .list.len() greater than 0.

df = pl.from_repr("""
┌───────┬────────┬──────────────────┐
│ Index ┆ Person ┆ Food Provided    │
│ ---   ┆ ---    ┆ ---              │
│ i64   ┆ str    ┆ str              │
╞═══════╪════════╪══════════════════╡
│ 1     ┆ Billy  ┆ Apple, Hot dog   │
│ 2     ┆ Suzy   ┆ Celery, brownies │
└───────┴────────┴──────────────────┘
""")
df.with_columns(
    (pl.col("Food Provided")
       .str.split(", ")
       .list.set_intersection(foods).list.len() > 0).alias(name) 
    for name, foods in other.items()
)
shape: (2, 7)
┌───────┬────────┬──────────────────┬───────┬───────┬───────────┬─────────┐
│ Index ┆ Person ┆ Food Provided    ┆ Fruit ┆ Meat  ┆ Vegetable ┆ Dessert │
│ ---   ┆ ---    ┆ ---              ┆ ---   ┆ ---   ┆ ---       ┆ ---     │
│ i64   ┆ str    ┆ str              ┆ bool  ┆ bool  ┆ bool      ┆ bool    │
╞═══════╪════════╪══════════════════╪═══════╪═══════╪═══════════╪═════════╡
│ 1     ┆ Billy  ┆ Apple, Hot dog   ┆ true  ┆ true  ┆ false     ┆ false   │
│ 2     ┆ Suzy   ┆ Celery, brownies ┆ false ┆ false ┆ true      ┆ true    │
└───────┴────────┴──────────────────┴───────┴───────┴───────────┴─────────┘

As has been mentioned, .with_columns() returns a new frame. (which you are discarding)

Returns: A new DataFrame with the columns added.

Similar to how .assign() is used in Pandas - you need to actually store the result.

df = df.with_columns(...)

本文标签: