Python Polars - creating new columns based on the key-value pair of a dict matched to a string in an existing column - Stack Ove-软件玩家

admin管理员组
文章数量:1241126

Sorry if the title is confusing.

I'm pretty familiar with Pandas and think I have a solid idea of how I would do this there. Pretty much just brute-force iteration and index-based assignment for the new columns. I recently learned about Polars, though, and want to try it for the parallelization/speed and to stay fresh and up to date on my data skills. This is my first foray, and it's not been going great.

I have a dataframe, and one column of this frame is basically a tag list. Each cell in that column is a list of relevant tags. What I want to do is scan through those lists, row by row, and add a column by the name of a more general tag if the existing tag is in the cell.

For example, say I have a dataframe that looks like this:

Index	Person	Food Provided
1	Billy	Apple, Hot dog
2	Suzy	Celery, brownies

Sorry if the title is confusing.

For example, say I have a dataframe that looks like this:

Index	Person	Food Provided
1	Billy	Apple, Hot dog
2	Suzy	Celery, brownies

Then I also have a dictionary that looks like this: foodTypes_dict = {'Apple':'Fruit', 'Hot dog':'Meat', 'Celery':'Vegetable', 'brownies':'Dessert'}

I would like to create a new column based on the food type that has a simple X or True or something if the "Food Provided" list contains the dict key.

Something like:

Index	Person	Food Provided	Fruit	Vegetable	Meat	Dessert
1	Billy	Apple, Hot dog	X	None	X	None
2	Suzy	Celery, brownies	None	X	None	X

I've tried:

for key in foodTypes_dict.keys():
    my_df.with_columns((pl.col("Food Provided").str.contains(key)).alias(foodTypes_dict[key]))

This has finally gotten me away from syntax errors, which I was encountering with everything else I tried. It doesn't, however, seem to actually be working at all. Essentially, it doesn't seem to create any new columns whatsoever. I tried adding a my_df.glimpse() call during each iteration of the for loop, but the dataframe dimensions don't change. I do not get any syntax errors or otherwise. I am using Jupyter Notebook which can suppress some of them, but the cell runs and finishes nearly instantly, just not close to the expected output.

Any help would be appreciated. Thanks!

Share Improve this question edited yesterday jqurious 21.4k4 gold badges20 silver badges39 bronze badges asked yesterday Sparky Parky 356 bronze badges

Add a comment |

3 Answers 3

Sorted by: Reset to default 3

First of all, the large majority of Polars' DataFrame operations are not in place, so you must re-assign to the variable if updating in a loop.

Next, for the "Food Provided" column, you should use Polars' list data type. This works natively with Polars' other operations and prevents substring-like issues (e.g., pineapple vs apple, etc) arising from string containment checks.

The list data type also makes is super easy to check if a particular value is in there.

Here's a solution that produces your expected output

my_df = pl.DataFrame({
    "index": [1, 2],
    "person": ["Billy", "Sally"],
    "Food Provided": [["Apple", "Hot dog"], ["Celery", "brownies"]]
})

food_types = {
    "Apple": "Fruit",
    "Celery": "Vegetable",
    "Hot dog": "Meat",
    "brownies": "Dessert"
}

my_df.with_columns(
    # When food is contained in the list of food provided
    pl.when(pl.col("Food Provided").list.contains(food))
    # Then a literal "X"
    .then(pl.lit("X"))
    # Implicit "None" by leaving out the "otherwise" block
    # Set the column name as the food type
    .alias(food_type)
    for food, food_type in food_types.items()
)

shape: (2, 7)
┌───────┬────────┬────────────────────────┬───────┬───────────┬──────┬─────────┐
│ index ┆ person ┆ Food Provided          ┆ Fruit ┆ Vegetable ┆ Meat ┆ Dessert │
│ ---   ┆ ---    ┆ ---                    ┆ ---   ┆ ---       ┆ ---  ┆ ---     │
│ i64   ┆ str    ┆ list[str]              ┆ str   ┆ str       ┆ str  ┆ str     │
╞═══════╪════════╪════════════════════════╪═══════╪═══════════╪══════╪═════════╡
│ 1     ┆ Billy  ┆ ["Apple", "Hot dog"]   ┆ X     ┆ null      ┆ X    ┆ null    │
│ 2     ┆ Sally  ┆ ["Celery", "brownies"] ┆ null  ┆ X         ┆ null ┆ X       │
└───────┴────────┴────────────────────────┴───────┴───────────┴──────┴─────────┘

Be sure to re-assign the result to a variable once you are done with the transformations.

Do note that this code will break once there is more than one food of a given type in the food_types dict. This is because Polars does not allow duplicate column names, which the code would create. At this point, consider switching the food type to be the key of the dict and have a list of foods as the values

EDIT: Here is a solution with the food_types dict having the food types as keys, and a list of values. any_horizontal returns true when any condition in the for food in foods loop is true.

my_df = pl.DataFrame({
    "index": [1, 2, 3],
    "person": ["Billy", "Sally", "Bob"],
    "Food Provided": [["Apple", "Hot dog"], ["Celery", "brownies"], ["Spinach"]],
})

food_types = {
    "Fruit": ["Apple"],
    "Vegetable": ["Celery", "Spinach"],
    "Meat": ["Hot dog", "Chicken"],
    "Dessert": ["brownies", "Cake"],
}


my_df.with_columns(
      # When any food in the foods list is contained in "Food Provided" column
      pl.when(pl.any_horizontal(
          pl.col("Food Provided").list.contains(food) for food in foods
      ))
      .then(pl.lit("X"))
      .alias(food_type)
      for food_type, foods in food_types.items()
  )

That is a fair bit of Python looping, so here's another option. It uses replace to do a join-like operation and then pivots the food type. If you know all the possible food types ahead of time, you can avoid pivot (available in eager API only) and do a "lazy pivot" as described in the last example of the DataFrame.pivot docs

my_df = pl.DataFrame({
    "index": [1, 2, 3],
    "person": ["Billy", "Sally", "Bob"],
    "Food Provided": [["Apple", "Hot dog"], ["Celery", "brownies"], ["Spinach"]],
})

food_types = {
    "Apple": "Fruit",
    "Celery": "Vegetable",
    "Hot dog": "Meat",
    "brownies": "Dessert",
    "Cake": "Dessert",
    "Chicken": "Meat",
    "Spinach": "Vegetable",
}


(
    my_df
    .with_columns(
        food_types=pl.col("Food Provided").list.eval(pl.element().replace(food_types)),
        # The value to use when the food type is pivoted
        value=pl.lit("X"),
    )
    .explode("food_types")
    .pivot("food_types", index=["index", "person", "Food Provided"])
)

All return the expected output.

You'd still need to iterate over the dictionary entries, but the desired output can be achieved using

pl.Expr.str.split to split the string column into a list of items,
pl.Expr.list.contains to check whether the list contains a particular item.

(
    df
    .with_columns(
        pl.col("Food Provided").str.split(", ").list.contains(item).alias(category)
        for item, category in foodTypes_dict.items()
    )
)

If you build a dict of lists instead.

foodTypes_dict = {"Apple":"Fruit", "Hot dog":"Meat", "Celery":"Vegetable", "brownies":"Dessert"}

other = {}
for k, v in foodTypes_dict.items():
    other.setdefault(v, []).append(k) # or collections.defaultdict

{'Fruit': ['Apple'],
 'Meat': ['Hot dog'],
 'Vegetable': ['Celery'],
 'Dessert': ['brownies']}

You could then check the .list.set_intersection() has a .list.len() greater than 0.

df = pl.from_repr("""
┌───────┬────────┬──────────────────┐
│ Index ┆ Person ┆ Food Provided    │
│ ---   ┆ ---    ┆ ---              │
│ i64   ┆ str    ┆ str              │
╞═══════╪════════╪══════════════════╡
│ 1     ┆ Billy  ┆ Apple, Hot dog   │
│ 2     ┆ Suzy   ┆ Celery, brownies │
└───────┴────────┴──────────────────┘
""")

df.with_columns(
    (pl.col("Food Provided")
       .str.split(", ")
       .list.set_intersection(foods).list.len() > 0).alias(name) 
    for name, foods in other.items()
)

shape: (2, 7)
┌───────┬────────┬──────────────────┬───────┬───────┬───────────┬─────────┐
│ Index ┆ Person ┆ Food Provided    ┆ Fruit ┆ Meat  ┆ Vegetable ┆ Dessert │
│ ---   ┆ ---    ┆ ---              ┆ ---   ┆ ---   ┆ ---       ┆ ---     │
│ i64   ┆ str    ┆ str              ┆ bool  ┆ bool  ┆ bool      ┆ bool    │
╞═══════╪════════╪══════════════════╪═══════╪═══════╪═══════════╪═════════╡
│ 1     ┆ Billy  ┆ Apple, Hot dog   ┆ true  ┆ true  ┆ false     ┆ false   │
│ 2     ┆ Suzy   ┆ Celery, brownies ┆ false ┆ false ┆ true      ┆ true    │
└───────┴────────┴──────────────────┴───────┴───────┴───────────┴─────────┘

As has been mentioned, .with_columns() returns a new frame. (which you are discarding)

Returns: A new DataFrame with the columns added.

Similar to how .assign() is used in Pandas - you need to actually store the result.

df = df.with_columns(...)

本文标签：

版权声明：本文标题：Python Polars - creating new columns based on the key-value pair of a dict matched to a string in an existing column - Stack Ove 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1740054605a2222305.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

Python Polars - creating new columns based on the key-value pair of a dict matched to a string in an existing column - Stack Ove

3 Answers 3

更多相关文章

javascript - Bootstrap: Accordion Collapse stopped working with Bootstrap 2.0.3 - Stack Overflow

How do you use a reduce function to find the intersectionunion between a set of arrays in javascript functional programming? - S

javascript - Get HTML String for jQuery andor DOM object - Stack Overflow

javascript - How to append slick arrows inside the dots list? - Stack Overflow

sql server - SSMS is indicating that my OUTER APPLY has more columns than specified in the column list - Stack Overflow

javascript - jquery search box and button - Stack Overflow

javascript - windows.print results in empty page - Stack Overflow

azure machine learning service - How can I instantiate MLClient without explicitly passing credentials in a Component? - Stack O

javascript - Puppeteer - Click on button by class name? - Stack Overflow

javascript - How to convert a regular expression to a String literal and back again? - Stack Overflow

javascript - How to concatenate an HTMLCollection with an array? - Stack Overflow

jquery - Remove selected file(s) before upload with Javascript - Stack Overflow

javascript - Using window.getSelection to get a string - Stack Overflow

javascript - Use jQuery to change font and background color of button - Stack Overflow

How can i push data from classic report to interactive grid in the same page in oracle apex? - Stack Overflow

javascript - Add text inside doughnut chart from chart js-2 in react - Stack Overflow

javascript - Can I edit React components without reloading the browser? - Stack Overflow

javascript - How can I change only one value in a React useState that contains an object and assign a different one to the rest?

javascript - Getting id of any tag when mouseover - Stack Overflow

python - Dynamically update bar values based on the selected legend series - Stack Overflow

发表评论

推荐文章

javascript - The required anti-forgery form field “__RequestVerificationToken” is not present in ajax call - Stack Overflow

r - Automatically Display Overall and Subgroup‐Specific Effects for a Predictor in gtsummary - Stack Overflow

How to ensure a Cloud Function is called at the end of a Google Cloud Workflow execution (success, failure, or cancellation)? -

javascript jquery change src of a script using a script - Stack Overflow

javascript - Display json in a readable format in a new tab - Stack Overflow

热门文章

apache flink - Issue Decoding Protobuf Messages from Kafka in PyFlink Due to String Conversion - Stack Overflow

javascript - How to set default font-size on CKEditor - Stack Overflow

javascript - A text node cannot be a child of a &lt;View&gt; - Stack Overflow

javascript - How to change the speed of setInterval in real time - Stack Overflow

html - Calling javascript functions from drop down - Stack Overflow

typescript - Vue components keep the old state even when replaced - Stack Overflow

c# - How can I set the operation ID for the incoming request based on the value of a custom HTTP header using Azure Monitor Open

android - FAILURE: Build failed with an exception. -A problem occurred evaluating script - Stack Overflow

javascript - React - filter by object property - Stack Overflow

What is the best practice to add a composite index and drop an old index in MySQL? - Stack Overflow

最新文章

猫头虎分享：ChatGPT4大范围降智图片无法识别，导致工作效率下降？今天亲测已局部恢复正常！测试报告&amp;解决方案

从用户体验角度对比 DeepSeek 和 ChatGPT

三分钟 ChatGPT 接入钉钉机器人

开源项目推荐：ChatGPT-ui

ChatGPT版本差异分析大全

c - How to terminate early while waiting for multiple threads using pthread? - Stack Overflow

javascript - Fabric js Load canvas from JSON - Stack Overflow

query optimization - why slow (5 seconds) Mariadb UPDATE using indexed columns and sensible INNER JOIN? - Stack Overflow

What version of JavaScript does ASP Classic use? - Stack Overflow

javascript - How do I get the clicked element with inline HTML onclick and jQuery? - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - A text node cannot be a child of a <View> - Stack Overflow

猫头虎分享：ChatGPT4大范围降智图片无法识别，导致工作效率下降？今天亲测已局部恢复正常！测试报告&解决方案