admin管理员组文章数量:1279124
So I got two csv which I load as polars frames:
left:
track_name,type,yield,group
8CEB45v1,corn,0.146957,A
A188v2,corn,0.86308,A
B73v6,corn,0.326076,A
CI6621v1,sweetcorn,0.0357792,A
CML103v1,sweetcorn,0.510464,A
right:
track_name,type,yield,group
8CEB45v1,corn,0.999,A
B1234,pepper,1,B
B1235,pepper,2,B
my code so far:
import polars as pl
left = pl.read_csv("left.csv")
right = pl.read_csv("right.csv")
matching_columns = list(set(left.columns) & set(right.columns)) # I do this since I want to join sometimes frame which does not have a 100 % column match. In that case I want to simply add the new columns to the outer frame.
outer = left.join(
right,
on=matching_columns,
how="outer",
coalesce=True,
maintain_order="left",
)
outer
my result:
shape: (8, 4)
┌────────────┬───────────┬───────────┬───────┐
│ track_name ┆ type ┆ yield ┆ group │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str │
╞════════════╪═══════════╪═══════════╪═══════╡
│ 8CEB45v1 ┆ corn ┆ 0.146957 ┆ A │
│ A188v2 ┆ corn ┆ 0.86308 ┆ A │
│ B73v6 ┆ corn ┆ 0.326076 ┆ A │
│ CI6621v1 ┆ sweetcorn ┆ 0.0357792 ┆ A │
│ CML103v1 ┆ sweetcorn ┆ 0.510464 ┆ A │
│ B1234 ┆ pepper ┆ 1.0 ┆ B │
│ B1235 ┆ pepper ┆ 2.0 ┆ B │
│ 8CEB45v1 ┆ corn ┆ 0.999 ┆ A │
└────────────┴───────────┴───────────┴───────┘
my desired output: (yield
of 8CEB45v1
from right
updates value of left
)
shape: (7, 4)
┌────────────┬───────────┬───────────┬───────┐
│ track_name ┆ type ┆ yield ┆ group │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str │
╞════════════╪═══════════╪═══════════╪═══════╡
│ 8CEB45v1 ┆ corn ┆ 0.999 ┆ A │
│ A188v2 ┆ corn ┆ 0.86308 ┆ A │
│ B73v6 ┆ corn ┆ 0.326076 ┆ A │
│ CI6621v1 ┆ sweetcorn ┆ 0.0357792 ┆ A │
│ CML103v1 ┆ sweetcorn ┆ 0.510464 ┆ A │
│ B1234 ┆ pepper ┆ 1.0 ┆ B │
│ B1235 ┆ pepper ┆ 2.0 ┆ B │
└────────────┴───────────┴───────────┴───────┘
So I got two csv which I load as polars frames:
left:
track_name,type,yield,group
8CEB45v1,corn,0.146957,A
A188v2,corn,0.86308,A
B73v6,corn,0.326076,A
CI6621v1,sweetcorn,0.0357792,A
CML103v1,sweetcorn,0.510464,A
right:
track_name,type,yield,group
8CEB45v1,corn,0.999,A
B1234,pepper,1,B
B1235,pepper,2,B
my code so far:
import polars as pl
left = pl.read_csv("left.csv")
right = pl.read_csv("right.csv")
matching_columns = list(set(left.columns) & set(right.columns)) # I do this since I want to join sometimes frame which does not have a 100 % column match. In that case I want to simply add the new columns to the outer frame.
outer = left.join(
right,
on=matching_columns,
how="outer",
coalesce=True,
maintain_order="left",
)
outer
my result:
shape: (8, 4)
┌────────────┬───────────┬───────────┬───────┐
│ track_name ┆ type ┆ yield ┆ group │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str │
╞════════════╪═══════════╪═══════════╪═══════╡
│ 8CEB45v1 ┆ corn ┆ 0.146957 ┆ A │
│ A188v2 ┆ corn ┆ 0.86308 ┆ A │
│ B73v6 ┆ corn ┆ 0.326076 ┆ A │
│ CI6621v1 ┆ sweetcorn ┆ 0.0357792 ┆ A │
│ CML103v1 ┆ sweetcorn ┆ 0.510464 ┆ A │
│ B1234 ┆ pepper ┆ 1.0 ┆ B │
│ B1235 ┆ pepper ┆ 2.0 ┆ B │
│ 8CEB45v1 ┆ corn ┆ 0.999 ┆ A │
└────────────┴───────────┴───────────┴───────┘
my desired output: (yield
of 8CEB45v1
from right
updates value of left
)
shape: (7, 4)
┌────────────┬───────────┬───────────┬───────┐
│ track_name ┆ type ┆ yield ┆ group │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str │
╞════════════╪═══════════╪═══════════╪═══════╡
│ 8CEB45v1 ┆ corn ┆ 0.999 ┆ A │
│ A188v2 ┆ corn ┆ 0.86308 ┆ A │
│ B73v6 ┆ corn ┆ 0.326076 ┆ A │
│ CI6621v1 ┆ sweetcorn ┆ 0.0357792 ┆ A │
│ CML103v1 ┆ sweetcorn ┆ 0.510464 ┆ A │
│ B1234 ┆ pepper ┆ 1.0 ┆ B │
│ B1235 ┆ pepper ┆ 2.0 ┆ B │
└────────────┴───────────┴───────────┴───────┘
Share
Improve this question
edited Feb 24 at 17:53
jqurious
21.6k4 gold badges20 silver badges39 bronze badges
asked Feb 24 at 17:51
Pm740Pm740
4233 silver badges14 bronze badges
1
|
1 Answer
Reset to default 1The thing you're doing wrong is including yield
in matching_columns
. You don't want to match by it, you want it as a value.
One idea to reconcile that would be
matching_columns = list(set(left.select(pl.col(pl.String)).columns) &
set(right.select(pl.col(pl.String)).columns))
Alternatively, you could start with your way but then remove f64 columns. It really just depends on your data.
matching_columns = set(left.columns) & set(right.columns)
matching_columns -= (
set(left.select(matching_columns).select(pl.col(pl.Float64)).columns)
)
Once you have your matching_columns
established, you can use the built in update
:
left.update(right, on=matching_columns, how="full")
shape: (7, 4)
┌────────────┬───────────┬───────────┬───────┐
│ track_name ┆ type ┆ yield ┆ group │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str │
╞════════════╪═══════════╪═══════════╪═══════╡
│ 8CEB45v1 ┆ corn ┆ 0.999 ┆ A │
│ A188v2 ┆ corn ┆ 0.86308 ┆ A │
│ B73v6 ┆ corn ┆ 0.326076 ┆ A │
│ CI6621v1 ┆ sweetcorn ┆ 0.0357792 ┆ A │
│ CML103v1 ┆ sweetcorn ┆ 0.510464 ┆ A │
│ B1235 ┆ pepper ┆ 2.0 ┆ B │
│ B1234 ┆ pepper ┆ 1.0 ┆ B │
└────────────┴───────────┴───────────┴───────┘
Note: you don't have to convert the python set into a list, polars is happy to accept a set as input.
Polars will accept a set at runtime but the type checker will complain about because polars has Sequence as the annotation. If you're using join
with on
then this won't really matter. If you're using left_on
and right_on
then it might get you in trouble because sets don't maintain order and the order of each of those inputs is how they're used.
Response to comment.
If you have extra columns in right
that you want the output to keep then you're better off doing the update
yourself with a join and a coalesce. I say that because the update
function is doing that under the hood but is dropping those extra columns. It's probably worth a feature request that update provide an option to keep extra columns in right.
Anyways, here's the code
New setup
left = pl.DataFrame(
[
pl.Series('track_name',['8CEB45v1','A188v2','B73v6','CI6621v1','CML103v1']),
pl.Series('type',['corn','corn','corn','sweetcorn', 'sweetcorn']),
pl.Series('yield', [0.146957,0.86308,0.326076,0.0357792,0.510464]),
pl.Series('group',['A','A','A','A','A']),
]
)
right = pl.DataFrame(
[
pl.Series('track_name',['8CEB45v1','B1234','B1235']),
pl.Series('type',['corn','pepper','pepper']),
pl.Series('yield',[0.999,1.0,2.0]),
pl.Series('group',['A','B','B']),
pl.Series("fruit",["apple","banana","carrot"])
]
)
the work
common_columns = set(left.columns) & set(right.columns)
join_columns = common_columns - set(left.select(pl.col(pl.Float64)).columns)
update_columns = common_columns - join_columns
extra_right_columns = set(right.columns) - common_columns
(
left
.join(right, on=list(join_columns), how="full", coalesce=True)
.select(
*join_columns,
*[pl.coalesce(f"{name}_right", name).alias(name)
for name in update_columns],
*extra_right_columns
)
)
shape: (7, 5)
┌───────────┬───────┬────────────┬───────────┬────────┐
│ type ┆ group ┆ track_name ┆ yield ┆ fruit │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str │
╞═══════════╪═══════╪════════════╪═══════════╪════════╡
│ corn ┆ A ┆ 8CEB45v1 ┆ 0.999 ┆ apple │
│ corn ┆ A ┆ A188v2 ┆ 0.86308 ┆ null │
│ corn ┆ A ┆ B73v6 ┆ 0.326076 ┆ null │
│ sweetcorn ┆ A ┆ CI6621v1 ┆ 0.0357792 ┆ null │
│ sweetcorn ┆ A ┆ CML103v1 ┆ 0.510464 ┆ null │
│ pepper ┆ B ┆ B1235 ┆ 2.0 ┆ carrot │
│ pepper ┆ B ┆ B1234 ┆ 1.0 ┆ banana │
└───────────┴───────┴────────────┴───────────┴────────┘
版权声明:本文标题:python - How to outer joinmerge two frames with polars while updating left with right values? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741249697a2365562.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
8CEB45v1
had multiple entries inleft
that matchright
? – jqurious Commented Feb 25 at 14:27