admin管理员组

文章数量:1279124

So I got two csv which I load as polars frames:

left:

track_name,type,yield,group
8CEB45v1,corn,0.146957,A
A188v2,corn,0.86308,A
B73v6,corn,0.326076,A
CI6621v1,sweetcorn,0.0357792,A
CML103v1,sweetcorn,0.510464,A

right:

track_name,type,yield,group
8CEB45v1,corn,0.999,A
B1234,pepper,1,B
B1235,pepper,2,B

my code so far:

import polars as pl

left = pl.read_csv("left.csv")
right = pl.read_csv("right.csv")

matching_columns = list(set(left.columns) & set(right.columns))  # I do this since I want to join sometimes frame which does not have a 100 % column match. In that case I want to simply add the new columns to the outer frame.

outer = left.join(
            right,
            on=matching_columns,
            how="outer",
            coalesce=True,
            maintain_order="left",
        )
outer

my result:

shape: (8, 4)
┌────────────┬───────────┬───────────┬───────┐
│ track_name ┆ type      ┆ yield     ┆ group │
│ ---        ┆ ---       ┆ ---       ┆ ---   │
│ str        ┆ str       ┆ f64       ┆ str   │
╞════════════╪═══════════╪═══════════╪═══════╡
│ 8CEB45v1   ┆ corn      ┆ 0.146957  ┆ A     │
│ A188v2     ┆ corn      ┆ 0.86308   ┆ A     │
│ B73v6      ┆ corn      ┆ 0.326076  ┆ A     │
│ CI6621v1   ┆ sweetcorn ┆ 0.0357792 ┆ A     │
│ CML103v1   ┆ sweetcorn ┆ 0.510464  ┆ A     │
│ B1234      ┆ pepper    ┆ 1.0       ┆ B     │
│ B1235      ┆ pepper    ┆ 2.0       ┆ B     │
│ 8CEB45v1   ┆ corn      ┆ 0.999     ┆ A     │
└────────────┴───────────┴───────────┴───────┘

my desired output: (yield of 8CEB45v1 from right updates value of left)

shape: (7, 4)
┌────────────┬───────────┬───────────┬───────┐
│ track_name ┆ type      ┆ yield     ┆ group │
│ ---        ┆ ---       ┆ ---       ┆ ---   │
│ str        ┆ str       ┆ f64       ┆ str   │
╞════════════╪═══════════╪═══════════╪═══════╡
│ 8CEB45v1   ┆ corn      ┆ 0.999     ┆ A     │
│ A188v2     ┆ corn      ┆ 0.86308   ┆ A     │
│ B73v6      ┆ corn      ┆ 0.326076  ┆ A     │
│ CI6621v1   ┆ sweetcorn ┆ 0.0357792 ┆ A     │
│ CML103v1   ┆ sweetcorn ┆ 0.510464  ┆ A     │
│ B1234      ┆ pepper    ┆ 1.0       ┆ B     │
│ B1235      ┆ pepper    ┆ 2.0       ┆ B     │
└────────────┴───────────┴───────────┴───────┘

So I got two csv which I load as polars frames:

left:

track_name,type,yield,group
8CEB45v1,corn,0.146957,A
A188v2,corn,0.86308,A
B73v6,corn,0.326076,A
CI6621v1,sweetcorn,0.0357792,A
CML103v1,sweetcorn,0.510464,A

right:

track_name,type,yield,group
8CEB45v1,corn,0.999,A
B1234,pepper,1,B
B1235,pepper,2,B

my code so far:

import polars as pl

left = pl.read_csv("left.csv")
right = pl.read_csv("right.csv")

matching_columns = list(set(left.columns) & set(right.columns))  # I do this since I want to join sometimes frame which does not have a 100 % column match. In that case I want to simply add the new columns to the outer frame.

outer = left.join(
            right,
            on=matching_columns,
            how="outer",
            coalesce=True,
            maintain_order="left",
        )
outer

my result:

shape: (8, 4)
┌────────────┬───────────┬───────────┬───────┐
│ track_name ┆ type      ┆ yield     ┆ group │
│ ---        ┆ ---       ┆ ---       ┆ ---   │
│ str        ┆ str       ┆ f64       ┆ str   │
╞════════════╪═══════════╪═══════════╪═══════╡
│ 8CEB45v1   ┆ corn      ┆ 0.146957  ┆ A     │
│ A188v2     ┆ corn      ┆ 0.86308   ┆ A     │
│ B73v6      ┆ corn      ┆ 0.326076  ┆ A     │
│ CI6621v1   ┆ sweetcorn ┆ 0.0357792 ┆ A     │
│ CML103v1   ┆ sweetcorn ┆ 0.510464  ┆ A     │
│ B1234      ┆ pepper    ┆ 1.0       ┆ B     │
│ B1235      ┆ pepper    ┆ 2.0       ┆ B     │
│ 8CEB45v1   ┆ corn      ┆ 0.999     ┆ A     │
└────────────┴───────────┴───────────┴───────┘

my desired output: (yield of 8CEB45v1 from right updates value of left)

shape: (7, 4)
┌────────────┬───────────┬───────────┬───────┐
│ track_name ┆ type      ┆ yield     ┆ group │
│ ---        ┆ ---       ┆ ---       ┆ ---   │
│ str        ┆ str       ┆ f64       ┆ str   │
╞════════════╪═══════════╪═══════════╪═══════╡
│ 8CEB45v1   ┆ corn      ┆ 0.999     ┆ A     │
│ A188v2     ┆ corn      ┆ 0.86308   ┆ A     │
│ B73v6      ┆ corn      ┆ 0.326076  ┆ A     │
│ CI6621v1   ┆ sweetcorn ┆ 0.0357792 ┆ A     │
│ CML103v1   ┆ sweetcorn ┆ 0.510464  ┆ A     │
│ B1234      ┆ pepper    ┆ 1.0       ┆ B     │
│ B1235      ┆ pepper    ┆ 2.0       ┆ B     │
└────────────┴───────────┴───────────┴───────┘
Share Improve this question edited Feb 24 at 17:53 jqurious 21.6k4 gold badges20 silver badges39 bronze badges asked Feb 24 at 17:51 Pm740Pm740 4233 silver badges14 bronze badges 1
  • There is a request for a similar "outer update" github/pola-rs/polars/issues/17509 + their question stackoverflow/questions/78707931 - But in this case, how do we know what columns should be updated? What if 8CEB45v1 had multiple entries in left that match right? – jqurious Commented Feb 25 at 14:27
Add a comment  | 

1 Answer 1

Reset to default 1

The thing you're doing wrong is including yield in matching_columns. You don't want to match by it, you want it as a value.

One idea to reconcile that would be

matching_columns = list(set(left.select(pl.col(pl.String)).columns) & 
                        set(right.select(pl.col(pl.String)).columns))

Alternatively, you could start with your way but then remove f64 columns. It really just depends on your data.

matching_columns = set(left.columns) & set(right.columns)
matching_columns -= (
set(left.select(matching_columns).select(pl.col(pl.Float64)).columns)
)

Once you have your matching_columns established, you can use the built in update:

left.update(right, on=matching_columns, how="full")

shape: (7, 4)
┌────────────┬───────────┬───────────┬───────┐
│ track_name ┆ type      ┆ yield     ┆ group │
│ ---        ┆ ---       ┆ ---       ┆ ---   │
│ str        ┆ str       ┆ f64       ┆ str   │
╞════════════╪═══════════╪═══════════╪═══════╡
│ 8CEB45v1   ┆ corn      ┆ 0.999     ┆ A     │
│ A188v2     ┆ corn      ┆ 0.86308   ┆ A     │
│ B73v6      ┆ corn      ┆ 0.326076  ┆ A     │
│ CI6621v1   ┆ sweetcorn ┆ 0.0357792 ┆ A     │
│ CML103v1   ┆ sweetcorn ┆ 0.510464  ┆ A     │
│ B1235      ┆ pepper    ┆ 2.0       ┆ B     │
│ B1234      ┆ pepper    ┆ 1.0       ┆ B     │
└────────────┴───────────┴───────────┴───────┘

Note: you don't have to convert the python set into a list, polars is happy to accept a set as input. Polars will accept a set at runtime but the type checker will complain about because polars has Sequence as the annotation. If you're using join with on then this won't really matter. If you're using left_on and right_on then it might get you in trouble because sets don't maintain order and the order of each of those inputs is how they're used.

Response to comment.

If you have extra columns in right that you want the output to keep then you're better off doing the update yourself with a join and a coalesce. I say that because the update function is doing that under the hood but is dropping those extra columns. It's probably worth a feature request that update provide an option to keep extra columns in right.

Anyways, here's the code

New setup

left = pl.DataFrame(
    [
        pl.Series('track_name',['8CEB45v1','A188v2','B73v6','CI6621v1','CML103v1']),
        pl.Series('type',['corn','corn','corn','sweetcorn', 'sweetcorn']),
        pl.Series('yield', [0.146957,0.86308,0.326076,0.0357792,0.510464]),
        pl.Series('group',['A','A','A','A','A']),
    ]
)


right = pl.DataFrame(
    [
        pl.Series('track_name',['8CEB45v1','B1234','B1235']),
        pl.Series('type',['corn','pepper','pepper']),
        pl.Series('yield',[0.999,1.0,2.0]),
        pl.Series('group',['A','B','B']),
        pl.Series("fruit",["apple","banana","carrot"])
    ]
)

the work

common_columns = set(left.columns) & set(right.columns)

join_columns = common_columns - set(left.select(pl.col(pl.Float64)).columns)

update_columns = common_columns - join_columns

extra_right_columns = set(right.columns) - common_columns

(
    left
    .join(right, on=list(join_columns), how="full", coalesce=True)
    .select(
        *join_columns,
        *[pl.coalesce(f"{name}_right", name).alias(name)
        for name in update_columns],
        *extra_right_columns
    )
)
shape: (7, 5)
┌───────────┬───────┬────────────┬───────────┬────────┐
│ type      ┆ group ┆ track_name ┆ yield     ┆ fruit  │
│ ---       ┆ ---   ┆ ---        ┆ ---       ┆ ---    │
│ str       ┆ str   ┆ str        ┆ f64       ┆ str    │
╞═══════════╪═══════╪════════════╪═══════════╪════════╡
│ corn      ┆ A     ┆ 8CEB45v1   ┆ 0.999     ┆ apple  │
│ corn      ┆ A     ┆ A188v2     ┆ 0.86308   ┆ null   │
│ corn      ┆ A     ┆ B73v6      ┆ 0.326076  ┆ null   │
│ sweetcorn ┆ A     ┆ CI6621v1   ┆ 0.0357792 ┆ null   │
│ sweetcorn ┆ A     ┆ CML103v1   ┆ 0.510464  ┆ null   │
│ pepper    ┆ B     ┆ B1235      ┆ 2.0       ┆ carrot │
│ pepper    ┆ B     ┆ B1234      ┆ 1.0       ┆ banana │
└───────────┴───────┴────────────┴───────────┴────────┘

本文标签: pythonHow to outer joinmerge two frames with polars while updating left with right valuesStack Overflow