admin管理员组

文章数量:1122832

I'm trying to replicate the repeat_by and explode functions from Polars using dbt with a Redshift database but am having difficulty finding an equivalent solution.

Here's the sample Polars DataFrame code for context:

NOTE: I don't need replicate column, if that's an issue, then don't worry about it

import polars as pl

# Sample DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["New York", "Los Angeles", "Chicago"],
    "replicate": [2, 3, 1]
})

Polars DataFrame output:

shape: (3, 4)
┌─────────┬─────┬─────────────┬───────────┐
│ name    ┆ age ┆ city        ┆ replicate │
│ ---     ┆ --- ┆ ---         ┆ ---       │
│ str     ┆ i64 ┆ str         ┆ i64       │
╞═════════╪═════╪═════════════╪═══════════╡
│ Alice   ┆ 25  ┆ New York    ┆ 2         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Charlie ┆ 35  ┆ Chicago     ┆ 1         │
└─────────┴─────┴─────────────┴───────────┘

Using the repeat_by and explode functions:

df.select(pl.all().repeat_by('replicate').explode())

Result:

shape: (6, 4)
┌─────────┬─────┬─────────────┬───────────┐
│ name    ┆ age ┆ city        ┆ replicate │
│ ---     ┆ --- ┆ ---         ┆ ---       │
│ str     ┆ i64 ┆ str         ┆ i64       │
╞═════════╪═════╪═════════════╪═══════════╡
│ Alice   ┆ 25  ┆ New York    ┆ 2         │
│ Alice   ┆ 25  ┆ New York    ┆ 2         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Charlie ┆ 35  ┆ Chicago     ┆ 1         │
└─────────┴─────┴─────────────┴───────────┘

I need help finding a way to achieve the same result in dbt with Redshift.

I'm trying to replicate the repeat_by and explode functions from Polars using dbt with a Redshift database but am having difficulty finding an equivalent solution.

Here's the sample Polars DataFrame code for context:

NOTE: I don't need replicate column, if that's an issue, then don't worry about it

import polars as pl

# Sample DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["New York", "Los Angeles", "Chicago"],
    "replicate": [2, 3, 1]
})

Polars DataFrame output:

shape: (3, 4)
┌─────────┬─────┬─────────────┬───────────┐
│ name    ┆ age ┆ city        ┆ replicate │
│ ---     ┆ --- ┆ ---         ┆ ---       │
│ str     ┆ i64 ┆ str         ┆ i64       │
╞═════════╪═════╪═════════════╪═══════════╡
│ Alice   ┆ 25  ┆ New York    ┆ 2         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Charlie ┆ 35  ┆ Chicago     ┆ 1         │
└─────────┴─────┴─────────────┴───────────┘

Using the repeat_by and explode functions:

df.select(pl.all().repeat_by('replicate').explode())

Result:

shape: (6, 4)
┌─────────┬─────┬─────────────┬───────────┐
│ name    ┆ age ┆ city        ┆ replicate │
│ ---     ┆ --- ┆ ---         ┆ ---       │
│ str     ┆ i64 ┆ str         ┆ i64       │
╞═════════╪═════╪═════════════╪═══════════╡
│ Alice   ┆ 25  ┆ New York    ┆ 2         │
│ Alice   ┆ 25  ┆ New York    ┆ 2         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Charlie ┆ 35  ┆ Chicago     ┆ 1         │
└─────────┴─────┴─────────────┴───────────┘

I need help finding a way to achieve the same result in dbt with Redshift.

Share Improve this question edited Nov 24, 2024 at 18:23 pppery 3,78425 gold badges37 silver badges50 bronze badges asked Nov 21, 2024 at 19:43 Akmal SolievAkmal Soliev 7428 silver badges24 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 2

I used JOIN and a CROSS JOIN since you mentioned generate_series is not supported.

I tested this in Snowflake , since I dont have redshift. I guess the syntax might not be the exact same, but still should work.

let me know if it works for you.

SET tbl_count = (SELECT max(replicate) FROM my_table);

WITH numbers AS (
    SELECT 
        row_number() OVER ( order by 1) AS num
    FROM 
        (SELECT 1 FROM my_table LIMIT $tbl_count ) )
        ,
repeated_data AS (
    SELECT 
        t.name,
        t.age,
        t.city,
        t.replicate
    FROM 
        my_table t
    CROSS JOIN numbers
    WHERE numbers.num <= t.replicate
)
SELECT 
    name, 
    age, 
    city,
    replicate

FROM 
    repeated_data
ORDER BY name, age;

Trying with Recursion since using Limit variable is tricky in Redshift

WITH RECURSIVE numbers AS (
    -- Base case: start from 1
    SELECT 1 AS num
    UNION ALL
    -- Recursive case: increment the number by 1 each time
    SELECT num + 1
    FROM numbers
    WHERE num < (SELECT max(replicate) FROM my_table) -- Stop when num reaches max(replicate)
),
repeated_data AS (
    SELECT 
        t.name,
        t.age,
        t.city,
        t.replicate
    FROM 
        my_table t
    CROSS JOIN numbers
    WHERE numbers.num <= t.replicate
)
SELECT 
    name, 
    age, 
    city,
    replicate
FROM 
    repeated_data
ORDER BY 
    name, age;

Output :

本文标签: sqlExplode values in dbtStack Overflow