admin管理员组

文章数量:1388822

Problem statement: I am working on a reporting dashboard in Tableau where each record belongs to one status_category. However, I need to display a common aggregated count (processed_count) across all categories without causing row explosion.

For example, if 1,200 transactions were processed, I want that number to appear in every category (eg - "Pending," "In Progress," etc.), even for records that didn’t have a processed status.

Approaches attempted:

  • Extracted distinct status categories
  • Extracted distinct grouping dimensions (that are used as filters in my dashboard)
  • CROSS JOINed them to ensure every status appears for each unique combination of the grouping columns
  • LEFT JOINed the result back to the main dataset

Issue: this worked correctly, but caused massive row explosion

  • Original dataset: 600k rows
  • After applying this approach: 2.4M rows

This impacts performance in our database and visualization tool by making queries inefficient and introducing unnecessary data bloat.

Sample output:

Region Department Segment Cycle_Month Status_Category Processed_Count Record_Count
North Finance Large Jan Processed in Current Cycle 1,200 500
North Finance Large Jan Pending 1,200 300
North Finance Large Jan In Progress 1,200 400

Problem statement: I am working on a reporting dashboard in Tableau where each record belongs to one status_category. However, I need to display a common aggregated count (processed_count) across all categories without causing row explosion.

For example, if 1,200 transactions were processed, I want that number to appear in every category (eg - "Pending," "In Progress," etc.), even for records that didn’t have a processed status.

Approaches attempted:

  • Extracted distinct status categories
  • Extracted distinct grouping dimensions (that are used as filters in my dashboard)
  • CROSS JOINed them to ensure every status appears for each unique combination of the grouping columns
  • LEFT JOINed the result back to the main dataset

Issue: this worked correctly, but caused massive row explosion

  • Original dataset: 600k rows
  • After applying this approach: 2.4M rows

This impacts performance in our database and visualization tool by making queries inefficient and introducing unnecessary data bloat.

Sample output:

Region Department Segment Cycle_Month Status_Category Processed_Count Record_Count
North Finance Large Jan Processed in Current Cycle 1,200 500
North Finance Large Jan Pending 1,200 300
North Finance Large Jan In Progress 1,200 400

Screenshot

Query:

-- Aggregate Data by Dimensions
WITH aggregated_data AS 
(
    SELECT
        pd.cycle_month,
        pd.region_id,
        pd.region_name,
        pd.department,
        pd.segment,
        pd.team_id,
        pd.team_name,
        pd.division_id,
        pd.division_name,
        pd.manager_id,
        pd.manager_name,
        pd.status_category,
        COUNT(DISTINCT pd.record_id) AS record_count
    FROM processed_data_table pd
    GROUP BY ALL
),

-- Extract Unique Status Categories
distinct_statuses AS (
    SELECT DISTINCT status_category
    FROM aggregated_data
),

-- Extract Unique Dimension Combinations (excluding status_category)
distinct_dimensions AS (
    SELECT DISTINCT
        cycle_month,
        region_id,
        region_name,
        department,
        segment,
        team_id,
        team_name,
        division_id,
        division_name,
        manager_id,
        manager_name
    FROM aggregated_data
),

-- Create a Full Status-Dimension Combination
status_dimension_mapping AS (
    SELECT
        d.*,  
        s.status_category
    FROM distinct_dimensions d
    CROSS JOIN distinct_statuses s
),

-- Merge Actual Record Counts While Retaining All Statuses
data_with_defaults AS (
    SELECT
        sdm.*,
        COALESCE(ad.record_count, 0) AS record_count
    FROM status_dimension_mapping sdm
    LEFT JOIN aggregated_data ad
        ON sdm.cycle_month = ad.cycle_month
        AND sdm.region_id = ad.region_id
        AND sdm.department = ad.department
        AND sdm.segment = ad.segment
        AND sdm.team_id = ad.team_id
        AND sdm.division_id = ad.division_id
        AND sdm.manager_id = ad.manager_id
        AND sdm.status_category = ad.status_category
),

-- Assign Processed Count Across All Statuses
output AS (
    SELECT
        *,
        MAX(CASE WHEN status_category = 'Processed in Current Cycle' THEN record_count END)
            OVER (PARTITION BY cycle_month, region_id, department, segment,
                             team_id, division_id, manager_id
                             )
            AS processed_count
    FROM data_with_defaults
)
SELECT * FROM output;
Share Improve this question edited Mar 15 at 8:29 marc_s 756k184 gold badges1.4k silver badges1.5k bronze badges asked Mar 15 at 0:22 blue thunderblue thunder 32 bronze badges 1
  • Please share sample input, expected sample output.minimal-reproducible-example – samhita Commented Mar 15 at 9:46
Add a comment  | 

1 Answer 1

Reset to default 0

This is based on assumption due to limited information about sample input data and columns.

Based on my understanding you need a view showing count of each status for a group of dimension,alongside count of processed status for the same group of dimensions.

If the understanding correct, you can just find count of each status category and find count of processed category like below .

SELECT
    pd.cycle_month,
    pd.region_id,
    pd.region_name,
    pd.department,
    pd.segment,
    pd.team_id,
    pd.team_name,
    pd.division_id,
    pd.division_name,
    pd.manager_id,
    pd.manager_name,
    pd.status_category,
    COUNT(DISTINCT pd.record_id) AS status_count,
    MAX(CASE WHEN pd.status_category = 'Processed' THEN COUNT(DISTINCT pd.record_id) END)
        OVER (PARTITION BY pd.cycle_month, pd.region_id, pd.department, pd.segment, 
                             pd.team_id, pd.division_id, pd.manager_id) AS processed_count
FROM
    processed_data_table pd
GROUP BY
    pd.cycle_month, pd.region_id, pd.region_name, pd.department, pd.segment,
    pd.team_id, pd.team_name, pd.division_id, pd.division_name,
    pd.manager_id, pd.manager_name, pd.status_category;

本文标签: