admin管理员组

文章数量:1123802

It is clear from this question that OR conditions in joins in SQL are likely to be inefficient because they require nested loops and can not be optimised as a hash or merge join. The recommend solution is to replace with UNION statements.

The query I am trying to optimise currently contains a series of OR conditions of the form: a.col = b.col OR a.col IS NULL. For a million rows, we are estimating runtimes measured in hours.

However, as there could be a dozen of these conditions, converting this into the equivalent UNION statements is likely to be cumbersome - with each OR condition doubling the number of subqueries that have to be UNIONed. So I am keen to find a better alternative.

An example of what I am currently working with:

SELECT  *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON  (a.col1 = b.col1 OR a.col1 IS NULL)
AND  (a.col2 = b.col2 OR a.col2 IS NULL)
AND  (a.col3 = b.col3 OR a.col3 IS NULL)

I have considered using COALESCE in the join, something like the following. This removes OR from the condition, but I expect the need to calculate the value of the COALESCE is also going to prevent optimising the join.

SELECT  *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON  COALESCE(a.col1, b.col1, -1) = COALESCE(b.col1, -1)
AND  COALESCE(a.col2, b.col2, -1) = COALESCE(b.col2, -1)
AND  COALESCE(a.col3, b.col3, -1) = COALESCE(b.col3, -1)

How can I approach optimising this join pattern?


A few clarifying details in response to comments:

  • While the current query allows for the case where all a cols are null. In most instances I would expect that every a col contains some nulls and every a row contains some nulls, but that no a col or row is completely null.
  • There are a couple of things I might use this pattern for. One is to join subtotals to their supertotal. For example table_b might contain the counts for each combination of age group, region, and country of birth. Then when table_a has nulls in the age column, that row would contain the total for a region and a country of birth - over all age groups.

It is clear from this question that OR conditions in joins in SQL are likely to be inefficient because they require nested loops and can not be optimised as a hash or merge join. The recommend solution is to replace with UNION statements.

The query I am trying to optimise currently contains a series of OR conditions of the form: a.col = b.col OR a.col IS NULL. For a million rows, we are estimating runtimes measured in hours.

However, as there could be a dozen of these conditions, converting this into the equivalent UNION statements is likely to be cumbersome - with each OR condition doubling the number of subqueries that have to be UNIONed. So I am keen to find a better alternative.

An example of what I am currently working with:

SELECT  *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON  (a.col1 = b.col1 OR a.col1 IS NULL)
AND  (a.col2 = b.col2 OR a.col2 IS NULL)
AND  (a.col3 = b.col3 OR a.col3 IS NULL)

I have considered using COALESCE in the join, something like the following. This removes OR from the condition, but I expect the need to calculate the value of the COALESCE is also going to prevent optimising the join.

SELECT  *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON  COALESCE(a.col1, b.col1, -1) = COALESCE(b.col1, -1)
AND  COALESCE(a.col2, b.col2, -1) = COALESCE(b.col2, -1)
AND  COALESCE(a.col3, b.col3, -1) = COALESCE(b.col3, -1)

How can I approach optimising this join pattern?


A few clarifying details in response to comments:

  • While the current query allows for the case where all a cols are null. In most instances I would expect that every a col contains some nulls and every a row contains some nulls, but that no a col or row is completely null.
  • There are a couple of things I might use this pattern for. One is to join subtotals to their supertotal. For example table_b might contain the counts for each combination of age group, region, and country of birth. Then when table_a has nulls in the age column, that row would contain the total for a region and a country of birth - over all age groups.
Share Improve this question edited 16 hours ago Simon.S.A. asked yesterday Simon.S.A.Simon.S.A. 6,9197 gold badges25 silver badges44 bronze badges 5
  • 2 I don't believe there is any way around that issue - you are correct that coalesce causes its own issues. – Dale K Commented yesterday
  • 1 Do you really have potential cross join (ie. all a cols are null)? Somehow i doubt it – siggemannen Commented yesterday
  • Yes, COALESCE prevents the use of indexes. There's no way to avoid null checking either. NULL in SQL itself means the value is UNKnown. Not missing or default, it's completely unknown. When you compare two unknown values in any way, the result itself is UNKnown. – Panagiotis Kanavos Commented yesterday
  • 1 What's the actual logic in this query? Joining with multiple "optional" columns? Trying to match a logic table with NULLs as wildcards? What's the percentage of NULL rows and how many rows are in each table? 1M rows is NOT a lot of data, assuming the columns are indexed. Using NULL as the wildcard isn't the best idea though. In a possibly similar situation I used a logic table with -1 as the wildcard. In another case, I used the original logic table to calculate all possible options and thus avoid the OR default clause entirely – Panagiotis Kanavos Commented yesterday
  • Assuming #table_a contains the "values to search for" (one or few rows) and table #table_b contains the "data to search" (many rows), your best bet may be to use dynamic SQL on the server side or LINQ on the client side. Both can just omit any search filter conditions where the search value is null. This allows SQL Server to efficiently build an execution plan that makes optimal use of available indexes. If you have multiple "values to search for" rows with a heterogenous mix of null and non-null search values, it may be best to loop to match one at a time before combining results. – T N Commented 10 hours ago
Add a comment  | 

3 Answers 3

Reset to default 0

For the following answer, I have renamed your tables as SearchCriteria and DataToSearch for improved readability.

I am assuming that SearchCriteria contains one or a few rows, while DataToSearch contains many rows. It is also assumed that appropriate indexes are defined on the DataToSearch table to support reasonably efficient lookups for various combinations of criteria.

You can generate dynamic SQL (one query per criteria row) that applies just the relevant filter conditions. Non-null values generate a filter condition that is added to the WHERE clause. Null search values are skipped. The results from all of the generated queries can then be combined using UNION ALL.

DECLARE @NL CHAR = CHAR(10) -- newline
DECLARE @UnionAll NVARCHAR(100) = @NL + 'UNION ALL' + @NL

DECLARE @sql NVARCHAR(MAX) = (
    SELECT STRING_AGG(Q.Sql, @UnionAll) WITHIN GROUP(ORDER BY S.search_id)
    --SELECT *
    FROM SearchCriteria S
    CROSS APPLY (
        SELECT CONCAT(
            CAST('' AS NVARCHAR(MAX)),
            'SELECT ', S.search_id, ' AS search_id, D.data_id, D.col1, D.col2, D.col3',
            @NL, 'FROM DataToSearch D',
            @NL, 'WHERE 1 = 1',
            -- The following lines will either generate a "AND condition" line
            -- or null (no condition) for cases where the search value is null.
            -- Use the following for exact numeric values (excluding real/float)
            -- If used for text values, the following is open to SQL Injection
            (@NL + 'AND D.col1 = ' + CAST(S.col1 AS VARCHAR(30))),
            (@NL + 'AND D.col2 = ' + CAST(S.col2 AS VARCHAR(30))),
            (@NL + 'AND D.col3 = ' + CAST(S.col3 AS VARCHAR(30))),
            -- Use the following for text values (limit 128 characters)
            --(@NL + 'AND D.col1 = N' + QUOTENAME(S.col1, '''')),
            --(@NL + 'AND D.col2 = N' + QUOTENAME(S.col2, '''')),
            --(@NL + 'AND D.col3 = N' + QUOTENAME(S.col3, '''')),
            -- Use the following for text values (possibly longer than 128 characters)
            --(@NL + 'AND D.col1 = N''' + REPLACE(S.col1, '''', '''''') + ''''),
            --(@NL + 'AND D.col2 = N''' + REPLACE(S.col2, '''', '''''') + ''''),
            --(@NL + 'AND D.col3 = N''' + REPLACE(S.col3, '''', '''''') + ''''),
            -- Use an appropriate mix of the above, if the columns have mixed types
            ''
        ) AS Sql
    ) Q
)

PRINT @sql
EXEC sp_executesql @sql

When a criteria column is null, the generated AND column = value expression will also be null, and the CONCAT() function will quietly skip that snippet.

Extreme care must be taken when injecting values into dynamic SQL to ensure that they are properly quoted and sanitized embedded quotes doubled up). Integer and other numeric values are safe as is, but text values must be carefully quoted and any embedded quotes doubled up. (Example "O'Malley" becomes N'O''Malley'.) If you have any floating point values (SQL REAL or FLOAT types), you would need to cast them using CONVERT(VARCHAR(30), value, 3) to ensure maximum precision is preserved.

An safer alternative is to use a cursor loop to generate and execute multiple parameterized queries that each add their results to a temp #Results table. The final results are then selected from that temp table.

CREATE TABLE #Results (
    search_id INT,
    data_id INT,
    col1 INT,
    col2 INT,
    col3 INT,
)

DECLARE CSR CURSOR FAST_FORWARD FOR
    SELECT *
    FROM SearchCriteria
    ORDER BY search_id
OPEN CSR

DECLARE @search_id INT,  @col1 INT, @col2 INT, @col3 INT

FETCH NEXT FROM CSR INTO @search_id,  @col1, @col2, @col3
WHILE @@FETCH_STATUS = 0
BEGIN
    DECLARE @sql NVARCHAR(MAX) = CONCAT(
        CAST('' AS NVARCHAR(MAX))
        , 'INSERT INTO #Results'
        , CHAR(10) + 'SELECT @search_id, D.data_id, D.col1, D.col2, D.col3'
        , CHAR(10) + 'FROM DataToSearch D'
        , CHAR(10) + 'WHERE 1 = 1'
        , CASE WHEN @col1 IS NOT NULL THEN CHAR(10) + 'AND D.col1 = @col1' END
        , CASE WHEN @col2 IS NOT NULL THEN CHAR(10) + 'AND D.col2 = @col2' END
        , CASE WHEN @col3 IS NOT NULL THEN CHAR(10) + 'AND D.col3 = @col3' END
    )

    DECLARE @params NVARCHAR(MAX) = '@search_id INT,  @col1 INT, @col2 INT, @col3 INT'

  PRINT @sql
    EXEC sp_executesql @sql, @params, @search_id, @col1, @col2, @col3
  
    FETCH NEXT FROM CSR INTO @search_id, @col1, @col2, @col3
END

CLOSE CSR
DEALLOCATE CSR

SELECT *
FROM #Results

DROP TABLE #Results

Both of the above should perform index seeks for nearly all cases, depending on index availability. Cursor and temp table overhead in the second case should be minimal.

UNION ALL in the above code can be changed to a simple UNION if you wish to eliminate duplicate results. (The search_id select item should also then be removed from the SQL for this to be effective.)

Given the following SearchCriteria data:

search_id col1 col2 col3
1 1 2 3
2 4 5 null
3 null 6 7

The following Dynamic SQL is generated:

SELECT 1 AS search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = 1
AND D.col2 = 2
AND D.col3 = 3
UNION ALL
SELECT 2 AS search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = 4
AND D.col2 = 5
UNION ALL
SELECT 3 AS search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col2 = 6
AND D.col3 = 7

The second version above generates the following sequence of queries, executing each using appropriate parameter values:

INSERT INTO #Results
SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = @col1
AND D.col2 = @col2
AND D.col3 = @col3

INSERT INTO #Results
SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = @col1
AND D.col2 = @col2

INSERT INTO #Results
SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col2 = @col2
AND D.col3 = @col3

Using some sample data-to-search that includes a mix of null and non-null values, the following results are obtained:

search_id data_id col1 col2 col3
1 313 1 2 3
2 46 4 5 null
2 146 4 5 1
2 246 4 5 2
2 346 4 5 3
... ... ... ... ...
3 707 null 6 7
3 717 1 6 7
3 727 2 6 7
3 737 3 6 7
... ... ... ... ...

See this db<>fiddle for a demo of both techniques, including the generated execution plans.

The original posted queries also produce the same results, but at a much greater cost caused by multiple table scans.

Bonus:

If to have an option to code this up as a LINQ query, the following C# code might work (untested):

var query = SearchCriteria
    .Select(sc => {
        var innerQuery = context.DataToSearch.AsQueryable();
        if (sc.col1 != null)
        {
            innerQuery = innerQuery .Where(d => d.col1 == sc.col1)
        }
        if (sc.col2 != null)
        {
            innerQuery = innerQuery .Where(d => d.col2 == sc.col2)
        }
        if (sc.col3 != null)
        {
            innerQuery = innerQuery .Where(d => d.col3 == sc.col3)
        }
        return innerQuery ;
    })
    .Aggregate((l, r) => l.Concat(r));  // Use .Union() to dedup

If you wish to dedup the results, the .Concat() can be changed to .Union() above. (I think this will be executed on the SQL server side. If not, you might need to define a comparator so that actual values, not references, are compared.)

You can filter the rows with NULL values before the join

SELECT *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON a.col1 = b.col1 AND a.col2 = b.col2 AND a.col3 = b.col3
WHERE a.col1 IS NOT NULL AND a.col2 IS NOT NULL AND a.col3 IS NOT NULL;

By filtering out rows where col1, col2, and col3 are NULL, you reduce the number of rows that need to be processed in the join operation.

Also ensure col1, col2, and col3 are indexed in both #table_a and #table_b

You can use ISNULL

SELECT  *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON  ISNULL(a.col1,b.col1) = b.col1
AND ISNULL(a.col2,b.col2) = b.col2
AND ISNULL(a.col3,b.col3) = b.col3

本文标签: sql serverAlternatives to OR NULL in SQL joinStack Overflow