admin管理员组文章数量:1123802
It is clear from this question that OR
conditions in joins in SQL are likely to be inefficient because they require nested loops and can not be optimised as a hash or merge join. The recommend solution is to replace with UNION
statements.
The query I am trying to optimise currently contains a series of OR
conditions of the form: a.col = b.col OR a.col IS NULL
. For a million rows, we are estimating runtimes measured in hours.
However, as there could be a dozen of these conditions, converting this into the equivalent UNION
statements is likely to be cumbersome - with each OR
condition doubling the number of subqueries that have to be UNION
ed. So I am keen to find a better alternative.
An example of what I am currently working with:
SELECT *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON (a.col1 = b.col1 OR a.col1 IS NULL)
AND (a.col2 = b.col2 OR a.col2 IS NULL)
AND (a.col3 = b.col3 OR a.col3 IS NULL)
I have considered using COALESCE
in the join, something like the following. This removes OR
from the condition, but I expect the need to calculate the value of the COALESCE
is also going to prevent optimising the join.
SELECT *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON COALESCE(a.col1, b.col1, -1) = COALESCE(b.col1, -1)
AND COALESCE(a.col2, b.col2, -1) = COALESCE(b.col2, -1)
AND COALESCE(a.col3, b.col3, -1) = COALESCE(b.col3, -1)
How can I approach optimising this join pattern?
A few clarifying details in response to comments:
- While the current query allows for the case where all
a
cols are null. In most instances I would expect that everya
col contains some nulls and everya
row contains some nulls, but that noa
col or row is completely null. - There are a couple of things I might use this pattern for. One is to join subtotals to their supertotal. For example table_b might contain the counts for each combination of age group, region, and country of birth. Then when table_a has nulls in the age column, that row would contain the total for a region and a country of birth - over all age groups.
It is clear from this question that OR
conditions in joins in SQL are likely to be inefficient because they require nested loops and can not be optimised as a hash or merge join. The recommend solution is to replace with UNION
statements.
The query I am trying to optimise currently contains a series of OR
conditions of the form: a.col = b.col OR a.col IS NULL
. For a million rows, we are estimating runtimes measured in hours.
However, as there could be a dozen of these conditions, converting this into the equivalent UNION
statements is likely to be cumbersome - with each OR
condition doubling the number of subqueries that have to be UNION
ed. So I am keen to find a better alternative.
An example of what I am currently working with:
SELECT *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON (a.col1 = b.col1 OR a.col1 IS NULL)
AND (a.col2 = b.col2 OR a.col2 IS NULL)
AND (a.col3 = b.col3 OR a.col3 IS NULL)
I have considered using COALESCE
in the join, something like the following. This removes OR
from the condition, but I expect the need to calculate the value of the COALESCE
is also going to prevent optimising the join.
SELECT *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON COALESCE(a.col1, b.col1, -1) = COALESCE(b.col1, -1)
AND COALESCE(a.col2, b.col2, -1) = COALESCE(b.col2, -1)
AND COALESCE(a.col3, b.col3, -1) = COALESCE(b.col3, -1)
How can I approach optimising this join pattern?
A few clarifying details in response to comments:
- While the current query allows for the case where all
a
cols are null. In most instances I would expect that everya
col contains some nulls and everya
row contains some nulls, but that noa
col or row is completely null. - There are a couple of things I might use this pattern for. One is to join subtotals to their supertotal. For example table_b might contain the counts for each combination of age group, region, and country of birth. Then when table_a has nulls in the age column, that row would contain the total for a region and a country of birth - over all age groups.
3 Answers
Reset to default 0For the following answer, I have renamed your tables as SearchCriteria
and DataToSearch
for improved readability.
I am assuming that SearchCriteria
contains one or a few rows, while DataToSearch
contains many rows. It is also assumed that appropriate indexes are defined on the DataToSearch
table to support reasonably efficient lookups for various combinations of criteria.
You can generate dynamic SQL (one query per criteria row) that applies just the relevant filter conditions. Non-null values generate a filter condition that is added to the WHERE
clause. Null search values are skipped. The results from all of the generated queries can then be combined using UNION ALL
.
DECLARE @NL CHAR = CHAR(10) -- newline
DECLARE @UnionAll NVARCHAR(100) = @NL + 'UNION ALL' + @NL
DECLARE @sql NVARCHAR(MAX) = (
SELECT STRING_AGG(Q.Sql, @UnionAll) WITHIN GROUP(ORDER BY S.search_id)
--SELECT *
FROM SearchCriteria S
CROSS APPLY (
SELECT CONCAT(
CAST('' AS NVARCHAR(MAX)),
'SELECT ', S.search_id, ' AS search_id, D.data_id, D.col1, D.col2, D.col3',
@NL, 'FROM DataToSearch D',
@NL, 'WHERE 1 = 1',
-- The following lines will either generate a "AND condition" line
-- or null (no condition) for cases where the search value is null.
-- Use the following for exact numeric values (excluding real/float)
-- If used for text values, the following is open to SQL Injection
(@NL + 'AND D.col1 = ' + CAST(S.col1 AS VARCHAR(30))),
(@NL + 'AND D.col2 = ' + CAST(S.col2 AS VARCHAR(30))),
(@NL + 'AND D.col3 = ' + CAST(S.col3 AS VARCHAR(30))),
-- Use the following for text values (limit 128 characters)
--(@NL + 'AND D.col1 = N' + QUOTENAME(S.col1, '''')),
--(@NL + 'AND D.col2 = N' + QUOTENAME(S.col2, '''')),
--(@NL + 'AND D.col3 = N' + QUOTENAME(S.col3, '''')),
-- Use the following for text values (possibly longer than 128 characters)
--(@NL + 'AND D.col1 = N''' + REPLACE(S.col1, '''', '''''') + ''''),
--(@NL + 'AND D.col2 = N''' + REPLACE(S.col2, '''', '''''') + ''''),
--(@NL + 'AND D.col3 = N''' + REPLACE(S.col3, '''', '''''') + ''''),
-- Use an appropriate mix of the above, if the columns have mixed types
''
) AS Sql
) Q
)
PRINT @sql
EXEC sp_executesql @sql
When a criteria column is null, the generated AND column = value
expression will also be null, and the CONCAT()
function will quietly skip that snippet.
Extreme care must be taken when injecting values into dynamic SQL to ensure that they are properly quoted and sanitized embedded quotes doubled up). Integer and other numeric values are safe as is, but text values must be carefully quoted and any embedded quotes doubled up. (Example "O'Malley" becomes N'O''Malley'
.) If you have any floating point values (SQL REAL
or FLOAT
types), you would need to cast them using CONVERT(VARCHAR(30), value, 3)
to ensure maximum precision is preserved.
An safer alternative is to use a cursor loop to generate and execute multiple parameterized queries that each add their results to a temp #Results
table. The final results are then selected from that temp table.
CREATE TABLE #Results (
search_id INT,
data_id INT,
col1 INT,
col2 INT,
col3 INT,
)
DECLARE CSR CURSOR FAST_FORWARD FOR
SELECT *
FROM SearchCriteria
ORDER BY search_id
OPEN CSR
DECLARE @search_id INT, @col1 INT, @col2 INT, @col3 INT
FETCH NEXT FROM CSR INTO @search_id, @col1, @col2, @col3
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @sql NVARCHAR(MAX) = CONCAT(
CAST('' AS NVARCHAR(MAX))
, 'INSERT INTO #Results'
, CHAR(10) + 'SELECT @search_id, D.data_id, D.col1, D.col2, D.col3'
, CHAR(10) + 'FROM DataToSearch D'
, CHAR(10) + 'WHERE 1 = 1'
, CASE WHEN @col1 IS NOT NULL THEN CHAR(10) + 'AND D.col1 = @col1' END
, CASE WHEN @col2 IS NOT NULL THEN CHAR(10) + 'AND D.col2 = @col2' END
, CASE WHEN @col3 IS NOT NULL THEN CHAR(10) + 'AND D.col3 = @col3' END
)
DECLARE @params NVARCHAR(MAX) = '@search_id INT, @col1 INT, @col2 INT, @col3 INT'
PRINT @sql
EXEC sp_executesql @sql, @params, @search_id, @col1, @col2, @col3
FETCH NEXT FROM CSR INTO @search_id, @col1, @col2, @col3
END
CLOSE CSR
DEALLOCATE CSR
SELECT *
FROM #Results
DROP TABLE #Results
Both of the above should perform index seeks for nearly all cases, depending on index availability. Cursor and temp table overhead in the second case should be minimal.
UNION ALL
in the above code can be changed to a simple UNION
if you wish to eliminate duplicate results. (The search_id
select item should also then be removed from the SQL for this to be effective.)
Given the following SearchCriteria data:
search_id | col1 | col2 | col3 |
---|---|---|---|
1 | 1 | 2 | 3 |
2 | 4 | 5 | null |
3 | null | 6 | 7 |
The following Dynamic SQL is generated:
SELECT 1 AS search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = 1
AND D.col2 = 2
AND D.col3 = 3
UNION ALL
SELECT 2 AS search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = 4
AND D.col2 = 5
UNION ALL
SELECT 3 AS search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col2 = 6
AND D.col3 = 7
The second version above generates the following sequence of queries, executing each using appropriate parameter values:
INSERT INTO #Results
SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = @col1
AND D.col2 = @col2
AND D.col3 = @col3
INSERT INTO #Results
SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = @col1
AND D.col2 = @col2
INSERT INTO #Results
SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col2 = @col2
AND D.col3 = @col3
Using some sample data-to-search that includes a mix of null and non-null values, the following results are obtained:
search_id | data_id | col1 | col2 | col3 |
---|---|---|---|---|
1 | 313 | 1 | 2 | 3 |
2 | 46 | 4 | 5 | null |
2 | 146 | 4 | 5 | 1 |
2 | 246 | 4 | 5 | 2 |
2 | 346 | 4 | 5 | 3 |
... | ... | ... | ... | ... |
3 | 707 | null | 6 | 7 |
3 | 717 | 1 | 6 | 7 |
3 | 727 | 2 | 6 | 7 |
3 | 737 | 3 | 6 | 7 |
... | ... | ... | ... | ... |
See this db<>fiddle for a demo of both techniques, including the generated execution plans.
The original posted queries also produce the same results, but at a much greater cost caused by multiple table scans.
Bonus:
If to have an option to code this up as a LINQ query, the following C# code might work (untested):
var query = SearchCriteria
.Select(sc => {
var innerQuery = context.DataToSearch.AsQueryable();
if (sc.col1 != null)
{
innerQuery = innerQuery .Where(d => d.col1 == sc.col1)
}
if (sc.col2 != null)
{
innerQuery = innerQuery .Where(d => d.col2 == sc.col2)
}
if (sc.col3 != null)
{
innerQuery = innerQuery .Where(d => d.col3 == sc.col3)
}
return innerQuery ;
})
.Aggregate((l, r) => l.Concat(r)); // Use .Union() to dedup
If you wish to dedup the results, the .Concat()
can be changed to .Union()
above. (I think this will be executed on the SQL server side. If not, you might need to define a comparator so that actual values, not references, are compared.)
You can filter the rows with NULL values before the join
SELECT *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON a.col1 = b.col1 AND a.col2 = b.col2 AND a.col3 = b.col3
WHERE a.col1 IS NOT NULL AND a.col2 IS NOT NULL AND a.col3 IS NOT NULL;
By filtering out rows where col1, col2, and col3 are NULL, you reduce the number of rows that need to be processed in the join operation.
Also ensure col1, col2, and col3 are indexed in both #table_a and #table_b
You can use ISNULL
SELECT *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON ISNULL(a.col1,b.col1) = b.col1
AND ISNULL(a.col2,b.col2) = b.col2
AND ISNULL(a.col3,b.col3) = b.col3
本文标签: sql serverAlternatives to OR NULL in SQL joinStack Overflow
版权声明:本文标题:sql server - Alternatives to OR NULL in SQL join - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736593659a1945115.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
a
cols are null)? Somehow i doubt it – siggemannen Commented yesterday-1
as the wildcard. In another case, I used the original logic table to calculate all possible options and thus avoid theOR default
clause entirely – Panagiotis Kanavos Commented yesterday#table_a
contains the "values to search for" (one or few rows) and table#table_b
contains the "data to search" (many rows), your best bet may be to use dynamic SQL on the server side or LINQ on the client side. Both can just omit any search filter conditions where the search value is null. This allows SQL Server to efficiently build an execution plan that makes optimal use of available indexes. If you have multiple "values to search for" rows with a heterogenous mix of null and non-null search values, it may be best to loop to match one at a time before combining results. – T N Commented 10 hours ago