admin管理员组文章数量:1415460
The problem is trying to match MCC values ("Merchant Category Codes") based not on the code but the business name, where the names are from two sources and thus not identical. The goal is to check how often the MCC matches or does not match.
For instance, Home Depot may be listed as 'Home Depot', 'The Home Depot', or 'Home Depot #276".
As a first try I am pulling the first 5 characters, then trying to match that substring from one data source to the 'master' or at least cleaner data source.
But at the end of the day, I need to check if one column is contained in the other column.
I have tried contains_substr, which only allows to match with a string literal so i can't use the column, and I've been trying regexp_contains, which throws an error:
"Cannot parse regular expression: no argument for repetition operator: *"
I've tried multiple variations of code like this: select * from space.linking_t.MCCs_join_test where REGEXP_CONTAINS(entity_name, CONCAT(r"(?i)", merch_name_5, r"",''))
Where 'entity_name' is the master to match to and 'merch_name_5' is the substring.
The problem is trying to match MCC values ("Merchant Category Codes") based not on the code but the business name, where the names are from two sources and thus not identical. The goal is to check how often the MCC matches or does not match.
For instance, Home Depot may be listed as 'Home Depot', 'The Home Depot', or 'Home Depot #276".
As a first try I am pulling the first 5 characters, then trying to match that substring from one data source to the 'master' or at least cleaner data source.
But at the end of the day, I need to check if one column is contained in the other column.
I have tried contains_substr, which only allows to match with a string literal so i can't use the column, and I've been trying regexp_contains, which throws an error:
"Cannot parse regular expression: no argument for repetition operator: *"
I've tried multiple variations of code like this: select * from space.linking_t.MCCs_join_test where REGEXP_CONTAINS(entity_name, CONCAT(r"(?i)", merch_name_5, r"",''))
Where 'entity_name' is the master to match to and 'merch_name_5' is the substring.
Share Improve this question asked Feb 21 at 1:39 Matt MillerMatt Miller 11 Answer
Reset to default 0For substring matching, you can use joins with the like
operator. Note that in BigQuery, ilike
is not available, but you can always convert everything to upper
or lower
if needed first.
with coded as (
select 'Home Depot' as name, 1 as code union all
select 'Chase', 2 union all
select 'Walmart', 3 union all
select 'Google', 4 union all
select 'AirBnB', 5
), uncoded as (
select 'The Home Depot' as name union all
select 'Google LLC' union all
select 'JP Man Chase'union all
select 'AirBnB' union all
select 'Penguin Random House' union all
select 'Chase Bank'
)
select
a.name as uncoded_name,
b.name,
b.code
from
uncoded a
left outer join coded b on a.name like concat('%', b.name, '%')
order by 2, 1
;
--Penguin Random House, null, null
--AirBnB, AirBnB, 5
--Chase Bank, Chase, 2
--JP Man Chase, Chase, 2
--Google LLC, Google, 4
--The Home Depot, Home Depot, 1
In all likelihood, your actual query will need to be more sophisticated than just checking for substring matching.
本文标签:
版权声明:本文标题:google cloud platform - In BigQuery, check if a substring of one column is in another column (or across tables) - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745171010a2645972.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论