admin管理员组

文章数量:1415460

The problem is trying to match MCC values ("Merchant Category Codes") based not on the code but the business name, where the names are from two sources and thus not identical. The goal is to check how often the MCC matches or does not match.

For instance, Home Depot may be listed as 'Home Depot', 'The Home Depot', or 'Home Depot #276".

As a first try I am pulling the first 5 characters, then trying to match that substring from one data source to the 'master' or at least cleaner data source.

But at the end of the day, I need to check if one column is contained in the other column.

I have tried contains_substr, which only allows to match with a string literal so i can't use the column, and I've been trying regexp_contains, which throws an error:

"Cannot parse regular expression: no argument for repetition operator: *"

I've tried multiple variations of code like this: select * from space.linking_t.MCCs_join_test where REGEXP_CONTAINS(entity_name, CONCAT(r"(?i)", merch_name_5, r"",''))

Where 'entity_name' is the master to match to and 'merch_name_5' is the substring.

The problem is trying to match MCC values ("Merchant Category Codes") based not on the code but the business name, where the names are from two sources and thus not identical. The goal is to check how often the MCC matches or does not match.

For instance, Home Depot may be listed as 'Home Depot', 'The Home Depot', or 'Home Depot #276".

As a first try I am pulling the first 5 characters, then trying to match that substring from one data source to the 'master' or at least cleaner data source.

But at the end of the day, I need to check if one column is contained in the other column.

I have tried contains_substr, which only allows to match with a string literal so i can't use the column, and I've been trying regexp_contains, which throws an error:

"Cannot parse regular expression: no argument for repetition operator: *"

I've tried multiple variations of code like this: select * from space.linking_t.MCCs_join_test where REGEXP_CONTAINS(entity_name, CONCAT(r"(?i)", merch_name_5, r"",''))

Where 'entity_name' is the master to match to and 'merch_name_5' is the substring.

Share Improve this question asked Feb 21 at 1:39 Matt MillerMatt Miller 1
Add a comment  | 

1 Answer 1

Reset to default 0

For substring matching, you can use joins with the like operator. Note that in BigQuery, ilike is not available, but you can always convert everything to upper or lower if needed first.

with coded as (
  select 'Home Depot' as name, 1 as code union all
  select 'Chase', 2 union all
  select 'Walmart', 3 union all
  select 'Google', 4 union all
  select 'AirBnB', 5
), uncoded as (
  select 'The Home Depot' as name union all
  select 'Google LLC' union all
  select 'JP Man Chase'union all
  select 'AirBnB' union all
  select 'Penguin Random House' union all
  select 'Chase Bank'
) 
select 
  a.name as uncoded_name,
  b.name,
  b.code
from   
  uncoded a
  left outer join coded b on a.name like concat('%', b.name, '%')
order by 2, 1
;
--Penguin Random House, null,   null
--AirBnB,   AirBnB, 5
--Chase Bank,   Chase,  2
--JP Man Chase,  Chase,  2
--Google LLC,   Google, 4
--The Home Depot,   Home Depot, 1

In all likelihood, your actual query will need to be more sophisticated than just checking for substring matching.

本文标签: