根据数组中的匹配项从字符串中删除单词

时间:2022-12-20 15:35:05

I have this table:

我有这张桌子:

根据数组中的匹配项从字符串中删除单词

Assume that "florio" is a city contained somewhere in the AllLocationTerms array column.

假设“florio”是包含在AllLocationTerms数组列中某处的城市。

How do I remove "florio" when it exists on my list of locations in AllLocationTerms array column?

如果在我的AllLocationTerms数组列中的位置列表中存在“florio”,该如何删除?

Basically, I want to remove all matching items in AllLocationTerms from "Query" column.

基本上,我想从“查询”列中删除AllLocationTerms中的所有匹配项。

It can happen that there are 2 or more words - "new york apartments" as Query and "new", "york" in the array. The outcome should be "apartments" in this case.

可能会发生有两个或更多的单词 - “纽约公寓”作为查询和“新”,“约克”在数组中。在这种情况下,结果应该是“公寓”。

2 个解决方案

#1


1  

Below is to address use case like yours when you need to check 800,000 rows against the location list array which has around 40k items
So, 40K items are definitely too much for being used to construct regular expression as it is in my previous answer.
So, to address this, I propose to rather split your query string into separate words preserving position number - then exclude those which are terms by left joining and finally assemble the survived words back to string

下面是为了解决像你这样的用例,当你需要检查800,000行对着具有大约40k项目的位置列表数组时,40K项目肯定太多用于构造正则表达式,就像我在上一个答案中一样。因此,为了解决这个问题,我建议将查询字符串拆分为保留位置编号的单独单词 - 然后通过左连接排除那些作为术语的字符串,最后将幸存的单词组合回字符串

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'florio management apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms UNION ALL
  SELECT 'florio creek management iowa apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms 
)
SELECT *,
  (
    SELECT STRING_AGG(word, ' ' ORDER BY pos) 
    FROM (
      SELECT word, MIN(pos) pos 
      FROM UNNEST(SPLIT(query, ' ')) word WITH OFFSET AS pos
      LEFT JOIN UNNEST(allLocationTerms) term 
      ON word = term
      GROUP BY word
      HAVING COUNT(DISTINCT term) = 0
    )
  ) modified_query
FROM `project.dataset.table`

#2


2  

Below is for BigQuery Standard SQL

以下是BigQuery Standard SQL

#standardSQL
SELECT *, 
  REGEXP_REPLACE(query, 
    (SELECT CONCAT('\\b', STRING_AGG(term, '\\b|\\b'), '\\b') FROM UNNEST(allLocationTerms) term),
  '') modified_query
FROM `project.dataset.table`   

you can test, play with above using dummy data as below

你可以测试,使用虚拟数据在上面玩,如下所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'florio management apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms UNION ALL
  SELECT 'florio creek management iowa apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms 
)
SELECT *, 
  REGEXP_REPLACE(query, 
    (SELECT CONCAT('\\b', STRING_AGG(term, '\\b|\\b'), '\\b') FROM UNNEST(allLocationTerms) term),
  '') modified_query
FROM `project.dataset.table`    

result is

结果是

Row query                              clicks   allLocationTerms    modified_query   
1   florio management apartments            1   battle              management apartments    
                                                creek        
                                                iowa         
                                                florio       
2   florio creek management iowa apartments 1   battle              management apartments    
                                                creek        
                                                iowa         
                                                florio       

#1


1  

Below is to address use case like yours when you need to check 800,000 rows against the location list array which has around 40k items
So, 40K items are definitely too much for being used to construct regular expression as it is in my previous answer.
So, to address this, I propose to rather split your query string into separate words preserving position number - then exclude those which are terms by left joining and finally assemble the survived words back to string

下面是为了解决像你这样的用例,当你需要检查800,000行对着具有大约40k项目的位置列表数组时,40K项目肯定太多用于构造正则表达式,就像我在上一个答案中一样。因此,为了解决这个问题,我建议将查询字符串拆分为保留位置编号的单独单词 - 然后通过左连接排除那些作为术语的字符串,最后将幸存的单词组合回字符串

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'florio management apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms UNION ALL
  SELECT 'florio creek management iowa apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms 
)
SELECT *,
  (
    SELECT STRING_AGG(word, ' ' ORDER BY pos) 
    FROM (
      SELECT word, MIN(pos) pos 
      FROM UNNEST(SPLIT(query, ' ')) word WITH OFFSET AS pos
      LEFT JOIN UNNEST(allLocationTerms) term 
      ON word = term
      GROUP BY word
      HAVING COUNT(DISTINCT term) = 0
    )
  ) modified_query
FROM `project.dataset.table`

#2


2  

Below is for BigQuery Standard SQL

以下是BigQuery Standard SQL

#standardSQL
SELECT *, 
  REGEXP_REPLACE(query, 
    (SELECT CONCAT('\\b', STRING_AGG(term, '\\b|\\b'), '\\b') FROM UNNEST(allLocationTerms) term),
  '') modified_query
FROM `project.dataset.table`   

you can test, play with above using dummy data as below

你可以测试,使用虚拟数据在上面玩,如下所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'florio management apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms UNION ALL
  SELECT 'florio creek management iowa apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms 
)
SELECT *, 
  REGEXP_REPLACE(query, 
    (SELECT CONCAT('\\b', STRING_AGG(term, '\\b|\\b'), '\\b') FROM UNNEST(allLocationTerms) term),
  '') modified_query
FROM `project.dataset.table`    

result is

结果是

Row query                              clicks   allLocationTerms    modified_query   
1   florio management apartments            1   battle              management apartments    
                                                creek        
                                                iowa         
                                                florio       
2   florio creek management iowa apartments 1   battle              management apartments    
                                                creek        
                                                iowa         
                                                florio