高效清洁表中的字符串

时间:2022-05-19 21:03:21

I'm currently working on a problem where certain characters need to be cleaned from strings that exist in a table. Normally I'd do a simple UPDATE with a replace, but in this case there are 32 different characters that need to be removed.

我目前正在处理一个问题,其中某些字符需要从表中存在的字符串中清除。通常我会使用替换进行简单的UPDATE,但在这种情况下,需要删除32个不同的字符。

I've done some looking around and I can't find any great solutions for quickly cleaning strings that already exist in a table.

我已经做了一些环顾四周,我找不到任何很好的解决方案来快速清理表中已经存在的字符串。

Things I've looked into:

我调查过的事情:

  1. Doing a series of nested replaces

    进行一系列嵌套替换

    This solution is do-able, but for 32 different replaces it would require either some ugly code, or hacky dynamic sql to build a huge series of replaces.

    这个解决方案是可行的,但是对于32种不同的替换,它需要一些丑陋的代码或hacky动态sql来构建一系列大量的替换。

  2. PATINDEX and while loops

    PATINDEX和while循环

    As seen in this answer it is possible to mimic a kind of regex replace, but I'm working with a lot of data so I'm hesitant to even trust the improved solution to run in a reasonable amount of time when the data volume is large.

    正如在这个答案中看到的那样,可以模仿一种正则表达式替换,但我正在使用大量数据,所以我甚至不相信改进的解决方案在合理的时间内运行时数据量是大。

  3. Recursive CTEs

    递归CTE

    I tried a CTE approuch to this problem, but it didn't run terribly fast once the number of rows got large.

    我尝试了一个CTE approuch来解决这个问题,但是一旦行数变大,它的运行速度就不会非常快。

For reference:

以供参考:

CREATE TABLE #BadChar(
    id int IDENTITY(1,1),
    badString nvarchar(10),
    replaceString nvarchar(10)

);

INSERT INTO #BadChar(badString, replaceString) SELECT 'A', '^';
INSERT INTO #BadChar(badString, replaceString) SELECT 'B', '}';
INSERT INTO #BadChar(badString, replaceString) SELECT 's', '5';
INSERT INTO #BadChar(badString, replaceString) SELECT '-', ' ';

CREATE TABLE #CleanMe(
    clean_id int IDENTITY(1,1),
    DirtyString nvarchar(20)
);

DECLARE @i int;
SET @i = 0;
WHILE @i < 100000 BEGIN
    INSERT INTO #CleanMe(DirtyString) SELECT 'AAAAA';
    INSERT INTO #CleanMe(DirtyString) SELECT 'BBBBB';
    INSERT INTO #CleanMe(DirtyString) SELECT 'AB-String-BA';
    SET @i = @i + 1
END;


WITH FixedString (Step, String, cid) AS (
    SELECT 1 AS Step, REPLACE(DirtyString, badString, replaceString), clean_id
    FROM #BadChar, #CleanMe
    WHERE id = 1

    UNION ALL

    SELECT Step + 1, REPLACE(String, badString, replaceString), cid
    FROM FixedString AS T1
    JOIN #BadChar AS T2 ON T1.step + 1 = T2.id
    Join #CleanMe AS T3 on T1.cid = t3.clean_id

)
SELECT String FROM FixedString WHERE step = (SELECT MAX(STEP) FROM FixedString);

DROP TABLE #BadChar;
DROP TABLE #CleanMe;
  1. Use a CLR

    使用CLR

    It seems like this is a common solution many people use, but the environment I'm in doesn't make this a very easy one to embark on.

    看起来这是许多人使用的常见解决方案,但我所处的环境并不能让这个很容易实现。

Are there any other ways to go about this I've over looked? Or any improvements upon the methods I've already looked into for this?

还有其他方法可以解决这个问题吗?或者对我已经考虑过的方法有什么改进?

1 个解决方案

#1


1  

Leveraging the idea from Alan Burstein's solution, you could do something like this, if you wanted to hard code the bad/replace strings. This would work for bad/replace strings longer than a single character as well.

利用Alan Burstein解决方案的想法,如果你想硬编码坏/替换字符串,你可以做这样的事情。这对于错误/替换字符串也比单个字符更长。

CREATE FUNCTION [dbo].[CleanStringV1]
(
  @String   nvarchar(4000)
)
RETURNS nvarchar(4000) WITH SCHEMABINDING AS 
BEGIN
 SELECT @string = REPLACE
  (
    @string COLLATE Latin1_General_BIN,
    badString,
    replaceString
  )
 FROM
 (VALUES
      ('A', '^')
    , ('B', '}')
    , ('s', '5')
    , ('-', ' ')
    ) t(badString, replaceString) 
 RETURN @string;
END;

Or, if you have a table containing the bad/replace strings, then

或者,如果你有一个包含坏/替换字符串的表,那么

CREATE FUNCTION [dbo].[CleanStringV2]
(
  @String   nvarchar(4000)
)
RETURNS nvarchar(4000) AS 
BEGIN
 SELECT @string = REPLACE
  (
    @string COLLATE Latin1_General_BIN,
    badString,
    replaceString
  )
 FROM BadChar
 RETURN @string;
END;

These are case sensitive. You can remove the COLLATE bit if you want case insensitive. I did a few small tests, and these were not much slower than nested REPLACE. The first one with the hardcoded strings was a the faster of the two, and was nearly as fast as nested REPLACE.

这些是区分大小写的。如果您不区分大小写,可以删除COLLATE位。我做了一些小测试,这些测试并没有比嵌套的REPLACE慢得多。带有硬编码字符串的第一个是两者中较快的一个,并且几乎与嵌套的REPLACE一样快。

#1


1  

Leveraging the idea from Alan Burstein's solution, you could do something like this, if you wanted to hard code the bad/replace strings. This would work for bad/replace strings longer than a single character as well.

利用Alan Burstein解决方案的想法,如果你想硬编码坏/替换字符串,你可以做这样的事情。这对于错误/替换字符串也比单个字符更长。

CREATE FUNCTION [dbo].[CleanStringV1]
(
  @String   nvarchar(4000)
)
RETURNS nvarchar(4000) WITH SCHEMABINDING AS 
BEGIN
 SELECT @string = REPLACE
  (
    @string COLLATE Latin1_General_BIN,
    badString,
    replaceString
  )
 FROM
 (VALUES
      ('A', '^')
    , ('B', '}')
    , ('s', '5')
    , ('-', ' ')
    ) t(badString, replaceString) 
 RETURN @string;
END;

Or, if you have a table containing the bad/replace strings, then

或者,如果你有一个包含坏/替换字符串的表,那么

CREATE FUNCTION [dbo].[CleanStringV2]
(
  @String   nvarchar(4000)
)
RETURNS nvarchar(4000) AS 
BEGIN
 SELECT @string = REPLACE
  (
    @string COLLATE Latin1_General_BIN,
    badString,
    replaceString
  )
 FROM BadChar
 RETURN @string;
END;

These are case sensitive. You can remove the COLLATE bit if you want case insensitive. I did a few small tests, and these were not much slower than nested REPLACE. The first one with the hardcoded strings was a the faster of the two, and was nearly as fast as nested REPLACE.

这些是区分大小写的。如果您不区分大小写,可以删除COLLATE位。我做了一些小测试,这些测试并没有比嵌套的REPLACE慢得多。带有硬编码字符串的第一个是两者中较快的一个,并且几乎与嵌套的REPLACE一样快。