从带有T-SQL特定模式的段落中删除一个句子

I have a large number of descriptions that can be anywhere from 5 to 20 sentences each. I am trying to put a script together that will locate and remove a sentence that contains a word with numbers before or after it.

我有大量的描述，每个都可以有5到20个句子。我正在尝试把一个脚本放在一起，它将定位并删除一个句子，其中包含一个在它之前或之后带有数字的单词。

before example: Hello world. Todays department has 345 employees. Have a good day. after example: Hello world. Have a good day.

之前的例子:Hello world。今天的部门有345名员工。有一个美好的一天。后的例子:Hello world。有一个美好的一天。

My main problem right now is identifying the violation.
Here "345 employees" is what causes the sentence to be removed. However, each description will have a different number and possibly a different variation of the word employee. I would like to avoid having to create a table of all the different variations of employee.

我现在的主要问题是识别违规行为。在这里，“345名员工”是导致判决被撤销的原因。但是，每个描述都有不同的数字，可能还有雇员这个词的不同变体。我希望避免必须创建一个包含所有员工的不同变体的表。

JTB

3 个解决方案

#1

This would make a good SQL Puzzle.

这将是一个很好的SQL难题。

Disclaimer: there are probably TONS of edge cases that would blow this up

免责声明:可能有大量的边缘案例会让这一切破灭

This would take a string, split it out into a table with a row for each sentence, then remove the rows that matched a condition, and then finally join them all back into a string.

这将取一个字符串，将它分割成一个表，每个句子都有一行，然后删除匹配条件的行，然后最终将它们合并到一个字符串中。

CREATE FUNCTION dbo.fn_SplitRemoveJoin(@Val VARCHAR(2000), @FilterCond VARCHAR(100))
RETURNS VARCHAR(2000)
AS 
BEGIN
    DECLARE @tbl TABLE (rid INT IDENTITY(1,1), val VARCHAR(2000))
    DECLARE @t VARCHAR(2000)

    -- Split into table @tbl
    WHILE CHARINDEX('.',@Val) > 0
    BEGIN
        SET @t = LEFT(@Val, CHARINDEX('.', @Val))
        INSERT @tbl (val) VALUES (@t)
        SET @Val = RIGHT(@Val, LEN(@Val) - LEN(@t))
    END

    IF (LEN(@Val) > 0)
        INSERT @tbl VALUES (@Val)


    -- Filter out condition 
    DELETE FROM @tbl WHERE val LIKE @FilterCond

    -- Join back into 1 string
    DECLARE @i INT, @rv VARCHAR(2000)
    SET @i = 1
    WHILE @i <= (SELECT MAX(rid) FROM @tbl)
    BEGIN
        SELECT @rv = IsNull(@rv,'') + IsNull(val,'') FROM @tbl WHERE rid = @i
        SET @i = @i + 1
    END
    RETURN @rv

END
go


CREATE TABLE #TMP (rid INT IDENTITY(1,1), sentence VARCHAR(2000))
INSERT #tmp (sentence) VALUES ('Hello world. Todays department has 345 employees. Have a good day.')
INSERT #tmp (sentence) VALUES ('Hello world. Todays department has 15 emps. Have a good day. Oh and by the way there are 12 employees somewhere else')


SELECT 
    rid, sentence, dbo.fn_SplitRemoveJoin(sentence, '%[0-9] Emp%')
FROM #tmp t

returns

rid | sentence |  |
1 | Hello world. Todays department has 345 employees. Have a good day. | Hello world. Have a good day.|
2 | Hello world. Todays department has 15 emps. Have a good day. Oh and by the way there are 12 employees somewhere else | Hello world. Have a good day. |

#2

I've used the split/remove/join technique as well.

我也使用了分割/删除/连接技术。

The main points are:

要点是:

This uses a pair of recursive CTEs, rather than a UDF.
这使用一对递归cte，而不是UDF。
This will work with all English sentence endings: . or ! or ?
这将与所有英语句子结尾一起使用:。或!还是?
This removes whitespace to make the comparison for "digit then employee" so you don't have to worry about multiple spaces and such.
这将删除空格，以便对“digit then employee”进行比较，因此您不必担心多个空格等。

Here's the SqlFiddle demo, and the code:

这是sql小提琴演示，代码如下:

-- Split descriptions into sentences (could use period, exclamation point, or question mark)
-- Delete any sentences that, without whitespace, are like '%[0-9]employ%'
-- Join sentences back into descriptions
;with Splitter as (
    select ID
        , ltrim(rtrim(Data)) as Data
        , cast(null as varchar(max)) as Sentence
        , 0 as SentenceNumber
    from Descriptions -- Your table here
    union all
    select ID
        , case when Data like '%[.!?]%' then right(Data, len(Data) - patindex('%[.!?]%', Data)) else null end
        , case when Data like '%[.!?]%' then left(Data, patindex('%[.!?]%', Data)) else Data end
        , SentenceNumber + 1
    from Splitter
    where Data is not null
), Joiner as (
    select ID
        , cast('' as varchar(max)) as Data
        , 0 as SentenceNumber
    from Splitter
    group by ID
    union all
    select j.ID
        , j.Data +
            -- Don't want "digit+employ" sentences, remove whitespace to search
            case when replace(replace(replace(replace(s.Sentence, char(9), ''), char(10), ''), char(13), ''), char(32), '') like '%[0-9]employ%' then '' else s.Sentence end
        , s.SentenceNumber
    from Joiner j
        join Splitter s on j.ID = s.ID and s.SentenceNumber = j.SentenceNumber + 1
)
-- Final Select
select a.ID, a.Data
from Joiner a
    join (
        -- Only get max SentenceNumber
        select ID, max(SentenceNumber) as SentenceNumber
        from Joiner
        group by ID
    ) b on a.ID = b.ID and a.SentenceNumber = b.SentenceNumber
order by a.ID, a.SentenceNumber

#3

One way to do this. Please note that it only works if you have one number in all sentences.

一种方法。请注意，只有在所有句子中有一个数字时，它才有效。

declare @d VARCHAR(1000) = 'Hello world. Todays department has 345 employees. Have a good day.'
declare @dr VARCHAR(1000)

set @dr = REVERSE(@d)

SELECT   REVERSE(RIGHT(@dr,LEN(@dr) - CHARINDEX('.',@dr,PATINDEX('%[0-9]%',@dr))))

 + RIGHT(@d,LEN(@d) - CHARINDEX('.',@d,PATINDEX('%[0-9]%',@d)) + 1)

#1