TSQL从HTML中删除带有特定src的img标记

时间:2022-01-12 00:15:43

I have an html text in my database with many img tags. My goal is remove img tags with specific src

我的数据库中有一个带有许多img标签的html文本。我的目标是用特定的src去除img标签

My Input is

我的输入

<div>
    <p>some text goes here <img width="100" src="/upload/remove-me.png" /></p>
    <p>some other text goes here <img height="100" src='/upload/remove-me.png' width="200" /></p>    
    <p>some other text goes here <img src="/upload/filename.png" /></p>
</div>

I'd like to remove all images where src="/upload/remove-me.png" my output result to be

我想删除src="/上载/远程-me的所有图片。我的输出结果是

<div>
    <p>some text goes here</p>
    <p>some other text goes here</p>
    <p>some other text goes here <img src="/upload/filename.png" /></p>
</div>

Is there any way to do it with regex in TSQL?

有没有办法用TSQL中的regex来做呢?

4 个解决方案

#1


1  

From your example it seems the tags can have their attributes in any order, so we need to loop through the text to take out the img tags one at a time. Obviously you will want to try this on a backed up version of your data to make sure it is only removing what you want to be removed:

从您的示例中可以看出,标记的属性可以是任意顺序的,因此我们需要对文本进行循环,以一次取出一个img标记。显然,您需要在备份的数据版本上尝试此操作,以确保只删除您希望删除的内容:

declare @HTML table(a nvarchar(max)) 
insert into @HTML
select 
'<div>
    <p>some text goes here <img width="100" src="/upload/remove-me.png" /></p>
    <p>some other text goes here <img height="100" src="/upload/remove-me.png" width="200" /></p>    
    <p>some other text goes here <img src="/upload/filename.png" /></p>
</div>'


declare @URL nvarchar(50) = 'src="/upload/remove-me.png"'   -- Search for img tags with this text in.
declare @TagStart int = -1
declare @TagEnd int = -1

while @TagStart <> 0
begin
    select @TagStart = patindex('%<img%' + @URL + '%/>%',a)-1       -- Find the start of the first img tag in the text.
            ,@TagEnd = patindex('%/>%'
                                        ,substring(a
                                        ,patindex('%<img%' + @URL + '%/>%',a)
                                        ,999999999
                                        )
                                )+1                                 -- Find the end of the first img tag in the text.
    from @HTML

    update @HTML                -- Update the table to remove just this tag
    set a = (select left(a,@TagStart) + right(a,len(a)-@TagStart-@TagEnd)
            from @HTML
            )

    select @TagStart = patindex('%<img%' + @URL + '%/>%',a)     -- Check if there are any more img tags with the URL to remove.  Will return 0 if there are none.
    from @HTML
end

select a as CleanHTML
from @HTML

#2


2  

XML DML gives more elegant solution. Most probably your main table has HTML field as (n)varchar(max)) so a temporary table is necessary.

XML DML提供了更优雅的解决方案。最可能的情况是,您的主表的HTML字段为(n)varchar(max),因此需要一个临时表。

declare @HTML table(id int, a xml) 
insert into @HTML
select id, html
from dbo.myTable
/* content of html field
'<div>
    <p>some text goes here <img width="100" src="/upload/remove-me.png" /></p>
    <p>some other text goes here <img height="100" src="/upload/remove-me.png" width="200" /></p>    
    <p>some other text goes here <img src="/upload/filename.png" /></p>
</div>'
*/
update @html
set a.modify('delete //img[contains(@src,"remove-me")]') --delete nodes and update
from @HTML cross apply a.nodes('div') t(v)

--select * from @html --just to see what happens
update dbo.myTable
set html = h.a
from dbo.myTable t
inner join @html h on t.id = h.id

#3


0  

If the img is constant as a whole (not just the src):

如果img作为一个整体是常数(不只是src):

<img height="100" src='/upload/remove-me.png' width="200" />

then you can use a simple REPLACE, like this:

然后你可以用一个简单的替换,比如:

UPDATE tablename SET columnname=REPLACE(
  columnname,
  N' <img height="100" src=''/upload/remove-me.png'' width="200" />',
  N''
)
WHERE columnname LIKE N'% <img height="100" src=''/upload/remove-me.png'' width="200" />%'

The space before the tag is intended. If the markup is stored in an ntext column, convert to nvarchar(max) first, otherwise REPLACE will fail.

标签前的空格。如果标记存储在ntext列中,首先转换为nvarchar(max),否则替换将失败。

If this is a task other than a one-off data correction, you should rather include that with your business logic layer.

如果这是一项任务,而不是一次性的数据修正,您应该将其包含到业务逻辑层中。

#4


0  

The following function should do the job. It simply finds the image start and end tags for the targeted image name and then removes the text.

下面的函数应该完成这项工作。它只是为目标图像名找到图像开始和结束标记,然后删除文本。

ALTER FUNCTION Html_RemoveImageAttributes
(
    @sourceImage        NVARCHAR(100),
    @inputHtml          NVARCHAR(MAX)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN

    DECLARE @imageTagStart INT = CHARINDEX('<img ' , @inputHtml, 1);
    DECLARE @imageIndex INT = CHARINDEX(@sourceImage, @inputHtml, @imageTagStart);
    DECLARE @imageTagEnd INT = CHARINDEX('/>' , @inputHtml, @imageTagStart);

    DECLARE @outputHtml NVARCHAR(MAX) = @inputHtml;

    WHILE (@imageIndex > 0) 
    BEGIN

        IF (@imageIndex > @imageTagStart) AND (@imageIndex < @imageTagEnd)
        BEGIN

            -- Remove first occurrence of image.
            SET @outputHtml = REPLACE(@outputHtml, SUBSTRING(@outputHtml, @imageTagStart, @imageTagEnd - @imageTagStart + 2), '');

            SET @imageTagStart  = CHARINDEX('<img ' , @outputHtml);
            SET @imageIndex  = CHARINDEX(@sourceImage, @outputHtml);
            SET @imageTagEnd  = CHARINDEX('/>' , @outputHtml);
        END
        ELSE
        BEGIN

            SET @imageTagStart  = CHARINDEX('<img ' , @outputHtml, @imageTagEnd);
            SET @imageIndex  = CHARINDEX(@sourceImage, @outputHtml, @imageTagEnd);
            SET @imageTagEnd  = CHARINDEX('/>' , @outputHtml, @imageTagEnd + 1);

        END

    END


    RETURN @outputHtml

END

The following example shows how it can be used:

下面的例子展示了如何使用它:

DECLARE @sourceImage NVARCHAR(50) = 'remove-me.png';
DECLARE @input NVARCHAR(4000) = N'<div>
    <p>some text goes here <img width="100" src="/upload/remove-me.png" /></p>
    <p>some other text goes here <img height="100" src=''/upload/remove-me.png'' width="200" /></p>    
    <p>some other text goes here <img src="/upload/filename.png" /></p>
</div>';

PRINT dbo.Html_RemoveImageAttributes(@sourceImage, @input);

#1


1  

From your example it seems the tags can have their attributes in any order, so we need to loop through the text to take out the img tags one at a time. Obviously you will want to try this on a backed up version of your data to make sure it is only removing what you want to be removed:

从您的示例中可以看出,标记的属性可以是任意顺序的,因此我们需要对文本进行循环,以一次取出一个img标记。显然,您需要在备份的数据版本上尝试此操作,以确保只删除您希望删除的内容:

declare @HTML table(a nvarchar(max)) 
insert into @HTML
select 
'<div>
    <p>some text goes here <img width="100" src="/upload/remove-me.png" /></p>
    <p>some other text goes here <img height="100" src="/upload/remove-me.png" width="200" /></p>    
    <p>some other text goes here <img src="/upload/filename.png" /></p>
</div>'


declare @URL nvarchar(50) = 'src="/upload/remove-me.png"'   -- Search for img tags with this text in.
declare @TagStart int = -1
declare @TagEnd int = -1

while @TagStart <> 0
begin
    select @TagStart = patindex('%<img%' + @URL + '%/>%',a)-1       -- Find the start of the first img tag in the text.
            ,@TagEnd = patindex('%/>%'
                                        ,substring(a
                                        ,patindex('%<img%' + @URL + '%/>%',a)
                                        ,999999999
                                        )
                                )+1                                 -- Find the end of the first img tag in the text.
    from @HTML

    update @HTML                -- Update the table to remove just this tag
    set a = (select left(a,@TagStart) + right(a,len(a)-@TagStart-@TagEnd)
            from @HTML
            )

    select @TagStart = patindex('%<img%' + @URL + '%/>%',a)     -- Check if there are any more img tags with the URL to remove.  Will return 0 if there are none.
    from @HTML
end

select a as CleanHTML
from @HTML

#2


2  

XML DML gives more elegant solution. Most probably your main table has HTML field as (n)varchar(max)) so a temporary table is necessary.

XML DML提供了更优雅的解决方案。最可能的情况是,您的主表的HTML字段为(n)varchar(max),因此需要一个临时表。

declare @HTML table(id int, a xml) 
insert into @HTML
select id, html
from dbo.myTable
/* content of html field
'<div>
    <p>some text goes here <img width="100" src="/upload/remove-me.png" /></p>
    <p>some other text goes here <img height="100" src="/upload/remove-me.png" width="200" /></p>    
    <p>some other text goes here <img src="/upload/filename.png" /></p>
</div>'
*/
update @html
set a.modify('delete //img[contains(@src,"remove-me")]') --delete nodes and update
from @HTML cross apply a.nodes('div') t(v)

--select * from @html --just to see what happens
update dbo.myTable
set html = h.a
from dbo.myTable t
inner join @html h on t.id = h.id

#3


0  

If the img is constant as a whole (not just the src):

如果img作为一个整体是常数(不只是src):

<img height="100" src='/upload/remove-me.png' width="200" />

then you can use a simple REPLACE, like this:

然后你可以用一个简单的替换,比如:

UPDATE tablename SET columnname=REPLACE(
  columnname,
  N' <img height="100" src=''/upload/remove-me.png'' width="200" />',
  N''
)
WHERE columnname LIKE N'% <img height="100" src=''/upload/remove-me.png'' width="200" />%'

The space before the tag is intended. If the markup is stored in an ntext column, convert to nvarchar(max) first, otherwise REPLACE will fail.

标签前的空格。如果标记存储在ntext列中,首先转换为nvarchar(max),否则替换将失败。

If this is a task other than a one-off data correction, you should rather include that with your business logic layer.

如果这是一项任务,而不是一次性的数据修正,您应该将其包含到业务逻辑层中。

#4


0  

The following function should do the job. It simply finds the image start and end tags for the targeted image name and then removes the text.

下面的函数应该完成这项工作。它只是为目标图像名找到图像开始和结束标记,然后删除文本。

ALTER FUNCTION Html_RemoveImageAttributes
(
    @sourceImage        NVARCHAR(100),
    @inputHtml          NVARCHAR(MAX)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN

    DECLARE @imageTagStart INT = CHARINDEX('<img ' , @inputHtml, 1);
    DECLARE @imageIndex INT = CHARINDEX(@sourceImage, @inputHtml, @imageTagStart);
    DECLARE @imageTagEnd INT = CHARINDEX('/>' , @inputHtml, @imageTagStart);

    DECLARE @outputHtml NVARCHAR(MAX) = @inputHtml;

    WHILE (@imageIndex > 0) 
    BEGIN

        IF (@imageIndex > @imageTagStart) AND (@imageIndex < @imageTagEnd)
        BEGIN

            -- Remove first occurrence of image.
            SET @outputHtml = REPLACE(@outputHtml, SUBSTRING(@outputHtml, @imageTagStart, @imageTagEnd - @imageTagStart + 2), '');

            SET @imageTagStart  = CHARINDEX('<img ' , @outputHtml);
            SET @imageIndex  = CHARINDEX(@sourceImage, @outputHtml);
            SET @imageTagEnd  = CHARINDEX('/>' , @outputHtml);
        END
        ELSE
        BEGIN

            SET @imageTagStart  = CHARINDEX('<img ' , @outputHtml, @imageTagEnd);
            SET @imageIndex  = CHARINDEX(@sourceImage, @outputHtml, @imageTagEnd);
            SET @imageTagEnd  = CHARINDEX('/>' , @outputHtml, @imageTagEnd + 1);

        END

    END


    RETURN @outputHtml

END

The following example shows how it can be used:

下面的例子展示了如何使用它:

DECLARE @sourceImage NVARCHAR(50) = 'remove-me.png';
DECLARE @input NVARCHAR(4000) = N'<div>
    <p>some text goes here <img width="100" src="/upload/remove-me.png" /></p>
    <p>some other text goes here <img height="100" src=''/upload/remove-me.png'' width="200" /></p>    
    <p>some other text goes here <img src="/upload/filename.png" /></p>
</div>';

PRINT dbo.Html_RemoveImageAttributes(@sourceImage, @input);