I've got data in SQL Server 2005 that contains html tags and I'd like to strip all that out, leaving just the text between the tags. Ideally also replacing things like <
with <
, etc.
我在SQL Server 2005中得到了包含html标签的数据,我想去掉所有这些,只留下标签之间的文本。理想情况下也可以替换<与 <等等。< p>
Is there an easy way to do this or has someone already got some sample t-sql code?
是否有一种简单的方法来实现这一点,或者是否有人已经获得了一些t-sql示例代码?
I don't have the ability to add extended stored procs and the like, so would prefer a pure t-sql approach (preferably one backwards compatible with sql 2000).
我没有能力添加扩展存储的procs之类的,所以我更喜欢纯t-sql方法(最好是向后兼容sql 2000)。
I just want to retrieve the data with stripped out html, not update it, so ideally it would be written as a user-defined function, to make for easy reuse.
我只想用剥离的html检索数据,而不是更新它,所以理想情况下,它应该写成用户定义的函数,以便易于重用。
So for example converting this:
例如把它转换成
<B>Some useful text</B>
<A onclick="return openInfo(this)"
href="http://there.com/3ce984e88d0531bac5349"
target=globalhelp>
<IMG title="Source Description" height=15 alt="Source Description"
src="/ri/new_info.gif" width=15 align=top border=0>
</A>> <b>more text</b></TD></TR>
to this:
:
Some useful text > more text
7 个解决方案
#1
127
There is a UDF that will do that described here:
有一个UDF将做这在这里描述:
User Defined Function to Strip HTML
用户定义函数来剥离HTML
CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX) AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0 AND @End > 0 AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO
Edit: note this is for SQL Server 2005, but if you change the keyword MAX to something like 4000, it will work in SQL Server 2000 as well.
编辑:注意这是针对SQL Server 2005的,但是如果您将关键字MAX更改为类似于4000的内容,它也可以在SQL Server 2000中工作。
#2
8
If your HTML is well formed, I think this is a better solution:
如果您的HTML格式良好,我认为这是一个更好的解决方案:
create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as
begin
declare @textXML xml
declare @result varchar(max)
set @textXML = REPLACE( @text, '&', '' );
with doc(contents) as
(
select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)
)
select @result = contents.value('.', 'varchar(max)') from doc
return @result
end
go
select dbo.StripHTML('This <i>is</i> an <b>html</b> test')
#3
7
Derived from @Goner Doug answer, with a few things updated:
- using REPLACE where possible
- conversion of predefined entities like é
(I chose the ones I needed :-)
- some conversion of list tags <ul> and <li>
源自@Goner Doug的回答,并更新了一些内容:-在可能的情况下使用替换-预定义实体的转换,如& e急性;(我选择了我需要的:-)-一些列表标签的转换
-
和
- 。
- 。
ALTER FUNCTION [dbo].[udf_StripHTML]
--by Patrick Honorez --- www.idevlop.com
--inspired by http://*.com/questions/457701/best-way-to-strip-html-tags-from-a-string-in-sql-server/39253602#39253602
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int
set @HTMLText = replace(@htmlText, '<br>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<br/>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<br />',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<li>','- ')
set @HTMLText = replace(@htmlText, '</li>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '’' collate Latin1_General_CS_AS, '''' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '"' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '€' collate Latin1_General_CS_AS, '€' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '<' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '>' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'œ' collate Latin1_General_CS_AS, 'oe' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, ' ' collate Latin1_General_CS_AS, ' ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '©' collate Latin1_General_CS_AS, '©' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '«' collate Latin1_General_CS_AS, '«' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '®' collate Latin1_General_CS_AS, '®' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '±' collate Latin1_General_CS_AS, '±' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '²' collate Latin1_General_CS_AS, '²' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '³' collate Latin1_General_CS_AS, '³' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'µ' collate Latin1_General_CS_AS, 'µ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '·' collate Latin1_General_CS_AS, '·' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'º' collate Latin1_General_CS_AS, 'º' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '»' collate Latin1_General_CS_AS, '»' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '¼' collate Latin1_General_CS_AS, '¼' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '½' collate Latin1_General_CS_AS, '½' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '¾' collate Latin1_General_CS_AS, '¾' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Aelig' collate Latin1_General_CS_AS, 'Æ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'Ç' collate Latin1_General_CS_AS, 'Ç' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'È' collate Latin1_General_CS_AS, 'È' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'É' collate Latin1_General_CS_AS, 'É' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'Ê' collate Latin1_General_CS_AS, 'Ê' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'Ö' collate Latin1_General_CS_AS, 'Ö' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'à' collate Latin1_General_CS_AS, 'à' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'â' collate Latin1_General_CS_AS, 'â' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ä' collate Latin1_General_CS_AS, 'ä' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'æ' collate Latin1_General_CS_AS, 'æ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ç' collate Latin1_General_CS_AS, 'ç' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'è' collate Latin1_General_CS_AS, 'è' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'é' collate Latin1_General_CS_AS, 'é' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ê' collate Latin1_General_CS_AS, 'ê' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ë' collate Latin1_General_CS_AS, 'ë' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'î' collate Latin1_General_CS_AS, 'î' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ô' collate Latin1_General_CS_AS, 'ô' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ö' collate Latin1_General_CS_AS, 'ö' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '÷' collate Latin1_General_CS_AS, '÷' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ø' collate Latin1_General_CS_AS, 'ø' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ù' collate Latin1_General_CS_AS, 'ù' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ú' collate Latin1_General_CS_AS, 'ú' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'û' collate Latin1_General_CS_AS, 'û' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ü' collate Latin1_General_CS_AS, 'ü' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '"' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '‹' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '›' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS)
-- Remove anything between <STYLE> tags
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
#4
4
This is not a complete new solution but a correction for afwebservant's solution:
这不是一个完整的新解决方案,而是对afwebservant解决方案的修正:
--note comments to see the corrections
CREATE FUNCTION [dbo].[StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
--DECLARE @TempStr varchar(255) (this is not used)
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0 AND @End > 0 AND @Length > 0
BEGIN
IF (UPPER(SUBSTRING(@HTMLText, @Start, 4)) <> '<BR>') AND (UPPER(SUBSTRING(@HTMLText, @Start, 5)) <> '</BR>')
begin
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
end
-- this ELSE and SET is important
ELSE
SET @Length = 0;
-- minus @Length here below is important
SET @Start = CHARINDEX('<',@HTMLText, @End-@Length)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))
-- instead of -1 it should be +1
SET @Length = (@End - @Start) + 1
END
RETURN RTRIM(LTRIM(@HTMLText))
END
#5
4
Here's an updated version of this function that incorporates the RedFilter answer (Pinal's original) with the LazyCoders additions and the goodeye typo corrections AND my own addition to handle in-line <STYLE>
tags inside the HTML.
这是这个函数的一个更新版本,它将RedFilter (Pinal的原始答案)与LazyCoders的添加、goodeye错误纠正以及我自己添加的用于处理HTML内的in-line
ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int
-- Replace the HTML entity & with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&amp;')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity < with the '<' character
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity > with the '>' character
SET @Start = CHARINDEX('>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity & with the '&' character
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity with the ' ' character
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END
-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <STYLE> tags
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
#6
3
Try this. It's a modified version of the one posted by RedFilter ... this SQL removes all tags except BR, B, and P with any accompanying attributes:
试试这个。这是RedFilter发布的一个修改版……此SQL删除除BR、B和P之外的所有标记,并附带任何属性:
CREATE FUNCTION [dbo].[StripHtml] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
DECLARE @TempStr varchar(255)
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0 AND @End > 0 AND @Length > 0
BEGIN
IF (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '<BR') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<P') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<B') AND (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '</B')
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
END
SET @Start = CHARINDEX('<',@HTMLText, @End)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))
SET @Length = (@End - @Start) - 1
END
RETURN RTRIM(LTRIM(@HTMLText))
END
#7
1
How about using XQuery with a one liner:
如何使用XQuery与一行代码:
select @xml.query('for $x in //. return ($x)//text()')
This loops through all elements and returns the text() only.
这个循环遍历所有元素并返回文本()。
To avoid text between elements concatenating without spaces, use:
要避免在没有空格的元素之间进行文本连接,请使用:
SELECT @xml.query('for $x in //. return concat((($x)//text())[1]," ")')
These are excellent for when you want to build seach phrases, strip HTML, etc.
当您想要构建seach短语、剥离HTML等时,这些都是非常好的。
Just note that this returns type xml, so CAST or COVERT to text where appropriate. The xml version of this data type is useless, as it is not a well formed XML.
请注意,这返回的是xml类型,因此在适当的地方转换或隐藏到文本。这种数据类型的xml版本是无用的,因为它不是格式良好的xml。
#1
127
There is a UDF that will do that described here:
有一个UDF将做这在这里描述:
User Defined Function to Strip HTML
用户定义函数来剥离HTML
CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX) AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0 AND @End > 0 AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO
Edit: note this is for SQL Server 2005, but if you change the keyword MAX to something like 4000, it will work in SQL Server 2000 as well.
编辑:注意这是针对SQL Server 2005的,但是如果您将关键字MAX更改为类似于4000的内容,它也可以在SQL Server 2000中工作。
#2
8
If your HTML is well formed, I think this is a better solution:
如果您的HTML格式良好,我认为这是一个更好的解决方案:
create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as
begin
declare @textXML xml
declare @result varchar(max)
set @textXML = REPLACE( @text, '&', '' );
with doc(contents) as
(
select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)
)
select @result = contents.value('.', 'varchar(max)') from doc
return @result
end
go
select dbo.StripHTML('This <i>is</i> an <b>html</b> test')
#3
7
Derived from @Goner Doug answer, with a few things updated:
- using REPLACE where possible
- conversion of predefined entities like é
(I chose the ones I needed :-)
- some conversion of list tags <ul> and <li>
源自@Goner Doug的回答,并更新了一些内容:-在可能的情况下使用替换-预定义实体的转换,如& e急性;(我选择了我需要的:-)-一些列表标签的转换
-
和
- 。
- 。
ALTER FUNCTION [dbo].[udf_StripHTML]
--by Patrick Honorez --- www.idevlop.com
--inspired by http://*.com/questions/457701/best-way-to-strip-html-tags-from-a-string-in-sql-server/39253602#39253602
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int
set @HTMLText = replace(@htmlText, '<br>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<br/>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<br />',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<li>','- ')
set @HTMLText = replace(@htmlText, '</li>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '’' collate Latin1_General_CS_AS, '''' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '"' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '€' collate Latin1_General_CS_AS, '€' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '<' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '>' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'œ' collate Latin1_General_CS_AS, 'oe' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, ' ' collate Latin1_General_CS_AS, ' ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '©' collate Latin1_General_CS_AS, '©' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '«' collate Latin1_General_CS_AS, '«' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '®' collate Latin1_General_CS_AS, '®' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '±' collate Latin1_General_CS_AS, '±' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '²' collate Latin1_General_CS_AS, '²' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '³' collate Latin1_General_CS_AS, '³' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'µ' collate Latin1_General_CS_AS, 'µ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '·' collate Latin1_General_CS_AS, '·' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'º' collate Latin1_General_CS_AS, 'º' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '»' collate Latin1_General_CS_AS, '»' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '¼' collate Latin1_General_CS_AS, '¼' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '½' collate Latin1_General_CS_AS, '½' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '¾' collate Latin1_General_CS_AS, '¾' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Aelig' collate Latin1_General_CS_AS, 'Æ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'Ç' collate Latin1_General_CS_AS, 'Ç' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'È' collate Latin1_General_CS_AS, 'È' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'É' collate Latin1_General_CS_AS, 'É' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'Ê' collate Latin1_General_CS_AS, 'Ê' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'Ö' collate Latin1_General_CS_AS, 'Ö' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'à' collate Latin1_General_CS_AS, 'à' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'â' collate Latin1_General_CS_AS, 'â' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ä' collate Latin1_General_CS_AS, 'ä' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'æ' collate Latin1_General_CS_AS, 'æ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ç' collate Latin1_General_CS_AS, 'ç' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'è' collate Latin1_General_CS_AS, 'è' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'é' collate Latin1_General_CS_AS, 'é' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ê' collate Latin1_General_CS_AS, 'ê' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ë' collate Latin1_General_CS_AS, 'ë' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'î' collate Latin1_General_CS_AS, 'î' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ô' collate Latin1_General_CS_AS, 'ô' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ö' collate Latin1_General_CS_AS, 'ö' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '÷' collate Latin1_General_CS_AS, '÷' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ø' collate Latin1_General_CS_AS, 'ø' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ù' collate Latin1_General_CS_AS, 'ù' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ú' collate Latin1_General_CS_AS, 'ú' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'û' collate Latin1_General_CS_AS, 'û' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ü' collate Latin1_General_CS_AS, 'ü' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '"' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '‹' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '›' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS)
-- Remove anything between <STYLE> tags
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
#4
4
This is not a complete new solution but a correction for afwebservant's solution:
这不是一个完整的新解决方案,而是对afwebservant解决方案的修正:
--note comments to see the corrections
CREATE FUNCTION [dbo].[StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
--DECLARE @TempStr varchar(255) (this is not used)
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0 AND @End > 0 AND @Length > 0
BEGIN
IF (UPPER(SUBSTRING(@HTMLText, @Start, 4)) <> '<BR>') AND (UPPER(SUBSTRING(@HTMLText, @Start, 5)) <> '</BR>')
begin
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
end
-- this ELSE and SET is important
ELSE
SET @Length = 0;
-- minus @Length here below is important
SET @Start = CHARINDEX('<',@HTMLText, @End-@Length)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))
-- instead of -1 it should be +1
SET @Length = (@End - @Start) + 1
END
RETURN RTRIM(LTRIM(@HTMLText))
END
#5
4
Here's an updated version of this function that incorporates the RedFilter answer (Pinal's original) with the LazyCoders additions and the goodeye typo corrections AND my own addition to handle in-line <STYLE>
tags inside the HTML.
这是这个函数的一个更新版本,它将RedFilter (Pinal的原始答案)与LazyCoders的添加、goodeye错误纠正以及我自己添加的用于处理HTML内的in-line
ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int
-- Replace the HTML entity & with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&amp;')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity < with the '<' character
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity > with the '>' character
SET @Start = CHARINDEX('>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity & with the '&' character
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity with the ' ' character
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END
-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <STYLE> tags
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
#6
3
Try this. It's a modified version of the one posted by RedFilter ... this SQL removes all tags except BR, B, and P with any accompanying attributes:
试试这个。这是RedFilter发布的一个修改版……此SQL删除除BR、B和P之外的所有标记,并附带任何属性:
CREATE FUNCTION [dbo].[StripHtml] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
DECLARE @TempStr varchar(255)
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0 AND @End > 0 AND @Length > 0
BEGIN
IF (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '<BR') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<P') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<B') AND (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '</B')
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
END
SET @Start = CHARINDEX('<',@HTMLText, @End)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))
SET @Length = (@End - @Start) - 1
END
RETURN RTRIM(LTRIM(@HTMLText))
END
#7
1
How about using XQuery with a one liner:
如何使用XQuery与一行代码:
select @xml.query('for $x in //. return ($x)//text()')
This loops through all elements and returns the text() only.
这个循环遍历所有元素并返回文本()。
To avoid text between elements concatenating without spaces, use:
要避免在没有空格的元素之间进行文本连接,请使用:
SELECT @xml.query('for $x in //. return concat((($x)//text())[1]," ")')
These are excellent for when you want to build seach phrases, strip HTML, etc.
当您想要构建seach短语、剥离HTML等时,这些都是非常好的。
Just note that this returns type xml, so CAST or COVERT to text where appropriate. The xml version of this data type is useless, as it is not a well formed XML.
请注意,这返回的是xml类型,因此在适当的地方转换或隐藏到文本。这种数据类型的xml版本是无用的,因为它不是格式良好的xml。