In my dataset, I have a field which stores text marked up with HTML. The general format is as follows:
在我的数据集中,我有一个字段用来存储用HTML标记的文本。一般格式如下:
<html><head></head><body><p>My text.</p></body></html>
< html > <头> < /头> <身体> < p >文本。< / p > < /身体> < / html >
I could attempt to solve the problem by doing the following:
我可以通过以下方法来解决这个问题:
REPLACE(REPLACE(Table.HtmlData, '<html><head></head><body><p>', ''), '</p></body></html>')
However, this is not a strict rule as some of entries break W3C Standards and do not include <head>
tags for example. Even worse, there could be missing closing tags. So I would need to include the REPLACE
function for each opening and closing tag that could exist.
但是,这并不是一个严格的规则,因为一些条目违反了W3C标准,并且不包括标签。更糟糕的是,可能会缺少关闭标签。因此,我需要为可能存在的每个打开和结束标记包含替换函数。
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
Table.HtmlData,
'<html>', ''),
'</html>', ''),
'<head>', ''),
'</head>', ''),
'<body>', ''),
'</body>', ''),
'<p>', ''),
'</p>', '')
I was wondering if there was a better way to accomplish this than using multiple nested REPLACE
functions. Unfortunately, the only languages I have available in this environment are SQL and Visual Basic (not .NET).
我想知道是否有比使用多个嵌套替换函数更好的方法来实现这一点。不幸的是,我在这个环境中仅有的语言是SQL和Visual Basic(不是。net)。
7 个解决方案
#1
7
DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'
SELECT t.c.value('.', 'NVARCHAR(MAX)')
FROM @x.nodes('*') t(c)
Update - For strings with unclosed tags:
更新-对于带有未关闭标签的字符串:
DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'
SELECT x.value('.', 'NVARCHAR(MAX)')
FROM (
SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)
) r
#2
5
If the HTML is well formed then there's no need to use replace to parse XML.
Just cast or convert it to an XML type and get the value(s).
如果HTML格式良好,那么就不需要使用replace来解析XML。只需将其转换或转换为XML类型并获取值。
Here's an example to output the text from all tags:
这里有一个从所有标签中输出文本的例子:
declare @htmlData nvarchar(100) = '<html>
<head>
</head>
<body>
<p>My text.</p>
<p>My other text.</p>
</body>
</html>';
select convert(XML,@htmlData,1).value('.', 'nvarchar(max)');
select cast(@htmlData as XML).value('.', 'nvarchar(max)');
Note that there's a difference in the output of whitespace between cast and convert.
注意,在类型转换和转换之间,空格的输出是不同的。
To only get content from a specific node, the XQuery syntax is used. (XQuery is based on the XPath syntax)
为了只从特定节点获取内容,使用XQuery语法。(XQuery基于XPath语法)
For example:
例如:
select cast(@htmlData as XML).value('(//body/p/node())[1]', 'nvarchar(max)');
select convert(XML,@htmlData,1).value('(//body/p/node())[1]', 'nvarchar(max)');
Result : My text.
结果:我的文本。
Of course, this still assumes a valid XML.
If for example, a closing tag is missing then this would raise an XML parsing
error.
当然,这仍然假设有一个有效的XML。例如,如果缺少结束标记,则会引发XML解析错误。
If the HTML isn't well formed as an XML, then one could use PATINDEX & SUBSTRING to get the first p tag. And then cast that to an XML type to get the value.
如果HTML不是很好的XML格式,那么可以使用PATINDEX和SUBSTRING来获得第一个p标记。然后将其转换为XML类型以获得该值。
select cast(SUBSTRING(@htmlData,patindex('%<p>%',@htmlData),patindex('%</p>%',@htmlData) - patindex('%<p>%',@htmlData)+4) as xml).value('.','nvarchar(max)');
or via a funky recursive way:
或者通过一种时髦的递归方式:
declare @xmlData nvarchar(100);
WITH Lines(n, x, y) AS (
SELECT 1, 1, CHARINDEX(char(13), @htmlData)
UNION ALL
SELECT n+1, y+1, CHARINDEX(char(13), @htmlData, y+1) FROM Lines
WHERE y > 0
)
SELECT @xmlData = concat(@xmlData,SUBSTRING(@htmlData,x,IIF(y>0,y-x,8)))
FROM Lines
where PATINDEX('%<p>%</p>%', SUBSTRING(@htmlData,x,IIF(y>0,y-x,10))) > 0
order by n;
select
@xmlData as xmlData,
convert(XML,@xmlData,1).value('(/p/node())[1]', 'nvarchar(max)') as FirstP;
#3
2
Firstly create a user defined function that strips the HTML out like so:
首先创建一个用户定义的函数,将HTML去掉,如下所示:
CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT;
DECLARE @End INT;
DECLARE @Length INT;
SET @Start = CHARINDEX('<', @HTMLText);
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
SET @Length = (@End - @Start) + 1;
WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '');
SET @Start = CHARINDEX('<', @HTMLText);
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
SET @Length = (@End - @Start) + 1;
END;
RETURN LTRIM(RTRIM(@HTMLText));
END;
GO
When you're trying to select it:
当你试图选择它时:
SELECT dbo.udf_StripHTML([column]) FROM SOMETABLE
This should lead to you avoiding to have to use several nested replace statements.
这将使您避免使用几个嵌套的替换语句。
Credit and further info: http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/
信贷和进一步的信息:http://blog.sqlauthority.com/2007/06/16/sql-server-udf用户定义的功能到条带-html-parse-html-no-regular-expression/
#4
1
One more solution, just to demonstrate a trick to replace many values of a table (easy to maintain!!!) in one single statement:
还有一个解决方案,就是演示如何在一个语句中替换一个表的多个值(易于维护!!):
--add any replace templates here:
——在这里添加任何替换模板:
CREATE TABLE ReplaceTags (HTML VARCHAR(100));
INSERT INTO ReplaceTags VALUES
('<html>'),('<head>'),('<body>'),('<p>'),('<br>')
,('</html>'),('</head>'),('</body>'),('</p>'),('</br>');
GO
--This function will perform the "trick"
——这个函数将执行“trick”
CREATE FUNCTION dbo.DoReplace(@Content VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
SELECT @Content=REPLACE(@Content,HTML,'')
FROM ReplaceTags;
RETURN @Content;
END
GO
--All examples I found in your question and in comments
——所有我在你的问题和评论中找到的例子。
DECLARE @content TABLE(Content VARCHAR(MAX));
INSERT INTO @content VALUES
('<html><head></head><body><p>My text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.</p></html>');
--this is the actual query
——这是实际的查询
SELECT dbo.DoReplace(Content) FROM @content;
GO
--Clean-Up
——清理
DROP FUNCTION dbo.DoReplace;
DROP TABLE ReplaceTags;
UPDATE
If you add a replace-value to the template-table you might even use different values as replacements like replace a <br>
with an actual line break...
如果您将替换值添加到模板表中,您甚至可以使用不同的值作为替换,例如用实际的换行符替换
…
#5
0
This is just an example. You can use this in script to rmeove any html tags:
这只是一个例子。您可以在脚本中使用它来rmeove任何html标记:
DECLARE @VALUE VARCHAR(MAX),@start INT,@end int,@remove varchar(max)
SET @VALUE='<html itemscope itemtype="http://schema.org/QAPage">
<head>
<title>sql - Converting INT to DATE then using GETDATE on conversion? - Stack Overflow</title>
<html>
</html>
'
set @start=charindex('<',@value)
while @start>0
begin
set @end=charindex('>',@VALUE)
set @remove=substring(@VALUE,@start,@end)
set @value=replace(@value,@remove,'')
set @start=charindex('<',@value)
end
print @value
#6
0
This is the simplest way.
这是最简单的方法。
DECLARE @str VARCHAR(299)
SELECT @str = '<html><head></head><body><p>My text.</p></body></html>'
SELECT cast(@str AS XML).query('.').value('.', 'varchar(200)')
#7
0
You mention the XML is not always valid, but does it always contain the <p> and </p> tags?
您提到XML并不总是有效的,但是它是否总是包含
和
标记?In that case the following would work:
在这种情况下,可以采用下列方法:
SUBSTRING(Table.HtmlData,
CHARINDEX('<p>', Table.HtmlData) + 1,
CHARINDEX('</p>', Table.HtmlData) - CHARINDEX('<p>', Table.HtmlData) + 1)
For finding all positions of a <p> within a HTML, there's already a good post here: https://dba.stackexchange.com/questions/41961/how-to-find-all-positions-of-a-string-within-another-string
要在HTML中找到
的所有位置,这里已经有了一个很好的帖子:https://dba.stackexchange.com/questions/41961/how to find-all- positionsof -string- string- string
Alternatively I suggest using Visual Basic, as you mentioned that is also an option.
或者,我建议使用Visual Basic,正如您所提到的,它也是一个选项。
#1
7
DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'
SELECT t.c.value('.', 'NVARCHAR(MAX)')
FROM @x.nodes('*') t(c)
Update - For strings with unclosed tags:
更新-对于带有未关闭标签的字符串:
DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'
SELECT x.value('.', 'NVARCHAR(MAX)')
FROM (
SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)
) r
#2
5
If the HTML is well formed then there's no need to use replace to parse XML.
Just cast or convert it to an XML type and get the value(s).
如果HTML格式良好,那么就不需要使用replace来解析XML。只需将其转换或转换为XML类型并获取值。
Here's an example to output the text from all tags:
这里有一个从所有标签中输出文本的例子:
declare @htmlData nvarchar(100) = '<html>
<head>
</head>
<body>
<p>My text.</p>
<p>My other text.</p>
</body>
</html>';
select convert(XML,@htmlData,1).value('.', 'nvarchar(max)');
select cast(@htmlData as XML).value('.', 'nvarchar(max)');
Note that there's a difference in the output of whitespace between cast and convert.
注意,在类型转换和转换之间,空格的输出是不同的。
To only get content from a specific node, the XQuery syntax is used. (XQuery is based on the XPath syntax)
为了只从特定节点获取内容,使用XQuery语法。(XQuery基于XPath语法)
For example:
例如:
select cast(@htmlData as XML).value('(//body/p/node())[1]', 'nvarchar(max)');
select convert(XML,@htmlData,1).value('(//body/p/node())[1]', 'nvarchar(max)');
Result : My text.
结果:我的文本。
Of course, this still assumes a valid XML.
If for example, a closing tag is missing then this would raise an XML parsing
error.
当然,这仍然假设有一个有效的XML。例如,如果缺少结束标记,则会引发XML解析错误。
If the HTML isn't well formed as an XML, then one could use PATINDEX & SUBSTRING to get the first p tag. And then cast that to an XML type to get the value.
如果HTML不是很好的XML格式,那么可以使用PATINDEX和SUBSTRING来获得第一个p标记。然后将其转换为XML类型以获得该值。
select cast(SUBSTRING(@htmlData,patindex('%<p>%',@htmlData),patindex('%</p>%',@htmlData) - patindex('%<p>%',@htmlData)+4) as xml).value('.','nvarchar(max)');
or via a funky recursive way:
或者通过一种时髦的递归方式:
declare @xmlData nvarchar(100);
WITH Lines(n, x, y) AS (
SELECT 1, 1, CHARINDEX(char(13), @htmlData)
UNION ALL
SELECT n+1, y+1, CHARINDEX(char(13), @htmlData, y+1) FROM Lines
WHERE y > 0
)
SELECT @xmlData = concat(@xmlData,SUBSTRING(@htmlData,x,IIF(y>0,y-x,8)))
FROM Lines
where PATINDEX('%<p>%</p>%', SUBSTRING(@htmlData,x,IIF(y>0,y-x,10))) > 0
order by n;
select
@xmlData as xmlData,
convert(XML,@xmlData,1).value('(/p/node())[1]', 'nvarchar(max)') as FirstP;
#3
2
Firstly create a user defined function that strips the HTML out like so:
首先创建一个用户定义的函数,将HTML去掉,如下所示:
CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT;
DECLARE @End INT;
DECLARE @Length INT;
SET @Start = CHARINDEX('<', @HTMLText);
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
SET @Length = (@End - @Start) + 1;
WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '');
SET @Start = CHARINDEX('<', @HTMLText);
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
SET @Length = (@End - @Start) + 1;
END;
RETURN LTRIM(RTRIM(@HTMLText));
END;
GO
When you're trying to select it:
当你试图选择它时:
SELECT dbo.udf_StripHTML([column]) FROM SOMETABLE
This should lead to you avoiding to have to use several nested replace statements.
这将使您避免使用几个嵌套的替换语句。
Credit and further info: http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/
信贷和进一步的信息:http://blog.sqlauthority.com/2007/06/16/sql-server-udf用户定义的功能到条带-html-parse-html-no-regular-expression/
#4
1
One more solution, just to demonstrate a trick to replace many values of a table (easy to maintain!!!) in one single statement:
还有一个解决方案,就是演示如何在一个语句中替换一个表的多个值(易于维护!!):
--add any replace templates here:
——在这里添加任何替换模板:
CREATE TABLE ReplaceTags (HTML VARCHAR(100));
INSERT INTO ReplaceTags VALUES
('<html>'),('<head>'),('<body>'),('<p>'),('<br>')
,('</html>'),('</head>'),('</body>'),('</p>'),('</br>');
GO
--This function will perform the "trick"
——这个函数将执行“trick”
CREATE FUNCTION dbo.DoReplace(@Content VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
SELECT @Content=REPLACE(@Content,HTML,'')
FROM ReplaceTags;
RETURN @Content;
END
GO
--All examples I found in your question and in comments
——所有我在你的问题和评论中找到的例子。
DECLARE @content TABLE(Content VARCHAR(MAX));
INSERT INTO @content VALUES
('<html><head></head><body><p>My text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.</p></html>');
--this is the actual query
——这是实际的查询
SELECT dbo.DoReplace(Content) FROM @content;
GO
--Clean-Up
——清理
DROP FUNCTION dbo.DoReplace;
DROP TABLE ReplaceTags;
UPDATE
If you add a replace-value to the template-table you might even use different values as replacements like replace a <br>
with an actual line break...
如果您将替换值添加到模板表中,您甚至可以使用不同的值作为替换,例如用实际的换行符替换
…
#5
0
This is just an example. You can use this in script to rmeove any html tags:
这只是一个例子。您可以在脚本中使用它来rmeove任何html标记:
DECLARE @VALUE VARCHAR(MAX),@start INT,@end int,@remove varchar(max)
SET @VALUE='<html itemscope itemtype="http://schema.org/QAPage">
<head>
<title>sql - Converting INT to DATE then using GETDATE on conversion? - Stack Overflow</title>
<html>
</html>
'
set @start=charindex('<',@value)
while @start>0
begin
set @end=charindex('>',@VALUE)
set @remove=substring(@VALUE,@start,@end)
set @value=replace(@value,@remove,'')
set @start=charindex('<',@value)
end
print @value
#6
0
This is the simplest way.
这是最简单的方法。
DECLARE @str VARCHAR(299)
SELECT @str = '<html><head></head><body><p>My text.</p></body></html>'
SELECT cast(@str AS XML).query('.').value('.', 'varchar(200)')
#7
0
You mention the XML is not always valid, but does it always contain the <p> and </p> tags?
您提到XML并不总是有效的,但是它是否总是包含
和
标记?In that case the following would work:
在这种情况下,可以采用下列方法:
SUBSTRING(Table.HtmlData,
CHARINDEX('<p>', Table.HtmlData) + 1,
CHARINDEX('</p>', Table.HtmlData) - CHARINDEX('<p>', Table.HtmlData) + 1)
For finding all positions of a <p> within a HTML, there's already a good post here: https://dba.stackexchange.com/questions/41961/how-to-find-all-positions-of-a-string-within-another-string
要在HTML中找到
的所有位置,这里已经有了一个很好的帖子:https://dba.stackexchange.com/questions/41961/how to find-all- positionsof -string- string- string
Alternatively I suggest using Visual Basic, as you mentioned that is also an option.
或者,我建议使用Visual Basic,正如您所提到的,它也是一个选项。