What is the most convenient way to remove all the HTML tags when using the SAS URL access method to read web pages?
使用SAS URL访问方法读取网页时,删除所有HTML标记的最方便方法是什么?
2 个解决方案
#1
This should do what you want. Removes everything between the <> including the <> and leaves just the content (aka innerHTML).
这应该做你想要的。删除<>(包括<>)之间的所有内容,只留下内容(也称为innerHTML)。
Data HTMLData;
filename INDEXIN URL "http://www.zug.com/";
input;
textline = _INFILE_;
/*-- Clear out the HTML text --*/
re1 = prxparse("s/<(.|\n)*?>//");
call prxchange(re1, -1, textline);
run;
#2
I think the methodology is not to remove the HTML from the page, but identify the standard patterns for the data you are trying to capture. This is the perl / regular expressions type methodology.
我认为该方法不是从页面中删除HTML,而是确定您尝试捕获的数据的标准模式。这是perl /正则表达式类型方法。
An example might be some data or table that comes so many characters after the logo image. You could write a script to keep only the data.
一个示例可能是徽标图像后面有这么多字符的某些数据或表格。您可以编写脚本以仅保留数据。
If you want to post up some html, maybe we can help decode it.
如果你想发布一些HTML,也许我们可以帮助解码它。
#1
This should do what you want. Removes everything between the <> including the <> and leaves just the content (aka innerHTML).
这应该做你想要的。删除<>(包括<>)之间的所有内容,只留下内容(也称为innerHTML)。
Data HTMLData;
filename INDEXIN URL "http://www.zug.com/";
input;
textline = _INFILE_;
/*-- Clear out the HTML text --*/
re1 = prxparse("s/<(.|\n)*?>//");
call prxchange(re1, -1, textline);
run;
#2
I think the methodology is not to remove the HTML from the page, but identify the standard patterns for the data you are trying to capture. This is the perl / regular expressions type methodology.
我认为该方法不是从页面中删除HTML,而是确定您尝试捕获的数据的标准模式。这是perl /正则表达式类型方法。
An example might be some data or table that comes so many characters after the logo image. You could write a script to keep only the data.
一个示例可能是徽标图像后面有这么多字符的某些数据或表格。您可以编写脚本以仅保留数据。
If you want to post up some html, maybe we can help decode it.
如果你想发布一些HTML,也许我们可以帮助解码它。