如何从SAS URL访问方法中删除HTML？

What is the most convenient way to remove all the HTML tags when using the SAS URL access method to read web pages?

使用SAS URL访问方法读取网页时,删除所有HTML标记的最方便方法是什么?

2 个解决方案

#1

This should do what you want. Removes everything between the <> including the <> and leaves just the content (aka innerHTML).

这应该做你想要的。删除<>(包括<>)之间的所有内容,只留下内容(也称为innerHTML)。

Data HTMLData;

filename INDEXIN URL "http://www.zug.com/";

input;

textline = _INFILE_;

/*-- Clear out the HTML text --*/
re1 = prxparse("s/<(.|\n)*?>//");
call prxchange(re1, -1, textline);

run;

#2

I think the methodology is not to remove the HTML from the page, but identify the standard patterns for the data you are trying to capture. This is the perl / regular expressions type methodology.

我认为该方法不是从页面中删除HTML,而是确定您尝试捕获的数据的标准模式。这是perl /正则表达式类型方法。

An example might be some data or table that comes so many characters after the logo image. You could write a script to keep only the data.

一个示例可能是徽标图像后面有这么多字符的某些数据或表格。您可以编写脚本以仅保留数据。

If you want to post up some html, maybe we can help decode it.

如果你想发布一些HTML,也许我们可以帮助解码它。

#1