Regex capture string between delimiters and excluding them

时间:2022-02-09 09:57:33

I saw in this forum an answare close to my "request" but not enough (Regexp to capture string between delimiters).

我在这个论坛中看到了一个接近我的“请求”的answare,但还不够(Regexp捕获分隔符之间的字符串)。

My question is: I have an HTML page and I would get only the src of all "img" tags of this page and put them in one array without using cheerio (I'm using node js).

我的问题是:我有一个HTML页面,我只得到这个页面的所有“img”标签的src,并将它们放在一个数组中而不使用cheerio(我正在使用节点js)。

The problem is that i would prefer to exclude the delimiters. How could i resolve this problem?

问题是我宁愿排除分隔符。我怎么能解决这个问题?

1 个解决方案

#1


0  

Yes this is possible with regex, but it would be much easier (and probably faster but don't quote me on that) to use a native DOM method. Let's start with the regex approach. We can use a capture group to easily parse the src of an img tag:

是的,这可以使用正则表达式,但是使用本机DOM方法会更容易(并且可能更快但不要引用我)。让我们从正则表达式方法开始。我们可以使用捕获组轻松解析img标记的src:

var html = `test<div>hello</div>
<img src="first">
<img class="test" src="second" data-lang="en">
test
<img src="third" >`;
var srcs = [];
html.replace(/<img[^<>]*src=['"](.*?)['"][^<>]*>/gm, (m, $1) => { srcs.push($1) })

console.log(srcs);

However, the better way would be to use getElementsByTagName:
(note the following will get some kind of parent domain url since the srcs are relative/fake but you get the idea)

但是,更好的方法是使用getElementsByTagName :(注意以下将获得某种父域url,因为srcs是相对/假的,但你明白了)

var srcs = [].slice.call(document.getElementsByTagName('img')).map(img => img.src);

console.log(srcs);
test<div>hello</div>
<img src="first">
<img class="test" src="second" data-lang="en">
test
<img src="third" >

#1


0  

Yes this is possible with regex, but it would be much easier (and probably faster but don't quote me on that) to use a native DOM method. Let's start with the regex approach. We can use a capture group to easily parse the src of an img tag:

是的,这可以使用正则表达式,但是使用本机DOM方法会更容易(并且可能更快但不要引用我)。让我们从正则表达式方法开始。我们可以使用捕获组轻松解析img标记的src:

var html = `test<div>hello</div>
<img src="first">
<img class="test" src="second" data-lang="en">
test
<img src="third" >`;
var srcs = [];
html.replace(/<img[^<>]*src=['"](.*?)['"][^<>]*>/gm, (m, $1) => { srcs.push($1) })

console.log(srcs);

However, the better way would be to use getElementsByTagName:
(note the following will get some kind of parent domain url since the srcs are relative/fake but you get the idea)

但是,更好的方法是使用getElementsByTagName :(注意以下将获得某种父域url,因为srcs是相对/假的,但你明白了)

var srcs = [].slice.call(document.getElementsByTagName('img')).map(img => img.src);

console.log(srcs);
test<div>hello</div>
<img src="first">
<img class="test" src="second" data-lang="en">
test
<img src="third" >