Windows - 如何grep(或findstr)html文件并显示第一个匹配的表达式

时间:2022-10-31 13:52:18

using grep or findstr I want to get the correct IMDB number, when searching by a specific movie via it's real name.

使用grep或findstr我希望获得正确的IMDB编号,当通过它的真实姓名搜索特定的电影时。

For example the movie "Das Boot" is listed at IMDB with movie number tt0082096.

例如,电影“Das Boot”在IMDB上列出,电影号为tt0082096。

Actually I'm trying to grep (or findstr) through html files that are generated by a search machine.

实际上我正试图通过搜索机器生成的html文件来grep(或findstr)。

The generated html file contains several parts like this:

生成的html文件包含以下几个部分:

<div id="statbox"> 
  <span class="uschr2">1. </span> <a href="http://www.imdb.com/title/tt0082096/" class="dublaulink">Das Boot (1981) - IMDb</a> <br>
  <div id="descbox"> 
  www.imdb.com/title/tt0082096/ - Im Cache - Ähnliche Seiten <BR>
  </div>

The string I'm looking for is the one containing the URL of the movie. In this case it's:

我正在寻找的字符串是包含电影URL的字符串。在这种情况下,它是:

http://www.imdb.com/title/tt0082096/

The string format is like:

字符串格式如下:

http://www.imdb.com/title/tt???????/

Where '?' stands for a digit 0...9

哪里'?'代表数字0 ... 9

My question is: How can grep or findstr return only the first occurrence of the matching string itself and not the complete line containing a match?

我的问题是:grep或findstr如何只返回匹配字符串本身的第一次出现而不是包含匹配的完整行?

Thank you a lot for your assistance! Best regards

非常感谢你的帮助!最好的祝福

2 个解决方案

#1


3  

Windows findstr returns complete lines. You can avoid this with GNU sed:

Windows findstr返回完整的行。您可以使用GNU sed避免这种情况:

sed -rn "\#http://www.imdb.com/title/tt#s#.*href=\"(.*)\"\s.*#\1#p" file
http://www.imdb.com/title/tt0082096/

In addition you can use grep -o:

另外你可以使用grep -o:

  -o, --only-matching       show only the part of a line matching PATTERN

#2


2  

With grep you can do something like:

使用grep,您可以执行以下操作:

grep -oP '(?<=href=\")[^"]+(?=\")' html.file

This is not the ideal way of parsing an html file. However, if it is a one off thing then you can probably get away with it. ?<=href=\" is a look behind search. If the above it returning a lot of stuff then you can probably add which is unique to the url lines.

这不是解析html文件的理想方式。但是,如果它是一个一件事,那么你可能可以逃脱它。 ?<= href = \“是搜索的背后。如果上面它返回了很多东西,那么你可以添加哪个是url行唯一的。

#1


3  

Windows findstr returns complete lines. You can avoid this with GNU sed:

Windows findstr返回完整的行。您可以使用GNU sed避免这种情况:

sed -rn "\#http://www.imdb.com/title/tt#s#.*href=\"(.*)\"\s.*#\1#p" file
http://www.imdb.com/title/tt0082096/

In addition you can use grep -o:

另外你可以使用grep -o:

  -o, --only-matching       show only the part of a line matching PATTERN

#2


2  

With grep you can do something like:

使用grep,您可以执行以下操作:

grep -oP '(?<=href=\")[^"]+(?=\")' html.file

This is not the ideal way of parsing an html file. However, if it is a one off thing then you can probably get away with it. ?<=href=\" is a look behind search. If the above it returning a lot of stuff then you can probably add which is unique to the url lines.

这不是解析html文件的理想方式。但是,如果它是一个一件事,那么你可能可以逃脱它。 ?<= href = \“是搜索的背后。如果上面它返回了很多东西,那么你可以添加哪个是url行唯一的。