I want to find title of pages from a huge haystack but that do not have any class or unique id, so i can't use DOM parser here, i am aware i must use regular expressions. Here is example of what i am trying to find:
我想从一个巨大的干草堆中找到页面的标题但是没有任何类或唯一的id,所以我不能在这里使用DOM解析器,我知道我必须使用正则表达式。这是我想要找到的例子:
<a href="http://example.com/xyz">
Series Hell In Heaven information
</a>
<a href="http://example.com/123">
Series What is going information
</a>
Output Should be an array with
输出应该是一个数组
[0] => Series Hell In Heaven information
[1] => Series What is going information
All series titles have start with Series and end with information. from a huge string of multiple things i only want to extract titles. Currently i am trying to use a regex but its not working, here's what i am doing right now.
所有系列游戏都以系列开头并以信息结尾。从一大堆多件事我只想提取标题。目前我正在尝试使用正则表达式,但它不起作用,这就是我现在正在做的事情。
$reg = "/^Series\..*information$/";
$str = $html;
preg_match_all($reg, $str, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
I don't know much about making regular expressions. Help would appreciated. Thanks
我对制作正则表达式知之甚少。帮助会很感激。谢谢
2 个解决方案
#1
1
Try
preg_match_all('/(Series.+?information)/', $str, $matches );
As
https://regex101.com/r/oJ0jZ4/1
As I said in the comments, remove the literal \.
dot and the start and end anchors... I would also use a non-greedy require any character. .+?
正如我在评论中所说,删除文字\。点和开始和结束锚...我也会使用非贪婪的任何角色。 。+?
Otherwise you could match this
否则你可以匹配这个
Seriesinformation
if the casing of Series or information may change such as
如果系列或信息的外壳可能会发生变化,例如
Series .... Information
系列....信息
Add the /i
flag as in
添加/ i标志,如下所示
preg_match_all('/(Series.+?information)/i', $str, $matches );
The outer capture group isn't really needed, but I think it looks nicer with it in there, if you just want the variable content without the Series or Information then move the capture ( )
to that bit.
外部捕获组并不是真的需要,但我认为它在那里看起来更好,如果你只想要没有Series或Information的变量内容,那么将capture()移动到那个位。
preg_match_all('/Series(.+?)information/i', $str, $matches );
Note you'll want to trim()
the match because it will likely have spaces at the beginning and end or add them to the regx like this.
请注意,您需要修剪()匹配,因为它可能在开头和结尾都有空格,或者像这样将它们添加到regx。
preg_match_all('/Series\s(.+?)\sinformation/i', $str, $matches );
But that will exclude matching Series information
with one space.
但这将排除匹配系列信息与一个空格。
If you want to be sure you don't match over an information such as
如果你想确定你不匹配的信息,如
[Series Hell In Heaven information Series Hell In Heaven information]
Matching all of that you can use a positive lookbehind
匹配所有这些你可以使用积极的lookbehind
preg_match_all('/(Series.+?(?<=information))/i', $str, $matches );
Conversely, if there is a possibility it will contain two information words
相反,如果有可能,它将包含两个信息词
<a href="http://example.com/123">
Series information is power information
</a>
You can do this
你可以这样做
preg_match_all('/(Series[^<]+)</i', $str, $matches );
Which will match up to the <
as in </a
哪个匹配 <在< a< p>
AS a Side note you could use the PHPQuery library ( which is a DOM parser ), and look for an a
tag that contains those words.
作为附注,您可以使用PHPQuery库(它是一个DOM解析器),并查找包含这些单词的标记。
https://github.com/punkave/phpQuery
And
https://code.google.com/archive/p/phpquery/wikis/Manual.wiki
Using something like
使用类似的东西
$tags = $doc->getElementsByTagName("a:contains('Series)")->text();
This is an excellent library for parsing HTML
这是一个用于解析HTML的优秀库
#2
1
try this:
$str = '<a href="http://example.com/xyz">
Series Hell In Heaven information
</a>
<a href="http://example.com/123">
Series What is going information
</a>';
preg_match_all('/Series(.*?)information/', $str, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
the capture will be in $matches[2]. Basically your regex does not match because of the \.
.
捕获将在$ matches中[2]。基本上你的正则表达式不匹配因为\ ..
[EDIT]
If you need also the words Series
and information
, then you don't need to capture just do /Series.*?information/
and found matches in $matches[0].
如果你还需要单词Series和information,那么你不需要只捕获/Series.*?information/并找到$ matches [0]中的匹配项。
#1
1
Try
preg_match_all('/(Series.+?information)/', $str, $matches );
As
https://regex101.com/r/oJ0jZ4/1
As I said in the comments, remove the literal \.
dot and the start and end anchors... I would also use a non-greedy require any character. .+?
正如我在评论中所说,删除文字\。点和开始和结束锚...我也会使用非贪婪的任何角色。 。+?
Otherwise you could match this
否则你可以匹配这个
Seriesinformation
if the casing of Series or information may change such as
如果系列或信息的外壳可能会发生变化,例如
Series .... Information
系列....信息
Add the /i
flag as in
添加/ i标志,如下所示
preg_match_all('/(Series.+?information)/i', $str, $matches );
The outer capture group isn't really needed, but I think it looks nicer with it in there, if you just want the variable content without the Series or Information then move the capture ( )
to that bit.
外部捕获组并不是真的需要,但我认为它在那里看起来更好,如果你只想要没有Series或Information的变量内容,那么将capture()移动到那个位。
preg_match_all('/Series(.+?)information/i', $str, $matches );
Note you'll want to trim()
the match because it will likely have spaces at the beginning and end or add them to the regx like this.
请注意,您需要修剪()匹配,因为它可能在开头和结尾都有空格,或者像这样将它们添加到regx。
preg_match_all('/Series\s(.+?)\sinformation/i', $str, $matches );
But that will exclude matching Series information
with one space.
但这将排除匹配系列信息与一个空格。
If you want to be sure you don't match over an information such as
如果你想确定你不匹配的信息,如
[Series Hell In Heaven information Series Hell In Heaven information]
Matching all of that you can use a positive lookbehind
匹配所有这些你可以使用积极的lookbehind
preg_match_all('/(Series.+?(?<=information))/i', $str, $matches );
Conversely, if there is a possibility it will contain two information words
相反,如果有可能,它将包含两个信息词
<a href="http://example.com/123">
Series information is power information
</a>
You can do this
你可以这样做
preg_match_all('/(Series[^<]+)</i', $str, $matches );
Which will match up to the <
as in </a
哪个匹配 <在< a< p>
AS a Side note you could use the PHPQuery library ( which is a DOM parser ), and look for an a
tag that contains those words.
作为附注,您可以使用PHPQuery库(它是一个DOM解析器),并查找包含这些单词的标记。
https://github.com/punkave/phpQuery
And
https://code.google.com/archive/p/phpquery/wikis/Manual.wiki
Using something like
使用类似的东西
$tags = $doc->getElementsByTagName("a:contains('Series)")->text();
This is an excellent library for parsing HTML
这是一个用于解析HTML的优秀库
#2
1
try this:
$str = '<a href="http://example.com/xyz">
Series Hell In Heaven information
</a>
<a href="http://example.com/123">
Series What is going information
</a>';
preg_match_all('/Series(.*?)information/', $str, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
the capture will be in $matches[2]. Basically your regex does not match because of the \.
.
捕获将在$ matches中[2]。基本上你的正则表达式不匹配因为\ ..
[EDIT]
If you need also the words Series
and information
, then you don't need to capture just do /Series.*?information/
and found matches in $matches[0].
如果你还需要单词Series和information,那么你不需要只捕获/Series.*?information/并找到$ matches [0]中的匹配项。