提取字符串中给定搜索字符串周围的X个字

I am looking for a way to extract X number of words on either side of a given word in a search.

我正在寻找一种方法来在搜索中的给定单词的两侧提取X个单词。

For example, if a user enters "inmate" as a search word and the MySQL query finds a post that contains "inmate" in the content of the post, I would like to return not the entire contents of the post but just x number of words on either side of it to give the user the gist of the post and then they can decide if they want to continue on to the post and read it in full.

例如，如果用户输入“inmate”作为搜索词并且MySQL查询在帖子的内容中找到包含“inmate”的帖子，我想不返回帖子的全部内容而只返回x的在它的任何一侧的文字给用户的帖子的要点然后他们可以决定他们是否想要继续到帖子并完整阅读。

I am using PHP.

我正在使用PHP。

Thanks!

谢谢！

2 个解决方案

#1

You might not be able to fully solve this problem with regex. There are too many possibilities of other characters between the words...

您可能无法使用正则表达式完全解决此问题。单词之间有太多其他字符的可能性......

But you can try this regex:

但你可以尝试这个正则表达式：

((?:\S+\s*){0,5}\S*inmate\S*(?:\s*\S+){0,5})

See here : rubular

见这里：rubular

You might also want to exclude certain characters as they are not counted as words. Right now the regex counts any sequence of non space characters that are surrounded by spaces as word.

您可能还想排除某些字符，因为它们不算作单词。现在，正则表达式计算由空格包围的任何非空格字符序列作为单词。

To match only real words:

仅匹配真实的单词：

((?:\w+\s*){0,5}<search word>(?:\s*\w+){0,5})

But here any non word character (,". etc.) brakes the matching.

但是这里任何非单词字符（，“。等）都会对匹配进行制动。

So you can go on...

所以你可以继续......

((?:[\w"',.-]+\s*){0,5}["',.-]?<search word>["',.-]?(?:\s*[\w"',.-]+){0,5})

This would also match 5 words with one of "',.- around your search term.

这也会在搜索字词周围匹配5个单词和“',. - 之一。

To use it in php:

要在php中使用它：

$sourcestring="For example, if a user enters \"inmate\" as a search word and the MySQL";
preg_match_all('/(?:\S+\s*){0,5}\S*inmate\S*(?:\s*\S+){0,5}/s',$sourcestring,$matches);
echo $matches[0][0]; // you might have more matches, they will be in $matches[0][x]

#2

I would use this regex for php which also takes UTF8 characters into account

我会使用这个正则表达式的php，它也考虑UTF8字符

'~(?:[\p{L}\p{N}\']+[^\p{L}\p{N}\']+){0,5}<search word>(?:[^\p{L}\p{N}\']+[\p{L}\p{N}\']+){0,5}~u'

In this case '~' is the delimiter and the modificator 'u' at the end identifies the regex is UTF8 interpreted.

在这种情况下，'〜'是分隔符，最后的修饰符'u'标识正则表达式是UTF8解释的。

please see a documentation about the Unicode Regex identifiers here:

请在此处查看有关Unicode Regex标识符的文档：

http://www.regular-expressions.info/refunicode.html

#1