I'm looking for a RegEx to return either the first [n] words in a paragraph or, if the paragraph contains less than [n] words, the complete paragraph is returned.
我正在寻找RegEx来返回段落中的第一个[n]单词,或者如果段落包含少于[n]个单词,则返回完整的段落。
For example, assuming I need, at most, the first 7 words:
例如,假设我最多需要前7个单词:
<p>one two <tag>three</tag> four five, six seven eight nine ten.</p><p>ignore</p>
I'd get:
one two <tag>three</tag> four five, six seven
And the same RegEx on a paragraph containing less than the requested number of words:
对包含少于请求的字数的段落的相同RegEx:
<p>one two <tag>three</tag> four five.</p><p>ignore</p>
Would simply return:
简单回归:
one two <tag>three</tag> four five.
My attempt at the problem resulted in the following RegEx:
我对此问题的尝试产生了以下RegEx:
^(?:\<p.*?\>)((?:\w+\b.*?){1,7}).*(?:\</p\>)
However, this returns just the first word - "one". It doesn't work. I think the .*? (after the \w+\b) is causing problems.
但是,这只返回第一个单词 - “one”。它不起作用。我觉得 。*? (在\ w + \ b之后)导致问题。
Where am I going wrong? Can anyone present a RegEx that will work?
我哪里错了?任何人都可以提出一个有效的RegEx吗?
FYI, I'm using .Net 3.5's RegEX engine (via C#)
仅供参考,我正在使用.Net 3.5的RegEX引擎(通过C#)
Many thanks
3 个解决方案
#1
7
OK, complete re-edit to acknowledge the new "spec" :)
好的,完成重新编辑以确认新的“规范”:)
I'm pretty sure you can't do that with one regex. The best tool definitely is an HTML parser. The closest I can get with regexes is a two-step approach.
我很确定你不能用一个正则表达式做到这一点。最好的工具肯定是HTML解析器。我能用正则表达式得到的最接近的是两步法。
First, isolate each paragraph's contents with:
首先,用以下内容隔离每个段落的内容:
<p>(.*?)</p>
You need to set RegexOptions.Singleline
if paragraphs can span multiple lines.
如果段落可以跨越多行,则需要设置RegexOptions.Singleline。
Then, in a next step, iterate over your matches and apply the following regex once on each match's Group[1].Value
:
然后,在下一步中,迭代您的匹配并在每个匹配的组[1]上应用以下正则表达式.Value:
((?:(\S+\s+){1,6})\w+)
That will match the first seven items separated by spaces/tabs/newlines, ignoring any trailing punctuation or non-word characters.
这将匹配由空格/制表符/换行符分隔的前七个项目,忽略任何尾随标点符号或非单词字符。
BUT it will treat a tag separated by spaces as one of those items, i. e. in
但它会将由空格分隔的标签视为其中一个项目,即。即在
One, two three <br\> four five six seven
it will only match up until six
. I guess that regex-wise, there's no way around that.
它只会匹配到六点。我想那是正则表达式,没有办法解决这个问题。
#2
0
- Use a HTML parser to get the first paragraph, flattening its structure (i.e. remove decorating HTML tags inside the paragraph).
- Search for the position of the nth whitespace character.
- Take the substring from 0 to that position.
使用HTML解析器获取第一段,展平其结构(即删除段落中的装饰HTML标记)。
搜索第n个空白字符的位置。
将子串从0到该位置。
edit: I removed the regex proposal for step 2 and 3, since it was wrong (thanks to the commenter). Also, the HTML structure needs to be flattened.
编辑:我删除了第2步和第3步的正则表达式提议,因为它是错误的(感谢评论者)。此外,HTML结构需要展平。
#3
0
I had the same problem and combined a few Stack Overflow answers into this class. It uses the HtmlAgilityPack which is a better tool for the job. Call:
我有同样的问题,并将一些Stack Overflow答案组合到这个类中。它使用HtmlAgilityPack,这是一个更好的工具。呼叫:
Words(string html, int n)
To get n words
得到n个单词
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace UmbracoUtilities
{
public class Text
{
/// <summary>
/// Return the first n words in the html
/// </summary>
/// <param name="html"></param>
/// <param name="n"></param>
/// <returns></returns>
public static string Words(string html, int n)
{
string words = html, n_words;
words = StripHtml(html);
n_words = GetNWords(words, n);
return n_words;
}
/// <summary>
/// Returns the first n words in text
/// Assumes text is not a html string
/// http://*.com/questions/13368345/get-first-250-words-of-a-string
/// </summary>
/// <param name="text"></param>
/// <param name="n"></param>
/// <returns></returns>
public static string GetNWords(string text, int n)
{
StringBuilder builder = new StringBuilder();
//remove multiple spaces
//http://*.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
IEnumerable<string> words = cleanedString.Split().Take(n + 1);
foreach (string word in words)
builder.Append(" " + word);
return builder.ToString();
}
/// <summary>
/// Returns a string of html with tags removed
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
public static string StripHtml(string html)
{
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var root = document.DocumentNode;
var stringBuilder = new StringBuilder();
foreach (var node in root.DescendantsAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
stringBuilder.Append(" " + text.Trim());
}
}
return stringBuilder.ToString();
}
}
}
Merry Christmas!
#1
7
OK, complete re-edit to acknowledge the new "spec" :)
好的,完成重新编辑以确认新的“规范”:)
I'm pretty sure you can't do that with one regex. The best tool definitely is an HTML parser. The closest I can get with regexes is a two-step approach.
我很确定你不能用一个正则表达式做到这一点。最好的工具肯定是HTML解析器。我能用正则表达式得到的最接近的是两步法。
First, isolate each paragraph's contents with:
首先,用以下内容隔离每个段落的内容:
<p>(.*?)</p>
You need to set RegexOptions.Singleline
if paragraphs can span multiple lines.
如果段落可以跨越多行,则需要设置RegexOptions.Singleline。
Then, in a next step, iterate over your matches and apply the following regex once on each match's Group[1].Value
:
然后,在下一步中,迭代您的匹配并在每个匹配的组[1]上应用以下正则表达式.Value:
((?:(\S+\s+){1,6})\w+)
That will match the first seven items separated by spaces/tabs/newlines, ignoring any trailing punctuation or non-word characters.
这将匹配由空格/制表符/换行符分隔的前七个项目,忽略任何尾随标点符号或非单词字符。
BUT it will treat a tag separated by spaces as one of those items, i. e. in
但它会将由空格分隔的标签视为其中一个项目,即。即在
One, two three <br\> four five six seven
it will only match up until six
. I guess that regex-wise, there's no way around that.
它只会匹配到六点。我想那是正则表达式,没有办法解决这个问题。
#2
0
- Use a HTML parser to get the first paragraph, flattening its structure (i.e. remove decorating HTML tags inside the paragraph).
- Search for the position of the nth whitespace character.
- Take the substring from 0 to that position.
使用HTML解析器获取第一段,展平其结构(即删除段落中的装饰HTML标记)。
搜索第n个空白字符的位置。
将子串从0到该位置。
edit: I removed the regex proposal for step 2 and 3, since it was wrong (thanks to the commenter). Also, the HTML structure needs to be flattened.
编辑:我删除了第2步和第3步的正则表达式提议,因为它是错误的(感谢评论者)。此外,HTML结构需要展平。
#3
0
I had the same problem and combined a few Stack Overflow answers into this class. It uses the HtmlAgilityPack which is a better tool for the job. Call:
我有同样的问题,并将一些Stack Overflow答案组合到这个类中。它使用HtmlAgilityPack,这是一个更好的工具。呼叫:
Words(string html, int n)
To get n words
得到n个单词
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace UmbracoUtilities
{
public class Text
{
/// <summary>
/// Return the first n words in the html
/// </summary>
/// <param name="html"></param>
/// <param name="n"></param>
/// <returns></returns>
public static string Words(string html, int n)
{
string words = html, n_words;
words = StripHtml(html);
n_words = GetNWords(words, n);
return n_words;
}
/// <summary>
/// Returns the first n words in text
/// Assumes text is not a html string
/// http://*.com/questions/13368345/get-first-250-words-of-a-string
/// </summary>
/// <param name="text"></param>
/// <param name="n"></param>
/// <returns></returns>
public static string GetNWords(string text, int n)
{
StringBuilder builder = new StringBuilder();
//remove multiple spaces
//http://*.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
IEnumerable<string> words = cleanedString.Split().Take(n + 1);
foreach (string word in words)
builder.Append(" " + word);
return builder.ToString();
}
/// <summary>
/// Returns a string of html with tags removed
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
public static string StripHtml(string html)
{
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var root = document.DocumentNode;
var stringBuilder = new StringBuilder();
foreach (var node in root.DescendantsAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
stringBuilder.Append(" " + text.Trim());
}
}
return stringBuilder.ToString();
}
}
}
Merry Christmas!