需要RegEx才能返回第一段或前n个单词

时间:2021-10-28 21:35:59

I'm looking for a RegEx to return either the first [n] words in a paragraph or, if the paragraph contains less than [n] words, the complete paragraph is returned.

我正在寻找RegEx来返回段落中的第一个[n]单词,或者如果段落包含少于[n]个单词,则返回完整的段落。

For example, assuming I need, at most, the first 7 words:

例如,假设我最多需要前7个单词:

<p>one two <tag>three</tag> four five, six seven eight nine ten.</p><p>ignore</p>

I'd get:

one two <tag>three</tag> four five, six seven

And the same RegEx on a paragraph containing less than the requested number of words:

对包含少于请求的字数的段落的相同RegEx:

<p>one two <tag>three</tag> four five.</p><p>ignore</p>

Would simply return:

简单回归:

one two <tag>three</tag> four five.

My attempt at the problem resulted in the following RegEx:

我对此问题的尝试产生了以下RegEx:

^(?:\<p.*?\>)((?:\w+\b.*?){1,7}).*(?:\</p\>)

However, this returns just the first word - "one". It doesn't work. I think the .*? (after the \w+\b) is causing problems.

但是,这只返回第一个单词 - “one”。它不起作用。我觉得 。*? (在\ w + \ b之后)导致问题。

Where am I going wrong? Can anyone present a RegEx that will work?

我哪里错了?任何人都可以提出一个有效的RegEx吗?

FYI, I'm using .Net 3.5's RegEX engine (via C#)

仅供参考,我正在使用.Net 3.5的RegEX引擎(通过C#)

Many thanks

3 个解决方案

#1


7  

OK, complete re-edit to acknowledge the new "spec" :)

好的,完成重新编辑以确认新的“规范”:)

I'm pretty sure you can't do that with one regex. The best tool definitely is an HTML parser. The closest I can get with regexes is a two-step approach.

我很确定你不能用一个正则表达式做到这一点。最好的工具肯定是HTML解析器。我能用正则表达式得到的最接近的是两步法。

First, isolate each paragraph's contents with:

首先,用以下内容隔离每个段落的内容:

<p>(.*?)</p>

You need to set RegexOptions.Singleline if paragraphs can span multiple lines.

如果段落可以跨越多行,则需要设置RegexOptions.Singleline。

Then, in a next step, iterate over your matches and apply the following regex once on each match's Group[1].Value:

然后,在下一步中,迭代您的匹配并在每个匹配的组[1]上应用以下正则表达式.Value:

((?:(\S+\s+){1,6})\w+)

That will match the first seven items separated by spaces/tabs/newlines, ignoring any trailing punctuation or non-word characters.

这将匹配由空格/制表符/换行符分隔的前七个项目,忽略任何尾随标点符号或非单词字符。

BUT it will treat a tag separated by spaces as one of those items, i. e. in

但它会将由空格分隔的标签视为其中一个项目,即。即在

One, two three <br\> four five six seven

it will only match up until six. I guess that regex-wise, there's no way around that.

它只会匹配到六点。我想那是正则表达式,没有办法解决这个问题。

#2


0  

  1. Use a HTML parser to get the first paragraph, flattening its structure (i.e. remove decorating HTML tags inside the paragraph).
  2. 使用HTML解析器获取第一段,展平其结构(即删除段落中的装饰HTML标记)。

  3. Search for the position of the nth whitespace character.
  4. 搜索第n个空白字符的位置。

  5. Take the substring from 0 to that position.
  6. 将子串从0到该位置。

edit: I removed the regex proposal for step 2 and 3, since it was wrong (thanks to the commenter). Also, the HTML structure needs to be flattened.

编辑:我删除了第2步和第3步的正则表达式提议,因为它是错误的(感谢评论者)。此外,HTML结构需要展平。

#3


0  

I had the same problem and combined a few Stack Overflow answers into this class. It uses the HtmlAgilityPack which is a better tool for the job. Call:

我有同样的问题,并将一些Stack Overflow答案组合到这个类中。它使用HtmlAgilityPack,这是一个更好的工具。呼叫:

 Words(string html, int n)

To get n words

得到n个单词

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;


namespace UmbracoUtilities
{
    public class Text
    {
      /// <summary>
      /// Return the first n words in the html
      /// </summary>
      /// <param name="html"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string Words(string html, int n)
      {
        string words = html, n_words;

        words = StripHtml(html);
        n_words = GetNWords(words, n);

        return n_words;
      }


      /// <summary>
      /// Returns the first n words in text
      /// Assumes text is not a html string
      /// http://*.com/questions/13368345/get-first-250-words-of-a-string
      /// </summary>
      /// <param name="text"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string GetNWords(string text, int n)
      {
        StringBuilder builder = new StringBuilder();

        //remove multiple spaces
        //http://*.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
        string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
        IEnumerable<string> words = cleanedString.Split().Take(n + 1);

        foreach (string word in words)
          builder.Append(" " + word);

        return builder.ToString();
      }


      /// <summary>
      /// Returns a string of html with tags removed
      /// </summary>
      /// <param name="html"></param>
      /// <returns></returns>
      public static string StripHtml(string html)
      {
        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(html);

        var root = document.DocumentNode;
        var stringBuilder = new StringBuilder();

        foreach (var node in root.DescendantsAndSelf())
        {
          if (!node.HasChildNodes)
          {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
              stringBuilder.Append(" " + text.Trim());
          }
        }

        return stringBuilder.ToString();
      }



    }
}

Merry Christmas!

#1


7  

OK, complete re-edit to acknowledge the new "spec" :)

好的,完成重新编辑以确认新的“规范”:)

I'm pretty sure you can't do that with one regex. The best tool definitely is an HTML parser. The closest I can get with regexes is a two-step approach.

我很确定你不能用一个正则表达式做到这一点。最好的工具肯定是HTML解析器。我能用正则表达式得到的最接近的是两步法。

First, isolate each paragraph's contents with:

首先,用以下内容隔离每个段落的内容:

<p>(.*?)</p>

You need to set RegexOptions.Singleline if paragraphs can span multiple lines.

如果段落可以跨越多行,则需要设置RegexOptions.Singleline。

Then, in a next step, iterate over your matches and apply the following regex once on each match's Group[1].Value:

然后,在下一步中,迭代您的匹配并在每个匹配的组[1]上应用以下正则表达式.Value:

((?:(\S+\s+){1,6})\w+)

That will match the first seven items separated by spaces/tabs/newlines, ignoring any trailing punctuation or non-word characters.

这将匹配由空格/制表符/换行符分隔的前七个项目,忽略任何尾随标点符号或非单词字符。

BUT it will treat a tag separated by spaces as one of those items, i. e. in

但它会将由空格分隔的标签视为其中一个项目,即。即在

One, two three <br\> four five six seven

it will only match up until six. I guess that regex-wise, there's no way around that.

它只会匹配到六点。我想那是正则表达式,没有办法解决这个问题。

#2


0  

  1. Use a HTML parser to get the first paragraph, flattening its structure (i.e. remove decorating HTML tags inside the paragraph).
  2. 使用HTML解析器获取第一段,展平其结构(即删除段落中的装饰HTML标记)。

  3. Search for the position of the nth whitespace character.
  4. 搜索第n个空白字符的位置。

  5. Take the substring from 0 to that position.
  6. 将子串从0到该位置。

edit: I removed the regex proposal for step 2 and 3, since it was wrong (thanks to the commenter). Also, the HTML structure needs to be flattened.

编辑:我删除了第2步和第3步的正则表达式提议,因为它是错误的(感谢评论者)。此外,HTML结构需要展平。

#3


0  

I had the same problem and combined a few Stack Overflow answers into this class. It uses the HtmlAgilityPack which is a better tool for the job. Call:

我有同样的问题,并将一些Stack Overflow答案组合到这个类中。它使用HtmlAgilityPack,这是一个更好的工具。呼叫:

 Words(string html, int n)

To get n words

得到n个单词

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;


namespace UmbracoUtilities
{
    public class Text
    {
      /// <summary>
      /// Return the first n words in the html
      /// </summary>
      /// <param name="html"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string Words(string html, int n)
      {
        string words = html, n_words;

        words = StripHtml(html);
        n_words = GetNWords(words, n);

        return n_words;
      }


      /// <summary>
      /// Returns the first n words in text
      /// Assumes text is not a html string
      /// http://*.com/questions/13368345/get-first-250-words-of-a-string
      /// </summary>
      /// <param name="text"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string GetNWords(string text, int n)
      {
        StringBuilder builder = new StringBuilder();

        //remove multiple spaces
        //http://*.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
        string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
        IEnumerable<string> words = cleanedString.Split().Take(n + 1);

        foreach (string word in words)
          builder.Append(" " + word);

        return builder.ToString();
      }


      /// <summary>
      /// Returns a string of html with tags removed
      /// </summary>
      /// <param name="html"></param>
      /// <returns></returns>
      public static string StripHtml(string html)
      {
        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(html);

        var root = document.DocumentNode;
        var stringBuilder = new StringBuilder();

        foreach (var node in root.DescendantsAndSelf())
        {
          if (!node.HasChildNodes)
          {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
              stringBuilder.Append(" " + text.Trim());
          }
        }

        return stringBuilder.ToString();
      }



    }
}

Merry Christmas!