编写正则表达式来从Java中的文本中扩展单词

时间:2022-10-14 21:45:55

I writing a program in Java using regex ,i have many structure for the sentence Given string; "book 'learning java' for doctor ahmed mohamed ". or" the best title: learning java for ahmed mohamed ", and so on ...,

我使用正则表达式用Java编写程序,我有很多结构用于句子给定字符串; “书'学习java'为医生艾哈迈德·*”。或“最好的头衔:为艾哈迈德*学习java”等等......

meaning that:

(book) may be [the book or text: or (text)].

(书)可以是[书或文字:或(文本)]。

(for doctor ) may be [ for author or for or by for doctor].

(对于医生)可能[为作者或为医生或为医生]。

the output:

I want to extract any words after (book) and before (for doctor ) and named it Title. And extract any words after (for doctor ) and named it Author.

我想在(书)和之前(为医生)提取任何单词并将其命名为Title。并在(医生)之后提取任何单词并将其命名为Author。

String inputtext =  "book 'learning java' for doctor  ahmed mohamed";

    Pattern p = Pattern.compile("(?<=(book| the book| \\( . \\)|\\:)) .*? (?=(for doctor| for| for author))");

    Matcher m = p.matcher(inputtext);


        if (m.matches()) {
        String author = m.group(1).trim();
        String bookTitle = m.group(2).trim();

        System.out.println("Title is : " + author);
        System.out.println("Author is : " + bookTitle);

1 个解决方案

#1


0  

I'll try and provide a hint, but since I can't read your expression I just can guess.

我会尝试提供一个提示,但由于我无法阅读你的表达,我只能猜到。

So your expression is this:

所以你的表达是这样的:

(?<=(للدكتورة|للعلامه|للشيخ|للكاتب |للكاتبه|للامام|للاستاذ|للقاضى|للدكتور|ل ))\s[^\s]+\s[^\s]+

In a break down it would look like this:

在细分中它看起来像这样:

  • positive look behind for (?<=(للدكتورة|للعلامه|للشيخ|للكاتب |للكاتبه|للامام|للاستاذ|للقاضى|للدكتور|ل ))
  • 积极寻找背后(?<=(للدكتورة|للعلامه|للشيخ|للكاتب|للكاتبه|للامام|للاستاذ|للقاضى|للدكتور|ل))

  • a whitespace character followed by some word
  • 一个空白字符后跟一些单词

  • a whitespace character followed by some word
  • 一个空白字符后跟一些单词

Basically the match would then be any sequence that contains 2 whitespace - word combinations and is preceeded by any of the words in your match.

基本上匹配将是包含2个空格 - 单词组合的任何序列,并且在匹配中的任何单词之前。

This seems to be your actual problem, as you stated:

这似乎是你的实际问题,正如你所说:

this expression give me only 2 word

这个表达只给我2个字

A possible solution would be to match more than 2 words and maybe even more that one whitespace. So after your look behind, try this: (?>\s+[^\s]+)+ instead of \s[^\s]+\s[^\s]+. This part should match any sequence of whitespace followed by non-whitespace, e.g. (in english letters) it would match aaa bbb as well as aaa bbb ccc ddd (HTML won't display multiple whitespace here, but imagine the gaps where larger than just one space).

一种可能的解决方案是匹配超过2个单词,甚至可能超过一个空白。所以在你看后面的时候,试试这个:(?> \ s + [^ \ s] +)+而不是\ s [^ \ s] + \ s [^ \ s] +。该部分应匹配任何空格序列,后跟非空格,例如(用英文字母表示)它会匹配aaa bbb以及aaa bbb ccc ddd(HTML不会在这里显示多个空格,但想象的是大于一个空格的间隙)。

#1


0  

I'll try and provide a hint, but since I can't read your expression I just can guess.

我会尝试提供一个提示,但由于我无法阅读你的表达,我只能猜到。

So your expression is this:

所以你的表达是这样的:

(?<=(للدكتورة|للعلامه|للشيخ|للكاتب |للكاتبه|للامام|للاستاذ|للقاضى|للدكتور|ل ))\s[^\s]+\s[^\s]+

In a break down it would look like this:

在细分中它看起来像这样:

  • positive look behind for (?<=(للدكتورة|للعلامه|للشيخ|للكاتب |للكاتبه|للامام|للاستاذ|للقاضى|للدكتور|ل ))
  • 积极寻找背后(?<=(للدكتورة|للعلامه|للشيخ|للكاتب|للكاتبه|للامام|للاستاذ|للقاضى|للدكتور|ل))

  • a whitespace character followed by some word
  • 一个空白字符后跟一些单词

  • a whitespace character followed by some word
  • 一个空白字符后跟一些单词

Basically the match would then be any sequence that contains 2 whitespace - word combinations and is preceeded by any of the words in your match.

基本上匹配将是包含2个空格 - 单词组合的任何序列,并且在匹配中的任何单词之前。

This seems to be your actual problem, as you stated:

这似乎是你的实际问题,正如你所说:

this expression give me only 2 word

这个表达只给我2个字

A possible solution would be to match more than 2 words and maybe even more that one whitespace. So after your look behind, try this: (?>\s+[^\s]+)+ instead of \s[^\s]+\s[^\s]+. This part should match any sequence of whitespace followed by non-whitespace, e.g. (in english letters) it would match aaa bbb as well as aaa bbb ccc ddd (HTML won't display multiple whitespace here, but imagine the gaps where larger than just one space).

一种可能的解决方案是匹配超过2个单词,甚至可能超过一个空白。所以在你看后面的时候,试试这个:(?> \ s + [^ \ s] +)+而不是\ s [^ \ s] + \ s [^ \ s] +。该部分应匹配任何空格序列,后跟非空格,例如(用英文字母表示)它会匹配aaa bbb以及aaa bbb ccc ddd(HTML不会在这里显示多个空格,但想象的是大于一个空格的间隙)。