Java正则表达式将句子中的单词拆分为值,其度量单位为单个单词

时间:2020-12-21 12:45:46

I am trying to split a sentence into a set of words. What I am looking is to consider also the metric when chunking numbers.

我试图将一个句子分成一组单词。我正在寻找的是在分块数时也要考虑指标。

E.g (Made-up).
 document= The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.

What is required, are set of words

所需要的是一组单词

the
root
cause
problem
...
40 degrees
30 percent
1.67 gpm
1-19666 tablet
3 hrs

What I have tried is

我试过的是

List<String> bagOfWords = new ArrayList<>();    
String [] words = StringUtils.normalizeSpace(document.replaceAll("[^0-9a-zA-Z_.-]", " ")).split(" ");
for(String word :words){
    bagOfWords.add(StringUtils.normalizeSpace(word.replaceAll("\\.(?!\\d)", " ")));         
    }                
System.out.println("NEW 2 :: " + bagOfWords.toString());

2 个解决方案

#1


2  

Let's assume that one word that contains a number is followed by another one that doesn't. Then here is the code:

让我们假设一个包含数字的单词后跟另一个不包含数字的单词。然后这是代码:

    private static final String DOC = "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs";

   // ...

    Pattern pattern = Pattern.compile("(\\b\\S*\\d\\S*\\b\\s+)?\\b\\S+\\b");
    Matcher matcher = pattern.matcher(DOC);
    List<String> words = new ArrayList<>();
    while (matcher.find()) {
        words.add(matcher.group());
    }
    for (String word : words) {
        System.out.println(word);
    }

Explanation:

  • \\b finds word boundary
  • \\ b找到单词边界

  • \\S is a non-space character. So you can have everything within a word, like dot or comma.
  • \\ S是一个非空格的角色。所以你可以在一个单词中包含所有内容,如点或逗号。

  • (...)? is the first optional part. It catches the word with a number, if any. I.e. it has some characters (\\S*), then a number (\\d), then again, some characters (\\S*)
  • (......)?是第一个可选部分。如果有的话,它会用一个数字来捕获这个单词。即它有一些字符(\\ S *),然后是一个数字(\\ d),然后是一些字符(\\ S *)

  • The second word is simple: at least one non-whitespace character. Hence it has a +, not a * after the S.
  • 第二个词很简单:至少有一个非空白字符。因此它在S之后有一个+,而不是一个*。

#2


1  

You question scope is a bit large, but here's a hack that can work for most sentences in this format.

你的问题范围有点大,但这是一个可以适用于这种格式的大多数句子的黑客。

First you can create a list of prefixes, which contains keywords of you units like hrs, tablet, gpm ... once you have this what you need becomes easy to pick out.

首先,你可以创建一个前缀列表,其中包含你的单位的关键字,如小时,平板电脑,gpm ...一旦你拥有了你需要的东西变得容易挑选。

    String document= "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.";
    if(document.endsWith(".")){
        document = document.substring(0, document.length() -1 );
    }
    System.out.println(document);
    String[] splitted = document.split(" ");
    List<String> keywords = new ArrayList();
    keywords.add("degrees");
    keywords.add("percent");
    keywords.add("gpm");
    keywords.add("tablet");
    keywords.add("hrs");

    List<String> words = new ArrayList();

    for(String s : splitted){
        if(!s.equals(",")){
            //if s is not a comma;
            if(keywords.contains(s) && words.size()!=0){
                //if s is a keyword append to last item in list
                int lastIndex = words.size()-1;
                words.set(lastIndex, words.get(lastIndex)+" "+s);
            }
            else{
                words.add(s);
            }
        }
    }
    for(String s : words){
        System.out.println(s);
    }

#1


2  

Let's assume that one word that contains a number is followed by another one that doesn't. Then here is the code:

让我们假设一个包含数字的单词后跟另一个不包含数字的单词。然后这是代码:

    private static final String DOC = "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs";

   // ...

    Pattern pattern = Pattern.compile("(\\b\\S*\\d\\S*\\b\\s+)?\\b\\S+\\b");
    Matcher matcher = pattern.matcher(DOC);
    List<String> words = new ArrayList<>();
    while (matcher.find()) {
        words.add(matcher.group());
    }
    for (String word : words) {
        System.out.println(word);
    }

Explanation:

  • \\b finds word boundary
  • \\ b找到单词边界

  • \\S is a non-space character. So you can have everything within a word, like dot or comma.
  • \\ S是一个非空格的角色。所以你可以在一个单词中包含所有内容,如点或逗号。

  • (...)? is the first optional part. It catches the word with a number, if any. I.e. it has some characters (\\S*), then a number (\\d), then again, some characters (\\S*)
  • (......)?是第一个可选部分。如果有的话,它会用一个数字来捕获这个单词。即它有一些字符(\\ S *),然后是一个数字(\\ d),然后是一些字符(\\ S *)

  • The second word is simple: at least one non-whitespace character. Hence it has a +, not a * after the S.
  • 第二个词很简单:至少有一个非空白字符。因此它在S之后有一个+,而不是一个*。

#2


1  

You question scope is a bit large, but here's a hack that can work for most sentences in this format.

你的问题范围有点大,但这是一个可以适用于这种格式的大多数句子的黑客。

First you can create a list of prefixes, which contains keywords of you units like hrs, tablet, gpm ... once you have this what you need becomes easy to pick out.

首先,你可以创建一个前缀列表,其中包含你的单位的关键字,如小时,平板电脑,gpm ...一旦你拥有了你需要的东西变得容易挑选。

    String document= "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.";
    if(document.endsWith(".")){
        document = document.substring(0, document.length() -1 );
    }
    System.out.println(document);
    String[] splitted = document.split(" ");
    List<String> keywords = new ArrayList();
    keywords.add("degrees");
    keywords.add("percent");
    keywords.add("gpm");
    keywords.add("tablet");
    keywords.add("hrs");

    List<String> words = new ArrayList();

    for(String s : splitted){
        if(!s.equals(",")){
            //if s is not a comma;
            if(keywords.contains(s) && words.size()!=0){
                //if s is a keyword append to last item in list
                int lastIndex = words.size()-1;
                words.set(lastIndex, words.get(lastIndex)+" "+s);
            }
            else{
                words.add(s);
            }
        }
    }
    for(String s : words){
        System.out.println(s);
    }