I am trying to split a sentence into a set of words. What I am looking is to consider also the metric when chunking numbers.
我试图将一个句子分成一组单词。我正在寻找的是在分块数时也要考虑指标。
E.g (Made-up).
document= The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.
What is required, are set of words
所需要的是一组单词
the
root
cause
problem
...
40 degrees
30 percent
1.67 gpm
1-19666 tablet
3 hrs
What I have tried is
我试过的是
List<String> bagOfWords = new ArrayList<>();
String [] words = StringUtils.normalizeSpace(document.replaceAll("[^0-9a-zA-Z_.-]", " ")).split(" ");
for(String word :words){
bagOfWords.add(StringUtils.normalizeSpace(word.replaceAll("\\.(?!\\d)", " ")));
}
System.out.println("NEW 2 :: " + bagOfWords.toString());
2 个解决方案
#1
2
Let's assume that one word that contains a number is followed by another one that doesn't. Then here is the code:
让我们假设一个包含数字的单词后跟另一个不包含数字的单词。然后这是代码:
private static final String DOC = "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs";
// ...
Pattern pattern = Pattern.compile("(\\b\\S*\\d\\S*\\b\\s+)?\\b\\S+\\b");
Matcher matcher = pattern.matcher(DOC);
List<String> words = new ArrayList<>();
while (matcher.find()) {
words.add(matcher.group());
}
for (String word : words) {
System.out.println(word);
}
Explanation:
-
\\b
finds word boundary -
\\S
is a non-space character. So you can have everything within a word, like dot or comma. -
(...)?
is the first optional part. It catches the word with a number, if any. I.e. it has some characters (\\S*
), then a number (\\d
), then again, some characters (\\S*
) - The second word is simple: at least one non-whitespace character. Hence it has a
+
, not a*
after theS
.
\\ b找到单词边界
\\ S是一个非空格的角色。所以你可以在一个单词中包含所有内容,如点或逗号。
(......)?是第一个可选部分。如果有的话,它会用一个数字来捕获这个单词。即它有一些字符(\\ S *),然后是一个数字(\\ d),然后是一些字符(\\ S *)
第二个词很简单:至少有一个非空白字符。因此它在S之后有一个+,而不是一个*。
#2
1
You question scope is a bit large, but here's a hack that can work for most sentences in this format.
你的问题范围有点大,但这是一个可以适用于这种格式的大多数句子的黑客。
First you can create a list of prefixes, which contains keywords of you units like hrs, tablet, gpm ...
once you have this what you need becomes easy to pick out.
首先,你可以创建一个前缀列表,其中包含你的单位的关键字,如小时,平板电脑,gpm ...一旦你拥有了你需要的东西变得容易挑选。
String document= "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.";
if(document.endsWith(".")){
document = document.substring(0, document.length() -1 );
}
System.out.println(document);
String[] splitted = document.split(" ");
List<String> keywords = new ArrayList();
keywords.add("degrees");
keywords.add("percent");
keywords.add("gpm");
keywords.add("tablet");
keywords.add("hrs");
List<String> words = new ArrayList();
for(String s : splitted){
if(!s.equals(",")){
//if s is not a comma;
if(keywords.contains(s) && words.size()!=0){
//if s is a keyword append to last item in list
int lastIndex = words.size()-1;
words.set(lastIndex, words.get(lastIndex)+" "+s);
}
else{
words.add(s);
}
}
}
for(String s : words){
System.out.println(s);
}
#1
2
Let's assume that one word that contains a number is followed by another one that doesn't. Then here is the code:
让我们假设一个包含数字的单词后跟另一个不包含数字的单词。然后这是代码:
private static final String DOC = "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs";
// ...
Pattern pattern = Pattern.compile("(\\b\\S*\\d\\S*\\b\\s+)?\\b\\S+\\b");
Matcher matcher = pattern.matcher(DOC);
List<String> words = new ArrayList<>();
while (matcher.find()) {
words.add(matcher.group());
}
for (String word : words) {
System.out.println(word);
}
Explanation:
-
\\b
finds word boundary -
\\S
is a non-space character. So you can have everything within a word, like dot or comma. -
(...)?
is the first optional part. It catches the word with a number, if any. I.e. it has some characters (\\S*
), then a number (\\d
), then again, some characters (\\S*
) - The second word is simple: at least one non-whitespace character. Hence it has a
+
, not a*
after theS
.
\\ b找到单词边界
\\ S是一个非空格的角色。所以你可以在一个单词中包含所有内容,如点或逗号。
(......)?是第一个可选部分。如果有的话,它会用一个数字来捕获这个单词。即它有一些字符(\\ S *),然后是一个数字(\\ d),然后是一些字符(\\ S *)
第二个词很简单:至少有一个非空白字符。因此它在S之后有一个+,而不是一个*。
#2
1
You question scope is a bit large, but here's a hack that can work for most sentences in this format.
你的问题范围有点大,但这是一个可以适用于这种格式的大多数句子的黑客。
First you can create a list of prefixes, which contains keywords of you units like hrs, tablet, gpm ...
once you have this what you need becomes easy to pick out.
首先,你可以创建一个前缀列表,其中包含你的单位的关键字,如小时,平板电脑,gpm ...一旦你拥有了你需要的东西变得容易挑选。
String document= "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.";
if(document.endsWith(".")){
document = document.substring(0, document.length() -1 );
}
System.out.println(document);
String[] splitted = document.split(" ");
List<String> keywords = new ArrayList();
keywords.add("degrees");
keywords.add("percent");
keywords.add("gpm");
keywords.add("tablet");
keywords.add("hrs");
List<String> words = new ArrayList();
for(String s : splitted){
if(!s.equals(",")){
//if s is not a comma;
if(keywords.contains(s) && words.size()!=0){
//if s is a keyword append to last item in list
int lastIndex = words.size()-1;
words.set(lastIndex, words.get(lastIndex)+" "+s);
}
else{
words.add(s);
}
}
}
for(String s : words){
System.out.println(s);
}