从文本文件中提取单词

时间:2022-09-13 09:49:12

Let's say you have a text file like this one: http://www.gutenberg.org/files/17921/17921-8.txt

假设你有一个像这样的文本文件:http://www.gutenberg.org/files/17921/17921-8.txt

Does anyone has a good algorithm, or open-source code, to extract words from a text file? How to get all the words, while avoiding special characters, and keeping things like "it's", etc...

有没有人有一个好的算法或开源代码从文本文件中提取单词?如何获取所有单词,同时避免使用特殊字符,并保留“it's”等内容......

I'm working in Java. Thanks

我在Java工作。谢谢

5 个解决方案

#1


17  

This sounds like the right job for regular expressions. Here is some Java code to give you an idea, in case you don't know how to start:

这听起来像是正则表达式的正确工作。这里有一些Java代码可以给你一个想法,以防你不知道如何开始:

String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(input);

while ( m.find() ) {
    System.out.println(input.substring(m.start(), m.end()));
}

The pattern [\w']+ matches all word characters, and the apostrophe, multiple times. The example string would be printed word-by-word. Have a look at the Java Pattern class documentation to read more.

模式[\ w'] +多次匹配所有单词字符和撇号。示例字符串将逐字打印。查看Java Pattern类文档以了解更多信息。

#2


3  

Pseudocode would look like this:

伪代码看起来像这样:

create words, a list of words, by splitting the input by whitespace
for every word, strip out whitespace and punctuation on the left and the right

The python code would be something like this:

python代码将是这样的:

words = input.split()
words = [word.strip(PUNCTUATION) for word in words]

where

PUNCTUATION = ",. \n\t\\\"'][#*:"

or any other characters you want to remove.

或者您要删除的任何其他字符。

I believe Java has equivalent functions in the String class: String.split() .

我相信Java在String类中具有相同的功能:String.split()。


Output of running this code on the text you provided in your link:

在您在链接中提供的文本上运行此代码的输出:

>>> print words[:100]
['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis', 
'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for', 
'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 
'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 
'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 
... etc etc.

#3


3  

Here's a good approach to your problem: This function receives your text as an input and returns an array of all the words inside the given text

这是解决问题的好方法:此函数接收文本作为输入并返回给定文本中所有单词的数组

private ArrayList<String> get_Words(String SInput){

    StringBuilder stringBuffer = new StringBuilder(SInput);
    ArrayList<String> all_Words_List = new ArrayList<String>();

    String SWord = "";
    for(int i=0; i<stringBuffer.length(); i++){
        Character charAt = stringBuffer.charAt(i);
        if(Character.isAlphabetic(charAt) || Character.isDigit(charAt)){
            SWord = SWord + charAt;
        }
        else{
            if(!SWord.isEmpty()) all_Words_List.add(new String(SWord));
            SWord = "";
        }

    }

    return all_Words_List;

}

#4


1  

Basically, you want to match

基本上,你想要匹配

([A-Za-z])+('([A-Za-z])*)?

right?

#5


0  

You could try regex, using a pattern you've made, and run a count the number of times that pattern has been found.

您可以使用您创建的模式尝试正则表达式,并运行计数已找到模式的次数。

#1


17  

This sounds like the right job for regular expressions. Here is some Java code to give you an idea, in case you don't know how to start:

这听起来像是正则表达式的正确工作。这里有一些Java代码可以给你一个想法,以防你不知道如何开始:

String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(input);

while ( m.find() ) {
    System.out.println(input.substring(m.start(), m.end()));
}

The pattern [\w']+ matches all word characters, and the apostrophe, multiple times. The example string would be printed word-by-word. Have a look at the Java Pattern class documentation to read more.

模式[\ w'] +多次匹配所有单词字符和撇号。示例字符串将逐字打印。查看Java Pattern类文档以了解更多信息。

#2


3  

Pseudocode would look like this:

伪代码看起来像这样:

create words, a list of words, by splitting the input by whitespace
for every word, strip out whitespace and punctuation on the left and the right

The python code would be something like this:

python代码将是这样的:

words = input.split()
words = [word.strip(PUNCTUATION) for word in words]

where

PUNCTUATION = ",. \n\t\\\"'][#*:"

or any other characters you want to remove.

或者您要删除的任何其他字符。

I believe Java has equivalent functions in the String class: String.split() .

我相信Java在String类中具有相同的功能:String.split()。


Output of running this code on the text you provided in your link:

在您在链接中提供的文本上运行此代码的输出:

>>> print words[:100]
['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis', 
'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for', 
'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 
'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 
'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 
... etc etc.

#3


3  

Here's a good approach to your problem: This function receives your text as an input and returns an array of all the words inside the given text

这是解决问题的好方法:此函数接收文本作为输入并返回给定文本中所有单词的数组

private ArrayList<String> get_Words(String SInput){

    StringBuilder stringBuffer = new StringBuilder(SInput);
    ArrayList<String> all_Words_List = new ArrayList<String>();

    String SWord = "";
    for(int i=0; i<stringBuffer.length(); i++){
        Character charAt = stringBuffer.charAt(i);
        if(Character.isAlphabetic(charAt) || Character.isDigit(charAt)){
            SWord = SWord + charAt;
        }
        else{
            if(!SWord.isEmpty()) all_Words_List.add(new String(SWord));
            SWord = "";
        }

    }

    return all_Words_List;

}

#4


1  

Basically, you want to match

基本上,你想要匹配

([A-Za-z])+('([A-Za-z])*)?

right?

#5


0  

You could try regex, using a pattern you've made, and run a count the number of times that pattern has been found.

您可以使用您创建的模式尝试正则表达式,并运行计数已找到模式的次数。