I need a regular expression that will extract sentences from text file. example text :
我需要一个正则表达式,从文本文件中提取句子。示例文本:
Consider, for example, the Asian tsunami disaster that happened in the end of 2004. A query to Google News (http://news.google.com) returned more than 80,000 online news articles about this event within one month (Jan.17 through Feb.17, 2005). information by mr. Kahana.
以2004年底发生的亚洲海啸为例。对谷歌News (http://news.google.com)的查询显示,在一个月内(2005年1月17日至2月17日)就有超过8万篇关于这一事件的在线新闻报道。信息发起的先生。
here's my code :
这是我的代码:
$re = '/(?<=[.!?]|[.!?][\'"])\s+/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
but the last sentence still splitted information by mr.
and Kahana.
how to solve it ? thank you :)
但最后一句话仍然是卡哈纳夫妇提供的信息。怎么解决呢?谢谢你:)
1 个解决方案
#1
7
You Can't Do this with Regular Expressions
正则表达式不能这样做
English as a language does not fit into well-placed formatting rules. As such, regular expressions are not fit to fulfill the purpose you are seeking out. What you are really looking for is something like a Natural Language Processor.
英语作为一种语言,不适合放置良好的格式规则。正因为如此,正则表达式并不适合实现您正在寻找的目标。您真正需要的是类似于自然语言处理器的东西。
Unless this is critical to your program, I suggest you instead determine the following things:
除非这对你的项目至关重要,否则我建议你决定以下事项:
- What is an acceptable level of error? Nothing you do will be perfect. But if it works 80% is that okay? 90%? 99%? How critical is this to you/your client?
- 什么是可接受的错误级别?你所做的一切都是完美的。但如果它能工作80%,没问题吧?90% ?99% ?这对你/你的客户有多重要?
- Where is the text coming from? For example, a textbook will most likely be written differently than people's twitter feeds. You can do research and make exceptions based on what you see in the actual text you are using.
- 文本来自哪里?例如,一本教科书很可能与人们的推特源不同。您可以根据您在实际使用的文本中看到的内容进行研究并进行异常处理。
- What am I doing with the text? If you are just indexing things like keywords, then it doesn't matter (as much) if you get the sentences split correctly. It's all about tuning the program to get the appropriate output for this specific purpose.
- 我在用这段文字做什么?如果你只是索引诸如关键字之类的东西,那么如果你把句子正确地分割开来,就没有那么重要了。所有这些都是关于调优程序以获得适合这个特定目的的适当输出。
My recommendation is to use trial and error to get your error rate down as much as possible. Run your program on a large set of text, and keep adding exceptions until you get an acceptable error rate. If, however, you need more than a couple dozen rules or so, you will probably just want to rethink the problem.
我的建议是使用试错法来尽可能降低错误率。在一组大的文本上运行程序,并不断添加异常,直到得到一个可接受的错误率。但是,如果您需要超过几十条规则,您可能只想重新考虑这个问题。
In short, PHP and Regular Expressions aren't meant for this because English is funky. So either live with adding exceptions to get a small(er) error rate, or rethink the point altogether.
简而言之,PHP和正则表达式的意思不是这样,因为英语很时髦。因此,要么通过添加异常来获得较小的错误率,要么完全重新考虑这一点。
#1
7
You Can't Do this with Regular Expressions
正则表达式不能这样做
English as a language does not fit into well-placed formatting rules. As such, regular expressions are not fit to fulfill the purpose you are seeking out. What you are really looking for is something like a Natural Language Processor.
英语作为一种语言,不适合放置良好的格式规则。正因为如此,正则表达式并不适合实现您正在寻找的目标。您真正需要的是类似于自然语言处理器的东西。
Unless this is critical to your program, I suggest you instead determine the following things:
除非这对你的项目至关重要,否则我建议你决定以下事项:
- What is an acceptable level of error? Nothing you do will be perfect. But if it works 80% is that okay? 90%? 99%? How critical is this to you/your client?
- 什么是可接受的错误级别?你所做的一切都是完美的。但如果它能工作80%,没问题吧?90% ?99% ?这对你/你的客户有多重要?
- Where is the text coming from? For example, a textbook will most likely be written differently than people's twitter feeds. You can do research and make exceptions based on what you see in the actual text you are using.
- 文本来自哪里?例如,一本教科书很可能与人们的推特源不同。您可以根据您在实际使用的文本中看到的内容进行研究并进行异常处理。
- What am I doing with the text? If you are just indexing things like keywords, then it doesn't matter (as much) if you get the sentences split correctly. It's all about tuning the program to get the appropriate output for this specific purpose.
- 我在用这段文字做什么?如果你只是索引诸如关键字之类的东西,那么如果你把句子正确地分割开来,就没有那么重要了。所有这些都是关于调优程序以获得适合这个特定目的的适当输出。
My recommendation is to use trial and error to get your error rate down as much as possible. Run your program on a large set of text, and keep adding exceptions until you get an acceptable error rate. If, however, you need more than a couple dozen rules or so, you will probably just want to rethink the problem.
我的建议是使用试错法来尽可能降低错误率。在一组大的文本上运行程序,并不断添加异常,直到得到一个可接受的错误率。但是,如果您需要超过几十条规则,您可能只想重新考虑这个问题。
In short, PHP and Regular Expressions aren't meant for this because English is funky. So either live with adding exceptions to get a small(er) error rate, or rethink the point altogether.
简而言之,PHP和正则表达式的意思不是这样,因为英语很时髦。因此,要么通过添加异常来获得较小的错误率,要么完全重新考虑这一点。