NLP：构建（小）语料库，或“在哪里获得大量不太专业的英语文本文件？”

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been useful to others, would be very helpful. Also, a partial or downloadable research corpus that isn't too marked-up, or some heuristic for finding an appropriate subset of wikipedia articles, or any other idea, is very appreciated.

有没有人建议在哪里可以找到用于小型语料库的档案或日常英语文本集合?我一直在使用古腾堡项目书籍作为工作原型,并希望融入更多当代语言。这里最近的一个答案间接地指出了一个伟大的usenet电影评论档案,这在我看来并没有发生,而且非常好。对于这个特定的程序,技术usenet档案或编程邮件列表会使结果倾斜并且难以分析,但任何类型的一般博客文本,聊天记录或任何可能对其他人有用的东西都会非常有用。此外,非常感谢部分或可下载的研究语料库,其中没有太多标记,或者用于查找*文章的适当子集或任何其他想法的一些启发式。

(BTW, I am being a good citizen w/r/t downloading, using a deliberately slow script that is not demanding on servers hosting such material, in case you perceive a moral hazard in pointing me to something enormous.)

(顺便说一句,我是一个很好的公民,没有下载,使用一个故意慢的脚本,对托管这些材料的服务器没有要求,如果你认为道德风险指向我巨大的东西。)

UPDATE: User S0rin points out that wikipedia requests no crawling and provides this export tool instead. Project Gutenberg has a policy specified here, bottom line, try not to crawl, but if you need to: "Configure your robot to wait at least 2 seconds between requests."

更新:用户S0rin指出*请求不爬行,而是提供此导出工具。 Project Gutenberg在这里指定了一个策略,底线,尽量不要抓取,但是如果你需要:“配置你的机器人在请求之间至少等待2秒。”

UPDATE 2 The wikpedia dumps are the way to go, thanks to the answerers who pointed them out. I ended up using the English version from here: http://download.wikimedia.org/enwiki/20090306/ , and a Spanish dump about half the size. They are some work to clean up, but well worth it, and they contain a lot of useful data in the links.

更新2*转储是要走的路,这要归功于指出它们的回答者。我最终使用的是英文版本:http://download.wikimedia.org/enwiki/20090306/,西班牙文件大小只有一半。它们是一些清理工作,但非常值得,它们在链接中包含许多有用的数据。

7 个解决方案

#1

Use the Wikipedia dumps
- needs lots of cleanup

使用Wikipedia转储需要大量清理

See if anything in nltk-data helps you
- the corpora are usually quite small

看看nltk-data中是否有任何内容可以帮助您完成语料库

the Wacky people have some free corpora
- tagged
- you can spider your own corpus using their toolkit

古怪的人有一些免费的语料库标记,你可以使用他们的工具包蜘蛛自己的语料库

Europarl is free and the basis of pretty much every academic MT system
- spoken language, translated

Europarl是免费的,是几乎所有学术MT系统口语的基础,翻译

The Reuters Corpora are free of charge, but only available on CD

路透社语料库是免费的,但只能通过CD获得

You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.

您可以随时获取自己的信息,但要注意:HTML页面通常需要大量清理,因此请限制自己使用RSS源。

If you do this commercially, the LDC might be a viable alternative.

如果你在商业上这样做,最不发达国家可能是一个可行的选择。

#2

Wikipedia sounds like the way to go. There is an experimental Wikipedia API that might be of use, but I have no clue how it works. So far I've only scraped Wikipedia with custom spiders or even wget.

*听起来像是要走的路。有一个实验性的*API可能有用,但我不知道它是如何工作的。到目前为止,我只使用自定义蜘蛛甚至wget来删除*。

Then you could search for pages that offer their full article text in RSS feeds. RSS, because no HTML tags get in your way.

然后,您可以搜索在RSS源中提供完整文章文本的页面。 RSS,因为没有HTML标签妨碍你。

Scraping mailing lists and/or the Usenet has several disatvantages: you'll be getting AOLbonics and Techspeak, and that will tilt your corpus badly.

刮痧邮件列表和/或Usenet有几个不利的地方:你将获得AOLbonics和Techspeak,这会严重影响你的语料库。

The classical corpora are the Penn Treebank and the British National Corpus, but they are paid for. You can read the Corpora list archives, or even ask them about it. Perhaps you will find useful data using the Web as Corpus tools.

古典语料库是Penn Treebank和英国国家语料库,但它们是有偿的。您可以阅读Corpora列表存档,甚至可以询问它们。也许您会使用Web作为Corpus工具找到有用的数据。

I actually have a small project in construction, that allows linguistic processing on arbitrary web pages. It should be ready for use within the next few weeks, but it's so far not really meant to be a scraper. But I could write a module for it, I guess, the functionality is already there.

我实际上有一个小型的构建项目,允许在任意网页上进行语言处理。它应该在接下来的几周内准备好使用,但它到目前为止并不是真正意义上的刮刀。但我可以为它编写一个模块,我想,功能已经存在。

#3

If you're willing to pay money, you should check out the data available at the Linguistic Data Consortium, such as the Penn Treebank.

如果您愿意付钱,您应该查看Linguistic Data Consortium提供的数据,例如Penn Treebank。

#4

Wikipedia seems to be the best way. Yes you'd have to parse the output. But thanks to wikipedia's categories you could easily get different types of articles and words. e.g. by parsing all the science categories you could get lots of science words. Details about places would be skewed towards geographic names, etc.

*似乎是最好的方式。是的,你必须解析输出。但是,由于*的类别,您可以轻松获得不同类型的文章和文字。例如通过解析所有科学类别,您可以获得许多科学词汇。有关地方的详细信息会偏向地理名称等。

#5

You've covered the obvious ones. The only other areas that I can think of too supplement:

你已经涵盖了明显的一些。我能想到的唯一其他领域也是补充:

1) News articles / blogs.

1)新闻文章/博客。

2) Magazines are posting a lot of free material online, and you can get a good cross section of topics.

2)杂志在网上发布了大量免费资料,你可以获得很好的主题横截面。

#6

Looking into the wikipedia data I noticed that they had done some analysis on bodies of tv and movie scripts. I thought that might interesting text but not readily accessible -- it turns out it is everywhere, and it is structured and predictable enough that it should be possible clean it up. This site, helpfully titled "A bunch of movie scripts and screenplays in one location on the 'net", would probably be useful to anyone who stumbles on this thread with a similar question.

在查看*数据时,我注意到他们已经对电视和电影剧本的主体进行了一些分析。我认为这可能是有趣的文字,但不容易访问 - 事实证明它无处不在,它的结构和可预测性足以应该可以清理它。这个网站有点名为“网上一个地方的一堆电影剧本和剧本”,对于那些偶然发现类似问题的人来说,可能会有用。

#7

You can get quotations content (in limited form) here: http://quotationsbook.com/services/

您可以在此处获取报价内容(限量形式):http://quotationsbook.com/services/

This content also happens to be on Freebase.

这个内容也恰好出现在Freebase上。

#1

Use the Wikipedia dumps
- needs lots of cleanup