有没有一种方法可以确定文档是否是文本句子文件？

I'm processing hundreds of thousands of files. Potentially millions later on down the road. A bad file will contain a text version of an excel spreadsheet or other text that isn't binary but also isn't sentences. Such files cause CoreNLP to blow up (technically, these files take a long time to process such as 15 seconds per kilobyte of text.) I'd love to detect these files and discard them in sub-second time.

我正在处理数十万个文件。可能数百万美元后来在路上。坏文件将包含excel电子表格的文本版本或其他不是二进制文件但也不是句子的文本。这样的文件导致CoreNLP爆炸(从技术上讲,这些文件需要很长时间才能处理,例如每千字节文本15秒。)我很乐意检测这些文件并在亚秒内丢弃它们。

What I am considering is taking a few thousand files at random, examining the first, say, 200 characters and looking for the distribution of characters to determine what is legic and what is an outlier. Example, if there are no punctuation marks or too many of them. Does this seem like a good approach? Is there a better one that has been proven? I think, for sure, this will work well enough, possibly throwing out potentially good files but rarely.

我正在考虑的是随机抽取几千个文件,检查第一个,比方说200个字符并寻找字符的分布以确定什么是合法的,什么是异常值。例如,如果没有标点符号或太多标点符号。这看起来像是一个好方法吗?有没有更好的证明?我认为,当然,这将运作良好,可能会抛出可能很好的文件,但很少。

Another idea is to simply run with annotators tokenize and ssplit and do word and sentence count. That seems to do a good job as well and returns quickly. I can think of cases where this might fail as well, possibly.

另一个想法是简单地使用注释器tokenize和ssplit运行并进行单词和句子计数。这似乎也做得很好并且回报很快。我可以想到这可能会失败的情况。

4 个解决方案

#1

This kind of processing pipeline is always in a state of continuous improvement. To kick off that process, the first thing I would build is an instrument around the timing behavior of CoreNLP. If CoreNLP is taking too long, kick out the offending file into a separate queue. If this isn't good enough, you can write recognizers for the most common things in the takes-too-long queue and divert them before they hit CoreNLP. The main advantage of this approach is that it works with inputs that you don't expect in advance.

这种处理流程始终处于持续改进的状态。为了启动这个过程,我要构建的第一件事是围绕CoreNLP的计时行为的工具。如果CoreNLP耗时太长,请将违规文件踢出一个单独的队列。如果这还不够好,您可以为需要太长队列中最常见的事情编写识别器,并在它们命中CoreNLP之前转移它们。这种方法的主要优点是它可以使用您事先没有预料到的输入。

#2

There are two main approaches to this kind of problem.

这种问题有两种主要方法。

The first is to take the approach you are considering in which you examine the contents of the file and decide whether it is acceptable text or not based on a statistical analysis of the data in the file.

第一种方法是采用您正在考虑的方法,在该方法中检查文件的内容,并根据文件中数据的统计分析来确定文本是否可接受。

The second approach is to use some kind of meta tag such as a file extension to at least eliminate those files that are pretty certainly to be a problem (.pdf, .jpg, etc.).

第二种方法是使用某种元标记,例如文件扩展名,以至少消除那些肯定是个问题的文件(.pdf,.jpg等)。

I would suggest a mixture of the two approaches so as to cut down on the amount of processing.

我会建议两种方法的混合,以减少处理量。

You might consider a pipeline approach in which you have a sequence of tests. The first test filters out files based on meta data such as the file extension, the second step then does a preliminary statistical check on the first few bytes of the file to filter out obvious problem files, a third step does a more involved statistical analysis of the text, and the fourth handles the CoreNLP rejection step.

您可能会考虑使用管道方法进行一系列测试。第一个测试基于元数据(例如文件扩展名)筛选出文件,第二步然后对文件的前几个字节进行初步统计检查以过滤掉明显的问题文件,第三步执行更复杂的统计分析文本,第四个处理CoreNLP拒绝步骤。

You do not say where the files originate nor if there are any language considerations (English versus French versus Simplified Chinese text). For instance are the acceptable text files using UTF-8, UTF-16, or some other encoding for the text?

您没有说文件的来源,也没有任何语言考虑因素(英语与法语和简体中文文本)。例如,可接受的文本文件使用UTF-8,UTF-16或文本的其他一些编码?

Also is it possible for the CoreNLP application to be more graceful about detecting and rejecting incompatible text files?

CoreNLP应用程序是否可以更加优雅地检测和拒绝不兼容的文本文件?

#3

Could you not just train a Naive Bayes Classifier to recognize the bad files? For features use things like (binned) percentage of punctuation, percentage of numerical characters, and average sentence length.

你能不只是训练朴素贝叶斯分类器识别坏文件?对于要素,请使用(分箱)标点符号百分比,数字字符百分比和平均句子长度等内容。

#4

-1

Peter,

You are clearly dealing with files for ediscovery. Anything and everything is possible, and as you know, anything kicked out must be logged as an exception. I've faced this, and have heard the same from other analytics processors.

您显然正在处理用于电子发现的文件。任何事情都是可能的,如你所知,任何被踢出的东西都必须作为例外记录。我遇到过这种情况,并从其他分析处理器那里听到了同样的情况。

Some of the solutions above, pre-process and in-line can help. In some ediscovery solutions it may be feasible to dump text into a field in SQL and truncate, or otherwise truncate, and still get what you need. In other apps, anything to do with semantic clustering, or predictive coding, it may be better to use pre-filters using metadata (e.g. file type), document type classification libraries, and entity extraction based upon prior examples, current sampling or your best guess as to the nature of "bad file" contents.

上面的一些解决方案,预处理和在线可以提供帮助。在某些电子发现解决方案中,将文本转储到SQL中的字段并截断或以其他方式截断,仍然可以获得所需内容是可行的。在其他应用程序中,与语义聚类或预测编码有关,最好使用预先过滤器,使用元数据(例如文件类型),文档类型分类库,以及基于先前示例,当前采样或您最佳的实体提取猜测“坏文件”内容的性质。

Good luck.

#1

#2