机器学习
Support Vector Machine
An implementation of Vapnik's Support Vector Machine A Library for Support Vector MachinesDecision Tree
The "classic" decision-tree tool, developed by J. R. Quinlan TutorialMaximum Entropy
Yet Another Small MaxEnt ToolkitConditional Random Field
A simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data自然语言处理
综合
An organizational center for open source projects related to natural language processing A suite of UNIX software tools to facilitate the construction and testing of statistical language models A Java-based development package for academic use in information retrieval (IR) and text mining. Include many NLP tools A suite of Java libraries for the linguistic analysis of human language, including- track mentions of entities (e.g. people or proteins);
- link entity mentions to database entries;
- uncover relations between entities and actions;
- classify text passages by language, character encoding, genre, topic, or sentiment;
- correct spelling with respect to a text collection;
- cluster documents by implicit topic and discover significant trends over time; and
- provide part-of-speech tagging and phrase chunking.
- Advanced Natural Lange Object-oriented Processing Environment.包括一系列工具(特别c#的stanford parser)
分词
中科院的中文分词系统 A Java implementation of a CRF-based Chinese Word Segmenter词性标注
A error-driven transformation-based tagger implemented by Eric Brill A Java implementation of the log-linear part-of-speech taggers descriped by Kristina Toutanova, et.al. A decision tree based tagger from the University of Stuttgart.- SVMTool, a POS Tagger based on SVMs
- QTAG Part of speech tagger
命名实体识别
A Java implementation of a Conditional Random Field sequence model, together with well-engineered features for Named Entity Recognition Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co. SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)Stemming
A process for removing the commoner morphological and inflexional endings from words in English by Martin Porter A small string processing language designed for creating stemming algorithms for use in Information Retrieval.句法分析
Java implementations of probabilistic natural language parsers, both highly optimized PCFG and dependency parsers, and a lexicalized PCFG parser.文本挖掘
摘要
- Rouge Rouge在Windows下的配置
其他
加密
包括众多加密算法,RSA、DES、MD5、SHA等 Win32安装版压缩
A Massively Spiffy Yet Delicately Unobtrusive Compression Library日志
Creates and maintains open-source software related to the logging of application behavior and released at no charge to the public, including 注: log4cxx官方版本有内存泄漏问题Unicode
A mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applicationsXML
A validating XML parser, including C and Java edition多字符串匹配
- AC in C#: Aho-Corasick string matching in C#
HTML Parser
- Html Agility Pack, an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
- Majestic-12, an open source high-performance .NET C# module that was created to parse HTML for links, indexing and other purposes. 速度快,但不生成dom树
外部联接
- An annotated list of resources by Stanford NLP Group
- KDnuggets 有一些与KDD相关的软件等
文章来源 : http://fuliang.iteye.com/blog/955023
http://video.sina.com.cn/v/b/107900125-2192582404.html