文件名称:Context-Based Chinese Word Segmentation without Dictionary Support
文件大小:533KB
文件格式:PDF
更新时间:2018-08-26 05:00:45
NLP, SVM, CWS
This paper presents a new machine-learning Chinese word segmentation (CWS) approach, which defines CWS as a break-point classifi- cation problem; the break point is the bound- ary of two subsequent words. Further, this paper exploits a support vector machine (SVM) classifier, which learns the segmenta- tion rules of the Chinese language from a context model of break points in a corpus. Additionally, we have designed an effective feature set for building the context model, and a systematic approach for creating the positive and negative samples used for train- ing the classifier. Unlike the traditional ap- proach, which requires the assistance of large-scale known information sources such as dictionaries or linguistic tagging, the pro- posed approach selects the most frequent words in the corpus as the learning sources. In this way, CWS is able to execute in any novel corpus without proper assistance sources. According to our experimental re- sults, the proposed approach can achieve a competitive result compared with the Chinese knowledge and information processing (CKIP) system from Academia Sinica.