http://yixuan.cos.name/cn/2011/03/text-mining-of-song-poems/
看了宋词频率统计的例子,想用php实现一下,php的split中文让我崩溃了。。。
php都5.3.8了,一个中文的问题还搞得这么烂。。。
-----------------------------------------------------
(add php源码)
终于能workaround搞定中文的php
<?php结果
ini_set("memory_limit","1024M");
echo "start\n";
$file_handle = fopen("tangshi.txt", "r");
$ci = array();
$ct = 0;
while (!feof($file_handle)) {
$line = fgets($file_handle);
if (preg_match('/\d+/',$line)) continue;
if (strlen($line) > 6*3 && strlen($line) < 500*3)
{
$ci[$ct++] = $line;
}
}
fclose($file_handle);
$groupcount = 3;
$sentence = array();
$ct = 0;
echo "start sentence\n";
foreach ($ci as $v) {
preg_match_all('/([\w]+)/u', $v, $matches);
$li = $matches[0];
foreach ($li as $val)
{
if (strlen($val) >= $groupcount*3)
{
$sentence[$ct++] = $val;
}
}
}
$word = array();
$ct = 0;
echo "start word\n";
foreach ($sentence as $v) {
if (strlen($v) > 500*3 || strlen($v) < $groupcount*3) continue;
for($i = 0; $i <= strlen($v)-$groupcount*3; $i+=3)
{
$word[$ct++] = substr($v,$i,$groupcount*3);
}
}
echo "start statistic\n";
$statistic = array();
foreach ($word as $v) {
if (array_key_exists($v, $statistic))
{
$statistic[$v] = $statistic[$v] + 1;
}
else
{
$statistic[$v] = 1;
}
}
echo "start output\n";
arsort($statistic);
while(list($key,$val) = each($statistic)) {
if ($val > 50)
{
echo $key . $val . "\r\n";
}
}
echo "done";
?>
倚阑干125 知何处107广寒宫94到如今89东风吹86留不住79人何处76有谁知74三十六70西风吹66云深处63人间世62不知何62人不见61人千里58君知否58与谁同57不如归55春归去54归去来53年今日53何处是52
东风1400 何处1224 人间1196 风流848 春风800 归去793 西风780 江南773 归来770 相思763 梅花744 千里675 明月659 回首648 多少646 如今636 阑干632 年年608 万里592 一笑586 黄昏550 天涯538 当年537 一枝528 芳草527 相逢524 尊前516 风雨506 故人505 流水475 风吹472 依旧470 木兰467 风月460 多情455 当时446 无人444 斜阳438 减字435 不知434 兰花428 不见423 深处422 字木412 时节401 平生399 春色396 凄凉390 功名380 一点378 匆匆377 无限377 天上374 今日369 西湖360 杨柳360 桃花355 消息350 扁舟349 芙蓉344 憔悴338 神仙335 何事334 桃李333 一片332 十分332 心事329 黄花329 人生328 一声325 佳人322 长安320 断肠316 东君315 鸳鸯314 为谁313 而今313 少年310 十年310 去年309 海棠308 无情306 不是306 昨夜304 富贵303 时候303 行人302 今夜301 江上300 蓬莱300 不似299 青山296 谁知296 寂寞296 几度295 天气295 何时293 惟有293 一曲292 肠断292 往事291 月明290 笙歌287 如何285 燕子282 清明281 人在281 悠悠280 无数280 明年279 落花278 千古270 一番269 十二266 精神266 梨花264 今年263 十里261 垂杨261 分付260 何人260 年时260 春归259 *258 有人257 帘幕257 思量256 几时256 如此255 如许252 秋风252 明朝249 夕阳248 记得247 清风244 风露243 不须243 只有243 相对240 不堪240 酒醒239 从今239 庭院239 又是238 今朝238 缥缈238 盈盈237 花开237 不如236 殷勤236 风光235 夜来233 无语232 分明232 只恐229 旧时229 倚阑229 秋千228 依然228 今宵228 行云227 春来224 自有224 携手223 金缕221 花前221
ps唐诗的结果
何处1662 不知1462 万里1445 千里1299 今日1159 不见1155 不可1144 春风1123 白云1105 不得943 人间890 明月886 无人875 风吹832 故人778 惆怅775 秋风745 悠悠737 相思733 长安721 白日689 如何684 十年675 青山664 何人659 少年631 相逢627 平生589 年年587 寂寞586 天子585 天地584 人不583 黄金580 何事575 江上555 流水547 可怜535 回首533 如此521 主人520 白发520 今朝515 月明512 从此505 日月505 行人501 不如495 三十494 将军494 归去493 落日493 日暮487 不能482 别离480 洛阳474 何时474 此时472 天下471 芳草470 归来469 无事468 相见466 夕阳459 江南458 当时454 杨柳452 风雨450 东风440 青云435 洞庭432 参差432 花落430 落花429 天涯429 芙蓉428 清风423 不是420 烟霞415 白头415 桃花412 不相411 唯有408 君不408 何如407 南山403 谁能398 千年392 如今392 天上389 十二387 花开384 与君383 桃李381 君王379 终日379 殷勤379 此地377 浮云377 二十377 苍苍374 门前374 凤凰374 神仙373 千万370 山中367 美人363 鸳鸯361 有时360 不敢360 自有359 无限356 有馀355 风起354 处处353 萧萧352 一声352 尽日352 三千351 风流349 山川349 君子348 春色343 故乡343 萧条341 何必339 裴回339 分明338 不堪336 几时333
君不见234 不知何127 行路难108 三千里108 不可见100 三十六94 在何处90 知何处90 二十年87 卷三百84 百六十82 卷八百81 卷五百81 卷二百81 卷七百81 卷四百81 卷一百81 卷六百80 百四十77 百二十76 三十年75 百七十75 百八十74 百九十74 百三十74 不相见73 无消息73 百一十73 何处去72 百五十72 无一事70 洛阳城69 千万里69 何处是68 水东流68 向人间66 归未得65 歌一曲62 一杯酒61 千里外61 明月夜58 归何处57 从此去56 东风吹56 归去来55 今何在55 人不知55 草萋萋55 春风吹54 无人知53 不知谁53 人不见53 不见人52 与君同52 不得意52 长安道52 复何如51 人间事51
-----------------------------------------------------
于是看看 yixuan 写的R语言代码,下了个R语言环境。
运行结果命令行里面显示的中文都是空的。。。
原来R语言的命令行不支持显示中文。。。。
改下输出到文件,这才看到结果
l = scan("Ci.txt", "character", sep = "\n");
l.len = nchar(l);
# 某些行是作者和标题,所以选取长度大于10的行;
# 另外这个文本文件不太规整,有些网址什么的,
# 所以也要排除那些长度太长的。
ci = l[l.len > 10 & l.len < 500];
# 句子用标点符号分割。
sentences = strsplit(ci, ",|。|!|?|、");
sentences = unlist(sentences);
sentences = sentences[sentences != ""];
s.len = nchar(sentences);
# 单句太长了说明有可能是错误的字符,去除掉。
sentences = sentences[s.len <= 10 & s.len >= 2];
s.len = nchar(sentences);
# 暴力挨个拆分,比如“犹解嫁东风”的所有二字组合为
# “犹解”“解嫁”“嫁东”“东风”,
# 无意义的词其频数自然就落在后面了。
splitwords = function(x, x.len) substring(x, 1:(x.len+1 - 2), 2:x.len);
words = mapply(splitwords, sentences, s.len, SIMPLIFY = TRUE, USE.NAMES = FALSE);
words = unlist(words);
words.freq = table(words);
words.freq = sort(words.freq, decreasing = TRUE);
df<-data.frame(Word = names(words.freq[1:100]), Freq = as.integer(words.freq[1:100]));
write.table(df, "1.txt");
修改绿色部分就可以获得3个字和4个字的结果
这回结果终于对了。
"Word" "Fre"
1,□□,1485
2,东风,1382
3,何处,1230
4,人间,1202
5,风流,857
6,归去,812
7,春风,802
8,西风,779
9,归来,771
10,江南,765
11,相思,753
12,梅花,732
13,千里,676
14,回首,656
15,明月,651
16,多少,648
17,如今,642
18,阑干,630
19,年年,613
20,万里,590
21,一笑,582
22,黄昏,550
23,当年,542
24,天涯,537
25,相逢,528
26,芳草,527
27,尊前,516
28,一枝,512
29,风雨,505
30,流水,472
31,依旧,472
32,风吹,471
33,风月,461
34,多情,457
35,故人,451
36,当时,450
37,无人,445
38,斜阳,438
39,不知,430
40,不见,429
41,深处,422
42,时节,403
43,平生,398
44,凄凉,398
45,春色,394
46,匆匆,383
47,功名,383
48,一点,378
49,无限,377
50,今日,369
51,天上,368
52,杨柳,362
53,西湖,356
54,桃花,354
55,扁舟,353
56,消息,351
57,憔悴,344
58,何事,339
59,芙蓉,338
60,神仙,334
61,一片,334
62,桃李,333
63,人生,332
64,十分,331
65,心事,329
66,黄花,328
67,一声,325
68,佳人,324
69,长安,321
70,东君,319
71,断肠,316
72,而今,315
73,鸳鸯,314
74,为谁,313
75,十年,310
76,去年,309
77,少年,308
78,海棠,307
79,寂寞,306
80,无情,306
81,不是,305
82,时候,304
83,肠断,303
84,富贵,303
85,蓬莱,303
86,昨夜,303
87,行人,302
88,今夜,301
89,谁知,300
90,不似,299
91,江上,298
92,悠悠,296
93,几度,295
94,青山,295
95,何时,294
96,天气,293
97,惟有,293
98,一曲,291
99,月明,291
100,往事,290
"少年" 308
"78" "海棠" 307
"79" "寂寞" 306
"80" "无情" 306
"81" "不是" 305
"82" "时候" 304
"83" "肠断" 303
"84" "富贵" 303
"85" "蓬莱" 303
"86" "昨夜" 303
"87" "行人" 302
"88" "今夜" 301
"89" "谁知" 300
"90" "不似" 299
"91" "江上" 298
"92" "悠悠" 296
"93" "几度" 295
"94" "青山" 295
"95" "何时" 294
"96" "天气" 293
"97" "惟有" 293
"98" "一曲" 291
"99" "月明" 291
"100" "往事" 290
三个字的
"Word" "Freq"
"1" "□□□" 893
"2" "倚阑干" 125
"3" "知何处" 107
"4" "广寒宫" 94
"5" "到如今" 89
"6" "东风吹" 85
"7" "留不住" 76
"8" "人何处" 76
"9" "有谁知" 74
"10" "三十六" 70
"11" "西风吹" 66
"12" "云深处" 63
"13" "不知何" 62
"14" "人间世" 61
"15" "人不见" 60
"16" "君知否" 58
"17" "人千里" 58
"18" "与谁同" 57
"19" "不如归" 54
"20" "春归去" 54
"21" "年今日" 53
"22" "何处是" 52
"23" "归去来" 51
"24" "二十四" 47
"25" "归何处" 47
"26" "花深处" 47
"27" "江南春" 46
"28" "向尊前" 46
"29" "花飞絮" 45
"30" "花流水" 45
"31" "记当年" 44
"32" "雨初晴" 43
"33" "长安道" 43
"34" "倚东风" 42
"35" "功名事" 41
"36" "海棠开" 41
"37" "谁知道" 41
"38" "知多少" 41
"39" "东风里" 40
"40" "海棠花" 40
"41" "老人星" 40
"42" "那堪更" 40
"43" "桥流水" 40
"44" "山无数" 40
"45" "在何处" 40
"46" "断人肠" 39
"47" "水晶宫" 39
"48" "月清风" 39
"49" "落花飞" 38
"50" "如归去" 38
"51" "无消息" 38
"52" "花无数" 37
"53" "人间天" 37
"54" "随流水" 37
"55" "TXT" 36
"56" "杜鹃啼" 36
"57" "歌金缕" 36
"58" "个人人" 36
"59" "故人相" 36
"60" "间天上" 36
"61" "明月清" 36
"62" "送春归" 36
"63" "影横斜" 36
"64" "春消息" 35
"65" "人间何" 35
"66" "一番新" 35
"67" "一年春" 35
"68" "一枝春" 35
"69" "二十年" 34
"70" "寄相思" 34
"71" "去来兮" 34
"72" "去年今" 34
"73" "人如玉" 34
"74" "是人间" 34
"75" "寿阳妆" 34
"76" "无觅处" 34
"77" "有个人" 34
"78" "此时情" 33
"79" "江南路" 33
"80" "留春住" 33
"81" "然一笑" 33
"82" "唱阳关" 32
"83" "分付与" 32
"84" "何处去" 32
"85" "江南岸" 32
"86" "人憔悴" 32
"87" "人去后" 32
"88" "十年前" 32
"89" "天如水" 32
"90" "须信道" 32
"91" "一枝斜" 32
"92" "又是一" 32
"93" "几时休" 31
"94" "尽人间" 31
"95" "天上人" 31
"96" "月黄昏" 31
"97" "在人间" 31
"98" "最好是" 31
"99" "暗香浮" 30
"100" "断肠声" 30
四个字
"Word" "Freq"
"1" "□□□□" 514
"2" "不知何处" 39
"3" "不如归去" 36
"4" "人间天上" 36
"5" "归去来兮" 34
"6" "明月清风" 34
"7" "广寒宫殿" 29
"8" "落花飞絮" 28
"9" "天上人间" 28
"10" "江南江北" 27
"11" "落花流水" 26
"12" "绿鬓朱颜" 26
"13" "一觞一咏" 26
"14" "去年今日" 25
"15" "人间何处" 25
"16" "朝朝暮暮" 24
"17" "三万六千" 24
"18" "疏影横斜" 24
"19" "清风明月" 23
"20" "三十六宫" 23
"21" "十洲三岛" 23
"22" "暗香浮动" 22
"23" "独倚阑干" 22
"24" "良辰美景" 22
"25" "功名富贵" 21
"26" "寒食清明" 21
"27" "岁岁年年" 21
"28" "五云深处" 21
"29" "嫣然一笑" 21
"30" "一轮明月" 21
"31" "朱颜绿鬓" 21
"32" "不堪回首" 20
"33" "二十四桥" 20
"34" "江北江南" 20
"35" "十二阑干" 20
"36" "冰肌玉骨" 19
"37" "满城风雨" 19
"38" "钱塘江上" 19
"39" "竹篱茅舍" 19
"40" "百花头上" 18
"41" "登山临水" 18
"42" "故人何处" 18
"43" "纶巾羽扇" 18
"44" "前度刘郎" 18
"45" "夜来风雨" 18
"46" "一钩新月" 18
"47" "有个人人" 18
"48" "玉皇香案" 18
"49" "葱葱佳气" 17
"50" "道骨仙风" 17
"51" "故人千里" 17
"52" "凌波微步" 17
"53" "茂林修竹" 17
"54" "赏心乐事" 17
"55" "数声啼鸟" 17
"56" "斜风细雨" 17
"57" "一年春事" 17
"58" "玉骨冰肌" 17
"59" "非烟非雾" 16
"60" "南北东西" 16
"61" "年年今日" 16
"62" "沈香亭北" 16
"63" "春归何处" 15
"64" "广寒宫阙" 15
"65" "记得年时" 15
"66" "桃花流水" 15
"67" "小春时候" 15
"68" "燕子归来" 15
"69" "又是一番" 15
"70" "与花为主" 15
"71" "整顿乾坤" 15
"72" "尊前一笑" 15
"73" "今夕何夕" 14
"74" "金风玉露" 14
"75" "绿窗朱户" 14
"76" "年年岁岁" 14
"77" "千岩万壑" 14
"78" "人生行乐" 14
"79" "如今憔悴" 14
"80" "天长地久" 14
"81" "仙风道骨" 14
"82" "一叶扁舟" 14
"83" "暗香疏影" 13
"84" "不比寻常" 13
"85" "潮生潮落" 13
"86" "对酒当歌" 13
"87" "富贵功名" 13
"88" "画桥流水" 13
"89" "*如画" 13
"90" "七十古来" 13
"91" "曲水流觞" 13
"92" "十古来稀" 13
"93" "天寒日暮" 13
"94" "天若有情" 13
"95" "文章太守" 13
"96" "细雨斜风" 13
"97" "一年一度" 13
"98" "倚遍阑干" 13