关于宋词频率统计(R语言)

时间:2022-07-20 04:04:49

http://yixuan.cos.name/cn/2011/03/text-mining-of-song-poems/


看了宋词频率统计的例子,想用php实现一下,php的split中文让我崩溃了。。。

php都5.3.8了,一个中文的问题还搞得这么烂。。。


-----------------------------------------------------

(add php源码)

终于能workaround搞定中文的php

<?php
ini_set("memory_limit","1024M");
echo "start\n";
$file_handle = fopen("tangshi.txt", "r");
$ci = array();
$ct = 0;
while (!feof($file_handle)) {
$line = fgets($file_handle);
if (preg_match('/\d+/',$line)) continue;
if (strlen($line) > 6*3 && strlen($line) < 500*3)
{
$ci[$ct++] = $line;
}
}
fclose($file_handle);
$groupcount = 3;
$sentence = array();
$ct = 0;
echo "start sentence\n";
foreach ($ci as $v) {
preg_match_all('/([\w]+)/u', $v, $matches);
$li = $matches[0];
foreach ($li as $val)
{
if (strlen($val) >= $groupcount*3)
{
$sentence[$ct++] = $val;
}
}
}
$word = array();
$ct = 0;
echo "start word\n";
foreach ($sentence as $v) {
if (strlen($v) > 500*3 || strlen($v) < $groupcount*3) continue;
for($i = 0; $i <= strlen($v)-$groupcount*3; $i+=3)
{
$word[$ct++] = substr($v,$i,$groupcount*3);
}
}
echo "start statistic\n";
$statistic = array();
foreach ($word as $v) {
if (array_key_exists($v, $statistic))
{
$statistic[$v] = $statistic[$v] + 1;
}
else
{
$statistic[$v] = 1;
}
}
echo "start output\n";
arsort($statistic);
while(list($key,$val) = each($statistic)) {
if ($val > 50)
{
echo $key . $val . "\r\n";
}
}
echo "done";
?>
结果

倚阑干125 知何处107广寒宫94到如今89东风吹86留不住79人何处76有谁知74三十六70西风吹66云深处63人间世62不知何62人不见61人千里58君知否58与谁同57不如归55春归去54归去来53年今日53何处是52


东风1400    何处1224    人间1196    风流848    春风800    归去793    西风780    江南773    归来770    相思763    梅花744    千里675    明月659    回首648    多少646    如今636    阑干632    年年608    万里592    一笑586    黄昏550    天涯538    当年537    一枝528    芳草527    相逢524    尊前516    风雨506    故人505    流水475    风吹472    依旧470    木兰467    风月460    多情455    当时446    无人444    斜阳438    减字435    不知434    兰花428    不见423    深处422    字木412    时节401    平生399    春色396    凄凉390    功名380    一点378    匆匆377    无限377    天上374    今日369    西湖360    杨柳360    桃花355    消息350    扁舟349    芙蓉344    憔悴338    神仙335    何事334    桃李333    一片332    十分332    心事329    黄花329    人生328    一声325    佳人322    长安320    断肠316    东君315    鸳鸯314    为谁313    而今313    少年310    十年310    去年309    海棠308    无情306    不是306    昨夜304    富贵303    时候303    行人302    今夜301    江上300    蓬莱300    不似299    青山296    谁知296    寂寞296    几度295    天气295    何时293    惟有293    一曲292    肠断292    往事291    月明290    笙歌287    如何285    燕子282    清明281    人在281    悠悠280    无数280    明年279    落花278    千古270    一番269    十二266    精神266    梨花264    今年263    十里261    垂杨261    分付260    何人260    年时260    春归259    *258    有人257    帘幕257    思量256    几时256    如此255    如许252    秋风252    明朝249    夕阳248    记得247    清风244    风露243    不须243    只有243    相对240    不堪240    酒醒239    从今239    庭院239    又是238    今朝238    缥缈238    盈盈237    花开237    不如236    殷勤236    风光235    夜来233    无语232    分明232    只恐229    旧时229    倚阑229    秋千228    依然228    今宵228    行云227    春来224    自有224    携手223    金缕221    花前221


ps唐诗的结果

何处1662    不知1462    万里1445    千里1299    今日1159    不见1155    不可1144    春风1123    白云1105    不得943    人间890    明月886    无人875    风吹832    故人778    惆怅775    秋风745    悠悠737    相思733    长安721    白日689    如何684    十年675    青山664    何人659    少年631    相逢627    平生589    年年587    寂寞586    天子585    天地584    人不583    黄金580    何事575    江上555    流水547    可怜535    回首533    如此521    主人520    白发520    今朝515    月明512    从此505    日月505    行人501    不如495    三十494    将军494    归去493    落日493    日暮487    不能482    别离480    洛阳474    何时474    此时472    天下471    芳草470    归来469    无事468    相见466    夕阳459    江南458    当时454    杨柳452    风雨450    东风440    青云435    洞庭432    参差432    花落430    落花429    天涯429    芙蓉428    清风423    不是420    烟霞415    白头415    桃花412    不相411    唯有408    君不408    何如407    南山403    谁能398    千年392    如今392    天上389    十二387    花开384    与君383    桃李381    君王379    终日379    殷勤379    此地377    浮云377    二十377    苍苍374    门前374    凤凰374    神仙373    千万370    山中367    美人363    鸳鸯361    有时360    不敢360    自有359    无限356    有馀355    风起354    处处353    萧萧352    一声352    尽日352    三千351    风流349    山川349    君子348    春色343    故乡343    萧条341    何必339    裴回339    分明338    不堪336    几时333


君不见234    不知何127    行路难108    三千里108    不可见100    三十六94    在何处90    知何处90    二十年87    卷三百84    百六十82    卷八百81    卷五百81    卷二百81    卷七百81    卷四百81    卷一百81    卷六百80    百四十77    百二十76    三十年75    百七十75    百八十74    百九十74    百三十74    不相见73    无消息73    百一十73    何处去72    百五十72    无一事70    洛阳城69    千万里69    何处是68    水东流68    向人间66    归未得65    歌一曲62    一杯酒61    千里外61    明月夜58    归何处57    从此去56    东风吹56    归去来55    今何在55    人不知55    草萋萋55    春风吹54    无人知53    不知谁53    人不见53    不见人52    与君同52    不得意52    长安道52    复何如51    人间事51

-----------------------------------------------------


于是看看 yixuan 写的R语言代码,下了个R语言环境。

运行结果命令行里面显示的中文都是空的。。。

原来R语言的命令行不支持显示中文。。。。


改下输出到文件,这才看到结果

l = scan("Ci.txt", "character", sep = "\n");
l.len = nchar(l);
 
# 某些行是作者和标题,所以选取长度大于10的行;
# 另外这个文本文件不太规整,有些网址什么的,
# 所以也要排除那些长度太长的。
ci = l[l.len > 10 & l.len < 500];
 
# 句子用标点符号分割。
sentences = strsplit(ci, ",|。|!|?|、");
sentences = unlist(sentences);
sentences = sentences[sentences != ""];
s.len = nchar(sentences);
 
# 单句太长了说明有可能是错误的字符,去除掉。
sentences = sentences[s.len <= 10 &  s.len >= 2];
s.len = nchar(sentences);
 
# 暴力挨个拆分,比如“犹解嫁东风”的所有二字组合为
# “犹解”“解嫁”“嫁东”“东风”,
# 无意义的词其频数自然就落在后面了。
splitwords = function(x, x.len) substring(x, 1:(x.len+1 - 2), 2:x.len);
 
words = mapply(splitwords, sentences, s.len, SIMPLIFY = TRUE, USE.NAMES = FALSE);
words = unlist(words);
words.freq = table(words);
words.freq = sort(words.freq, decreasing = TRUE);
df<-data.frame(Word = names(words.freq[1:100]), Freq = as.integer(words.freq[1:100]));
write.table(df, "1.txt");

修改绿色部分就可以获得3个字和4个字的结果

这回结果终于对了。

"Word" "Fre"
1,□□,1485
2,东风,1382
3,何处,1230
4,人间,1202
5,风流,857
6,归去,812
7,春风,802
8,西风,779
9,归来,771
10,江南,765
11,相思,753
12,梅花,732
13,千里,676
14,回首,656
15,明月,651
16,多少,648
17,如今,642
18,阑干,630
19,年年,613
20,万里,590
21,一笑,582
22,黄昏,550
23,当年,542
24,天涯,537
25,相逢,528
26,芳草,527
27,尊前,516
28,一枝,512
29,风雨,505
30,流水,472
31,依旧,472
32,风吹,471
33,风月,461
34,多情,457
35,故人,451
36,当时,450
37,无人,445
38,斜阳,438
39,不知,430
40,不见,429
41,深处,422
42,时节,403
43,平生,398
44,凄凉,398
45,春色,394
46,匆匆,383
47,功名,383
48,一点,378
49,无限,377
50,今日,369
51,天上,368
52,杨柳,362
53,西湖,356
54,桃花,354
55,扁舟,353
56,消息,351
57,憔悴,344
58,何事,339
59,芙蓉,338
60,神仙,334
61,一片,334
62,桃李,333
63,人生,332
64,十分,331
65,心事,329
66,黄花,328
67,一声,325
68,佳人,324
69,长安,321
70,东君,319
71,断肠,316
72,而今,315
73,鸳鸯,314
74,为谁,313
75,十年,310
76,去年,309
77,少年,308
78,海棠,307
79,寂寞,306
80,无情,306
81,不是,305
82,时候,304
83,肠断,303
84,富贵,303
85,蓬莱,303
86,昨夜,303
87,行人,302
88,今夜,301
89,谁知,300
90,不似,299
91,江上,298
92,悠悠,296
93,几度,295
94,青山,295
95,何时,294
96,天气,293
97,惟有,293
98,一曲,291
99,月明,291
100,往事,290
"少年" 308
"78" "海棠" 307
"79" "寂寞" 306
"80" "无情" 306
"81" "不是" 305
"82" "时候" 304
"83" "肠断" 303
"84" "富贵" 303
"85" "蓬莱" 303
"86" "昨夜" 303
"87" "行人" 302
"88" "今夜" 301
"89" "谁知" 300
"90" "不似" 299
"91" "江上" 298
"92" "悠悠" 296
"93" "几度" 295
"94" "青山" 295
"95" "何时" 294
"96" "天气" 293
"97" "惟有" 293
"98" "一曲" 291
"99" "月明" 291
"100" "往事" 290

三个字的

"Word" "Freq"
"1" "□□□" 893
"2" "倚阑干" 125
"3" "知何处" 107
"4" "广寒宫" 94
"5" "到如今" 89
"6" "东风吹" 85
"7" "留不住" 76
"8" "人何处" 76
"9" "有谁知" 74
"10" "三十六" 70
"11" "西风吹" 66
"12" "云深处" 63
"13" "不知何" 62
"14" "人间世" 61
"15" "人不见" 60
"16" "君知否" 58
"17" "人千里" 58
"18" "与谁同" 57
"19" "不如归" 54
"20" "春归去" 54
"21" "年今日" 53
"22" "何处是" 52
"23" "归去来" 51
"24" "二十四" 47
"25" "归何处" 47
"26" "花深处" 47
"27" "江南春" 46
"28" "向尊前" 46
"29" "花飞絮" 45
"30" "花流水" 45
"31" "记当年" 44
"32" "雨初晴" 43
"33" "长安道" 43
"34" "倚东风" 42
"35" "功名事" 41
"36" "海棠开" 41
"37" "谁知道" 41
"38" "知多少" 41
"39" "东风里" 40
"40" "海棠花" 40
"41" "老人星" 40
"42" "那堪更" 40
"43" "桥流水" 40
"44" "山无数" 40
"45" "在何处" 40
"46" "断人肠" 39
"47" "水晶宫" 39
"48" "月清风" 39
"49" "落花飞" 38
"50" "如归去" 38
"51" "无消息" 38
"52" "花无数" 37
"53" "人间天" 37
"54" "随流水" 37
"55" "TXT" 36
"56" "杜鹃啼" 36
"57" "歌金缕" 36
"58" "个人人" 36
"59" "故人相" 36
"60" "间天上" 36
"61" "明月清" 36
"62" "送春归" 36
"63" "影横斜" 36
"64" "春消息" 35
"65" "人间何" 35
"66" "一番新" 35
"67" "一年春" 35
"68" "一枝春" 35
"69" "二十年" 34
"70" "寄相思" 34
"71" "去来兮" 34
"72" "去年今" 34
"73" "人如玉" 34
"74" "是人间" 34
"75" "寿阳妆" 34
"76" "无觅处" 34
"77" "有个人" 34
"78" "此时情" 33
"79" "江南路" 33
"80" "留春住" 33
"81" "然一笑" 33
"82" "唱阳关" 32
"83" "分付与" 32
"84" "何处去" 32
"85" "江南岸" 32
"86" "人憔悴" 32
"87" "人去后" 32
"88" "十年前" 32
"89" "天如水" 32
"90" "须信道" 32
"91" "一枝斜" 32
"92" "又是一" 32
"93" "几时休" 31
"94" "尽人间" 31
"95" "天上人" 31
"96" "月黄昏" 31
"97" "在人间" 31
"98" "最好是" 31
"99" "暗香浮" 30
"100" "断肠声" 30

四个字

"Word" "Freq"
"1" "□□□□" 514
"2" "不知何处" 39
"3" "不如归去" 36
"4" "人间天上" 36
"5" "归去来兮" 34
"6" "明月清风" 34
"7" "广寒宫殿" 29
"8" "落花飞絮" 28
"9" "天上人间" 28
"10" "江南江北" 27
"11" "落花流水" 26
"12" "绿鬓朱颜" 26
"13" "一觞一咏" 26
"14" "去年今日" 25
"15" "人间何处" 25
"16" "朝朝暮暮" 24
"17" "三万六千" 24
"18" "疏影横斜" 24
"19" "清风明月" 23
"20" "三十六宫" 23
"21" "十洲三岛" 23
"22" "暗香浮动" 22
"23" "独倚阑干" 22
"24" "良辰美景" 22
"25" "功名富贵" 21
"26" "寒食清明" 21
"27" "岁岁年年" 21
"28" "五云深处" 21
"29" "嫣然一笑" 21
"30" "一轮明月" 21
"31" "朱颜绿鬓" 21
"32" "不堪回首" 20
"33" "二十四桥" 20
"34" "江北江南" 20
"35" "十二阑干" 20
"36" "冰肌玉骨" 19
"37" "满城风雨" 19
"38" "钱塘江上" 19
"39" "竹篱茅舍" 19
"40" "百花头上" 18
"41" "登山临水" 18
"42" "故人何处" 18
"43" "纶巾羽扇" 18
"44" "前度刘郎" 18
"45" "夜来风雨" 18
"46" "一钩新月" 18
"47" "有个人人" 18
"48" "玉皇香案" 18
"49" "葱葱佳气" 17
"50" "道骨仙风" 17
"51" "故人千里" 17
"52" "凌波微步" 17
"53" "茂林修竹" 17
"54" "赏心乐事" 17
"55" "数声啼鸟" 17
"56" "斜风细雨" 17
"57" "一年春事" 17
"58" "玉骨冰肌" 17
"59" "非烟非雾" 16
"60" "南北东西" 16
"61" "年年今日" 16
"62" "沈香亭北" 16
"63" "春归何处" 15
"64" "广寒宫阙" 15
"65" "记得年时" 15
"66" "桃花流水" 15
"67" "小春时候" 15
"68" "燕子归来" 15
"69" "又是一番" 15
"70" "与花为主" 15
"71" "整顿乾坤" 15
"72" "尊前一笑" 15
"73" "今夕何夕" 14
"74" "金风玉露" 14
"75" "绿窗朱户" 14
"76" "年年岁岁" 14
"77" "千岩万壑" 14
"78" "人生行乐" 14
"79" "如今憔悴" 14
"80" "天长地久" 14
"81" "仙风道骨" 14
"82" "一叶扁舟" 14
"83" "暗香疏影" 13
"84" "不比寻常" 13
"85" "潮生潮落" 13
"86" "对酒当歌" 13
"87" "富贵功名" 13
"88" "画桥流水" 13
"89" "*如画" 13
"90" "七十古来" 13
"91" "曲水流觞" 13
"92" "十古来稀" 13
"93" "天寒日暮" 13
"94" "天若有情" 13
"95" "文章太守" 13
"96" "细雨斜风" 13
"97" "一年一度" 13
"98" "倚遍阑干" 13