Python3自然语言(NLTK)——语言大数据

时间:2022-09-09 11:12:53

NLTK

这是一个处理文本的python库,我们知道文字性的知识可是拥有非常庞大的数据量,故而这属于大数据系列。

本文只是浅尝辄止,目前本人并未涉及这块知识,只是偶尔好奇,才写本文。

从NLTK中的book模块中,载入所有条目

  • book 模块包含所有数据
from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
text1
<Text: Moby Dick by Herman Melville 1851>
text2
<Text: Sense and Sensibility by Jane Austen 1811>

搜索文本或主题

  1. concordance允许在课文中查找单词,并打印出来
  2. similar 用来识别文章中和搜索词相似的词语,可以用在搜索引擎中的相关度识别功能中。
  3. common_contexts 用来识别2个关键词相似的词语。
  4. dispersion_plot 绘制单词的离散图
text1.concordance('monstrous') # 在text1中查阅词汇'monstrous'
# concordance
# 英 [kən'kɔːd(ə)ns] 美 [kən'kɔrdns]
# n. 调和,一致;用语索引;著作或作家全集的重要用字索引
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
text2.concordance('affection')
Displaying 25 of 79 matches:
, however , and , as a mark of his affection for the three girls , he left them
t . It was very well known that no affection was ever supposed to exist between
deration of politeness or maternal affection on the side of the former , the tw
d the suspicion -- the hope of his affection for me may warrant , without impru
hich forbade the indulgence of his affection . She knew that his mother neither
rd she gave one with still greater affection . Though her late conversation wit
can never hope to feel or inspire affection again , and if her home be uncomfo
m of the sense , elegance , mutual affection , and domestic comfort of the fami
, and which recommended him to her affection beyond every thing else . His soci
ween the parties might forward the affection of Mr . Willoughby , an equally st
the most pointed assurance of her affection . Elinor could not be surprised at
he natural consequence of a strong affection in a young and ardent mind . This
opinion . But by an appeal to her affection for her mother , by representing t
every alteration of a place which affection had established as perfect with hi
e will always have one claim of my affection , which no other can possibly shar
f the evening declared at once his affection and happiness . " Shall we see you
ause he took leave of us with less affection than his usual behaviour has shewn
ness ." " I want no proof of their affection ," said Elinor ; " but of their en
onths , without telling her of his affection ;-- that they should part without
ould be the natural result of your affection for her . She used to be all unres
distinguished Elinor by no mark of affection . Marianne saw and listened with i
th no inclination for expense , no affection for strangers , no profession , an
till distinguished her by the same affection which once she had felt no doubt o
al of her confidence in Edward ' s affection , to the remembrance of every mark
was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if
text1.similar('monstrous')
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless
text2.similar('monstrous')
very so exceedingly heartily a as good great extremely remarkably
sweet vast amazingly
text2.common_contexts(['monstrous','very'])
a_pretty am_glad a_lucky is_pretty be_glad
# 从文本中检查一个单词的位置,从该单词出现开始出现了多少次。
# Each stripe represents an instance of a word,
# and each row represents the entire text.
text4.dispersion_plot(['citizens','democracy','freedon','duties','America','liberty'])
# dispersion
# 英 [dɪ'spɜːʃ(ə)n] 美 [dɪ'spɝʒn]
# n. 散布;[统计][数] 离差;驱散

Python3自然语言(NLTK)——语言大数据

print(text3.generate('monstrous'))
None

统计词汇

len(text3)
44764
sorted(set(text3))
['!',
"'",
'(',
')',
',',
',)',
'.',
'.)',
':',
';',
';)',
'?',
'?)',
'A',
'Abel',
'Abelmizraim',
'Abidah',
'Abide',
'Abimael',
'Abimelech',
'Abr',
'Abrah',
'Abraham',
'Abram',
'Accad',
'Achbor',
'Adah',
'Adam',
'Adbeel',
'Admah',
'Adullamite',
'After',
'Aholibamah',
'Ahuzzath',
'Ajah',
'Akan',
'All',
'Allonbachuth',
'Almighty',
'Almodad',
'Also',
'Alvah',
'Alvan',
'Am',
'Amal',
'Amalek',
'Amalekites',
'Ammon',
'Amorite',
'Amorites',
'Amraphel',
'An',
'Anah',
'Anamim',
'And',
'Aner',
'Angel',
'Appoint',
'Aram',
'Aran',
'Ararat',
'Arbah',
'Ard',
'Are',
'Areli',
'Arioch',
'Arise',
'Arkite',
'Arodi',
'Arphaxad',
'Art',
'Arvadite',
'As',
'Asenath',
'Ashbel',
'Asher',
'Ashkenaz',
'Ashteroth',
'Ask',
'Asshur',
'Asshurim',
'Assyr',
'Assyria',
'At',
'Atad',
'Avith',
'Baalhanan',
'Babel',
'Bashemath',
'Be',
'Because',
'Becher',
'Bedad',
'Beeri',
'Beerlahairoi',
'Beersheba',
'Behold',
'Bela',
'Belah',
'Benam',
'Benjamin',
'Beno',
'Beor',
'Bera',
'Bered',
'Beriah',
'Bethel',
'Bethlehem',
'Bethuel',
'Beware',
'Bilhah',
'Bilhan',
'Binding',
'Birsha',
'Bless',
'Blessed',
'Both',
'Bow',
'Bozrah',
'Bring',
'But',
'Buz',
'By',
'Cain',
'Cainan',
'Calah',
'Calneh',
'Can',
'Cana',
'Canaan',
'Canaanite',
'Canaanites',
'Canaanitish',
'Caphtorim',
'Carmi',
'Casluhim',
'Cast',
'Cause',
'Chaldees',
'Chedorlaomer',
'Cheran',
'Cherubims',
'Chesed',
'Chezib',
'Come',
'Cursed',
'Cush',
'Damascus',
'Dan',
'Day',
'Deborah',
'Dedan',
'Deliver',
'Diklah',
'Din',
'Dinah',
'Dinhabah',
'Discern',
'Dishan',
'Dishon',
'Do',
'Dodanim',
'Dothan',
'Drink',
'Duke',
'Dumah',
'Earth',
'Ebal',
'Eber',
'Edar',
'Eden',
'Edom',
'Edomites',
'Egy',
'Egypt',
'Egyptia',
'Egyptian',
'Egyptians',
'Ehi',
'Elah',
'Elam',
'Elbethel',
'Eldaah',
'EleloheIsrael',
'Eliezer',
'Eliphaz',
'Elishah',
'Ellasar',
'Elon',
'Elparan',
'Emins',
'En',
'Enmishpat',
'Eno',
'Enoch',
'Enos',
'Ephah',
'Epher',
'Ephra',
'Ephraim',
'Ephrath',
'Ephron',
'Er',
'Erech',
'Eri',
'Es',
'Esau',
'Escape',
'Esek',
'Eshban',
'Eshcol',
'Ethiopia',
'Euphrat',
'Euphrates',
'Eve',
'Even',
'Every',
'Except',
'Ezbon',
'Ezer',
'Fear',
'Feed',
'Fifteen',
'Fill',
'For',
'Forasmuch',
'Forgive',
'From',
'Fulfil',
'G',
'Gad',
'Gaham',
'Galeed',
'Gatam',
'Gather',
'Gaza',
'Gentiles',
'Gera',
'Gerar',
'Gershon',
'Get',
'Gether',
'Gihon',
'Gilead',
'Girgashites',
'Girgasite',
'Give',
'Go',
'God',
'Gomer',
'Gomorrah',
'Goshen',
'Guni',
'Hadad',
'Hadar',
'Hadoram',
'Hagar',
'Haggi',
'Hai',
'Ham',
'Hamathite',
'Hamor',
'Hamul',
'Hanoch',
'Happy',
'Haran',
'Hast',
'Haste',
'Have',
'Havilah',
'Hazarmaveth',
'Hazezontamar',
'Hazo',
'He',
'Hear',
'Heaven',
'Heber',
'Hebrew',
'Hebrews',
'Hebron',
'Hemam',
'Hemdan',
'Here',
'Hereby',
'Heth',
'Hezron',
'Hiddekel',
'Hinder',
'Hirah',
'His',
'Hitti',
'Hittite',
'Hittites',
'Hivite',
'Hobah',
'Hori',
'Horite',
'Horites',
'How',
'Hul',
'Huppim',
'Husham',
'Hushim',
'Huz',
'I',
'If',
'In',
'Irad',
'Iram',
'Is',
'Isa',
'Isaac',
'Iscah',
'Ishbak',
'Ishmael',
'Ishmeelites',
'Ishuah',
'Isra',
'Israel',
'Issachar',
'Isui',
'It',
'Ithran',
'Jaalam',
'Jabal',
'Jabbok',
'Jac',
'Jachin',
'Jacob',
'Jahleel',
'Jahzeel',
'Jamin',
'Japhe',
'Japheth',
'Jared',
'Javan',
'Jebusite',
'Jebusites',
'Jegarsahadutha',
'Jehovahjireh',
'Jemuel',
'Jerah',
'Jetheth',
'Jetur',
'Jeush',
'Jezer',
'Jidlaph',
'Jimnah',
'Job',
'Jobab',
'Jokshan',
'Joktan',
'Jordan',
'Joseph',
'Jubal',
'Judah',
'Judge',
'Judith',
'Kadesh',
'Kadmonites',
'Karnaim',
'Kedar',
'Kedemah',
'Kemuel',
'Kenaz',
'Kenites',
'Kenizzites',
'Keturah',
'Kiriathaim',
'Kirjatharba',
'Kittim',
'Know',
'Kohath',
'Kor',
'Korah',
'LO',
'LORD',
'Laban',
'Lahairoi',
'Lamech',
'Lasha',
'Lay',
'Leah',
'Lehabim',
'Lest',
'Let',
'Letushim',
'Leummim',
'Levi',
'Lie',
'Lift',
'Lo',
'Look',
'Lot',
'Lotan',
'Lud',
'Ludim',
'Luz',
'Maachah',
'Machir',
'Machpelah',
'Madai',
'Magdiel',
'Magog',
'Mahalaleel',
'Mahalath',
'Mahanaim',
'Make',
'Malchiel',
'Male',
'Mam',
'Mamre',
'Man',
'Manahath',
'Manass',
'Manasseh',
'Mash',
'Masrekah',
'Massa',
'Matred',
'Me',
'Medan',
'Mehetabel',
'Mehujael',
'Melchizedek',
'Merari',
'Mesha',
'Meshech',
'Mesopotamia',
'Methusa',
'Methusael',
'Methuselah',
'Mezahab',
'Mibsam',
'Mibzar',
'Midian',
'Midianites',
'Milcah',
'Mishma',
'Mizpah',
'Mizraim',
'Mizz',
'Moab',
'Moabites',
'Moreh',
'Moreover',
'Moriah',
'Muppim',
'My',
'Naamah',
'Naaman',
'Nahath',
'Nahor',
'Naphish',
'Naphtali',
'Naphtuhim',
'Nay',
'Nebajoth',
'Neither',
'Night',
'Nimrod',
'Nineveh',
'Noah',
'Nod',
'Not',
'Now',
'O',
'Obal',
'Of',
'Oh',
'Ohad',
'Omar',
'On',
'Onam',
'Onan',
'Only',
'Ophir',
'Our',
'Out',
'Padan',
'Padanaram',
'Paran',
'Pass',
'Pathrusim',
'Pau',
'Peace',
'Peleg',
'Peniel',
'Penuel',
'Peradventure',
'Perizzit',
'Perizzite',
'Perizzites',
'Phallu',
'Phara',
'Pharaoh',
'Pharez',
'Phichol',
'Philistim',
'Philistines',
'Phut',
'Phuvah',
'Pildash',
'Pinon',
'Pison',
'Potiphar',
'Potipherah',
'Put',
'Raamah',
'Rachel',
'Rameses',
'Rebek',
'Rebekah',
'Rehoboth',
'Remain',
'Rephaims',
'Resen',
'Return',
'Reu',
'Reub',
'Reuben',
'Reuel',
'Reumah',
'Riphath',
'Rosh',
'Sabtah',
'Sabtech',
'Said',
'Salah',
'Salem',
'Samlah',
'Sarah',
'Sarai',
'Saul',
'Save',
'Say',
'Se',
'Seba',
'See',
'Seeing',
'Seir',
'Sell',
'Send',
'Sephar',
'Serah',
'Sered',
'Serug',
'Set',
'Seth',
'Shalem',
'Shall',
'Shalt',
'Shammah',
'Shaul',
'Shaveh',
'She',
'Sheba',
'Shebah',
'Shechem',
'Shed',
'Shel',
'Shelah',
'Sheleph',
'Shem',
'Shemeber',
'Shepho',
'Shillem',
'Shiloh',
'Shimron',
'Shinab',
'Shinar',
'Shobal',
'Should',
'Shuah',
'Shuni',
'Shur',
'Sichem',
'Siddim',
'Sidon',
'Simeon',
'Sinite',
'Sitnah',
'Slay',
'So',
'Sod',
'Sodom',
'Sojourn',
'Some',
'Spake',
'Speak',
'Spirit',
'Stand',
'Succoth',
'Surely',
'Swear',
'Syrian',
'Take',
'Tamar',
'Tarshish',
'Tebah',
'Tell',
'Tema',
'Teman',
'Temani',
'Terah',
'Thahash',
'That',
'The',
'Then',
'There',
'Therefore',
'These',
'They',
'Thirty',
'This',
'Thorns',
'Thou',
'Thus',
'Thy',
'Tidal',
'Timna',
'Timnah',
'Timnath',
'Tiras',
'To',
'Togarmah',
'Tola',
'Tubal',
'Tubalcain',
'Twelve',
'Two',
'Unstable',
'Until',
'Unto',
'Up',
'Upon',
'Ur',
'Uz',
'Uzal',
'We',
'What',
'When',
'Whence',
'Where',
'Whereas',
'Wherefore',
'Which',
'While',
'Who',
'Whose',
'Whoso',
'Why',
'Wilt',
'With',
'Woman',
'Ye',
'Yea',
'Yet',
'Zaavan',
'Zaphnathpaaneah',
'Zar',
'Zarah',
'Zeboiim',
'Zeboim',
'Zebul',
'Zebulun',
'Zemarite',
'Zepho',
'Zerah',
'Zibeon',
'Zidon',
'Zillah',
'Zilpah',
'Zimran',
'Ziphion',
'Zo',
'Zoar',
'Zohar',
'Zuzims',
'a',
'abated',
'abide',
'able',
'abode',
'abomination',
'about',
'above',
'abroad',
'absent',
'abundantly',
'accept',
'accepted',
'according',
'acknowledged',
'activity',
'add',
'adder',
'afar',
'afflict',
'affliction',
'afraid',
'after',
'afterward',
'afterwards',
'aga',
'again',
'against',
'age',
'aileth',
'air',
'al',
'alive',
'all',
'almon',
'alo',
'alone',
'aloud',
'also',
'altar',
'altogether',
'always',
'am',
'among',
'amongst',
'an',
'and',
'angel',
'angels',
'anger',
'angry',
'anguish',
'anointedst',
'anoth',
'another',
'answer',
'answered',
'any',
'anything',
'appe',
'appear',
'appeared',
'appease',
'appoint',
'appointed',
'aprons',
'archer',
'archers',
'are',
'arise',
'ark',
'armed',
'arms',
'army',
'arose',
'arrayed',
'art',
'artificer',
'as',
'ascending',
'ash',
'ashamed',
'ask',
'asked',
'asketh',
'ass',
'assembly',
'asses',
'assigned',
'asswaged',
'at',
'attained',
'audience',
'avenged',
'aw',
'awaked',
'away',
'awoke',
'back',
'backward',
'bad',
'bade',
'badest',
'badne',
'bak',
'bake',
'bakemeats',
'baker',
'bakers',
'balm',
'bands',
'bank',
'bare',
'barr',
'barren',
'basket',
'baskets',
'battle',
'bdellium',
'be',
'bear',
'beari',
'bearing',
'beast',
'beasts',
'beautiful',
'became',
'because',
'become',
'bed',
'been',
'befall',
'befell',
'before',
'began',
'begat',
'beget',
'begettest',
'begin',
'beginning',
'begotten',
'beguiled',
'beheld',
'behind',
'behold',
'being',
'believed',
'belly',
'belong',
'beneath',
'bereaved',
'beside',
'besides',
'besought',
'best',
'betimes',
'better',
'between',
'betwixt',
'beyond',
'binding',
'bird',
'birds',
'birthday',
'birthright',
'biteth',
'bitter',
'blame',
'blameless',
'blasted',
'bless',
'blessed',
'blesseth',
'blessi',
'blessing',
'blessings',
'blindness',
'blood',
'blossoms',
'bodies',
'boldly',
'bondman',
'bondmen',
'bondwoman',
'bone',
'bones',
'book',
'booths',
'border',
'borders',
'born',
'bosom',
'both',
'bottle',
'bou',
'boug',
'bough',
'bought',
'bound',
'bow',
'bowed',
'bowels',
'bowing',
'boys',
'bracelets',
'branches',
'brass',
'bre',
'breach',
'bread',
'breadth',
'break',
'breaketh',
'breaking',
'breasts',
'breath',
'breathed',
'breed',
'brethren',
'brick',
'brimstone',
'bring',
'brink',
'broken',
'*',
'broth',
'brother',
'brought',
'brown',
'bruise',
'budded',
'build',
'builded',
'built',
'bulls',
'bundle',
'bundles',
'burdens',
'buried',
'burn',
'burning',
'burnt',
'bury',
'buryingplace',
'business',
'but',
'butler',
'butlers',
'butlership',
'butter',
'buy',
'by',
'cakes',
'calf',
'call',
'called',
'came',
'camel',
'camels',
'camest',
'can',
'cannot',
'canst',
'captain',
'captive',
'captives',
'carcases',
'carried',
'carry',
'cast',
'castles',
'catt',
'cattle',
'caught',
'cause',
'caused',
'cave',
'cease',
'ceased',
'certain',
'certainly',
'chain',
'chamber',
'change',
'changed',
'changes',
'charge',
'charged',
'chariot',
'chariots',
'chesnut',
'chi',
'chief',
'child',
'childless',
'childr',
'children',
'chode',
'choice',
'chose',
'circumcis',
'circumcise',
'circumcised',
'citi',
'cities',
'city',
'clave',
'clean',
'clear',
'cleave',
'clo',
'closed',
'clothed',
'clothes',
'cloud',
'clusters',
'co',
'coat',
'coats',
'coffin',
'cold',
...]
len(set(text3))
2789
len(text3)/len(set(text3))
16.050197203298673
text3.count('smote')
5
100*text4.count('a')/len(text4)
1.4643016433938312
def lexical_diversity(text):
# lexical英['leksɪk(ə)l] 美 ['lɛksɪkl]
# adj.词汇的;[语] 词典的;词典编纂的
# diversity英[daɪ'vɜːsɪtɪ; dɪ-]美 [dɪˈvəsɪti]
# n.多样性;差异
return len(text)/len(set(text))
def percentage(count, total):
return 100*count/total print('text3中词汇多样性指标:{}'.format(lexical_diversity(text3)))
print('text4中单词a占全文的百分比:{}'.format(percentage(text4.count('a'),len(text4))))
text3中词汇多样性指标:16.050197203298673
text4中单词a占全文的百分比:1.4643016433938312

列表 = Lists

sent1 = ['Call', 'me','Ishmael','.']
print('打印sent1中的内容:{}'.format(sent1))
print('打印sent1中内容的长度:{}'.format(len(sent1)))
print('sent1中词汇多样性指标:{}'.format(lexical_diversity(sent1)))
打印sent1中的内容:['Call', 'me', 'Ishmael', '.']
打印sent1中内容的长度:4
sent1中词汇多样性指标:1.0
sent1,sent2,sent3,sent4 # 这是内部定义好的列表
(['Call', 'me', 'Ishmael', '.'],
['The',
'family',
'of',
'Dashwood',
'had',
'long',
'been',
'settled',
'in',
'Sussex',
'.'],
['In',
'the',
'beginning',
'God',
'created',
'the',
'heaven',
'and',
'the',
'earth',
'.'],
['Fellow',
'-',
'Citizens',
'of',
'the',
'Senate',
'and',
'of',
'the',
'House',
'of',
'Representatives',
':'])
sent4+sent1
['Fellow',
'-',
'Citizens',
'of',
'the',
'Senate',
'and',
'of',
'the',
'House',
'of',
'Representatives',
':',
'Call',
'me',
'Ishmael',
'.']
sent1.append('Some')
['Call', 'me', 'Ishmael', '.', 'Some', 'Some', 'Some', 'Some']

列表索引

type(text4)
nltk.text.Text
text4[173]
'awaken'
text4.index('awaken')
173
text5[16715:16735]
['U86',
'thats',
'why',
'something',
'like',
'gamefly',
'is',
'so',
'good',
'because',
'you',
'can',
'actually',
'play',
'a',
'full',
'game',
'without',
'buying',
'it']
text6[1600:1625]
['We',
"'",
're',
'an',
'anarcho',
'-',
'syndicalist',
'commune',
'.',
'We',
'take',
'it',
'in',
'turns',
'to',
'act',
'as',
'a',
'sort',
'of',
'executive',
'officer',
'for',
'the',
'week']

变量

sent1 = ['Call','me','Ishmael','.']
my_sent = ['Bravely','bold','Sir','Robin',',','rode','forth','from','Camelot','.']
noun_phrase = my_sent[1:4]
print('打印切片后的列表:noun_phrase-》{}'.format(noun_phrase))
wOrDs = sorted(noun_phrase)
print('打印排序后的列表:wOrDs-》{}'.format(wOrDs))
打印切片后的列表:noun_phrase-》['bold', 'Sir', 'Robin']
打印排序后的列表:wOrDs-》['Robin', 'Sir', 'bold']

字符串

name = 'bright'
print('打印name中的第一个字母:{}'.format(name[0]))
print(name[:4])
print(name*2)
print(name + '!')
打印name中的第一个字母:b
brig
brightbright
bright!
' '.join(['Monty', 'Python'])
'Monty Python'
'Monty Python'.split()
['Monty', 'Python']
saying = ['After','all','is','said','and','done','more','is','said','than','done']
tokens = set(saying)
tokens = sorted(tokens)
tokens[-2:]
['said', 'than']
fdist1 = FreqDist(text1)
vocabulary1 = fdist1.keys()
type(vocabulary1)
dict_keys
fdist1.plot(50, cumulative=True)
#Cumulative frequency plot for the 50 most frequently used words in Moby Dick, which
#account for nearly half of the tokens.

Python3自然语言(NLTK)——语言大数据

fdist1.hapaxes() #the words that occur once only
['Herman',
'Melville',
']',
'ETYMOLOGY',
'Late',
'Consumptive',
'School',
'threadbare',
'lexicons',
'mockingly',
'flags',
'mortality',
'signification',
'HACKLUYT',
'Sw',
'HVAL',
'roundness',
'Dut',
'Ger',
'WALLEN',
'WALW',
'IAN',
'RICHARDSON',
'KETOS',
'GREEK',
'CETUS',
'LATIN',
'WHOEL',
'ANGLO',
'SAXON',
'WAL',
'HWAL',
'SWEDISH',
'ICELANDIC',
'BALEINE',
'BALLENA',
'FEGEE',
'ERROMANGOAN',
'Librarian',
'painstaking',
'burrower',
'grub',
'Vaticans',
'stalls',
'higgledy',
'piggledy',
'gospel',
'promiscuously',
'commentator',
'belongest',
'sallow',
'Pale',
'Sherry',
'loves',
'bluntly',
'Subs',
'thankless',
'Hampton',
'Court',
'hie',
'refugees',
'pampered',
'Michael',
'Raphael',
'unsplinterable',
'GENESIS',
'JOB',
'JONAH',
'punish',
'ISAIAH',
'soever',
'cometh',
'incontinently',
'perisheth',
'PLUTARCH',
'MORALS',
'breedeth',
'Whirlpooles',
'Balaene',
'arpens',
'PLINY',
'Scarcely',
'TOOKE',
'LUCIAN',
'TRUE',
'catched',
'OCTHER',
'VERBAL',
'TAKEN',
'MOUTH',
'ALFRED',
'890',
'gudgeon',
'retires',
'MONTAIGNE',
'APOLOGY',
'RAIMOND',
'SEBOND',
'Nick',
'RABELAIS',
'cartloads',
'STOWE',
'ANNALS',
'LORD',
'BACON',
'Touching',
'ork',
'DEATH',
'sovereignest',
'bruise',
'HAMLET',
'leach',
'Mote',
'availle',
'returne',
'againe',
'worker',
'Dinting',
'paine',
'thro',
'maine',
'FAERIE',
'Immense',
'til',
'DAVENANT',
'PREFACE',
'GONDIBERT',
'spermacetti',
'Hosmannus',
'Nescio',
'VIDE',
'Spencer',
'Talus',
'flail',
'threatens',
'jav',
'lins',
'WALLER',
'SUMMER',
'ISLANDS',
'Commonwealth',
'Civitas',
'OPENING',
'SENTENCE',
'HOBBES',
'LEVIATHAN',
'Silly',
'Mansoul',
'chewing',
'sprat',
'PILGRIM',
'PROGRESS',
'Created',
'PARADISE',
'LOST',
'---"',
'Hugest',
'Stretched',
'Draws',
'FULLLER',
'PROFANE',
'HOLY',
'STATE',
'DRYDEN',
'ANNUS',
'MIRABILIS',
'aground',
'EDGE',
'TEN',
'SPITZBERGEN',
'PURCHAS',
'wantonness',
'fuzzing',
'vents',
'HERBERT',
'INTO',
'ASIA',
'AFRICA',
'SCHOUTEN',
'SIXTH',
'CIRCUMNAVIGATION',
'Elbe',
'ducat',
'herrings',
'GREENLAND',
'Several',
'Fife',
'Anno',
'1652',
'Pitferren',
'SIBBALD',
'FIFE',
'KINROSS',
'Myself',
'Sperma',
'ceti',
'fierceness',
'RICHARD',
'STRAFFORD',
'LETTER',
'BERMUDAS',
'PHIL',
'TRANS',
'1668',
'PRIMER',
'COWLEY',
'1729',
'"...',
'frequendy',
'insupportable',
'disorder',
'ULLOA',
'SOUTH',
'AMERICA',
'sylphs',
'petticoat',
'Oft',
'Tho',
'RAPE',
'LOCK',
'NAT',
'wales',
'JOHNSON',
'COOK',
'dung',
'lime',
'juniper',
'UNO',
'VON',
'TROIL',
'LETTERS',
'BANKS',
'SOLANDER',
'1772',
'Nantuckois',
'JEFFERSON',
'MEMORIAL',
'MINISTER',
'REFERENCE',
'PARLIAMENT',
'SOMEWHERE',
'guarding',
'protecting',
'robbers',
'BLACKSTONE',
'Rodmond',
'suspends',
'attends',
'FALCONER',
'Bright',
'roofs',
'domes',
'rockets',
'Around',
'unwieldy',
'COWPER',
'VISIT',
'LONDON',
'HUNTER',
'DISSECTION',
'SMALL',
'SIZED',
'aorta',
'gushing',
'PALEY',
'THEOLOGY',
'mammiferous',
'hind',
'BARON',
'CUVIER',
'COLNETT',
'PURPOSE',
'EXTENDING',
'SPERMACETI',
'Floundered',
'chace',
'peopling',
'Gather',
'Led',
'instincts',
'trackless',
'Assaulted',
'voracious',
'spiral',
'MONTGOMERY',
'WORLD',
'FLOOD',
'Paean',
'fatter',
'Flounders',
'CHARLES',
'LAMB',
'TRIUMPH',
'1690',
'OBED',
'Susan',
'HAWTHORNE',
'TWICE',
'bespeak',
'raal',
'COOPER',
'PILOT',
'Berlin',
'Gazette',
'ECKERMANN',
'CONVERSATIONS',
'GOETHE',
'ESSEX',
'WAS',
'ATTACKED',
'FINALLY',
'DESTROYED',
'OWEN',
'CHACE',
'FIRST',
'SAID',
'VESSEL',
'YORK',
'1821',
'piping',
'dimmed',
'phospher',
'ELIZABETH',
'OAKES',
'SMITH',
'amounted',
'440',
'SCORESBY',
'Mad',
'agonies',
'endures',
'infuriated',
'rears',
'snaps',
'propelled',
'observers',
'opportunities',
'habitudes',
'BEALE',
'offensively',
'artful',
'mischievous',
'FREDERICK',
'DEBELL',
'1840',
'October',
'Raise',
'ay',
'THAR',
'bowes',
'os',
'ROSS',
'ETCHINGS',
'CRUIZE',
'1846',
'Globe',
'transactions',
'relate',
'HUSSEY',
'SURVIVORS',
'parried',
'MISSIONARY',
'JOURNAL',
'TYERMAN',
'boldest',
'persevering',
'REPORT',
'DANIEL',
'SPEECH',
'SENATE',
'APPLICATION',
'ERECTION',
'BREAKWATER',
'CAPTORS',
'WHALEMAN',
'ADVENTURES',
'BIOGRAPHY',
'GATHERED',
'HOMEWARD',
'COMMODORE',
'PREBLE',
'REV',
'CHEEVER',
'MUTINEER',
'BROTHER',
'ANOTHER',
'MCCULLOCH',
'COMMERCIAL',
'reciprocal',
'clews',
'SOMETHING',
'UNPUBLISHED',
'CURRENTS',
'Pedestrians',
'recollect',
'gateways',
'VOYAGER',
'ARCTIC',
'NEWSPAPER',
'TAKING',
'RETAKING',
'HOBOMACK',
'MIRIAM',
'FISHERMAN',
'appliance',
'RIBS',
'TRUCKS',
'Terra',
'Del',
'Fuego',
'DARWIN',
'NATURALIST',
";--'",
'!\'"',
'WHARTON',
'Loomings',
'spleen',
'regulating',
'circulation',
'Whenever',
'drizzly',
'hypos',
'philosophical',
'Cato',
'Manhattoes',
'reefs',
'downtown',
'gazers',
'Circumambulate',
'Corlears',
'Coenties',
'Slip',
'Whitehall',
'Posted',
'sentinels',
'spiles',
'pier',
'lath',
'counters',
'desks',
'loitering',
'shady',
'Inlanders',
'lanes',
'alleys',
'attract',
'dale',
'dreamiest',
'shadiest',
'quietest',
'enchanting',
'Saco',
'crucifix',
'Deep',
'mazy',
'Tiger',
'Tennessee',
'Rockaway',
'Persians',
'deity',
'Narcissus',
'ungraspable',
'hazy',
'quarrelsome',
'offices',
'abominate',
'toils',
'trials',
'barques',
'schooners',
'broiling',
'buttered',
'judgmatically',
'peppered',
'reverentially',
'idolatrous',
'dotings',
'ibis',
'roasted',
'bake',
'plumb',
'Van',
'Rensselaers',
'Randolphs',
'Hardicanutes',
'lording',
'tallest',
'decoction',
'Seneca',
'Stoics',
'Testament',
'promptly',
'rub',
'infliction',
'BEING',
'PAID',
'urbane',
'ills',
'monied',
'consign',
'prevalent',
'violate',
'Pythagorean',
'commonalty',
'police',
'surveillance',
'programme',
'solo',
'CONTESTED',
'ELECTION',
'PRESIDENCY',
'UNITED',
'STATES',
'ISHMAEL',
'BLOODY',
'AFFGHANISTAN',
'managers',
'genteel',
'comedies',
'farces',
'cunningly',
'disguises',
'cajoling',
'unbiased',
'freewill',
'discriminating',
'overwhelming',
'undeliverable',
'itch',
'forbidden',
'ignoring',
'lodges',
'Carpet',
'Bag',
'Manhatto',
'candidates',
'penalties',
'Tyre',
'Carthage',
'imported',
'cobblestones',
'bitingly',
'shouldering',
'price',
'fervent',
'asphaltic',
'pavement',
'flinty',
'projections',
'soles',
'Too',
'cheapest',
'cheeriest',
'invitingly',
'particles',
'peer',
'Angel',
'Doom',
'wailing',
'gnashing',
'Wretched',
'entertainment',
'Moving',
'emigrant',
'poverty',
'creak',
'lodgings',
'zephyr',
'hob',
'toasting',
'observest',
'sashless',
'glazier',
'reasonest',
'chinks',
'crannies',
'lint',
'chattering',
'shiverings',
'cob',
'redder',
'Orion',
'glitters',
'conservatories',
'president',
'temperance',
'blubbering',
'straggling',
'wainscots',
'reminding',
'oilpainting',
'besmoked',
'defaced',
'unequal',
'crosslights',
'hags',
'delineate',
'bewitched',
'ponderings',
'boggy',
'soggy',
'squitchy',
'froze',
'heath',
'icebound',
'represents',
'Horner',
'foundered',
'clubs',
'harvesting',
'hacking',
'horrifying',
'Mixed',
'Nathan',
'Swain',
'corkscrew',
'Blanco',
'sojourning',
'fireplaces',
'duskier',
'cockpits',
'rarities',
'Projecting',
'Within',
'shelves',
'flasks',
'bustles',
'deliriums',
'Abominable',
'tumblers',
'cylinders',
'goggling',
'deceitfully',
'tapered',
'Parallel',
'pecked',
'footpads',
'Fill',
'shilling',
'examining',
'SKRIMSHANDER',
'accommodated',
'unoccupied',
'haint',
'pose',
'whalin',
'decidedly',
'objectionable',
'wander',
'Battery',
'ruminating',
'adorning',
'potatoes',
'sartainty',
'diabolically',
'steaks',
'undress',
'looker',
'rioting',
'Grampus',
'seed',
'Feegees',
'tramping',
'Enveloped',
'bedarned',
'eruption',
'officiating',
'brimmers',
'complained',
'potion',
'colds',
'catarrhs',
'liquor',
'arrantest',
'topers',
'obstreperously',
'aloof',
'desirous',
'hilarity',
'coffer',
'Southerner',
'mountaineers',
'Alleghanian',
'missed',
'supernaturally',
'congratulate',
'multiply',
'bachelor',
'abominated',
'tidiest',
'bedwards',
'shan',
'tablecloth',
'Skrimshander',
'bump',
'spraining',
'eider',
'yoking',
'rickety',
'whirlwinds',
'knockings',
'dismissed',
'popped',
'cherishing',
'chuckled',
'chuckle',
'mightily',
'catches',
'bamboozingly',
'overstocked',
'toothpick',
'rayther',
'BROWN',
'slanderin',
'farrago',
'BROKE',
'Sartain',
'Mt',
'Hecla',
'persist',
'mystifying',
'unsay',
'criminal',
'Wall',
'purty',
'sarmon',
'rips',
'tellin',
'bought',
'balmed',
'curios',
'sellin',
'inions',
'fooling',
'idolators',
'Depend',
'reg',
'lar',
'spliced',
'Johnny',
'sprawling',
'Arter',
'glim',
'jiffy',
'irresolute',
'vum',
'WON',
'Folding',
'scrutiny',
'porcupine',
'moccasin',
'ponchos',
'parade',
'rainy',
'remembering',
'commended',
'cobs',
'Nod',
'footfall',
'unlacing',
'blackish',
'plasters',
'inkling',
'Placing',
'crammed',
'scalp',
'mildewed',
'Ignorance',
'parent',
'nonplussed',
'undressing',
'checkered',
'Thirty',
'frogs',
'quaked',
'wrapall',
'dreadnaught',
'fumbled',
'Remembering',
'manikin',
'tenpin',
'andirons',
'jambs',
'bricks',
'appropriate',
'applying',
'hastier',
'withdrawals',
'antics',
'devotee',
'extinguishing',
'unceremoniously',
'bagged',
'sportsman',
'woodcock',
'uncomfortableness',
'deliberating',
'puffed',
'sang',
'Stammering',
'conjured',
'responses',
'debel',
'flourishing',
'Angels',
'flourishings',
'peddlin',
'sleepe',
'grunted',
'gettee',
'motioning',
'comely',
'insured',
'Counterpane',
'parti',
'triangles',
'interminable',
'caper',
'supperless',
'21st',
'hemisphere',
'sigh',
'Sixteen',
'ached',
'coaches',
'stockinged',
'slippering',
'misbehaviour',
'unendurable',
'stepmothers',
'misfortunes',
'steeped',
'shudderingly',
'confounding',
'soberly',
'recurred',
'predicament',
'unlock',
'bridegroom',
'clasp',
'hugged',
'rouse',
'snore',
'scratch',
'Throwing',
'expostulations',
'unbecomingness',
'matrimonial',
'dawning',
'overture',
'innate',
'compliment',
'civility',
'rudeness',
'toilette',
'dressing',
'donning',
'gaspings',
'booting',
'caterpillar',
'outlandishness',
'manners',
'education',
'undergraduate',
'dreamt',
'cowhide',
'pinched',
'curtains',
'indecorous',
'contented',
'restricting',
'donned',
'lathering',
'unsheathes',
'whets',
'Rogers',
'cutlery',
'Afterwards',
'baton',
'Breakfast',
'pleasantly',
'bountifully',
'laughable',
'bosky',
'unshorn',
'gowns',
'toasted',
'lingers',
'tarried',
'barred',
'Grub',
'Park',
'assurance',
'polish',
'occasioned',
'embarrassed',
'bashfulness',
'duelled',
'winking',
'tastes',
'sheepishly',
'bashful',
'icicle',
'admirer',
'cordially',
'grappling',
'genteelly',
'eschewed',
'undivided',
'6',
'circulating',
'nondescripts',
'Chestnut',
'jostle',
'Regent',
'Lascars',
'Bombay',
'Apollo',
'Feegeeans',
'Tongatobooarrs',
'Erromanggoans',
'Pannangians',
'Brighggians',
'weekly',
'Vermonters',
'stalwart',
'frames',
'felled',
'strutting',
'wester',
'bombazine',
'cloak',
'mow',
'gloves',
'joins',
'outfit',
'waistcoats',
'Hay',
'Seed',
'tract',
'dearest',
'pave',
'eggs',
'patrician',
'parks',
'scraggy',
'scoria',
'Herr',
'dowers',
'nieces',
'reservoirs',
'maples',
'bountiful',
'proffer',
'passer',
'cones',
'blossoms',
'superinduced',
'carnation',
'Salem',
'sweethearts',
'Puritanic',
'Whaleman',
'Wrapping',
'Each',
'quote',
'TALBOT',
'Near',
'Desolation',
'1st',
'SISTER',
'ROBERT',
'WILLIS',
'ELLERY',
'NATHAN',
'COLEMAN',
'WALTER',
'CANNY',
'SETH',
'GLEIG',
'Forming',
'ELIZA',
'31st',
'MARBLE',
'SHIPMATES',
'EZEKIEL',
'HARDY',
'AUGUST',
'3d',
'1833',
'WIDOW',
'Shaking',
'glazed',
'Affected',
'relatives',
'unhealing',
'sympathetically',
'wounds',
'bleed',
'blanks',
...]

单词的精细选择

  1. the set of all w such that w is an element of V (the vocabulary) and w has property P

    {w|w \(\in\) V and P(w)}
  2. The corresponding Python expression is given:

    [w for w in V if p(w)]
V = set(text1)
long_words = [w for w in V if len(w)>15]
sorted(long_words)
['CIRCUMNAVIGATION',
'Physiognomically',
'apprehensiveness',
'cannibalistically',
'characteristically',
'circumnavigating',
'circumnavigation',
'circumnavigations',
'comprehensiveness',
'hermaphroditical',
'indiscriminately',
'indispensableness',
'irresistibleness',
'physiognomically',
'preternaturalness',
'responsibilities',
'simultaneousness',
'subterraneousness',
'supernaturalness',
'superstitiousness',
'uncomfortableness',
'uncompromisedness',
'undiscriminating',
'uninterpenetratingly']

本文选自《Natural Language Processing with Python》

Python3自然语言(NLTK)——语言大数据的更多相关文章

  1. python3如何随机生成大数据存储到指定excel文档里

    本次主要采用的是python3的第三方库xlwt,来创建一个excel文件.具体步骤如下: 1.确认存储位置,文件命名跟随时间格式 2.封装写入格式 3.实现随机数列生成 4.定位行和列把随机数写入 ...

  2. &lbrack;转&rsqb;大数据时代,python竟是最好的语言?

      随着大数据疯狂的浪潮,新生代的工具Python得到了前所未有的爆发.简洁.开源是这款工具吸引了众多粉丝的原因.目前Python最热的领域,非数据分析和挖掘莫属了.从以Pandas为代表的数据分析领 ...

  3. 大数据全栈式开发语言 – Python

    前段时间,ThoughtWorks在深圳举办一次社区活动上,有一个演讲主题叫做“Fullstack JavaScript”,是关于用JavaScript进行前端.服务器端,甚至数据库(MongoDB) ...

  4. 为什么说Python 是大数据全栈式开发语言

    欢迎大家访问我的个人网站<刘江的博客和教程>:www.liujiangblog.com 主要分享Python 及Django教程以及相关的博客 交流QQ群:453131687 原文链接 h ...

  5. 大数据时代,Python是最好的语言!

    随着大数据疯狂的浪潮,新生代的工具Python得到了前所未有的爆发.简洁.开源是这款工具吸引了众多粉丝的原因.目前Python最热的领域,非数据分析和挖掘莫属了.从以Pandas为代表的数据分析领域开 ...

  6. 大数据平台R语言web UI应用架构 设计与开发

    1. 系统拓扑图 在日常业务分析中,R是非常常用的分析工具,而当数据量较大时,用R语言需要需用更多的时间来完成训练模型,spark作为大规模数据处理框架,采用内存计算,可以短时间内完成大量的数据的处理 ...

  7. R语言和大数据

    #安装R语言R3.3版本会出现各种so不存在的问题,退回去到R3.1版本时候就顺利安装.在安装R环境之前,先安装好中文(如果没有的话图表中显示汉字成框框了)和tcl/tk包(少了这个没法安装sqldf ...

  8. 大数据时代的精准数据挖掘——使用R语言

    老师简介: Gino老师,即将步入不惑之年,早年获得名校数学与应用数学专业学士和统计学专业硕士,有海外学习和工作的经历,近二十年来一直进行着数据分析的理论和实践,数学.统计和计算机功底强悍. 曾在某一 ...

  9. 大数据spark学习第一周Scala语言基础

    Scala简单介绍 Scala(Scala Language的简称)语言是一种能够执行于JVM和.Net平台之上的通用编程语言.既可用于大规模应用程序开发,也可用于脚本编程,它由由Martin Ode ...

随机推荐

  1. Dynamics AX 2012 R2 业务系列-销售业务流程

    在博文Dynamics AX R2 业务系列中,Reinhard对这个系列做了一个规划,下面我们就按照规划开始说业务吧. 1.销售的主要职责 其实这里说的职责主要是针对销售文员,并非整天外面满世界跑业 ...

  2. Edit Distance编辑距离(NM tag)- sam&sol;bam格式解读进阶

    sam格式很精炼,几乎包含了比对的所有信息,我们平常用到的信息很少,但特殊情况下,我们会用到一些较为生僻的信息,关于这些信息sam官方文档的介绍比较精简,直接看估计很难看懂. 今天要介绍的是如何通过b ...

  3. NuGet的安装&semi;

        下载完毕安装需要重启一下VS 然后我们去NuGet里面安装我们要的客户端 搜索  CouchbaseNetClient   引用之后,编译项目,查看到已引用的dll文件

  4. YARN环境搭建 之 一:CentOS7&period;0系统配置

    一.我缘何选择CentOS7.0 14年7月7日17:39:42发布了CentOS 7.0.1406正式版,我曾使用过多款Linux,对于Hadoop2.X/YARN的环境配置缘何选择CentOS7. ...

  5. Spring 整合 Tibco EMS

    参考文档:  http://haohaoxuexi.iteye.com/blog/1893038 http://www.blogjava.net/chenhui7502/archive/2011/08 ...

  6. go Test的实现 以及 压力测试

    引用 import "testing" 一些原则 文件名必须是 *_test.go* 结尾的,这样在执行 go test 的时候才会执行到相应的代码 必须 import testi ...

  7. &lbrack;RESTful&rsqb; 项目设计实践

    有以下的项目需求 用户登录.注册 文章发表.编辑.管理.列表 一.资源路径 /users./articles 二.HTTP动词 GET.POST.DELETE.PUT 三.过滤信息 文章的分页筛选 四 ...

  8. Eclipse 4&period;2 failed to start after TEE is installed

    ---------------  VM Arguments---------------  jvm_args: -Dosgi.requiredJavaVersion=1.6 -Dhelp.lucene ...

  9. Chisel常用命令总结

    Chisel简介 Chisel是Facebook开源的一款lldb调试工具,其实就是对系统lldb命令的封装,开发者可以通过简化的命令更方便的进行调试工作.开源地址:https://github.co ...

  10. CSS之各种居中

    本博客讨论居中情况设定为 总宽度不定,内容宽度不定 的情况.(改变大小时,仍然居中). 特别说明:在元素设置 position:absolute; 来设置居中效果时,除去博客下介绍的css3方法外,还 ...