熊猫系列。应用程序不是由字符串组成的

It's seems possible to relate with Japanese Language problem, So I asked in Japanese * also.

似乎有可能与日语问题有关，所以我在日语*上问道。

When I use string just object, it works fine.

当我使用string just object时，它工作得很好。

I tried to encode but I couldn't find the reason of this error. Could you please give me advice?

我试着编码，但我找不到这个错误的原因。你能给我一些建议吗?

MeCab is an open source text segmentation library for use with text written in the Japanese language originally developed by the Nara Institute of Science and Technology and currently maintained by Taku Kudou (工藤拓) as part of his work on the Google Japanese Input project. https://en.wikipedia.org/wiki/MeCab

MeCab是一个开源的文本分割库使用文本用日本语言编写的最初由奈良科学技术研究所开发,目前由佐藤工藤(工藤拓)作为工作的一部分,谷歌日本输入项目。https://en.wikipedia.org/wiki/MeCab

sample.csv

0,今日も夜まで働きました。
1,オフィスには誰もいませんが、エラーと格闘中
2,デバッグばかりしていますが、どうにもなりません。

This is Pandas Python3 code

这是熊猫的python代码

import pandas as pd
import MeCab  
# https://en.wikipedia.org/wiki/MeCab
from tqdm import tqdm_notebook as tqdm
# This is working...
df = pd.read_csv('sample.csv', encoding='utf-8')

m = MeCab.Tagger ("-Ochasen")

text = "りんごを食べました、そして、みかんも食べました"
a = m.parse(text)

print(a)# working! 

# But I want to use Pandas's Series



def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
    tagger = MeCab.Tagger('-Ochasen')
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        if node.feature.split(",")[0] == u"名詞": # this means noun
            keywords.append(node.surface)
        node = node.next
    return keywords



aa = extractKeyword(text) #working!!

me = df.apply(lambda x: extractKeyword(x))

#TypeError: ("in method 'Tagger_parseToNode', argument 2 of type 'char const *'", 'occurred at index 0')

This is the trace error

这是跟踪错误

りんご リンゴ りんご 名詞-一般       
を   ヲ   を   助詞-格助詞-一般       
食べ  タベ  食べる 動詞-自立   一段  連用形
まし  マシ  ます  助動詞 特殊・マス   連用形
た   タ   た   助動詞 特殊・タ    基本形
、   、   、   記号-読点       
そして ソシテ そして 接続詞     
、   、   、   記号-読点       
みかん ミカン みかん 名詞-一般       
も   モ   も   助詞-係助詞      
食べ  タベ  食べる 動詞-自立   一段  連用形
まし  マシ  ます  助動詞 特殊・マス   連用形
た   タ   た   助動詞 特殊・タ    基本形
EOS

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-174-81a0d5d62dc4> in <module>()
    32 aa = extractKeyword(text) #working!!
    33 
---> 34 me = df.apply(lambda x: extractKeyword(x))

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4260                         f, axis,
4261                         reduce=reduce,
-> 4262                         ignore_failures=ignore_failures)
4263             else:
4264                 return self._apply_broadcast(f, axis)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
4356             try:
4357                 for i, v in enumerate(series_gen):
-> 4358                     results[i] = func(v)
4359                     keys.append(v.name)
4360             except Exception as e:

<ipython-input-174-81a0d5d62dc4> in <lambda>(x)
    32 aa = extractKeyword(text) #working!!
    33 
---> 34 me = df.apply(lambda x: extractKeyword(x))

<ipython-input-174-81a0d5d62dc4> in extractKeyword(text)
    20     """Morphological analysis of text and returning a list of only nouns"""
    21     tagger = MeCab.Tagger('-Ochasen')
---> 22     node = tagger.parseToNode(text)
    23     keywords = []
    24     while node:

~/anaconda3/lib/python3.6/site-packages/MeCab.py in parseToNode(self, *args)
    280     __repr__ = _swig_repr
    281     def parse(self, *args): return _MeCab.Tagger_parse(self, *args)
--> 282     def parseToNode(self, *args): return _MeCab.Tagger_parseToNode(self, *args)
    283     def parseNBest(self, *args): return _MeCab.Tagger_parseNBest(self, *args)
    284     def parseNBestInit(self, *args): return _MeCab.Tagger_parseNBestInit(self, *args)

TypeError: ("in method 'Tagger_parseToNode', argument 2 of type 'char const *'", 'occurred at index 0')w

2 个解决方案

#1

I see you got some help on the Japanese *, but here's an answer in English:

我知道你在日语*上得到了一些帮助，但是这里有一个英文答案:

The first thing to fix is that read_csv was treating the first line of your example.csv as the header. To fix that, use the names argument in read_csv.

首先要修正的是read_csv处理了示例的第一行。csv头。要解决这个问题，请使用read_csv中的names参数。

Next, df.apply will by default apply the function on columns of the dataframe. You need to do something like df.apply(lambda x: extractKeyword(x['String']), axis=1), but this won't work because each sentence will have a different number of nouns and Pandas will complain it cannot stack a 1x2 array on top of a 1x5 array. The simplest way is to apply on the Series of String.

接下来,df。apply将在默认情况下对dataframe的列应用这个函数。你需要做一些像df这样的事情。应用(lambda x: extractKeyword(x['String'])， axis=1)，但这不会起作用，因为每个句子将有不同数量的名词，而猫儿们会抱怨它不能在1x5阵列上堆叠1x2数组。最简单的方法是对字符串序列进行应用。

The final problem is, there's a bug in the MeCab Python3 bindings: see https://github.com/SamuraiT/mecab-python3/issues/3 You found a workaround by running parseToNode twice, you can also call parse before parseToNode.

最后一个问题是，在MeCab Python3绑定中有一个bug:请参见https://github.com/SamuraiT/mecab-python3/issues/3。

Putting all these three things together:

把这三件事放在一起:

import pandas as pd
import MeCab  
df = pd.read_csv('sample.csv', encoding='utf-8', names=['Number', 'String'])

def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
    tagger = MeCab.Tagger('-Ochasen')
    tagger.parse(text)
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        if node.feature.split(",")[0] == u"名詞": # this means noun
            keywords.append(node.surface)
        node = node.next
    return keywords

me = df['String'].apply(extractKeyword)
print(me)

When you run this script, with the example.csv you provide:

当您运行这个脚本时，使用这个示例。csv提供:

➜  python3 demo.py
0                  [今日, 夜]
1    [オフィス, 誰, エラー, 格闘, 中]
2                   [デバッグ]
Name: String, dtype: object

#2

parseToNode fail everytime , so needed to put this code

parseToNode每次都会失败，因此需要放置此代码

 tagger.parseToNode('dummy')

before

之前

 node = tagger.parseToNode(text)

and It's worked!

这是工作!

But I don't know the reason, maybe parseToNode method has bug..

但是我不知道为什么，也许parseToNode方法有缺陷。

def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
   tagger = MeCab.Tagger('-Ochasen')
   tagger.parseToNode('ダミー') 
   node = tagger.parseToNode(text)
   keywords = []
   while node:
       if node.feature.split(",")[0] == u"名詞": # this means noun
           keywords.append(node.surface)
       node = node.next
   return keywords

#1