使用pandas dataframe应用函数 - POS标记器计算时间

时间:2022-11-29 21:27:26

I'm very confused on the apply function for pandas. I have a big dataframe where one column is a column of strings. I'm then using a function to count part-of-speech occurrences. I'm just not sure the way of setting up my apply statement or my function.

我对pandas的apply函数非常困惑。我有一个大数据框,其中一列是一列字符串。然后我使用一个函数来计算词性发生次数。我只是不确定设置我的apply语句或我的函数的方法。

def noun_count(row):
    x = tagger(df['string'][row].split())
    # array flattening and filtering out all but nouns, then summing them
    return num

So basically I have a function similar to the above where I use a POS tagger on a column that outputs a single number (number of nouns). I may possibly rewrite it to output multiple numbers for different parts of speech, but I can't wrap my head around apply.

所以基本上我有一个类似于上面的功能,我在一个输出单个数字(名词数)的列上使用POS标记器。我可能会重写它以输出不同词性的多个数字,但我无法绕过申请。

I'm pretty sure I don't really have either part arranged correctly. For instance, I can run noun_count[row] and get the correct value for any index but I can't figure out how to make it work with apply how I have it set up. Basically I don't know how to pass the row value to the function within the apply statement.

我很确定我没有正确安排任何一部分。例如,我可以运行noun_count [row]并为任何索引获取正确的值,但我无法弄清楚如何使其工作与应用我如何设置它。基本上我不知道如何将行值传递给apply语句中的函数。

df['num_nouns'] = df.apply(noun_count(??),1)

Sorry this question is all over the place. So what can I do to get a simple result like

对不起,这个问题到处都是。那么我该怎样才能得到一个简单的结果呢

      string     num_nouns
0      'cat'             1
1 'two cats'             1

EDIT: So I've managed to get something working by using list comprehension (someone posted an answer, but they've deleted it).

编辑:所以我设法通过使用列表理解(有人发布了答案,但他们删除了它)得到了一些工作。

df['string'].apply(lambda row: noun_count(row),1)

which required an adjustment to my function:

这需要调整我的功能:

def tagger_nouns(x):
    list_of_lists = st.tag(x.split())
    flat = [y for z in list_of_lists for y in z]
    Parts_of_speech = [row[1] for row in flattened]
    c = Counter(Parts_of_speech)
    nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
    return nouns

I'm using the Stanford tagger, but I have a big problem with computation time, and I'm using the left 3 words model. I'm noticing that it's calling the .jar file again and again (java keeps opening and closing in the task manager) and maybe that's unavoidable, but it's really taking far too long to run. Any way I can speed it up?

我正在使用斯坦福标记器,但我的计算时间问题很严重,而且我正在使用左边的3个单词模型。我注意到它一次又一次地调用.jar文件(java在任务管理器中保持打开和关闭)并且这可能是不可避免的,但它确实需要太长时间才能运行。我可以用任何方式加快速度吗?

1 个解决方案

#1


I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:

我不知道'标记器'是什么,但这里有一个简单的例子,字数应该或多或少地以相同的方式工作:

f = lambda x: len(x.split())

df['num_words'] = df['string'].apply(f)

       string  num_words
0       'cat'          1
1  'two cats'          2

#1


I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:

我不知道'标记器'是什么,但这里有一个简单的例子,字数应该或多或少地以相同的方式工作:

f = lambda x: len(x.split())

df['num_words'] = df['string'].apply(f)

       string  num_words
0       'cat'          1
1  'two cats'          2