将函数同时应用于两个数据帧中的值,以生成第三个数据帧

时间:2021-07-26 12:21:48

Apologies if this turns out to be a very specific problem, which may not generalise to that of others'.

抱歉,如果这是一个非常具体的问题,可能不会推广到其他人的问题。

Background

背景

I hope to do some sentiment analysis, starting from the basic binary matching of words from a lexicon, and then moving towards some more complex form of Sentiment Analysis, making use of grammatical rules, etc.

我希望做一些情绪分析,从词典中单词的基本二进制匹配开始,然后转向一些更复杂的情感分析形式,利用语法规则等。

Problem

问题

To do some binary matching - which will form the first phase of Sentiment Analysis - I am provided with two tables, one containing words, and the other containing Parts-Of-Speech for these words.

要做一些二元匹配 - 这将形成情感分析的第一阶段 - 我提供了两个表,一个包含单词,另一个包含这些单词的词性。

    V1     V2        V3          V4   V5
1    R     is fantastic    language <NA>
2 Java     is       far        from good
3 Data mining        is fascinating <NA>


   V1  V2  V3 V4   V5
1  NN VBZ  JJ NN <NA>
2 NNP VBZ  RB IN   JJ
3 NNP  NN VBZ JJ <NA>

I would like to carry out some basic Sentiment Analysis as follows: I want to apply a function that takes two arguments, a word (from the 1st data frame) and its corresponding POS tag (from the second) to determine which list words to use in determining positive/negative orientation of a word. For example, the word fantastic would be extracted along with the POS tag 'JJ', and so the list of adjectives alone would be inspected for presence/absence of this word.

我想执行一些基本的情感分析,如下所示:我想应用一个带有两个参数的函数,一个字(来自第一个数据框)和相应的POS标记(来自第二个)来确定要使用的列表字确定单词的正/负方向。例如,单词“奇妙”将与POS标签“JJ”一起被提取,因此将仅检查单词形容词列表中该单词的存在/不存在。

Eventually, I would like to end up with a data frame that shows the result of matching:

最后,我想得到一个显示匹配结果的数据框:

   V1  V2  V3 V4   V5
1  0   0   1   0   <NA>
2  0   0  -1   0   1
3  0   0   0   1   <NA>

I tried formulating my own code, but kept getting an error, after which I felt this was not going to work.

我尝试制定自己的代码,但一直出现错误,之后我觉得这不会起作用。

#test sentences
sentences<- as.list(c("R is fantastic language", "Java is far from good", "Data mining is fascinating"))

#using the OpenNLP package
require(openNLP)

#perform tagging
taggedSentences<- tagPOS(sentences)

#split to words
individualWords<- unname(sapply(taggedSentences, function(x){strsplit(x,split=" ")}))

#Strip Tags
individualWordsClean<- unname(sapply(individualWords, function(x){gsub("/.+","",x)}))

#Strip words
individualTags<- unname(sapply(individualWords, function(x){gsub(".+/","",x)}))

#create a dataframe for words; courtesy @trinker
numberRow<- length(individualWords)
numberCol<- unname(sapply(individualWords, length))
df1<- as.data.frame(matrix(nrow=numberRow, ncol=max(numberCol)))
for (i in 1:numberRow){
df1[i,1:numberCol[i]]<- individualWordsClean [[i]]
}


#create a dataframe for tags; courtesy @trinker
numberRow<- length(individualWords)
numberCol<- unname(sapply(individualTags, length))
df2<- as.data.frame(matrix(nrow=numberRow, ncol=max(numberCol)))
for (i in 1:numberRow){
df2[i,1:numberCol[i]]<- individualTags [[i]]
}

#Create negative/positive words' lists
posAdj<- c("fantastic","fascinating","good")
negAdj<- c("bad","poor")
posNoun<- "R"
negNoun<- "Java"

#Function to match words and assign sentiment score
checkLexicon<- function(word,tag){
if (grep("JJ|JJR|JJS",tag)){
ifelse(word %in% posAdj, +1, ifelse(word  %in% negAdj, -1, 0))
}
else if(grep("NN|NNP|NNPS|NNS",tag)){
ifelse(word %in% posNoun, +1, ifelse(word %in% negNoun, -1, 0))
}
else if(grep("VBZ",tag)){
ifelse(word %in% "is","ok","none")
}
else if(grep("RB",tag)){
ifelse(word %in% "not",-1,0)
}
else if(grep("IN",tag)){
ifelse(word %in% "far",-1,0)
}
}

#Method to output a single value when used in conjuction with apply
justShow<- function(x){
    x
    }

#Main method that intends to extract word/POS tag pair, and determine sentiment score
mapply(FUN=checkLexicon, word=apply(df1,2,justShow),tag=apply(df2,2,justShow))

Unfortunately, I have had no success with this method, and the error received is

不幸的是,我没有成功使用这种方法,并且收到的错误是

Error in if (grep("JJ|JJR|JJS", tag)) { : argument is of length zero

I am relatively new to R, but it seems that I am unable to use the apply function here, as it returns no argument to the mapply function. Also, I am not sure if mapply will actually produce another data frame.

我对R来说比较新,但似乎我无法在这里使用apply函数,因为它不会向mapply函数返回任何参数。此外,我不确定mapply是否会实际产生另一个数据帧。

Please do criticise/advise. Thanks

请批评/建议。谢谢

PS. Link to TRinker's notes on R for those interested.

PS。链接到TRinker关于R的笔记,感兴趣的人。

1 个解决方案

#1


1  

The mistake was attempting to use grep as grepl. This was corrected after Joran pointed it out. The working function is as follows.

错误是试图使用grep作为grepl。在乔兰指出之后,这一问题得到了纠正。工作功能如下。

>df1

    V1     V2        V3          V4   V5
1    R     is fantastic    language <NA>
2 Java     is       far        from good
3 Data mining        is fascinating <NA>

>df2

   V1  V2  V3 V4   V5
1  NN VBZ  JJ NN <NA>
2 NNP VBZ  RB IN   JJ
3 NNP  NN VBZ JJ <NA>

#Function to match words and assign sentiment score
checkLexicon<- function(word,tag){
if (grepl("JJ|JJR|JJS",tag)){
ifelse(word %in% posAdj, +1, ifelse(word  %in% negAdj, -1, 0))
}
else if(grepl("NN|NNP|NNPS|NNS",tag)){
ifelse(word %in% posNoun, +1, ifelse(word %in% negNoun, -1, 0))
}
else if(grepl("VBZ",tag)){
ifelse(word %in% "is","ok","none")
}
else if(grepl("RB",tag)){
ifelse(word %in% "not",-1,0)
}
else if(grepl("IN",tag)){
ifelse(word %in% "far",-1,0)
}
}

#Method to output a single value when used in conjuction with apply
justShow<- function(x){
    x
    }

#Main method that intends to extract word/POS tag pair, and determine sentiment score
myObject<- mapply(FUN=checkLexicon, word=apply(df1,2,justShow),tag=apply(df2,2,justShow))

#Shaping the final dataframe
scoredDF<- as.data.frame(matrix(myObject,nrow=3))

  V1 V2 V3 V4   V5
1  1 ok  1  0 NULL
2 -1 ok  0  0    1
3  0  0 ok  1 NULL

#1


1  

The mistake was attempting to use grep as grepl. This was corrected after Joran pointed it out. The working function is as follows.

错误是试图使用grep作为grepl。在乔兰指出之后,这一问题得到了纠正。工作功能如下。

>df1

    V1     V2        V3          V4   V5
1    R     is fantastic    language <NA>
2 Java     is       far        from good
3 Data mining        is fascinating <NA>

>df2

   V1  V2  V3 V4   V5
1  NN VBZ  JJ NN <NA>
2 NNP VBZ  RB IN   JJ
3 NNP  NN VBZ JJ <NA>

#Function to match words and assign sentiment score
checkLexicon<- function(word,tag){
if (grepl("JJ|JJR|JJS",tag)){
ifelse(word %in% posAdj, +1, ifelse(word  %in% negAdj, -1, 0))
}
else if(grepl("NN|NNP|NNPS|NNS",tag)){
ifelse(word %in% posNoun, +1, ifelse(word %in% negNoun, -1, 0))
}
else if(grepl("VBZ",tag)){
ifelse(word %in% "is","ok","none")
}
else if(grepl("RB",tag)){
ifelse(word %in% "not",-1,0)
}
else if(grepl("IN",tag)){
ifelse(word %in% "far",-1,0)
}
}

#Method to output a single value when used in conjuction with apply
justShow<- function(x){
    x
    }

#Main method that intends to extract word/POS tag pair, and determine sentiment score
myObject<- mapply(FUN=checkLexicon, word=apply(df1,2,justShow),tag=apply(df2,2,justShow))

#Shaping the final dataframe
scoredDF<- as.data.frame(matrix(myObject,nrow=3))

  V1 V2 V3 V4   V5
1  1 ok  1  0 NULL
2 -1 ok  0  0    1
3  0  0 ok  1 NULL