根据另一列中值的存在，有条件地将字符串附加到Pandas数据帧中的行

I'm working with tweets in a Pandas dataframe (Python). I'm trying to indicate that a specific tweet is a 'quoted tweet' by:

我正在使用Pandas数据帧（Python）中的推文。我试图通过以下方式表明特定的推文是“引用的推文”：

1) Looking at whether the 'quoted_author' field is blank or not

1）查看'quoted_author'字段是否为空白

2) If the field is NOT blank, add the following prefix in front of the tweet text that includes the quoted author's username:

2）如果该字段不是空白，请在包含引用作者用户名的推文文本前添加以下前缀：

'QT @[quoted_author]: [tweet text]'

'QT @ [quoted_author]：[鸣叫文字]'

This is the code that is not working for me. What am I doing wrong? Thanks!

这是不适合我的代码。我究竟做错了什么？谢谢！

for row in df['quoted_author']:
        if row == "":
            pass
        else:
            df['Text'].append('QT ' + df['quoted_author'].astype(str) + ': ' + df['Text'].astype(str))

3 个解决方案

#1

Instead of looping over every row and finding if it's equal to null or not,try to get all the rows that are not null.

不是循环遍历每一行并查找它是否等于null，而是尝试获取非空的所有行。

df_author = df[df['quoated author'] != ""]

df_author = df [df ['quoated author']！=“”]

Then use apply function to append all the rows of df_author with corresponding author names.

然后使用apply函数将df_author的所有行附加到相应的作者名称。

#2

I went through and evaluated two different methods of achieving this. The first involves using apply and a separate function. See below:

我经历了并评估了实现这一目标的两种不同方法。第一个涉及使用应用和单独的功能。见下文：

df
      quoted_author        tweet_text
    0       person1         tweettext
    1       person2  somethingtweeted
    2           NaN         fooootext
    3           NaN        sometweets
    4        author            atweet
    5   some_author    someothertweet

Method 1- Function and apply:

方法1-功能和应用：

def nullCheck(author, tweet):
    if not pd.isnull(author):
        return 'QT ' + str(author) + ': ' + str(tweet)
    else:
        return np.nan


df['output'] = df[['quoted_author', 'tweet_text']].apply(lambda x: nullCheck(*x), axis=1)

%timeit df['output'] = df[['quoted_author', 'tweet_text']].apply(lambda x: nullCheck(*x), axis=1)
1000 loops, best of 3: 1.01 ms per loop

Method 2- Slice the dataframe to only view non-null authors then produce output in separate column:

方法2-将数据帧切片为仅查看非空作者，然后在单独的列中生成输出：

df.loc[~pd.isnull(df['quoted_author']),'output'] = 'QT ' + df['quoted_author'] + ': ' + df['tweet_text']

%timeit df.loc[~pd.isnull(df['quoted_author']),'output'] = 'QT ' + df['quoted_author'] + ': ' + df['tweet_text']
    1000 loops, best of 3: 1.68 ms per loop

Interestingly the first method is faster though I'm not exactly sure why. Can anyone else share some insight on this? Either way this will get you what you're looking for.

有趣的是，第一种方法更快，但我不确定为什么。其他人可以对此分享一些见解吗？无论哪种方式，这将为您提供您正在寻找的东西。

#3

Another one-liner solution

另一个单线解决方案

Setup (using Andrew L's example)

设置（使用安德鲁的例子）

df = pd.DataFrame({'quoted_author': {0: 'person1',
  1: 'person2',  2: '',  3: '',  4: 'author',  5: 'some_author'}, 'text': {0: 'tweettext',
  1: 'somethingtweeted',  2: 'fooootext',  3: 'sometweets',  4: 'atweet',  5: 'someothertweet'}})

Solution

解

#use apply to reset test column based on the value of quoted_author. 
df.text = df.apply(lambda x: 'QT {}: {}'.format(x.quoted_author, x.text) if x.quoted_author else x.text, axis=1)

  quoted_author                            text
0       person1           QT person1: tweettext
1       person2    QT person2: somethingtweeted
2                                     fooootext
3                                    sometweets
4        author               QT author: atweet
5   some_author  QT some_author: someothertweet

#1