熊猫Dataframe:将一列分割为多个列

I have a pandas dataframe with columns names as: (columns type as Object)

我有一个熊猫dataframe，列名为:(列类型为对象)

1. x_id
2. y_id
3. Sentence1
4. Sentences2
5. Label

I want to separate sentences1 and sentence2 into multiple columns in same dataframe.

我想把句子1和句子2在同一个dataframe中分成多个列。

Here is an example: dataframe names as df

这里有一个例子:dataframe作为df

x_id     y_id     Sentence1          Sentence2          Label
0        2        This is a ball     I hate you         0
1        5        I am a boy         Ahmed Ali          1
2        1        Apple is red       Rose is red        1
3        9        I love you so much Me too             1

After splitting the columns[Sentence1,Sentence2] by ' ' Space, dataframe looks like:

将列[Sentence1,Sentence2]按' Space分割后，dataframe如下:

x_id     y_id     1     2     3    4     5      6      7     8      Label
0        2        This  is    a    ball  NONE   I      hate  you    0
1        5        I     am    a    boy   NONE   Ahmed  Ali   NONE   1
2        1        Apple is    red  NONE  NONE   Rose   is    red    1
3        9        I     love  you  so    much   Me     too   NONE   1

How to split the columns like this in python? How to do this using pandas dataframe?

如何在python中像这样拆分列?如何使用熊猫dataframe?

4 个解决方案

#1

One-hot-encoding labeling solution:

One-hot-encoding标签解决方案:

In [14]: df.Sentence1 += ' ' + df.pop('Sentence2')

In [15]: df
Out[15]:
   x_id  y_id                  Sentence1  Label
0     0     2  This is a ball I hate you      0
1     1     5       I am a boy Ahmed Ali      1
2     2     1   Apple is red Rose is red      1
3     3     9  I love you so much Me too      1

In [16]: from sklearn.feature_extraction.text import CountVectorizer

In [17]: vect = CountVectorizer()

In [18]: X = vect.fit_transform(df.Sentence1.fillna(''))

X - is a sparsed (memory saving) matrix:

X -是一个稀疏的(内存保存)矩阵:

In [23]: X
Out[23]:
<4x17 sparse matrix of type '<class 'numpy.int64'>'
        with 19 stored elements in Compressed Sparse Row format>

In [24]: type(X)
Out[24]: scipy.sparse.csr.csr_matrix

In [19]: X.toarray()
Out[19]:
array([[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)

Most of sklearn methods accept sparsed matrixes.

大多数sklearn方法都接受sparsed矩阵。

If you want to "unpack" it:

如果你想“打开”它:

In [21]: r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())

In [22]: r
Out[22]:
   ahmed  ali  am  apple  ball  boy  hate  is  love  me  much  red  rose  so  this  too  you
0      0    0   0      0     1    0     1   1     0   0     0    0     0   0     1    0    1
1      1    1   1      0     0    1     0   0     0   0     0    0     0   0     0    0    0
2      0    0   0      1     0    0     0   2     0   0     0    2     1   0     0    0    0
3      0    0   0      0     0    0     0   0     1   1     1    0     0   1     0    1    1

#2

In [26]: x = pd.concat([df.pop('Sentence1').str.split(expand=True),
    ...:                df.pop('Sentence2').str.split(expand=True)],
    ...:               axis=1)
    ...:

In [27]: x.columns = np.arange(1, x.shape[1]+1)

In [28]: x
Out[28]:
       1     2    3     4     5      6     7     8
0   This    is    a  ball  None      I  hate   you
1      I    am    a   boy  None  Ahmed   Ali  None
2  Apple    is  red  None  None   Rose    is   red
3      I  love  you    so  much     Me   too  None

In [29]: df = df.join(x)

In [30]: df
Out[30]:
   x_id  y_id  Label      1     2    3     4     5      6     7     8
0     0     2      0   This    is    a  ball  None      I  hate   you
1     1     5      1      I    am    a   boy  None  Ahmed   Ali  None
2     2     1      1  Apple    is  red  None  None   Rose    is   red
3     3     9      1      I  love  you    so  much     Me   too  None

#3

Here is how to do it for the sentences in the column Sentence1. The idea is identical for the Sentence2 column.

这是如何做的句子在列句子1。这一观点与《箴言》一栏相同。

splits = df.Sentence1.str.split(' ')
longest = splits.apply(len).max()

Note that longest is the length of the longest sentence. Now make the Null columns:

注意最长的是最长的句子的长度。现在将空列设为:

for j in range(1,longest+1):
    df[str(j)] = np.nan

And finally, go through the splitted values and assign them:

最后，对分割后的值进行赋值:

for j in splits.values:
    for k in range(1,longest+1):
        try:
            df.loc[str(j), k] = j[k]
        except:
            pass

”

#4

It looks like a machine learning problem. Converting from 1 col to max words columns this way may not be efficient.

这看起来像是机器学习的问题。以这种方式从1 col转换为max words列可能不太有效。

Another (probably more efficient) solution is converting each words to integer and then padding to the longest sentences. Tensorflow as tools for that.

另一个(可能更有效的)解决方案是将每个单词转换成整数，然后填充为最长的句子。张力流作为工具。

#1