
时间:2021-03-13 22:34:30

I have a pandas dataframe with columns names as: (columns type as Object)


1. x_id
2. y_id
3. Sentence1
4. Sentences2
5. Label

I want to separate sentences1 and sentence2 into multiple columns in same dataframe.


Here is an example: dataframe names as df


x_id     y_id     Sentence1          Sentence2          Label
0        2        This is a ball     I hate you         0
1        5        I am a boy         Ahmed Ali          1
2        1        Apple is red       Rose is red        1
3        9        I love you so much Me too             1

After splitting the columns[Sentence1,Sentence2] by ' ' Space, dataframe looks like:

将列[Sentence1,Sentence2]按' Space分割后,dataframe如下:

x_id     y_id     1     2     3    4     5      6      7     8      Label
0        2        This  is    a    ball  NONE   I      hate  you    0
1        5        I     am    a    boy   NONE   Ahmed  Ali   NONE   1
2        1        Apple is    red  NONE  NONE   Rose   is    red    1
3        9        I     love  you  so    much   Me     too   NONE   1

How to split the columns like this in python? How to do this using pandas dataframe?


4 个解决方案



One-hot-encoding labeling solution:


In [14]: df.Sentence1 += ' ' + df.pop('Sentence2')

In [15]: df
   x_id  y_id                  Sentence1  Label
0     0     2  This is a ball I hate you      0
1     1     5       I am a boy Ahmed Ali      1
2     2     1   Apple is red Rose is red      1
3     3     9  I love you so much Me too      1

In [16]: from sklearn.feature_extraction.text import CountVectorizer

In [17]: vect = CountVectorizer()

In [18]: X = vect.fit_transform(df.Sentence1.fillna(''))

X - is a sparsed (memory saving) matrix:

X -是一个稀疏的(内存保存)矩阵:

In [23]: X
<4x17 sparse matrix of type '<class 'numpy.int64'>'
        with 19 stored elements in Compressed Sparse Row format>

In [24]: type(X)
Out[24]: scipy.sparse.csr.csr_matrix

In [19]: X.toarray()
array([[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)

Most of sklearn methods accept sparsed matrixes.


If you want to "unpack" it:


In [21]: r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())

In [22]: r
   ahmed  ali  am  apple  ball  boy  hate  is  love  me  much  red  rose  so  this  too  you
0      0    0   0      0     1    0     1   1     0   0     0    0     0   0     1    0    1
1      1    1   1      0     0    1     0   0     0   0     0    0     0   0     0    0    0
2      0    0   0      1     0    0     0   2     0   0     0    2     1   0     0    0    0
3      0    0   0      0     0    0     0   0     1   1     1    0     0   1     0    1    1



In [26]: x = pd.concat([df.pop('Sentence1').str.split(expand=True),
    ...:                df.pop('Sentence2').str.split(expand=True)],
    ...:               axis=1)

In [27]: x.columns = np.arange(1, x.shape[1]+1)

In [28]: x
       1     2    3     4     5      6     7     8
0   This    is    a  ball  None      I  hate   you
1      I    am    a   boy  None  Ahmed   Ali  None
2  Apple    is  red  None  None   Rose    is   red
3      I  love  you    so  much     Me   too  None

In [29]: df = df.join(x)

In [30]: df
   x_id  y_id  Label      1     2    3     4     5      6     7     8
0     0     2      0   This    is    a  ball  None      I  hate   you
1     1     5      1      I    am    a   boy  None  Ahmed   Ali  None
2     2     1      1  Apple    is  red  None  None   Rose    is   red
3     3     9      1      I  love  you    so  much     Me   too  None



Here is how to do it for the sentences in the column Sentence1. The idea is identical for the Sentence2 column.


splits = df.Sentence1.str.split(' ')
longest = splits.apply(len).max()

Note that longest is the length of the longest sentence. Now make the Null columns:


for j in range(1,longest+1):
    df[str(j)] = np.nan

And finally, go through the splitted values and assign them:


for j in splits.values:
    for k in range(1,longest+1):
            df.loc[str(j), k] = j[k]




It looks like a machine learning problem. Converting from 1 col to max words columns this way may not be efficient.

这看起来像是机器学习的问题。以这种方式从1 col转换为max words列可能不太有效。

Another (probably more efficient) solution is converting each words to integer and then padding to the longest sentences. Tensorflow as tools for that.




One-hot-encoding labeling solution:


In [14]: df.Sentence1 += ' ' + df.pop('Sentence2')

In [15]: df
   x_id  y_id                  Sentence1  Label
0     0     2  This is a ball I hate you      0
1     1     5       I am a boy Ahmed Ali      1
2     2     1   Apple is red Rose is red      1
3     3     9  I love you so much Me too      1

In [16]: from sklearn.feature_extraction.text import CountVectorizer

In [17]: vect = CountVectorizer()

In [18]: X = vect.fit_transform(df.Sentence1.fillna(''))

X - is a sparsed (memory saving) matrix:

X -是一个稀疏的(内存保存)矩阵:

In [23]: X
<4x17 sparse matrix of type '<class 'numpy.int64'>'
        with 19 stored elements in Compressed Sparse Row format>

In [24]: type(X)
Out[24]: scipy.sparse.csr.csr_matrix

In [19]: X.toarray()
array([[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)

Most of sklearn methods accept sparsed matrixes.


If you want to "unpack" it:


In [21]: r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())

In [22]: r
   ahmed  ali  am  apple  ball  boy  hate  is  love  me  much  red  rose  so  this  too  you
0      0    0   0      0     1    0     1   1     0   0     0    0     0   0     1    0    1
1      1    1   1      0     0    1     0   0     0   0     0    0     0   0     0    0    0
2      0    0   0      1     0    0     0   2     0   0     0    2     1   0     0    0    0
3      0    0   0      0     0    0     0   0     1   1     1    0     0   1     0    1    1



In [26]: x = pd.concat([df.pop('Sentence1').str.split(expand=True),
    ...:                df.pop('Sentence2').str.split(expand=True)],
    ...:               axis=1)

In [27]: x.columns = np.arange(1, x.shape[1]+1)

In [28]: x
       1     2    3     4     5      6     7     8
0   This    is    a  ball  None      I  hate   you
1      I    am    a   boy  None  Ahmed   Ali  None
2  Apple    is  red  None  None   Rose    is   red
3      I  love  you    so  much     Me   too  None

In [29]: df = df.join(x)

In [30]: df
   x_id  y_id  Label      1     2    3     4     5      6     7     8
0     0     2      0   This    is    a  ball  None      I  hate   you
1     1     5      1      I    am    a   boy  None  Ahmed   Ali  None
2     2     1      1  Apple    is  red  None  None   Rose    is   red
3     3     9      1      I  love  you    so  much     Me   too  None



Here is how to do it for the sentences in the column Sentence1. The idea is identical for the Sentence2 column.


splits = df.Sentence1.str.split(' ')
longest = splits.apply(len).max()

Note that longest is the length of the longest sentence. Now make the Null columns:


for j in range(1,longest+1):
    df[str(j)] = np.nan

And finally, go through the splitted values and assign them:


for j in splits.values:
    for k in range(1,longest+1):
            df.loc[str(j), k] = j[k]




It looks like a machine learning problem. Converting from 1 col to max words columns this way may not be efficient.

这看起来像是机器学习的问题。以这种方式从1 col转换为max words列可能不太有效。

Another (probably more efficient) solution is converting each words to integer and then padding to the longest sentences. Tensorflow as tools for that.
