I have a pandas dataframe
with columns names as: (columns type as Object
)
我有一个熊猫dataframe,列名为:(列类型为对象)
1. x_id
2. y_id
3. Sentence1
4. Sentences2
5. Label
I want to separate sentences1 and sentence2 into multiple columns in same dataframe
.
我想把句子1和句子2在同一个dataframe中分成多个列。
Here is an example: dataframe
names as df
这里有一个例子:dataframe作为df
x_id y_id Sentence1 Sentence2 Label
0 2 This is a ball I hate you 0
1 5 I am a boy Ahmed Ali 1
2 1 Apple is red Rose is red 1
3 9 I love you so much Me too 1
After splitting the columns[Sentence1,Sentence2] by ' ' Space, dataframe
looks like:
将列[Sentence1,Sentence2]按' Space分割后,dataframe如下:
x_id y_id 1 2 3 4 5 6 7 8 Label
0 2 This is a ball NONE I hate you 0
1 5 I am a boy NONE Ahmed Ali NONE 1
2 1 Apple is red NONE NONE Rose is red 1
3 9 I love you so much Me too NONE 1
How to split the columns like this in python
? How to do this using pandas dataframe
?
如何在python中像这样拆分列?如何使用熊猫dataframe?
4 个解决方案
#1
0
One-hot-encoding labeling solution:
One-hot-encoding标签解决方案:
In [14]: df.Sentence1 += ' ' + df.pop('Sentence2')
In [15]: df
Out[15]:
x_id y_id Sentence1 Label
0 0 2 This is a ball I hate you 0
1 1 5 I am a boy Ahmed Ali 1
2 2 1 Apple is red Rose is red 1
3 3 9 I love you so much Me too 1
In [16]: from sklearn.feature_extraction.text import CountVectorizer
In [17]: vect = CountVectorizer()
In [18]: X = vect.fit_transform(df.Sentence1.fillna(''))
X
- is a sparsed (memory saving) matrix:
X -是一个稀疏的(内存保存)矩阵:
In [23]: X
Out[23]:
<4x17 sparse matrix of type '<class 'numpy.int64'>'
with 19 stored elements in Compressed Sparse Row format>
In [24]: type(X)
Out[24]: scipy.sparse.csr.csr_matrix
In [19]: X.toarray()
Out[19]:
array([[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)
Most of sklearn methods accept sparsed matrixes.
大多数sklearn方法都接受sparsed矩阵。
If you want to "unpack" it:
如果你想“打开”它:
In [21]: r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
In [22]: r
Out[22]:
ahmed ali am apple ball boy hate is love me much red rose so this too you
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1
1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 2 0 0 0 2 1 0 0 0 0
3 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1
#2
1
In [26]: x = pd.concat([df.pop('Sentence1').str.split(expand=True),
...: df.pop('Sentence2').str.split(expand=True)],
...: axis=1)
...:
In [27]: x.columns = np.arange(1, x.shape[1]+1)
In [28]: x
Out[28]:
1 2 3 4 5 6 7 8
0 This is a ball None I hate you
1 I am a boy None Ahmed Ali None
2 Apple is red None None Rose is red
3 I love you so much Me too None
In [29]: df = df.join(x)
In [30]: df
Out[30]:
x_id y_id Label 1 2 3 4 5 6 7 8
0 0 2 0 This is a ball None I hate you
1 1 5 1 I am a boy None Ahmed Ali None
2 2 1 1 Apple is red None None Rose is red
3 3 9 1 I love you so much Me too None
#3
0
Here is how to do it for the sentences in the column Sentence1
. The idea is identical for the Sentence2
column.
这是如何做的句子在列句子1。这一观点与《箴言》一栏相同。
splits = df.Sentence1.str.split(' ')
longest = splits.apply(len).max()
Note that longest
is the length of the longest sentence. Now make the Null columns:
注意最长的是最长的句子的长度。现在将空列设为:
for j in range(1,longest+1):
df[str(j)] = np.nan
And finally, go through the splitted values and assign them:
最后,对分割后的值进行赋值:
for j in splits.values:
for k in range(1,longest+1):
try:
df.loc[str(j), k] = j[k]
except:
pass
`
”
#4
0
It looks like a machine learning problem. Converting from 1 col to max words columns this way may not be efficient.
这看起来像是机器学习的问题。以这种方式从1 col转换为max words列可能不太有效。
Another (probably more efficient) solution is converting each words to integer and then padding to the longest sentences. Tensorflow
as tools for that.
另一个(可能更有效的)解决方案是将每个单词转换成整数,然后填充为最长的句子。张力流作为工具。
#1
0
One-hot-encoding labeling solution:
One-hot-encoding标签解决方案:
In [14]: df.Sentence1 += ' ' + df.pop('Sentence2')
In [15]: df
Out[15]:
x_id y_id Sentence1 Label
0 0 2 This is a ball I hate you 0
1 1 5 I am a boy Ahmed Ali 1
2 2 1 Apple is red Rose is red 1
3 3 9 I love you so much Me too 1
In [16]: from sklearn.feature_extraction.text import CountVectorizer
In [17]: vect = CountVectorizer()
In [18]: X = vect.fit_transform(df.Sentence1.fillna(''))
X
- is a sparsed (memory saving) matrix:
X -是一个稀疏的(内存保存)矩阵:
In [23]: X
Out[23]:
<4x17 sparse matrix of type '<class 'numpy.int64'>'
with 19 stored elements in Compressed Sparse Row format>
In [24]: type(X)
Out[24]: scipy.sparse.csr.csr_matrix
In [19]: X.toarray()
Out[19]:
array([[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)
Most of sklearn methods accept sparsed matrixes.
大多数sklearn方法都接受sparsed矩阵。
If you want to "unpack" it:
如果你想“打开”它:
In [21]: r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
In [22]: r
Out[22]:
ahmed ali am apple ball boy hate is love me much red rose so this too you
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1
1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 2 0 0 0 2 1 0 0 0 0
3 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1
#2
1
In [26]: x = pd.concat([df.pop('Sentence1').str.split(expand=True),
...: df.pop('Sentence2').str.split(expand=True)],
...: axis=1)
...:
In [27]: x.columns = np.arange(1, x.shape[1]+1)
In [28]: x
Out[28]:
1 2 3 4 5 6 7 8
0 This is a ball None I hate you
1 I am a boy None Ahmed Ali None
2 Apple is red None None Rose is red
3 I love you so much Me too None
In [29]: df = df.join(x)
In [30]: df
Out[30]:
x_id y_id Label 1 2 3 4 5 6 7 8
0 0 2 0 This is a ball None I hate you
1 1 5 1 I am a boy None Ahmed Ali None
2 2 1 1 Apple is red None None Rose is red
3 3 9 1 I love you so much Me too None
#3
0
Here is how to do it for the sentences in the column Sentence1
. The idea is identical for the Sentence2
column.
这是如何做的句子在列句子1。这一观点与《箴言》一栏相同。
splits = df.Sentence1.str.split(' ')
longest = splits.apply(len).max()
Note that longest
is the length of the longest sentence. Now make the Null columns:
注意最长的是最长的句子的长度。现在将空列设为:
for j in range(1,longest+1):
df[str(j)] = np.nan
And finally, go through the splitted values and assign them:
最后,对分割后的值进行赋值:
for j in splits.values:
for k in range(1,longest+1):
try:
df.loc[str(j), k] = j[k]
except:
pass
`
”
#4
0
It looks like a machine learning problem. Converting from 1 col to max words columns this way may not be efficient.
这看起来像是机器学习的问题。以这种方式从1 col转换为max words列可能不太有效。
Another (probably more efficient) solution is converting each words to integer and then padding to the longest sentences. Tensorflow
as tools for that.
另一个(可能更有效的)解决方案是将每个单词转换成整数,然后填充为最长的句子。张力流作为工具。