I am trying to join two numpy arrays. In one I have a set of columns/features after running TF-IDF on a single column of text. In the other I have one column/feature which is an integer. So I read in a column of train and test data, run TF-IDF on this, and then I want to add another integer column because I think this will help my classifier learn more accurately how it should behave.
我想加入两个numpy数组。在一个文本中运行TF-IDF后,我有一组列/功能。在另一个我有一个列/功能是一个整数。所以我读了一列火车和测试数据,在这上面运行TF-IDF,然后我想添加另一个整数列,因为我认为这将有助于我的分类器更准确地学习它应该如何表现。
Unfortunately, I am getting the error in the title when I try and run hstack
to add this single column to my other numpy array.
不幸的是,当我尝试运行hstack将此单列添加到我的其他numpy数组时,我收到标题中的错误。
Here is my code :
这是我的代码:
#reading in test/train data for TF-IDF
traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])
#reading in labels for training
y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]
#reading in single integer column to join
AlexaTrainData = p.read_csv('FinalCSVFin.csv', delimiter=";")[["alexarank"]]
AlexaTestData = p.read_csv('FinalTestCSVFin.csv', delimiter=";")[["alexarank"]]
AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData)
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #tf-idf object
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001,
C=1, fit_intercept=True, intercept_scaling=1.0,
class_weight=None, random_state=None) #Classifier
X_all = traindata + testdata #adding test and train data to put into tf-idf
lentrain = len(traindata) #find length of train data
tfv.fit(X_all) #fit tf-idf on all our text
X_all = tfv.transform(X_all) #transform it
X = X_all[:lentrain] #reduce to size of training set
AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo[:lentrain] #reduce to size of training set
X_test = X_all[lentrain:] #reduce to size of training set
#printing debug info, output below :
print "X.shape => " + str(X.shape)
print "AllAlexaAndGoogleInfo.shape => " + str(AllAlexaAndGoogleInfo.shape)
print "X_all.shape => " + str(X_all.shape)
#line we get error on
X = np.hstack((X, AllAlexaAndGoogleInfo))
Below is the output and error message :
以下是输出和错误消息:
X.shape => (7395, 238377)
AllAlexaAndGoogleInfo.shape => (7395, 1)
X_all.shape => (10566, 238377)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-2b310887b5e4> in <module>()
31 print "X_all.shape => " + str(X_all.shape)
32 #X = np.column_stack((X, AllAlexaAndGoogleInfo))
---> 33 X = np.hstack((X, AllAlexaAndGoogleInfo))
34 sc = preprocessing.StandardScaler().fit(X)
35 X = sc.transform(X)
C:\Users\Simon\Anaconda\lib\site-packages\numpy\core\shape_base.pyc in hstack(tup)
271 # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
272 if arrs[0].ndim == 1:
--> 273 return _nx.concatenate(arrs, 0)
274 else:
275 return _nx.concatenate(arrs, 1)
ValueError: all the input arrays must have same number of dimensions
What is causing my problem here? How can I fix this? As far as I can see I should be able to join these columns? What have I misunderstood?
是什么原因引起了我的问题?我怎样才能解决这个问题?据我所知,我应该可以加入这些专栏?我误解了什么?
Thank you.
谢谢。
Edit :
编辑:
Using the method in the answer below gets the following error :
使用下面的答案中的方法会收到以下错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-640ef6dd335d> in <module>()
---> 36 X = np.column_stack((X, AllAlexaAndGoogleInfo))
37 sc = preprocessing.StandardScaler().fit(X)
38 X = sc.transform(X)
C:\Users\Simon\Anaconda\lib\site-packages\numpy\lib\shape_base.pyc in column_stack(tup)
294 arr = array(arr,copy=False,subok=True,ndmin=2).T
295 arrays.append(arr)
--> 296 return _nx.concatenate(arrays,1)
297
298 def dstack(tup):
ValueError: all the input array dimensions except for the concatenation axis must match exactly
Interestingly, I tried to print the dtype
of X and this worked fine :
有趣的是,我试图打印X的dtype,这很好:
X.dtype => float64
However, trying to print the dtype of AllAlexaAndGoogleInfo
like so :
但是,尝试打印AllAlexaAndGoogleInfo的dtype,如下所示:
print "AllAlexaAndGoogleInfo.dtype => " + str(AllAlexaAndGoogleInfo.dtype)
produces :
产生:
'DataFrame' object has no attribute 'dtype'
3 个解决方案
#1
13
As X
is a sparse array, instead of numpy.hstack
, use scipy.sparse.hstack
to join the arrays. In my opinion the error message is kind of misleading here.
由于X是稀疏数组,而不是numpy.hstack,请使用scipy.sparse.hstack来连接数组。在我看来,错误信息在这里有点误导。
This minimal example illustrates the situation:
这个最小的例子说明了这种情况
import numpy as np
from scipy import sparse
X = sparse.rand(10, 10000)
xt = np.random.random((10, 1))
print 'X shape:', X.shape
print 'xt shape:', xt.shape
print 'Stacked shape:', np.hstack((X,xt)).shape
#print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works
Based on the following output
基于以下输出
X shape: (10, 10000)
xt shape: (10, 1)
one may expect that the hstack
in the following line will work, but the fact is that it throws this error:
可以预期以下行中的hstack将起作用,但事实是它会抛出此错误:
ValueError: all the input arrays must have same number of dimensions
So, use scipy.sparse.hstack
when you have a sparse array to stack.
因此,当你有一个稀疏数组要堆叠时,请使用scipy.sparse.hstack。
In fact I have answered this as a comment in your another questions, and you mentioned that another error message pops up:
事实上,我已经在你的另一个问题中回答了这个问题,你提到了另一条错误信息:
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
First of all, AllAlexaAndGoogleInfo
does not have a dtype
as it is a DataFrame
. To get it's underlying numpy array, simply use AllAlexaAndGoogleInfo.values
. Check its dtype
. Based on the error message, it has a dtype
of object
, which means that it might contain non-numerical elements like strings.
首先,AllAlexaAndGoogleInfo没有dtype,因为它是一个DataFrame。要获得它的基础numpy数组,只需使用AllAlexaAndGoogleInfo.values。检查它的dtype。根据错误消息,它有一个对象的dtype,这意味着它可能包含非数字元素,如字符串。
This is a minimal example that reproduces this situation:
这是一个重现这种情况的最小例子:
X = sparse.rand(100, 10000)
xt = np.random.random((100, 1))
xt = xt.astype('object') # Comment this to fix the error
print 'X:', X.shape, X.dtype
print 'xt:', xt.shape, xt.dtype
print 'Stacked shape:', sparse.hstack((X,xt)).shape
The error message:
错误消息:
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
So, check if there is any non-numerical values in AllAlexaAndGoogleInfo
and repair them, before doing the stacking.
因此,在进行堆叠之前,请检查AllAlexaAndGoogleInfo中是否存在任何非数值并进行修复。
#2
8
Use .column_stack
. Like so:
使用.column_stack。像这样:
X = np.column_stack((X, AllAlexaAndGoogleInfo))
From the docs:
来自文档:
Take a sequence of 1-D arrays and stack them as columns to make a single 2-D array. 2-D arrays are stacked as-is, just like with hstack.
取一系列1-D阵列并将它们作为列堆叠以制作单个2-D阵列。二维数组按原样堆叠,就像使用hstack一样。
#3
1
Try:
尝试:
X = np.hstack((X, AllAlexaAndGoogleInfo.values))
I don't have a running Pandas module, so can't test it. But the DataFrame documentation describes values Numpy representation of NDFrame
. np.hstack
is a numpy
function, and as such knows nothing about the internal structure of the DataFrame
.
我没有运行Pandas模块,所以无法测试它。但DataFrame文档描述了NDFrame的值Numpy表示。 np.hstack是一个numpy函数,因此对DataFrame的内部结构一无所知。
#1
13
As X
is a sparse array, instead of numpy.hstack
, use scipy.sparse.hstack
to join the arrays. In my opinion the error message is kind of misleading here.
由于X是稀疏数组,而不是numpy.hstack,请使用scipy.sparse.hstack来连接数组。在我看来,错误信息在这里有点误导。
This minimal example illustrates the situation:
这个最小的例子说明了这种情况
import numpy as np
from scipy import sparse
X = sparse.rand(10, 10000)
xt = np.random.random((10, 1))
print 'X shape:', X.shape
print 'xt shape:', xt.shape
print 'Stacked shape:', np.hstack((X,xt)).shape
#print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works
Based on the following output
基于以下输出
X shape: (10, 10000)
xt shape: (10, 1)
one may expect that the hstack
in the following line will work, but the fact is that it throws this error:
可以预期以下行中的hstack将起作用,但事实是它会抛出此错误:
ValueError: all the input arrays must have same number of dimensions
So, use scipy.sparse.hstack
when you have a sparse array to stack.
因此,当你有一个稀疏数组要堆叠时,请使用scipy.sparse.hstack。
In fact I have answered this as a comment in your another questions, and you mentioned that another error message pops up:
事实上,我已经在你的另一个问题中回答了这个问题,你提到了另一条错误信息:
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
First of all, AllAlexaAndGoogleInfo
does not have a dtype
as it is a DataFrame
. To get it's underlying numpy array, simply use AllAlexaAndGoogleInfo.values
. Check its dtype
. Based on the error message, it has a dtype
of object
, which means that it might contain non-numerical elements like strings.
首先,AllAlexaAndGoogleInfo没有dtype,因为它是一个DataFrame。要获得它的基础numpy数组,只需使用AllAlexaAndGoogleInfo.values。检查它的dtype。根据错误消息,它有一个对象的dtype,这意味着它可能包含非数字元素,如字符串。
This is a minimal example that reproduces this situation:
这是一个重现这种情况的最小例子:
X = sparse.rand(100, 10000)
xt = np.random.random((100, 1))
xt = xt.astype('object') # Comment this to fix the error
print 'X:', X.shape, X.dtype
print 'xt:', xt.shape, xt.dtype
print 'Stacked shape:', sparse.hstack((X,xt)).shape
The error message:
错误消息:
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
So, check if there is any non-numerical values in AllAlexaAndGoogleInfo
and repair them, before doing the stacking.
因此,在进行堆叠之前,请检查AllAlexaAndGoogleInfo中是否存在任何非数值并进行修复。
#2
8
Use .column_stack
. Like so:
使用.column_stack。像这样:
X = np.column_stack((X, AllAlexaAndGoogleInfo))
From the docs:
来自文档:
Take a sequence of 1-D arrays and stack them as columns to make a single 2-D array. 2-D arrays are stacked as-is, just like with hstack.
取一系列1-D阵列并将它们作为列堆叠以制作单个2-D阵列。二维数组按原样堆叠,就像使用hstack一样。
#3
1
Try:
尝试:
X = np.hstack((X, AllAlexaAndGoogleInfo.values))
I don't have a running Pandas module, so can't test it. But the DataFrame documentation describes values Numpy representation of NDFrame
. np.hstack
is a numpy
function, and as such knows nothing about the internal structure of the DataFrame
.
我没有运行Pandas模块,所以无法测试它。但DataFrame文档描述了NDFrame的值Numpy表示。 np.hstack是一个numpy函数,因此对DataFrame的内部结构一无所知。