平均值平方误差返回不合理的高数

I'm trying to predict the profit each film made on IMDb.

我试图预测IMDb上每一部电影的利润。

My dataframe and features are as follows:

我的dataframe和特性如下:

   Actor1  Actor2  Actor3  Actor4   Day  Director  Genre1  Genre2  Genre3  \
0       0       0       0       0  19.0         0       0       0       0   
1       1       1       1       1   6.0         1       1       1       1   
2       2       2       2       2  20.0         2       0       2       2   
3       3       3       3       3   9.0         3       2       0      -1   
4       4       4       4       4   9.0         4       3       3       3   

   Language  Month  Production  Rated  Runtime  Writer    Year    BoxOffice  

0         1      0           0      0    118.0       0  2007.0   37500000.0  

1         2      1           1      0    151.0       1  2006.0  132300000.0  

2         1      1           2      1    130.0       2  2006.0   53100000.0  

3         1      2           1      0    117.0       3  2007.0  210500000.0  

4         4      3           3      2    117.0       4  2006.0  244052771.0

and the value I'm trying to predict (target) is the BoxOffice.

我想预测的价值(目标)是票房。

I'm following documentation for sklearn exactly as it is (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error)

我为sklearn提供的文档和它的一样(http://scikitlearn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html #)

from sklearn import preprocessing, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score

X = dataset[:,0:16] # Features
Y = dataset[:,16] #Target

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.33)

regr = linear_model.LinearRegression()
regr.fit(X_train,Y_train)
mean_squared_error(Y_test, regr.predict(X_test))

and the output is always something along the lines of: 11385650623660550 ($11,385,650,623,660,500.00)

输出总是类似于:11385650623660550 ($11,385,650,623,660,500.00)

While the mean of the BoxOffice is: 107989121

而售票处的平均数是:107989121

etc.

等。

Ive tried multiple different approaches, cross-validation as well as other models (keras) and feel like I've tried everything.

我尝试了多种不同的方法，交叉验证和其他模型(keras)，感觉我已经尝试了一切。

The returning sum is extremely high which makes me question that the problem is not in the model or the data, but something else that I'm missing.

返回的和非常高，这让我怀疑问题不在于模型或数据，而在于我缺少的其他东西。

3 个解决方案

#1

I think, your problem is not related with mean squared error, it is model itself.

我认为，你的问题与平均平方误差无关，它是模型本身。

For your categorical features, I recommend you to try another encode method like OneHotEncoder. LabelEncoder is not good option for lineer regression.

对于您的分类特性，我建议您尝试另一种编码方法，比如OneHotEncoder。对于线性回归来说，LabelEncoder不是很好的选择。

(For more information: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

(更多信息:http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)

Before train your model, take a look correlation of your numeric features with your target variable maybe some of them irrelevant, for categorical features you can try different methods to analyze their relationship with your target variables (like boxplots)

在训练你的模型之前，先看看你的数字特征与目标变量之间的相关性，也许有些是无关紧要的，对于分类特征，你可以尝试不同的方法来分析它们与你的目标变量之间的关系(比如箱形图)

Lineer regression need continuous variables so you may want to try other algorithms as well. Just make sure that you have the enough background before apply them.

线性回归需要连续变量，所以您也可以尝试其他算法。在应用之前，一定要有足够的背景知识。

#2

Try standardizing your output (or Y) variables and bringing them between 0 and 1.

尝试标准化输出(或Y)变量，并将它们放在0到1之间。

#3

-1

can you check the accuracy of your model? I guess it's very low hence you are getting high mean squared error.Because of low accuracy of model difference between predicted box office and actual is very high and squaring it becomes even bigger.

你能检查一下你的模型的准确性吗?我猜它很低，所以你得到的是高平均值的平方误差。由于预测票房与实际票房之间的模型差异精度较低，且方差较大。

rgr.score(X_test,Y_test)

rgr.score(X_test Y_test)

#1