I have a pandas dataframe in which values in a column are used as the group-by basis to create submodels.
我有一个熊猫dataframe,其中一个列中的值被用作一个组,以创建子模型。
import pandas as pd
from sklearn.linear_model import Ridge
data = pd.DataFrame({"Name": ["A", "A", "A", "B", "B", "B"], "Score": [90, 80, 90, 92, 87, 80], "Age": [10, 12, 14, 9, 11, 12], "Training": [0, 1, 2, 0, 1, 2]})
"Name"
is used as the basis to create submodel for each individual. I want o use variable "Age"
and "Training"
to predict "Score"
of one individual "Name"
(i.e "A"
and "B"
in this case). That is, if I have "A"
and know the "Age"
and "Training"
of "A"
, I would love to use "A"
, "Age"
, "Training"
to predict "Score"
. However, "A"
should be used to access to the model that "A"
belongs to other than other model.
“Name”是为每个个体创建子模型的基础。我想使用变量“年龄”和“训练”来预测一个人的“名字”(I)的“分数”。e "A"和"B"在这种情况下)。也就是说,如果我有“A”并且知道“A”的“年龄”和“培训”,我喜欢用“A”、“年龄”、“培训”来预测“成绩”。但是,“A”应该用于访问“A”属于其他模型之外的模型。
grouped_df = data.groupby(['Name'])
for key, item in grouped_df:
Score = grouped_df['Score']
Y = grouped_df['Age', 'Training']
Score_item = Score.get_group(key)
Y_item = Y.get_group(key)
model = Ridge(alpha = 1.2)
modelfit = model.fit(Y_item, Score_item)
modelpred = model.predict(Y_item)
modelscore = model.score(Y_item, Score_item)
print modelscore
Up to here, I have built simple Ridge models to sub-groups A
and B
.
到目前为止,我已经构建了简单的山脊模型来划分A和B亚群。
My question is, with test data as below:
我的问题是,测试数据如下:
test_data = [u"A, 13, 0", u"B, 12, 1", u"A 10, 0"] ##each element, respectively, represents `Name`, `Age` and `Training`
How to feed the data to the prediction models? I have
如何向预测模型提供数据?我有
line = test_data
Name = [line[i].split()[0] for i in range(len(line))]
Age = [line[i].split()[1] for i in range(len(line))]
Training = [line[i].split()[2] for i in range(len(line))]
Y = pd.DataFrame({"Name": Name, "Age": Age, "Training": Training})
This gives me the pandas dataframe of the test data. However, I am not sure how to proceed further to feed the test data to the model. I highly appreciate your help. Thank you!!
这给了我测试数据的熊猫数据。但是,我不确定如何进一步向模型提供测试数据。我非常感谢你的帮助。谢谢你! !
UPDATE
更新
After I adopted the code of Parfait, the code looks better now. Here I did not, however, create another pandas dataframe of the testdata (as I am not sure how to deal with row in there). Instead, I feed in the test values by spliting strings. I obtained an error as indicated below. I searched and found a post here Preprocessing in scikit learn - single sample - Depreciation warning which is related. However, I tried to reshape the test data but it is on the list form so it does not have the attribute of reshap. I think I misunderstand. I highly appreciate if you can let me know how to fix this error. Thank you.
在我采用了Parfait的代码之后,代码现在看起来更好了。然而,在这里我并没有创建另一个熊猫dataframe的testdata(因为我不知道如何处理这一行)。相反,我通过分割字符串来输入测试值。我得到如下所示的错误。我在scikit学习中找到了一个post -预处理-单样本-折旧警告相关。但是,我尝试重新塑造测试数据,但是它在列表表单中,所以它没有reshap的属性。我想我误解了。如果你能告诉我如何改正这个错误,我将不胜感激。谢谢你!
import pandas as pd
from sklearn.linear_model import Ridge
import numpy as np
data = pd.DataFrame({"Name": ["A", "A", "A", "B", "B", "B"], "Score": [90, 80, 90, 92, 87, 80], "Age": [10, 12, 14, 9, 11, 12], "Training": [0, 1, 2, 0,$
modeldict = {} # INITIALIZE DICT
grouped_df = data.groupby(['Name'])
for key, item in grouped_df:
Score = grouped_df['Score']
Y = grouped_df['Age', 'Training']
Score_item = Score.get_group(key)
Y_item = Y.get_group(key)
model = Ridge(alpha = 1.2)
modelfit = model.fit(Y_item, Score_item)
modelpred = model.predict(Y_item)
modelscore = model.score(Y_item, Score_item)
modeldict[key] = modelfit # SAVE EACH FITTED MODEL TO DICT
line = [u"A, 13, 0", u"B, 12, 1", u"A, 10, 0"]
Name = [line[i].split(",")[0] for i in range(len(line))]
Age = [line[i].split(",")[1] for i in range(len(line))]
Training = [line[i].split(",")[2] for i in range(len(line))]
for i in range(len(line)):
Name = line[i].split(",")[0]
Age = line[i].split(",")[1]
Training = line[i].split(",")[2]
model = modeldict[Name]
ip = [float(Age), float(Training)]
score = model.predict(ip)
print score
ERROR
错误
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
86.6666666667
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
83.5320600273
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
86.6666666667
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
[ 86.66666667]
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
[ 83.53206003]
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
[ 86.66666667]
1 个解决方案
#1
0
Consider saving submodels in a dictionary with Name as the key and then run a pandas.DataFrame.apply() to run operations on each row aligning row's Name to corresponding model.
考虑将子模型保存在以名称为键的字典中,然后运行pandas.DataFrame.apply(),以对每一行将行的名称对齐到相应的模型上运行操作。
NOTE: Below is untested code but hopefully gives a general idea to which you can adjust accordingly. The main issue might be the model.predict()
input and output in the defined function, runModel
, used in the apply()
. A numpy matrix to of Age and Training values are used in model.predict()
which hopefully returns a numpy equal to sample size (i.e., each row). See Ridge model:
注意:下面是未经测试的代码,但希望给出一个大致的概念,您可以据此进行相应的调整。主要问题可能是在apply()中使用的已定义函数runModel中的model. forecast()输入和输出。在model. forecast()中使用一个年龄和训练值的numpy矩阵,它希望返回一个等于样本大小的numpy(例如,每一行)。看到脊模型:
modeldict = {} # INITIALIZE DICT
grouped_df = data.groupby(['Name'])
for key, item in grouped_df:
Score = grouped_df['Score']
Y = grouped_df['Age', 'Training']
Score_item = Score.get_group(key)
Y_item = Y.get_group(key)
model = Ridge(alpha = 1.2)
modelfit = model.fit(Y_item, Score_item)
modelpred = model.predict(Y_item)
modelscore = model.score(Y_item, Score_item)
print modelscore
modeldict[key] = modelfit # SAVE EACH FITTED MODEL TO DICT
line = [u"A, 13, 0", u"B, 12, 1", u"A 10, 0"]
Name = [line[i].split()[0] for i in range(len(line))]
Age = [line[i].split()[1] for i in range(len(line))]
Training = [line[i].split()[2] for i in range(len(line))]
testdata = pd.DataFrame({"Name": Name, "Age": Age, "Training": Training})
def runModel(row):
# LOCATE MODEL BY NAME KEY
model = modeldict[row['Name']]
# PREDICT VALUES
score = model.predict(np.matrix([row['Age'], row['Training']])
# RETURN SCALAR FROM score ARRAY
return(score[0])
testdata['predictedScore'] = testdata.apply(runModel, axis=1)
#1
0
Consider saving submodels in a dictionary with Name as the key and then run a pandas.DataFrame.apply() to run operations on each row aligning row's Name to corresponding model.
考虑将子模型保存在以名称为键的字典中,然后运行pandas.DataFrame.apply(),以对每一行将行的名称对齐到相应的模型上运行操作。
NOTE: Below is untested code but hopefully gives a general idea to which you can adjust accordingly. The main issue might be the model.predict()
input and output in the defined function, runModel
, used in the apply()
. A numpy matrix to of Age and Training values are used in model.predict()
which hopefully returns a numpy equal to sample size (i.e., each row). See Ridge model:
注意:下面是未经测试的代码,但希望给出一个大致的概念,您可以据此进行相应的调整。主要问题可能是在apply()中使用的已定义函数runModel中的model. forecast()输入和输出。在model. forecast()中使用一个年龄和训练值的numpy矩阵,它希望返回一个等于样本大小的numpy(例如,每一行)。看到脊模型:
modeldict = {} # INITIALIZE DICT
grouped_df = data.groupby(['Name'])
for key, item in grouped_df:
Score = grouped_df['Score']
Y = grouped_df['Age', 'Training']
Score_item = Score.get_group(key)
Y_item = Y.get_group(key)
model = Ridge(alpha = 1.2)
modelfit = model.fit(Y_item, Score_item)
modelpred = model.predict(Y_item)
modelscore = model.score(Y_item, Score_item)
print modelscore
modeldict[key] = modelfit # SAVE EACH FITTED MODEL TO DICT
line = [u"A, 13, 0", u"B, 12, 1", u"A 10, 0"]
Name = [line[i].split()[0] for i in range(len(line))]
Age = [line[i].split()[1] for i in range(len(line))]
Training = [line[i].split()[2] for i in range(len(line))]
testdata = pd.DataFrame({"Name": Name, "Age": Age, "Training": Training})
def runModel(row):
# LOCATE MODEL BY NAME KEY
model = modeldict[row['Name']]
# PREDICT VALUES
score = model.predict(np.matrix([row['Age'], row['Training']])
# RETURN SCALAR FROM score ARRAY
return(score[0])
testdata['predictedScore'] = testdata.apply(runModel, axis=1)