使用panda groupby创建子模型,并使用测试数据定位每个模型

时间:2021-10-10 07:35:57

I have a pandas dataframe in which values in a column are used as the group-by basis to create submodels.

我有一个熊猫dataframe,其中一个列中的值被用作一个组,以创建子模型。

import pandas as pd
from sklearn.linear_model import Ridge

data = pd.DataFrame({"Name": ["A", "A", "A", "B", "B", "B"], "Score": [90, 80, 90, 92, 87, 80], "Age": [10, 12, 14, 9, 11, 12], "Training": [0, 1, 2, 0, 1, 2]})

"Name" is used as the basis to create submodel for each individual. I want o use variable "Age" and "Training" to predict "Score" of one individual "Name" (i.e "A" and "B" in this case). That is, if I have "A" and know the "Age" and "Training" of "A", I would love to use "A", "Age", "Training" to predict "Score". However, "A" should be used to access to the model that "A" belongs to other than other model.

“Name”是为每个个体创建子模型的基础。我想使用变量“年龄”和“训练”来预测一个人的“名字”(I)的“分数”。e "A"和"B"在这种情况下)。也就是说,如果我有“A”并且知道“A”的“年龄”和“培训”,我喜欢用“A”、“年龄”、“培训”来预测“成绩”。但是,“A”应该用于访问“A”属于其他模型之外的模型。

grouped_df = data.groupby(['Name'])
for key, item in grouped_df:
    Score = grouped_df['Score']
    Y = grouped_df['Age', 'Training']
    Score_item = Score.get_group(key)
    Y_item = Y.get_group(key)
    model = Ridge(alpha = 1.2)
    modelfit = model.fit(Y_item, Score_item)
    modelpred = model.predict(Y_item)
    modelscore = model.score(Y_item, Score_item)
    print modelscore

Up to here, I have built simple Ridge models to sub-groups A and B.

到目前为止,我已经构建了简单的山脊模型来划分A和B亚群。

My question is, with test data as below:

我的问题是,测试数据如下:

test_data = [u"A, 13, 0", u"B, 12, 1", u"A 10, 0"] ##each element, respectively, represents `Name`, `Age` and `Training`

How to feed the data to the prediction models? I have

如何向预测模型提供数据?我有

line = test_data
Name = [line[i].split()[0] for i in range(len(line))]
Age = [line[i].split()[1] for i in range(len(line))]
Training = [line[i].split()[2] for i in range(len(line))]
Y = pd.DataFrame({"Name": Name, "Age": Age, "Training": Training})

This gives me the pandas dataframe of the test data. However, I am not sure how to proceed further to feed the test data to the model. I highly appreciate your help. Thank you!!

这给了我测试数据的熊猫数据。但是,我不确定如何进一步向模型提供测试数据。我非常感谢你的帮助。谢谢你! !

UPDATE

更新

After I adopted the code of Parfait, the code looks better now. Here I did not, however, create another pandas dataframe of the testdata (as I am not sure how to deal with row in there). Instead, I feed in the test values by spliting strings. I obtained an error as indicated below. I searched and found a post here Preprocessing in scikit learn - single sample - Depreciation warning which is related. However, I tried to reshape the test data but it is on the list form so it does not have the attribute of reshap. I think I misunderstand. I highly appreciate if you can let me know how to fix this error. Thank you.

在我采用了Parfait的代码之后,代码现在看起来更好了。然而,在这里我并没有创建另一个熊猫dataframe的testdata(因为我不知道如何处理这一行)。相反,我通过分割字符串来输入测试值。我得到如下所示的错误。我在scikit学习中找到了一个post -预处理-单样本-折旧警告相关。但是,我尝试重新塑造测试数据,但是它在列表表单中,所以它没有reshap的属性。我想我误解了。如果你能告诉我如何改正这个错误,我将不胜感激。谢谢你!

import pandas as pd
from sklearn.linear_model import Ridge
import numpy as np

data = pd.DataFrame({"Name": ["A", "A", "A", "B", "B", "B"], "Score": [90, 80, 90, 92, 87, 80], "Age": [10, 12, 14, 9, 11, 12], "Training": [0, 1, 2, 0,$


modeldict = {}                                           # INITIALIZE DICT
grouped_df = data.groupby(['Name'])

for key, item in grouped_df:
    Score = grouped_df['Score']
    Y = grouped_df['Age', 'Training']
    Score_item = Score.get_group(key)
    Y_item = Y.get_group(key)
    model = Ridge(alpha = 1.2)
    modelfit = model.fit(Y_item, Score_item)
    modelpred = model.predict(Y_item)
    modelscore = model.score(Y_item, Score_item)
    modeldict[key] = modelfit                            # SAVE EACH FITTED MODEL TO DICT


line = [u"A, 13, 0", u"B, 12, 1", u"A, 10, 0"]
Name = [line[i].split(",")[0] for i in range(len(line))]
Age = [line[i].split(",")[1] for i in range(len(line))]
Training = [line[i].split(",")[2] for i in range(len(line))]


for i in range(len(line)):
Name = line[i].split(",")[0]
Age = line[i].split(",")[1]
Training = line[i].split(",")[2]
model = modeldict[Name]
ip = [float(Age), float(Training)]
score = model.predict(ip)

print score

ERROR

错误

/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
86.6666666667
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
83.5320600273
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
86.6666666667
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
[ 86.66666667]
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
[ 83.53206003]
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
[ 86.66666667]

1 个解决方案

#1


0  

Consider saving submodels in a dictionary with Name as the key and then run a pandas.DataFrame.apply() to run operations on each row aligning row's Name to corresponding model.

考虑将子模型保存在以名称为键的字典中,然后运行pandas.DataFrame.apply(),以对每一行将行的名称对齐到相应的模型上运行操作。

NOTE: Below is untested code but hopefully gives a general idea to which you can adjust accordingly. The main issue might be the model.predict() input and output in the defined function, runModel, used in the apply(). A numpy matrix to of Age and Training values are used in model.predict() which hopefully returns a numpy equal to sample size (i.e., each row). See Ridge model:

注意:下面是未经测试的代码,但希望给出一个大致的概念,您可以据此进行相应的调整。主要问题可能是在apply()中使用的已定义函数runModel中的model. forecast()输入和输出。在model. forecast()中使用一个年龄和训练值的numpy矩阵,它希望返回一个等于样本大小的numpy(例如,每一行)。看到脊模型:

modeldict = {}                                           # INITIALIZE DICT
grouped_df = data.groupby(['Name'])

for key, item in grouped_df:
    Score = grouped_df['Score']
    Y = grouped_df['Age', 'Training']
    Score_item = Score.get_group(key)
    Y_item = Y.get_group(key)
    model = Ridge(alpha = 1.2)
    modelfit = model.fit(Y_item, Score_item)
    modelpred = model.predict(Y_item)
    modelscore = model.score(Y_item, Score_item)
    print modelscore

    modeldict[key] = modelfit                            # SAVE EACH FITTED MODEL TO DICT

line = [u"A, 13, 0", u"B, 12, 1", u"A 10, 0"] 
Name = [line[i].split()[0] for i in range(len(line))]
Age = [line[i].split()[1] for i in range(len(line))]
Training = [line[i].split()[2] for i in range(len(line))]

testdata = pd.DataFrame({"Name": Name, "Age": Age, "Training": Training})

def runModel(row):
    # LOCATE MODEL BY NAME KEY 
    model = modeldict[row['Name']]
    # PREDICT VALUES
    score = model.predict(np.matrix([row['Age'], row['Training']])
    # RETURN SCALAR FROM score ARRAY 
    return(score[0])    

testdata['predictedScore'] = testdata.apply(runModel, axis=1)

#1


0  

Consider saving submodels in a dictionary with Name as the key and then run a pandas.DataFrame.apply() to run operations on each row aligning row's Name to corresponding model.

考虑将子模型保存在以名称为键的字典中,然后运行pandas.DataFrame.apply(),以对每一行将行的名称对齐到相应的模型上运行操作。

NOTE: Below is untested code but hopefully gives a general idea to which you can adjust accordingly. The main issue might be the model.predict() input and output in the defined function, runModel, used in the apply(). A numpy matrix to of Age and Training values are used in model.predict() which hopefully returns a numpy equal to sample size (i.e., each row). See Ridge model:

注意:下面是未经测试的代码,但希望给出一个大致的概念,您可以据此进行相应的调整。主要问题可能是在apply()中使用的已定义函数runModel中的model. forecast()输入和输出。在model. forecast()中使用一个年龄和训练值的numpy矩阵,它希望返回一个等于样本大小的numpy(例如,每一行)。看到脊模型:

modeldict = {}                                           # INITIALIZE DICT
grouped_df = data.groupby(['Name'])

for key, item in grouped_df:
    Score = grouped_df['Score']
    Y = grouped_df['Age', 'Training']
    Score_item = Score.get_group(key)
    Y_item = Y.get_group(key)
    model = Ridge(alpha = 1.2)
    modelfit = model.fit(Y_item, Score_item)
    modelpred = model.predict(Y_item)
    modelscore = model.score(Y_item, Score_item)
    print modelscore

    modeldict[key] = modelfit                            # SAVE EACH FITTED MODEL TO DICT

line = [u"A, 13, 0", u"B, 12, 1", u"A 10, 0"] 
Name = [line[i].split()[0] for i in range(len(line))]
Age = [line[i].split()[1] for i in range(len(line))]
Training = [line[i].split()[2] for i in range(len(line))]

testdata = pd.DataFrame({"Name": Name, "Age": Age, "Training": Training})

def runModel(row):
    # LOCATE MODEL BY NAME KEY 
    model = modeldict[row['Name']]
    # PREDICT VALUES
    score = model.predict(np.matrix([row['Age'], row['Training']])
    # RETURN SCALAR FROM score ARRAY 
    return(score[0])    

testdata['predictedScore'] = testdata.apply(runModel, axis=1)