从模型性能计算中排除缺失值

时间:2022-03-21 14:57:46

I have a dataset and I want to build a model, preferably with the caret package. My data is actually a time series but the question is not specific to time series, it's just that I work with CreateTimeSlices for the data partition.

我有一个数据集,我想建立一个模型,最好是使用插入包。我的数据实际上是一个时间序列,但问题不是特定于时间序列,而是我使用CreateTimeSlices进行数据分区。

My data has a certain amount of missing values NA, and I imputed them separately of the caret code. I also kept a record of their locations:

我的数据有一定数量的缺失值NA,我把它们分别插入了插入符号代码。我还记录了他们的位置:

# a logical vector same size as the data, which obs were imputed NA
imputed=c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE)
imputed[imputed] <- NA; print(imputed)
#### [1] FALSE FALSE FALSE    NA FALSE FALSE

I know there is an option in Caret train function to either exclude the NA or impute them with different techniques. That's not what I want. I need to build the model on the already imputed dataset but I want to exclude the imputed points from the calculation of the error indicators (RMSE, MAE, ...).

我知道Caret列车功能中有一个选项可以排除NA或用不同的技术对它们进行估算。那不是我想要的。我需要在已经估算的数据集上构建模型,但我想从错误指标(RMSE,MAE,...)的计算中排除推算点。

I don't know how to do this in caret. In my first script I tried to do the whole cross validation manually, and then I had a customized error measure:

我不知道如何在插入符号中这样做。在我的第一个脚本中,我尝试手动完成整个交叉验证,然后我有一个自定义的错误度量:

actual = c(5, 4, 3, 6, 7, 5)
predicted = c(4, 4, 3.5, 7, 6.8, 4)
Metrics::rmse(actual, predicted) # with all the points
#### [1] 0.7404953
sqrt(mean( (!imputed)*(actual-predicted)^2 , na.rm=T)) # excluding the imputed
#### [1] 0.676757

How can I handle this way of doing in caret? Or is there another way to avoid coding everything by hand?

我怎样才能在插入符号中处理这种方式?或者还有另一种避免手工编码的方法吗?

1 个解决方案

#1


4  

I dont know if you are looking for this but here is a simple solution by creating a function.

我不知道你是否在寻找这个,但这是一个通过创建一个函数的简单解决方案。

i=which(imputed==F) ## As you have index for NA values

metric_na=function(fun, actual, predicted, index){
    fun(actual[index], predicted[index])
}

metric_na(Metrics::rmse, actual, predicted, index = i)
0.676757
metric_na(Metrics::mae, actual, predicted, index = i)
0.54

Also you can just use the index directly while calculating the desired metrics.

您还可以在计算所需指标时直接使用索引。

Metrics::rmse(actual[i], predicted[i])

#1


4  

I dont know if you are looking for this but here is a simple solution by creating a function.

我不知道你是否在寻找这个,但这是一个通过创建一个函数的简单解决方案。

i=which(imputed==F) ## As you have index for NA values

metric_na=function(fun, actual, predicted, index){
    fun(actual[index], predicted[index])
}

metric_na(Metrics::rmse, actual, predicted, index = i)
0.676757
metric_na(Metrics::mae, actual, predicted, index = i)
0.54

Also you can just use the index directly while calculating the desired metrics.

您还可以在计算所需指标时直接使用索引。

Metrics::rmse(actual[i], predicted[i])