R机器学习包处理具有大量级别的因素

I'm trying to do some machine learning stuff that involves a lot of factor-type variables (words, descriptions, times, basically non-numeric stuff). I usually rely on randomForest but it doesn't work w/factors that have >32 levels.

我正在尝试做一些涉及很多因子类型变量的机器学习内容（单词，描述，时间，基本上是非数字的东西）。我通常依赖randomForest，但它不适用于具有> 32级别的因素。

Can anyone suggest some good alternatives?

谁能提出一些好的选择？

3 个解决方案

#1

Tree methods won't work, because the number of possible splits increases exponentially with the number of levels. However, with words this is typically addressed by creating indicator variables for each word (of the description etc.) - that way splits can use a word at a time (yes/no) instead of picking all possible combinations. In general you can always expand levels into indicators (and some models do that implicitly, such as glm). The same is true in ML for handling text with other methods such as SVM etc. So the answer may be that you need to think about your input data structure, not as much the methods. Alternatively, if you have some kind of order on the levels, you can linearize it (so there are only c-1 splits).

树方法不起作用，因为可能的分割数随着级别数呈指数增长。然而，通过单词这通常通过为每个单词（描述等）创建指示符变量来解决 - 这种方式分裂可以一次使用一个单词（是/否）而不是选择所有可能的组合。通常，您总是可以将级别扩展为指标（有些模型会隐式执行，例如glm）。在使用SVM等其他方法处理文本时，ML也是如此。所以答案可能是您需要考虑输入数据结构，而不是方法。或者，如果您在关卡上有某种顺序，则可以将其线性化（因此只有c-1分割）。

#2

In general the best package I've found for situations where there are lots of factor levels is to use the gbm package.

一般来说，在有很多因子水平的情况下我找到的最好的包是使用gbm包。

It can handle up to 1024 factor levels.

它最多可以处理1024个因子级别。

If there are more than 1024 levels I usually change the data by keeping the 1023 most frequently occurring factor levels and then code the remaining levels as one level.

如果有超过1024个级别，我通常会通过保持1023最常出现的因子级别来更改数据，然后将剩余级别编码为一个级别。

#3

There is nothing wrong in theory with the use of randomForest's method on class variables that have more than 32 classes - it's computationally expensive, but not impossible to handle any number of classes using the randomForest methodology. The normal R package randomForest sets 32 as a max number of classes for a given class variable and thus prohibits the user from running randomForest on anything with > 32 classes for any class variable.

对于具有超过32个类的类变量，使用randomForest方法在理论上没有任何错误 - 它的计算成本很高，但使用randomForest方法处理任意数量的类并非不可能。普通的R包randomForest将32设置为给定类变量的最大类数，因此禁止用户对任何类变量> 32类的任何东西运行randomForest。

Linearlizing the variable is a very good suggestion - I've used the method of ranking the classes, then breaking them up evenly into 32 meta-classes. So if there are actually 64 distinct classes, meta-class 1 consists of all things in class 1 and 2, etc. The only problem here is figuring out a sensible way of doing the ranking - and if you're working with, say, words it's very difficult to know how each word should be ranked against every other word.

对变量进行线性化是一个非常好的建议 - 我使用了对类进行排名的方法，然后将它们均匀地分解为32个元类。因此，如果实际上有64个不同的类，则元类1由1类和2类中的所有内容组成，等等。这里唯一的问题是找出一种合理的排名方式 - 如果你正在使用，比方说，很难知道每个单词应如何与其他单词进行排名。

A way around this is to make n different prediction sets, where each set contains all instances with any particular subset of 31 of the classes in each class variable with more than 32 classes. You can make a prediction using all sets, then using variable importance measures that come with the package find the implementation where the classes used were most predictive. Once you've uncovered the 31 most predictive classes, implement a new version of RF using all the data that designates these most predictive classes as 1 through 31, and everything else into an 'other' class, giving you the max 32 classes for the categorical variable but hopefully preserving much of the predictive power.

解决这个问题的方法是制作n个不同的预测集，其中每个集合包含具有超过32个类的每个类变量中31个类的任何特定子集的所有实例。您可以使用所有集进行预测，然后使用包附带的变量重要性度量来查找所使用的类最具预测性的实现。一旦你发现了31个最具预测性的类，使用所有数据来实现新版本的RF，这些数据将这些最具预测性的类指定为1到31，将其他所有类都指定为“其他”类，为您提供最多32个类。分类变量，但希望保留大部分预测能力。

Good luck!

祝你好运！

#1

#2