SMOTE初始化期望n_neighbour

时间:2022-03-05 21:22:13

I have already pre-cleaned the data, and below shows the format of the top 4 rows:

我已经预先清理了数据,下面显示了前四行的格式:

     [IN] df.head()

    [OUT]   Year    cleaned
         0  1909    acquaint hous receiv follow letter clerk crown...
         1  1909    ask secretari state war whether issu statement...
         2  1909    i beg present petit sign upward motor car driv...
         3  1909    i desir ask secretari state war second lieuten...
         4  1909    ask secretari state war whether would introduc...

I have called train_test_split() as follows:

我调用train_test_split()如下:

     [IN] X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df['Year'], random_state=2)
   [Note*] `X_train` and `y_train` are now Pandas.core.series.Series of shape (1785,) and `X_test` and `y_test` are also Pandas.core.series.Series of shape (595,)

I have then vectorized the X training and testing data using the following TfidfVectorizer and fit/transform procedures:

然后,我使用以下TfidfVectorizer和fit/transform过程向量化了X培训和测试数据:

     [IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1, 1), sublinear_tf=True)
          X_train = v.fit_transform(X_train)
          X_test = v.transform(X_test)

I'm now at the stage where I would normally apply a classifier, etc (if this were a balanced set of data). However, I initialize imblearn's SMOTE() class (to perform over-sampling)...

我现在正处于应用分类器等的阶段(如果这是一组平衡的数据)。但是,我初始化imblearn的SMOTE()类(执行过度采样)……

     [IN] smote_pipeline = make_pipeline_imb(SMOTE(), classifier(random_state=42))
          smote_model = smote_pipeline.fit(X_train, y_train)
          smote_prediction = smote_model.predict(X_test)

... but this results in:

…但这将导致:

     [OUT] ValueError: "Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 6.

I've attempted to whittle down the number of n_neighbors but to no avail, any tips or advice would be much appreciated. Thanks for reading.

我曾试图减少邻居的数量,但没有任何效果,任何提示或建议将非常感谢。感谢你的阅读。

------------------------------------------------------------------------------------------------------------------------------------

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

EDIT:

编辑:

Full Traceback

完整的回溯

The dataset/dataframe (df) contains 2380 rows across two columns, as shown in df.head() above. X_train contains 1785 of these rows in the format of a list of strings (df['cleaned']) and y_train also contains 1785 rows in the format of strings (df['Year']).

数据集/dataframe (df)包含横跨两列的2380行,如上面的df.head()所示。X_train以字符串列表(df['clean '])的格式包含1785行,y_train也以字符串格式(df['Year'])包含1785行。

Post-vectorization using TfidfVectorizer(): X_train and X_test are converted from pandas.core.series.Series of shape '(1785,)' and '(595,)' respectively, to scipy.sparse.csr.csr_matrix of shape '(1785, 126459)' and '(595, 126459)' respectively.

使用TfidfVectorizer()进行后向量化:X_train和X_test是从pandas.core.series转换的。形状系列‘(1785,)’和‘(595,)’分别对应于scipy.sparse.csr。csr_matrix of shape '(1785, 126459)'和'(595,126459)'。

As for the number of classes: using Counter(), I've calculated that there are 199 classes (Years), each instance of a class is attached to one element of aforementioned df['cleaned'] data which contains a list of strings extracted from a textual corpus.

至于类的数量:使用Counter(),我计算出有199个类(年),类的每个实例都附加到前面提到的df['clean ']数据的一个元素上,该元素包含从文本语料库中提取的字符串列表。

The objective of this process is to automatically determine/guess the year, decade or century (any degree of classification will do!) of input textual data based on vocabularly present.

这个过程的目标是自动确定/猜出基于词汇表的输入文本数据的年份、十年或世纪(任何程度的分类都可以做到!)

1 个解决方案

#1


1  

Since there are approximately 200 classes and 1800 samples in the training set, you have on average 9 samples per class. The reason for the error message is that a) probably the data are not perfectly balanced and there are classes with less than 6 samples and b) the number of neighbors is 6. A few solutions for your problem:

由于在训练集中大约有200个班和1800个样本,每个班平均有9个样本。错误消息的原因是a)可能数据不完全平衡,并且有少于6个样本和b的类)邻居的数目是6。解决你的问题的一些方法:

  1. Calculate the minimum number of samples (n_samples) among the 199 classes and select n_neighbors parameter of SMOTE class less or equal to n_samples.

    计算199个类中最小样本数(n_samples),选择SMOTE类小于或等于n_samples的n_neighbour参数。

  2. Exclude from oversampling the classes with n_samples < n_neighbors using the ratio parameter of SMOTE class.

    排除使用SMOTE类的比值参数使用n_samples < n_neighbour重载类的情况。

  3. Use RandomOverSampler class which does not have a similar restriction.

    使用没有类似限制的RandomOverSampler类。

  4. Combine 3 and 4 solutions: Create a pipeline that is using SMOTE and RandomOversampler in a way that satisfies the condition n_neighbors <= n_samples for smoted classes and uses random oversampling when the condition is not satisfied.

    组合3和4个解决方案:创建一个管道,该管道使用SMOTE和RandomOversampler,其方式满足smoted类的条件n_neighbour <= n_samples,并在条件不满足时使用随机抽样。

#1


1  

Since there are approximately 200 classes and 1800 samples in the training set, you have on average 9 samples per class. The reason for the error message is that a) probably the data are not perfectly balanced and there are classes with less than 6 samples and b) the number of neighbors is 6. A few solutions for your problem:

由于在训练集中大约有200个班和1800个样本,每个班平均有9个样本。错误消息的原因是a)可能数据不完全平衡,并且有少于6个样本和b的类)邻居的数目是6。解决你的问题的一些方法:

  1. Calculate the minimum number of samples (n_samples) among the 199 classes and select n_neighbors parameter of SMOTE class less or equal to n_samples.

    计算199个类中最小样本数(n_samples),选择SMOTE类小于或等于n_samples的n_neighbour参数。

  2. Exclude from oversampling the classes with n_samples < n_neighbors using the ratio parameter of SMOTE class.

    排除使用SMOTE类的比值参数使用n_samples < n_neighbour重载类的情况。

  3. Use RandomOverSampler class which does not have a similar restriction.

    使用没有类似限制的RandomOverSampler类。

  4. Combine 3 and 4 solutions: Create a pipeline that is using SMOTE and RandomOversampler in a way that satisfies the condition n_neighbors <= n_samples for smoted classes and uses random oversampling when the condition is not satisfied.

    组合3和4个解决方案:创建一个管道,该管道使用SMOTE和RandomOversampler,其方式满足smoted类的条件n_neighbour <= n_samples,并在条件不满足时使用随机抽样。