Cluster Analysis using R for large data sample

时间:2021-02-10 20:11:35

I am just starting out with segmenting a customer database using R I have for an ecommerce retail business. I seek some guidance about the best approach to proceed with for this exercise. I have searched the topics already posted here and tried them out myself like dist() and hclust(). However I am running into one issue or another and not able to overcome it since I am new to using R. Here is the brief description of my problem. I have approximately 480K records of customers who have bought so far. The data contains following columns:

我刚开始使用R来分割客户数据库,我已经开始进行电子商务零售业务。我寻求关于进行此练习的最佳方法的一些指导。我已经搜索了已发布在这里的主题,并像dist()和hclust()一样自己尝试了。但是我遇到了一个或另一个问题而且由于我不熟悉使用R而无法克服它。以下是我的问题的简要说明。到目前为止,我有大约480K的客户记录。该数据包含以下列:

  • email id
  • 电子邮件ID
  • gender
  • 性别
  • city
  • total transactions so far
  • 到目前为止的总交易量
  • average basket value
  • 平均篮值
  • average basket size ( no of item purchased during one transaction)
  • 平均篮子大小(在一次交易中购买的物品数量)
  • average discount claimed per transaction
  • 每笔交易平均折扣
  • No of days since the user first purchased
  • 自用户首次购买以来的几天
  • Average duration between two purchases
  • 两次购买的平均持续时间
  • No of days since last transaction
  • 自上次交易以来没有天数

The business goal of this exercise is to identify the most profitable segments and encourage repeat purchases in those segments using campaigns. Can I please get some guidance as to how to do this successfully without running into problems like the size of the sample or the data type of columns?

此练习的业务目标是确定最有利可图的细分受众群,并鼓励使用广告系列在这些细分受众群中重复购买。我是否可以获得有关如何成功完成此操作的指导,而不会遇到样本大小或列数据类型等问题?

2 个解决方案

#1


1  

Read this to learn how to subset data frames. When you try to define d, it looks like you're providing way to much data, which might be fixed by subsetting your table first. If not, you might want to take a random sample of your data instead of all of it. Suppose you know that columns 4 through 10 of your data frame called cust_data contain numerical data, then you might try this:

阅读本文以了解如何对数据帧进行子集化。当您尝试定义d时,看起来您正在为大量数据提供方法,这可能通过首先对您的表进行子集化来修复。如果没有,您可能需要随机抽取数据而不是全部数据。假设您知道名为cust_data的数据框的第4列到第10列包含数字数据,那么您可以尝试这样做:

cust_data2 <- cust_data[, 4:10]
d <- dist(cust_data2)

For large values, you may want to log transform them--just experiment and see what makes sense. I really am not sure about this, and that's just a suggestion. Maybe choosing a more appropriate clustering or distance metric would be better.

对于较大的值,您可能希望记录转换它们 - 只需进行实验并查看有意义的内容。我真的不确定这个,这只是一个建议。也许选择更合适的聚类或距离度量会更好。

Finally, when you run hclust, you need to pass in the d matrix, and not the original data set.

最后,当你运行hclust时,你需要传入d矩阵,而不是原始数据集。

h <- hclust(d, "ave")

#2


0  

Sadly your data does not contain any attributes that indicate what types of items/transactions did NOT result in a sale.

遗憾的是,您的数据不包含任何指示哪些类型的商品/交易未导致销售的属性。

I am not sure if clustering is the way to go here.

我不确定聚类是否可以到达这里。

Here are some ideas:

以下是一些想法:

First split your data into a training set (say 70%) and a test set.

首先将数据拆分为训练集(比如说70%)和测试集。

Set up a simple linear regression model with,say, "average basket value" as a response variable, and all other variables as independent variables.

设置一个简单的线性回归模型,例如,“平均篮子值”作为响应变量,所有其他变量作为自变量。

fit <-lm(averagebasketvalue ~., data = custdata)

fit <-lm(averagebasketvalue~。,data = custdata)

Run the model on the training set, determine significant attributes (those with at least one star in the summary(fit) output), then focus on those variables.

在训练集上运行模型,确定重要属性(在摘要(拟合)输出中至少有一个星的属性),然后关注这些变量。

Check your regression coefficients on the test set, by calculating R-squared and Sum of squared errors (SSE) on the test set. You can use the predict() function , the calls will look like

通过计算测试集上的R平方和平方误差和(SSE),检查测试集上的回归系数。你可以使用predict()函数,调用看起来像

fitpred <- predict(fit, newdata=testset)
summary(fitpred) # will give you R²

Maybe "city" contains too many unique values to be meaningful. Try to generalize them by introducing a new attribute CityClass (e.g. BigCity-MediumCity-SmallCity ... or whatever classification scheme is useful for your cities). You might also condition the model on "gender". Drop "email id".

也许“城市”包含太多有意义的独特价值观。尝试通过引入一个新属性CityClass来推广它们(例如BigCity-MediumCity-SmallCity ......或任何分类方案对您的城市有用)。您也可以在“性别”上调整模型。删除“电子邮件ID”。

This can go on for a while... play with the model to try to get better R-squared and SSEs.

这可以持续一段时间......与模型一起尝试获得更好的R平方和SSE。

I think a tree-based model (rpart) might also work well here.

我认为基于树的模型(rpart)在这里也可能运行良好。

Then you might change to cluster analysis at a later time.

然后,您可能会在以后更改为群集分析。

#1


1  

Read this to learn how to subset data frames. When you try to define d, it looks like you're providing way to much data, which might be fixed by subsetting your table first. If not, you might want to take a random sample of your data instead of all of it. Suppose you know that columns 4 through 10 of your data frame called cust_data contain numerical data, then you might try this:

阅读本文以了解如何对数据帧进行子集化。当您尝试定义d时,看起来您正在为大量数据提供方法,这可能通过首先对您的表进行子集化来修复。如果没有,您可能需要随机抽取数据而不是全部数据。假设您知道名为cust_data的数据框的第4列到第10列包含数字数据,那么您可以尝试这样做:

cust_data2 <- cust_data[, 4:10]
d <- dist(cust_data2)

For large values, you may want to log transform them--just experiment and see what makes sense. I really am not sure about this, and that's just a suggestion. Maybe choosing a more appropriate clustering or distance metric would be better.

对于较大的值,您可能希望记录转换它们 - 只需进行实验并查看有意义的内容。我真的不确定这个,这只是一个建议。也许选择更合适的聚类或距离度量会更好。

Finally, when you run hclust, you need to pass in the d matrix, and not the original data set.

最后,当你运行hclust时,你需要传入d矩阵,而不是原始数据集。

h <- hclust(d, "ave")

#2


0  

Sadly your data does not contain any attributes that indicate what types of items/transactions did NOT result in a sale.

遗憾的是,您的数据不包含任何指示哪些类型的商品/交易未导致销售的属性。

I am not sure if clustering is the way to go here.

我不确定聚类是否可以到达这里。

Here are some ideas:

以下是一些想法:

First split your data into a training set (say 70%) and a test set.

首先将数据拆分为训练集(比如说70%)和测试集。

Set up a simple linear regression model with,say, "average basket value" as a response variable, and all other variables as independent variables.

设置一个简单的线性回归模型,例如,“平均篮子值”作为响应变量,所有其他变量作为自变量。

fit <-lm(averagebasketvalue ~., data = custdata)

fit <-lm(averagebasketvalue~。,data = custdata)

Run the model on the training set, determine significant attributes (those with at least one star in the summary(fit) output), then focus on those variables.

在训练集上运行模型,确定重要属性(在摘要(拟合)输出中至少有一个星的属性),然后关注这些变量。

Check your regression coefficients on the test set, by calculating R-squared and Sum of squared errors (SSE) on the test set. You can use the predict() function , the calls will look like

通过计算测试集上的R平方和平方误差和(SSE),检查测试集上的回归系数。你可以使用predict()函数,调用看起来像

fitpred <- predict(fit, newdata=testset)
summary(fitpred) # will give you R²

Maybe "city" contains too many unique values to be meaningful. Try to generalize them by introducing a new attribute CityClass (e.g. BigCity-MediumCity-SmallCity ... or whatever classification scheme is useful for your cities). You might also condition the model on "gender". Drop "email id".

也许“城市”包含太多有意义的独特价值观。尝试通过引入一个新属性CityClass来推广它们(例如BigCity-MediumCity-SmallCity ......或任何分类方案对您的城市有用)。您也可以在“性别”上调整模型。删除“电子邮件ID”。

This can go on for a while... play with the model to try to get better R-squared and SSEs.

这可以持续一段时间......与模型一起尝试获得更好的R平方和SSE。

I think a tree-based model (rpart) might also work well here.

我认为基于树的模型(rpart)在这里也可能运行良好。

Then you might change to cluster analysis at a later time.

然后,您可能会在以后更改为群集分析。