RDataMining系列:Chapter 4 Decision Trees --决策树实现,未完待续

时间:2021-04-14 16:21:33

 

RDataMining系列:Chapter 4 Decision Trees --决策树实现,未完待续

RDataMining系列:Chapter 4 Decision Trees --决策树实现,未完待续

 

*****************

利用party来做决策树分类

*****************

数据:iris data

目标:

利用Sepal.Length, Sepal.Width,Petal.Length and Petal.Width 来预测 Species of flowers.

预处理:

分成训练,测试样本集:

> set.seed(1234)
> ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
> trainData <- iris[ind==1,]
> testData <- iris[ind==2,]

接下里就是

1.Load package party,

2.build a decision tree,

3.and check the prediction.

> library(party)
> myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
> iris_ctree <- ctree(myFormula, data=trainData)
> # check the prediction
> table(predict(iris_ctree), trainData$Species)

                        setosa         versicolor          virginica
setosa                     40                   0                     0
versicolor                   0                  37                    3
virginica                     0                    1                   31

 

下面主要是分析结果:即分析得到的决策树

>  print(iris_ctree)

RDataMining系列:Chapter 4 Decision Trees --决策树实现,未完待续

 

>  plot(iris_ctree)

RDataMining系列:Chapter 4 Decision Trees --决策树实现,未完待续

 

>  plot(iris_ctree,  type="simple")

RDataMining系列:Chapter 4 Decision Trees --决策树实现,未完待续

 

在测试样本上使用决策树得到预测结果

> # predict on test data
> testPred <- predict(iris_ctree, newdata = testData)
> table(testPred, testData$Species)
 testPred            setosa        versicolor          virginica
setosa                     10                   0                    0
versicolor                   0                 12                    2
virginica                     0                   0                  14

 

注意的问题:

The current version of ctree (i.e. version 0.9-9995) does not handle missing values well. An
instance with a missing value may sometimes go to the left sub-tree and sometimes to the right.

Another issue is that, when a variable exists in training data and is fed into ctree but does not
appear in the built decision tree, the test data must also have that variable to make prediction.
Otherwise, a call to predict would fail. Moreover, if the value levels of a categorical variable
in test data are different from that in train data, it would also fail to make prediction on the
test data. One way to get around the above issue is, after building a decision tree, to call ctree
build a new decision tree with data containing only those variables existing in the first tree, and
to explicitly set the levels of categorical variables in test data to the levels of the corresponding
variables in train data.

 **********************************

4.2    Building Decision Trees with Package rpart

**********************************

未完待续

*****************

4.3    Random Forest

*****************

Package randomForest 中的 cforest可以来构建 Random Forest进行预测

第一步:The iris data is split below into two subsets: training (70%) and testing (30%).

> ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
> trainData <- iris[ind==1,]
> testData <- iris[ind==2,]

第二步:

Load randomForest and then train a random forest.
> library(randomForest)
> rf <- randomForest(Species ~ ., data=trainData, ntree=100, proximity=TRUE)
> table(predict(rf), trainData$Species)

                      setosa         versicolor           virginica
setosa                   38                    0                    0
versicolor                 0                  33                    2
virginica                   0                    2                  28

*****************

后面未完待续

*****************