随机森林算法Matlab实现
- 瞎BB
- 代码
-
- 计算当前自身gini系数
- 求最优划分点及其gini系数
- 对data中按decision属性值从小到大排列
- 生成结点
- 生成随机采样样本数据
- 生成决策树
- 评价函数
- 随机森林
- 样本决策函数
- 正确率计算函数
- 主函数
- 样本数据
瞎BB
1.实现根据样本数据(用眼距离distance、最长持续用眼时长duration、总用眼时长total_time、户外运动时长outdoor、用眼角度angle、健康环境光照用眼比例proportion)判别是否需要近视预警
2.这个是大一数模校赛写的Matlab实现随机森林算法的代码。那个时候太菜。
3.代码还有一点bug,在建立决策树时用的是分支,如果样本太少会出bug。
4.如果不是一定要用Matlab,可以用python中的sklearn库来实现随机森林算法。
详情:随机森林算法python实现 /CYBERLIFERK800/article/details/90552735
5.需改进:用二叉树递归代替分支结构生成决策树,评估函数应改用recall来评判好坏,并交叉验证。等有时间了就改一改。
代码
计算当前自身gini系数
%计算当前自身gini系数
function gini_now=gini_self(data)
sample_select=size(data,1)-1;
decision_select=size(data,2)-1;
time=0;
for i=1:sample_select
if data(i,decision_select+1)
time=time+1;
end
end
gini_now=1-(time/sample_select)^2-((sample_select-time)/sample_select)^2;
求最优划分点及其gini系数
%求最优划分点及其gini系数,输入数据和决策属性,输出最优划分点和最优gini系数
function [boundary_best,gini_best]=gini(data_new,decision)
sample_select=size(data_new,1)-1; %选取样本数
decision_select=size(data_new,2)-1; %选取决策属性个数
%初始化
range=[min(data_new(1:sample_select,decision)) max(data_new(1:sample_select,decision))];%决策属性值范围
gini_best=1; %最优解
boundary_best=range(1); %最优边界
%计算time_lt sum_lt time_ge sum_ge
for j=range(1)+1:range(2)
result_temp=[0 0];
time_lt=0; %小于boundary的样本个数
sum_lt=0; %小于boundary的样本中需要预警的个数
time_ge=0; %大于等于boundary的样本个数
sum_ge=0; %大于等于boundary的样本中需要预警的个数
boundary=j;
for i=1:sample_select
if(data_new(i,decision)<boundary)
time_lt=time_lt+1;
sum_lt=sum_lt+data_new(i,decision_select+1);
else
time_ge=time_ge+1;
sum_ge=sum_ge+data_new(i,decision_select+1);
end
end
%计算gini系数
time=[time_lt time_lt time_ge time_ge];
sum=[sum_lt time_lt-sum_lt sum_ge time_ge-sum_ge];
rate=sum./time;
result_temp(1)=1-rate(1)^2-rate(2)^2;
result_temp(2)=1-rate(3)^2-rate(4)^2;
result=time_lt/sample_select*result_temp(1)+time_ge/sample_select*result_temp(2);
if result<gini_best
gini_best=result;
boundary_best=boundary;
end
end
对data中按decision属性值从小到大排列
%对data中按decision属性值从小到大排列,输出新数据和划分位置
function [data_new,index]=BubbleSort(data,decision,boundary)
sample_select=size(data,1)-1;
for i=1:sample_select-1
for j=1:sample_select-i
if data(j,decision)>data(j+1,decision)
temp=data(j,:);
data(j,:)=data(j+1,:);
data(j+1,:)=temp;
end
end
end
for i=1:sample_select
if data(i,decision)>boundary
break
end
end
index=i-1;
data_new=data;
生成结点
%生成结点,输入数据,输出最佳决策属性,最佳划分边界,以及划分后的两组数据及其gini系数和加权gini系数
function [decision_global_best,boundary_global_best,data_new1,gini_now1,data_new2,gini_now2,gini_new]=generate_node(data_new)
decision_select=size(data_new,2)-1;
gini_global_best=1;
decision_global_best=1;
boundary_global_best=0;
for i=1:decision_select
decision=i;
[boundary_best,gini_best]=gini(data_new,decision);
if gini_best<gini_global_best
gini_global_best=gini_best;
decision_global_best=decision;
boundary_global_best=boundary_best;
end
end
%按decision_global_best属性从小到大排列
[data_nnew,index]=BubbleSort(data_new,decision_global_best,boundary_global_best);
%生成子结点
data_new1=data_nnew(1:index,:);
data_new1(index+1,:)=data_nnew(end,:);
gini_now1=gini_self(data_new1);
%去除decision_global_best列
for i=1:decision_select
if i>=decision_global_best
data_new1(:,i)=data_new1(:,i+1);
end
end
data_new1(:,i)=[];
data_new2=data_nnew(index+1:end,:);
gini_now2=gini_self(data_new2);
%去除decision_global_best列
for i=1:decision_select
if i>=decision_global_best
data_new2(:,i)=data_new2(:,i+1);
end
end
data_new2(:,i)=[];
size1=size(data_new1,1)-1;
size2=size(data_new2,1)-1;
gini_new=gini_now1*size1/(size1+size2)+gini_now2*size2/(size1+size2);
生成随机采样样本数据
%生成随机采样样本数据,对样本有放回采样m组,对决策属性无放回采样n组
function data_