随机森林算法Matlab实现

时间:2024-10-12 07:09:45

随机森林算法Matlab实现

  • 瞎BB
  • 代码
    • 计算当前自身gini系数
    • 求最优划分点及其gini系数
    • 对data中按decision属性值从小到大排列
    • 生成结点
    • 生成随机采样样本数据
    • 生成决策树
    • 评价函数
    • 随机森林
    • 样本决策函数
    • 正确率计算函数
    • 主函数
  • 样本数据

瞎BB

1.实现根据样本数据(用眼距离distance、最长持续用眼时长duration、总用眼时长total_time、户外运动时长outdoor、用眼角度angle、健康环境光照用眼比例proportion)判别是否需要近视预警
2.这个是大一数模校赛写的Matlab实现随机森林算法的代码。那个时候太菜。
3.代码还有一点bug,在建立决策树时用的是分支,如果样本太少会出bug。
4.如果不是一定要用Matlab,可以用python中的sklearn库来实现随机森林算法。
详情:随机森林算法python实现 /CYBERLIFERK800/article/details/90552735
5.需改进:用二叉树递归代替分支结构生成决策树,评估函数应改用recall来评判好坏,并交叉验证。等有时间了就改一改。

代码

计算当前自身gini系数

%计算当前自身gini系数
function gini_now=gini_self(data)
sample_select=size(data,1)-1;
decision_select=size(data,2)-1;
time=0;
for i=1:sample_select
    if data(i,decision_select+1)
        time=time+1;
    end
end
gini_now=1-(time/sample_select)^2-((sample_select-time)/sample_select)^2;

求最优划分点及其gini系数

%求最优划分点及其gini系数,输入数据和决策属性,输出最优划分点和最优gini系数
function [boundary_best,gini_best]=gini(data_new,decision)
sample_select=size(data_new,1)-1;           %选取样本数
decision_select=size(data_new,2)-1;         %选取决策属性个数
%初始化
range=[min(data_new(1:sample_select,decision)) max(data_new(1:sample_select,decision))];%决策属性值范围
gini_best=1;                    %最优解
boundary_best=range(1);         %最优边界
%计算time_lt sum_lt time_ge sum_ge
for j=range(1)+1:range(2)
    result_temp=[0 0];
    time_lt=0;                      %小于boundary的样本个数
    sum_lt=0;                       %小于boundary的样本中需要预警的个数
    time_ge=0;                      %大于等于boundary的样本个数
    sum_ge=0;                       %大于等于boundary的样本中需要预警的个数
    boundary=j;
    for i=1:sample_select
        if(data_new(i,decision)<boundary)
            time_lt=time_lt+1;
            sum_lt=sum_lt+data_new(i,decision_select+1);
        else
            time_ge=time_ge+1;
            sum_ge=sum_ge+data_new(i,decision_select+1);
        end
    end
    %计算gini系数
    time=[time_lt time_lt time_ge time_ge];
    sum=[sum_lt time_lt-sum_lt sum_ge time_ge-sum_ge];
    rate=sum./time;
    result_temp(1)=1-rate(1)^2-rate(2)^2;
    result_temp(2)=1-rate(3)^2-rate(4)^2;
    result=time_lt/sample_select*result_temp(1)+time_ge/sample_select*result_temp(2);
    if result<gini_best
        gini_best=result;
        boundary_best=boundary;
    end
end

对data中按decision属性值从小到大排列

%对data中按decision属性值从小到大排列,输出新数据和划分位置
function [data_new,index]=BubbleSort(data,decision,boundary)
sample_select=size(data,1)-1;
for i=1:sample_select-1
    for j=1:sample_select-i
        if data(j,decision)>data(j+1,decision)
            temp=data(j,:);
            data(j,:)=data(j+1,:);
            data(j+1,:)=temp;
        end
    end
end
for i=1:sample_select
    if data(i,decision)>boundary
        break
    end
end
index=i-1;
data_new=data;

生成结点

%生成结点,输入数据,输出最佳决策属性,最佳划分边界,以及划分后的两组数据及其gini系数和加权gini系数
function [decision_global_best,boundary_global_best,data_new1,gini_now1,data_new2,gini_now2,gini_new]=generate_node(data_new)
decision_select=size(data_new,2)-1;
gini_global_best=1;
decision_global_best=1;
boundary_global_best=0;
for i=1:decision_select
    decision=i;
    [boundary_best,gini_best]=gini(data_new,decision);
    if gini_best<gini_global_best
        gini_global_best=gini_best;
        decision_global_best=decision;
        boundary_global_best=boundary_best;
    end
end
%按decision_global_best属性从小到大排列
[data_nnew,index]=BubbleSort(data_new,decision_global_best,boundary_global_best);
%生成子结点

data_new1=data_nnew(1:index,:);
data_new1(index+1,:)=data_nnew(end,:);
gini_now1=gini_self(data_new1);
%去除decision_global_best列
for i=1:decision_select
    if i>=decision_global_best
        data_new1(:,i)=data_new1(:,i+1);
    end
end
data_new1(:,i)=[];

data_new2=data_nnew(index+1:end,:);
gini_now2=gini_self(data_new2);
%去除decision_global_best列
for i=1:decision_select
    if i>=decision_global_best
        data_new2(:,i)=data_new2(:,i+1);
    end
end
data_new2(:,i)=[];

size1=size(data_new1,1)-1;
size2=size(data_new2,1)-1;
gini_new=gini_now1*size1/(size1+size2)+gini_now2*size2/(size1+size2);

生成随机采样样本数据

%生成随机采样样本数据,对样本有放回采样m组,对决策属性无放回采样n组
function data_