Bag of Word主要思想:将训练样本特征Kmeans聚类,对测试样本的每个特征,计算与其最近的类心,相应类别计数count加1,这样每个测试样本可以生成ncenter维的直方图。
比如:训练样本特征a、b、c、a、d、f、e、b、e、d、c、f,如果类别数ncenter为6,则可以聚成6类[a,b,c,d,e,f]注意实际聚类时类心不一定为训练样本中特征,因为kmeans聚类更新类心时都重新计算。
假如一个测试样本特征为:a、b、c、d.那么经过BoW生成6维的直方图[1,1,1,1,0,0].
其实前面就是kmeans,然后Hard voting。关于kmeans不细说了,就是更新类心的过程,一直到类心变化在误差范围内。
kmeans聚类时用的训练数据中center个随机数据初始化,后面用的欧氏距离度量,其中计算欧氏距离时用了矢量化编程,加速运算。
这是参考了别人的代码实现的,每个人针对自己的研究可能还需要小小修改。适合入门的看看。
function dic=CalDic(data,dicsize) fprintf('Building Dictionary using Training Data\n\n'); dictionarySize = dicsize; niters=100;%迭代次数 centres=zeros(dictionarySize,size(data,2)); [ndata,data_dim]=size(data); [ncentres,dim]=size(centres); %% initialization perm = randperm(ndata); perm = perm(1:ncentres); centres = data(perm, :); num_points=zeros(1,dictionarySize); old_centres = centres; display('Run k-means'); for n=1:niters % Save old centres to check for termination e2=max(max(abs(centres - old_centres))); inError(n)=e2; old_centres = centres; tempc = zeros(ncentres, dim); num_points=zeros(1,ncentres); [ndata, data_dim] = size(data); id = eye(ncentres); d2 = EuclideanDistance(data,centres); % Assign each point to nearest centre [minvals, index] = min(d2', [], 1); post = id(index,:); % matrix, if word i is in cluster j, post(i,j)=1, else 0; num_points = num_points + sum(post, 1); for j = 1:ncentres tempc(j,:) = tempc(j,:)+sum(data(find(post(:,j)),:), 1); end for j = 1:ncentres if num_points(j)>0 centres(j,:) = tempc(j,:)/num_points(j); end end if n > 1 % Test for termination %Threshold ThrError=0.009; if max(max(abs(centres - old_centres))) <0.009 dictionary= centres; fprintf('Saving texton dictionary\n'); mkdir('data');%建立data文件夹 save ('data\dictionary','dictionary');%保存dictionary到data文件夹下。 break; end fprintf('The %d th interation finished \n',n); end end
下面是欧氏距离函数:
function d = EuclideanDistance(a,b) % DISTANCE - computes Euclidean distance matrix % % E = EuclideanDistance(A,B) % % A - (MxD) matrix % B - (NxD) matrix % % Returns: % E - (MxN) Euclidean distances between vectors in A and B % % % Description : % This fully vectorized (VERY FAST!) m-file computes the % Euclidean distance between two vectors by: % % ||A-B|| = sqrt ( ||A||^2 + ||B||^2 - 2*A.B ) % % Example : % A = rand(100,400); B = rand(200,400); % d = EuclideanDistance(A,B); % Author : Roland Bunschoten % University of Amsterdam % Intelligent Autonomous Systems (IAS) group % Kruislaan 403 1098 SJ Amsterdam % tel.(+31)20-5257524 % bunschot@wins.uva.nl % Last Rev : Oct 29 16:35:48 MET DST 1999 % Tested : PC Matlab v5.2 and Solaris Matlab v5.3 % Thanx : Nikos Vlassis % Copyright notice: You are free to modify, extend and distribute % this code granted that the author of the original code is % mentioned as the original author of the code. if (nargin ~= 2) b=a; end if (size(a,2) ~= size(b,2)) error('A and B should be of same dimensionality'); end aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b'; d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab));
<strong><span style="font-family:Times New Roman;font-size:18px;"> </span></strong>
<strong><span style="font-family:Times New Roman;font-size:18px;">下面是Hard Voting函数:</span></strong>
function His=HardVoting(data,dic) ncentres=size(dic,1); id = eye(ncentres); d2 = EuclideanDistance(data,dic);% Assign each point to nearest centre [minvals, index] = min(d2', [], 1); post = id(index,:); % matrix, if word i is in cluster j, post(i,j)=1, else 0 His=sum(post, 1); end
如果用于分类问题,可以尝试用LLC(CVPR2010) 一般比Hard Voting效果好。