High-Speed Tracking with Kernelized Correlation Filters 的翻译与分析
基于核相关滤波器的高速目标跟踪方法,简称KCF
写在前面,之所以对这篇文章进行精细的阅读,是因为这篇文章极其重要,在目标跟踪领域石破天惊的一篇论文,后来在此论文基础上又相继出现了很多基于KCF的文章,因此文章好比作大厦的基石,深度学习,长短记忆等框架网络也可以在KCF上进行增添模块,并能够达到较好的效果,因此我将深入学习这篇文章,并在此与大家分享,由于学识有限,难免有些谬误,还请斧正。
High-Speed Tracking with Kernelized Correlation Filters:
Abstract---The core component of most modern trackers is discriminative classifier, tasked with distinguishing between the target and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled sample patches. Such sets of samples are riddled with redundancies --any overlapping pixels are constrained to be the same. Based on this simple observation, we can diagonalize it with the Discrete Fourier Transform, reducing both storage and computation by several orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation Filter (KCF), that unlike other kernel algorithms has the exact same complexity as its linear counterpart. Building on it ,we also propose a fast multi-channel extension of linear correlation filters ,via a linear kernel ,which we call Dual Correlation Filter(DCF).Both KCF and DCF outperform top-ranking trackers such as Struck of TLD on a 50videos benchmark, despite running at hundreds of frames-per -second , and being implemented in a few lines of code(Algorithm1). To encourage further developments, our tracking framework was made open-source.
Index Terms——Visual tracking, circulant matrices, discrete Fourier transform, kernel methods, ridge regression, correlation filters.
摘要:当前最前沿的跟踪器的核心组成是一个具有分辨能力的分类器,负责将目标和其背景环境区分开来。为了应对自然图像的变化,这个分类器通常采用译本或者按比例缩小的样本训练。这种样本集充斥着很多冗余,任何重叠像素都被约束为相同的类。基于这种简单的观察,我们提出一种分析模型为了成千上万个translate patches .通过显示,我们看到得到的结果为数据矩阵,这个矩阵是循环矩阵,我们可以使用离散傅里叶变换将它对角化,通过同时可以降低几个量级存储空间和节省大量计算资源。有趣的是,对于线性回归来说,我们的数学模型等价于一个相关过滤器,使用了一些快速有竞争力的追踪;对于核回归来说,我们提出一个新的核相关滤波器,它不同于其他的核算法,它有着线性回归的计算复杂度,为了建立它,我们又提出一种线性相关滤波器的快速多通道扩展项,通过一个线性的kernel,我们把它称为Dual correlation filter(DCF)双卷积滤波器。通过一个50个视频的标准对比试验,DCF与KCF胜过了Struck和TLD上的排名top的跟踪器,并且速度达到了每秒百帧,关键是代码很少,为了鼓励更多的研发人员,我们对代码进行了开源。
1:Introduction
Arguable one of the biggest breakthroughs in recent visual tracking research was the widespread adoption of discriminative learning methods. The task of tracking , a crucial component of many computer vision systems ,can be naturally specified as an online learning problem .Given an initial image patch containing the target, the goal is to learn a classifier to discriminate between its appearance and that of discriminate between its appearance and that of the environment. This classifier can be evaluated exhaustively at many locations, in order to detect it in subsequent frames .Of course, each new detection provides a new image patch that can be used to update the model.
介绍:
在最近目标物体追踪的科学研究中,一个最大的突破就是广泛采用的辨别学习方法。目标追踪任务是许多计算机视觉系统的关键组成部分。这项任务可以很自然的归化为一个在线学习的问题。给定一个初始图像块,包含有目标对象,我们的目标就是学习一个分类器,去区分对象和对象周围的环境,为了在后续的图像帧中也能检测到目标target,我们这个分类器能在许多位置尽情的评估检测。当然,每一个新的检测都提供一个可以用于更新模型的图像块。
图像块(image patches)
It is tempting to focus on characterizing the object of interest - the positive samples for the classifier. However, a core tenet of discriminative methods is to give as much importance, or more, to the relevant environment - the negative samples. The most commonly used negative samples are image patches from different locations and scales, reflecting the prior knowledge that the classifier will be evaluated under those conditions.
人们很容易将注意力集中在描述感兴趣的对象——分类器的正样本上。然而,判别方法的一个核心原则是对相关的环境——负样本给予同等或更多的重视。最常用的负样本是:来自不同位置和尺度的图像小块,分类器将在这些条件下进行评估,反映了先验知识。
A extremely challenging factor is the virtually(事实上) unlimited amount of negative samples that can be obtained from an image . Due to thee time-sensitive nature of tracking, modern trackers walk a fine line between incorporating as many samples as possible and keeping computational demand low .It is common practice to randomly choose only a few samples each frame34567.
实际上一个极端的挑战因素是从一幅图像中可以提取无数个负样本。由于目标追踪本质是时间敏感的,现代追踪器在 尽可能多的合并样本 与 保持一个小的计算量 之间进行着一个“走钢丝的活动”。
一个非常常见的做法就是在每一帧中随机选择一些样本。
Although the reasons for doing so are understandable we argue that undersampling negatives is the main factor inhibiting performance in tracking .In this paper, we develop tools to analytically incorporate thousands of samples at different relative translations, without iteration over them explicitly. This is made possible by the discovery that, in the Fourier domain, some learning algorithms actually become easier as we add more samples, if we use a specific model for translations.
尽管这样做的原因可以理解,我们认为对负样本进行降采样是抑制降低追踪器效果的主要因素。在这片文章里,我们开发了一个工具,去合并成百上千个相对平移的样本,而不需要明显的迭代。发现表明,在傅里叶域里是可以实现的。事实上,在傅里叶域中如果采用一个特殊的模型进行转换,,许多学习算法随着更多样本的加入变得更简单。
These analytical tools, namely circulant matrices, provide a useful bridge between popular learning algorithms and classical signal processing. The implication is that we are able to propose a tracker based on Kernel Ridge Regression that does not suffer from the "curse of kernelization",Instead ,it can be seen as a kernelized version of a linear correlation filter, which forms the basis for the fastest trackers available .We leverage the powerful kernel trick at the same computational complexity as linear correlation filters .Our framework easily incorporates multiple feature channels , and by using a linear kernel we show a fast extension of linear correlation filters to the multichannel case.
这些分析工具称之为循环矩阵,给现在流行的一些学习算法和经典的信号处理的算法提供了一个有力的桥梁,它的影响就是我们能够提出一个基于核的脊回归算法。这种算法避免了核化 的诅咒,也就是更大的渐近线复杂度。这种算法甚至展现出比非结构化线性回归更低的复杂度。这种算法可以看作是线性相关滤波器的核化版本,这就为最快的跟踪器的形成奠定了基础。我们利用了强有力的核技巧使得计算复杂度跟线性相关滤波器一样。我们的框架能够很容易包含多通道特征,通过使用线性核,我们展示线性相关滤波器对多通道情况的快速的拓展。
2 Related Work
2.1 On tracking-by-detection
A comprehensive review of tracking-by-detection is outside the scope of this article, but we refer the interested reader to two excellent and very recent surveys .The most popular approach is to use a discriminative appearance model .It consists of training a classifier online, inspired by statistical machine learning methods, to predict the presence or absence of the target in an image patch. This classifier is then tested on many candidate patches to find the most likely location. Alternatively, the position can also be predicted directly .Regression with class labels can be seen as classification, so we use the two terms interchangeably.
2 相关工作
2.1 关于检测的跟踪
对于通过检测跟踪方法的全面回顾超出了我们本文的视野范围,但我们给读者推荐两个杰出的近期的研究工作。最受欢迎的方法是采用一个具有分辨外观形状的模型。受启发与统计方面的机器学习方法,这种方法包含一个在线训练分类器,使用者个分类器去预测一个图像块儿中是否有所检测目标。这个分类器将测试更多的候选的图像块,找到target的最可能的位置坐标。同时,这个坐标位置也可以直接预测,带有类标签的回归也可以看做分类,因此我们可以交叉使用这两个术语。
We will discuss some relevant trackers before focusing on the literature that is more directly related to our analytical methods. Canonical examples of the tracking-by-detection paradigm include those based on Support Vector Machine (SVM) Random Forest classifiers, or boosting variants .All the mentioned algorithms had to be adapted for online learning, in order to be useful of tracking .Zhang et al. propose a projection to a fixed random basis, to train a Naive Bayes classifier, inspired by compressive sensing techniques. Aiming to predict the target's location directly , instead of its presence in a given image patch ,Hare et al. employed a Structured Output SVM and Gaussian kernels, based on a large number of image features. Examples of non-discriminative trackers include the work of Wu who formulate tracking as a sequence of image alignment objectives, and of Sevilla-Lare and Learned-Miller, who propose a strong appearance descriptor based on distribution fields. Another discriminative approach by Kalal uses a set of structural constraints to guide the sampling process of a boosting classifier .Finally, Bolme employ classical signal processing analysis to derive fast correlation filters. We will discuss these last two works in more detail shortly.
在看下面的文章之前,我们先讨论一些跟踪器相关的工作,这些工作与我们的分析方法有着极强的相关性。有关跟踪器的权威例子有SVM,随机森林分类器,或者boosting变体。所有的算法都必须是英在线学习,以便对跟踪任务有用。张开华等人提出了一种对固定随机基的投影,来训练一个基于压缩感知技术的朴素贝叶斯分类器。Hare等人致力于直接预测出目标的位置,而不是在给定的图像块中显示出来,他们使用了一种基于大量图像特征的结构化输出的SVM和高斯核。非分类型跟踪器的工作方面有有吴**等人,他们将跟踪描述为图像目标配准的序列。并且SLLM等人提出一种基于分布野的强大的外观描述器。另外一种辨别区分方法是K 等人提出来的,使用一组结构化的约束项来引导boosting分类器进行采样。最后,Bolme等人使用经典的信号处理分析的方法提出了快速核相关滤波器的概念,我们稍后将对最后两个工作进行详尽讲解。
2.2 On sample translations and correlation filtering
Recall that our goal is to learn and detect over translated image patches efficiently. Unlike our approach, most attempts so far have focused on trying to weed out irrelevant image patches. On the detection side , it is possible to use branch-and-bound to find the maximum of a classifier's response while avoiding unpromising candidate patches .Unfortunately , in the worst-case the algorithm may still have to iterate over all patches .A related method finds the most similar patches of a pair of images efficiently ,but is not directly translated to our setting .Though it does not preclude an exhaustive search , a notable optimization is to use a fast but inaccurate classifier to select promising patches, and only apply the full , slower classifier on those .
2.2 有关样本平移和空间相关滤波器:
回顾我们的目标是学习并高效的检测平移后的图像块,与我们的方法不同,几乎所有尝试的方法都是在排除不相关的图像块。在检测方法的方面,可以通过branch & bound 找到分类器的最大响应值,同时可以避免不可能找到object的候选图像块儿。不幸的是,算法最差的仍然是要迭代所有的图像块,一个相关的方法可以有效的找到最相似的图像块,但是不能直接转换到我们的设置中去。因此仍然要对目标区域进行一个彻底的搜索,但可以采用一个有效的优化算法可以很大的提高速度,首先使用一个快速但准确性一般的分类器将有可能成为包含目标的图像块找到,然后对找到的图像块使用精度较高的,速度相对较慢的分类器。
On the training side, Kalal propose using structural constraints to select relevant sample patches from each new image .This approach is relatively expensive, limiting the features that can be used, and requires careful tuning of the structural heuristics. A popular and related method, though it is mainly used in offline detector learning, is hard-negative mining .It consists of running an initial detector on a pool of images, and selecting any wrong detections as samples for re-training .Even though both approaches reduce the number of training samples, a major drawback is that the candidate patches have to be considered exhaustively, by running a detector.
在训练方面,Kalal 等人提出使用结构化约束的方法从每一个新的图像中挑选相关样本的图像块。这种方法相对来说比较费时费力,能使用的特征受到了限制,并且要求非常仔细的微调结构structural探索heuristics。有个常用的相关方法,叫做hard-negative mining (硬负样本挖掘),它主要用于离线检测器的学习。它包括在一组图像上运行一个初始检测器,并选择任何错误的检测器作为样本进行重新训练。即使两种方法都降低了训练样本的数量,一个 主要的缺点就是需要运行一个检测器对所有候选的图像块进行全面的考虑。
The initial motivation for our line of research was the recent success of correlation filters in tracking .Correlation filters have proved to be competitive with far more complicated approaches, but using only a fraction of the computational power ,at hundreds of frames-per-second .They take advantage of the fact that the convolution of two patches(loosely ,their dot- product at different relative translations)is equivalent to an element-wise product in the Fourier domain .Thus ,by formulating their objective in the Fourier domain, they can specify the desired output of a linear classifier for several translations, or image shifts, at once.
我们研究的最初动机就是最近相关(卷积)滤波器在跟踪方面的成功,相关滤波器算法相对于其他复杂的多的算法来比非常具有竞争力,而且只消耗了一小部分计算量,就能达到每秒数百帧。他们利用了两个图像块的相关卷积的优点,(不严格的讲,不同相对位置平移图像像素点的点积)等价于在傅里叶域中元素与元素之间的积。因此,通过转化他们对象到傅里叶域,可以对几个平移变换指定一个线性分类器的输出。
A Fourier domain approach can be very efficient, and has several decades of research in signal processing to draw from .Unfortunately, it can also be extremely limiting. We would like to simultaneously leverage more recent advances in computer vision, such as more powerful features, large-margin classifiers or kernel methods.
傅里叶域的方法是非常高效的,可以借鉴几十年的数字信号处理的经验,不过,那也是格外的有限。我们,我们也想同时利用计算机视觉领域更多的最新方法,像更有力的特征,更大范围的分类器,或kernel的方法。
A few studies go in that direction ,and attempt to apply kernel methods to correlation filters , In these works, a distinction must be drawn between two types of objective functions :those that do not consider the power spectrum or image translations, such as Synthetic Discriminant Function(SDF)filters, and those that do,such as Minimum Average Correlation Energy ,Optimal Trade-Off and Minimum output Sum of Squared Error(MOSSE)filters. Since the spatial structure can effectively be ignored ,the former are easier to kernelize ,and Kernel (SDF)filters have been proposed .However, lacking a clearer relationship between translated images, non-linear kernels and the Fourier domain, applying the kernel trick to other filters has proven much more difficult, with some proposals requiring significantly higher computation times and imposing strong limits on the number of image shift that can be considered.
在这个方向上有一些研究工作,他们尝试使用核的方法在相关滤波器上。这些工作中,两种类型的目标对象函数必须区分开来。一种是不考虑能量谱或者图像平移的SDF滤波器。另一种是考虑能量谱以及平移的,例如最小平均相关能量法(minimum average correlation energy ),最优化权衡(optimal trade off)平方误差的最小输出和滤波器(MOSSE),由于空间结构可以有效的忽略,所以前者更容易实现,也就是Kernel SDF滤波器已提了出来。然而,变换后的图像,非线性核与傅里叶域之间尚缺乏一个清晰的关系。将核技巧应用到其他滤波器已经被证明更加困难,其中一些提议明显要求更过的计算时间,并对可考虑的图像以为数量进行了严格限制。
For us, this hinted that as deeper connection between translated image patches and training algorithms was needed, in order to overcome the limitations of direct Fourier domain formulations.
对我们来说,这表明需要在变换后的图像块和训练算法之间建立更深层的联系,以克服直接傅里叶变换的局限性。
。。。。。。。。未完待续