论文笔记：SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks

SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks

2019-04-02 12:44:36

Paper：https://arxiv.org/pdf/1812.11703.pdf

Project：https://lb1100.github.io/SiamRPN++

1. Background and Motivation:

与 CVPR 2019 的另一篇文章 Deeper and Wider Siamese Networks for Real-Time Visual Tracking 类似，这篇文章也是为了解决 Siamese Tracker 无法利用 Deep Backbone Network 的问题。作者的实验发现，较深的网络，如 ResNet, 无法带来跟踪精度提升的原因在于：the distroy of the strict translation invariance。因为目标可能出现在搜索区域的任何位置，所以学习的target template 的特征表达应该保持 spatial invariant，而作者发现，在众多网络中，仅仅 AlexNet 满足这种约束。本文中，作者提出一种 layer-wise feature aggravation structure 来进行 cross-correlation operation，帮助跟踪器从多个层次来预测相似形图。

此外，作者通过分析 Siamese Network 发现：the two network branches are highly imbalanced in terms of parameter number; 作者进一步提出 depth-wise separable correlation structure，这种结构不但可以大幅度的降低 target template branch 的参数个数，还可以稳定整个模型的训练。此外，另一个有趣的现象是：objects in the same categories have high response on the same channels while responses of the rest channels are supressed. 这种正交的属性可能有助于改善跟踪的效果。

2. Analysis on Siamese Networks for Tracking:

各种实验说明了 stride，padding 对深度网络的影响。

3. ResNet-driven Siamese Tracking :

为了降低上述影响因子对跟踪结果的影响，作者对原始的 ResNet 进行了修改。因为原始的残差网络 stride 为 32，这个参数对跟踪的影响非常之大。所以作者对最后两个 block 的有效 stride，从 32 和 16 改为 8，并且通过 dilated convolution 来增加 receptive field。利用 1*1 的卷积，将维度降为 256。但是这篇文章，并没有将 padding 的参数进行更改，所以 template feature map 的空间分辨率增加到 15，这就在进行 correlation 操作的时候，计算量较大，影响跟踪速度。所以，作者从中 crop 一块 7*7 regions 作为 template feature，每一个 feature cell 仍然可以捕获整个目标区域。作者发现仔细的调整 ResNet，是可以进一步提升效果的。通过将 ResNet extractor 的学习率设置为 RPN 网络的 1/10，得到的 feature 可以更加适合 tracking 任务。