Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network

作者和相关链接

个人主页：Zhi Tian，黄伟林，Tong He，Pan He，乔宇
作者简单信息：

论文阅读（Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network）

论文下载：论文传送门
代码下载：代码传送门

几个关键的Idea出发点

文本检测和一般目标检测的不同——文本线是一个sequence（字符、字符的一部分、多字符组成的一个sequence），而不是一般目标检测中只有一个独立的目标。这既是优势，也是难点。优势体现在同一文本线上不同字符可以互相利用上下文，可以用sequence的方法比如RNN来表示。难点体现在要检测出一个完整的文本线，同一文本线上不同字符可能差异大，距离远，要作为一个整体检测出来难度比单个目标更大——因此，作者认为预测文本的竖直位置（文本bounding box的上下边界）比水平位置（文本bounding box的左右边界）更容易。
Top-down（先检测文本区域，再找出文本线）的文本检测方法比传统的bottom-up的检测方法（先检测字符，再串成文本线）更好。自底向上的方法的缺点在于（这点在作者的另一篇文章中说的更清楚），总结起来就是没有考虑上下文，不够鲁棒，系统需要太多子模块，太复杂且误差逐步积累，性能受限。
RNN和CNN的无缝结合可以提高检测精度。CNN用来提取深度特征，RNN用来序列的特征识别（2类），二者无缝结合，用在检测上性能更好。

方法概括

基本流程如Fig 1，整个检测分六步：
- 第一，用VGG16的前5个Conv stage（到conv5）得到feature map(W*H*C)
- 第二，在Conv5的feature map的每个位置上取3*3*C的窗口的特征，这些特征将用于预测该位置k个anchor（anchor的定义和Faster RCNN类似）对应的类别信息，位置信息。
- 第三，将每一行的所有窗口对应的3*3*C的特征（W*3*3*C）输入到RNN（BLSTM）中，得到W*256的输出
- 第四，将RNN的W*256输入到512维的fc层
- 第五，fc层特征输入到三个分类或者回归层中。第二个2k scores 表示的是k个anchor的类别信息（是字符或不是字符）。第一个2k vertical coordinate和第三个k side-refinement是用来回归k个anchor的位置信息。2k vertical coordinate表示的是bounding box的高度和中心的y轴坐标（可以决定上下边界），k个side-refinement表示的bounding box的水平平移量。这边注意，只用了3个参数表示回归的bounding box，因为这里默认了每个anchor的width是16，且不再变化（VGG16的conv5的stride是16）。回归出来的box如Fig.1中那些红色的细长矩形，它们的宽度是一定的。
- 第六，用简单的文本线构造算法，把分类得到的文字的proposal（图Fig.1（b）中的细长的矩形）合并成文本线

论文阅读（Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network）

Fig. 1: (a) Architecture of the Connectionist Text Proposal Network (CTPN). We densely slide a 3×3 spatial window through the last convolutional maps (conv5 ) of the VGG16 model [27]. The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors. (b) The CTPN outputs sequential fixed-width fine-scale text proposals. Color of each box indicates the text/non-text score. Only the boxes with positive scores are presented.

方法细节

Detecting Text in Fine-scale proposals
- k个anchor尺度和长宽比设置：宽度都是16，k = 10，高度从11~273（每次除于0.7）
- 回归的高度和bounding box的中心的y坐标如下，带*的表示是groundTruth，带a的表示是anchor

论文阅读（Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network）

- score阈值设置：0.7 （+NMS）
- 一般的RPN和采用本文的方法检测出的效果对比

论文阅读（Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network）

Recurrent Connectionist Text Proposals
- RNN类型：BLSTM（双向LSTM），每个LSTM有128个隐含层
- RNN输入：每个滑动窗口的3*3*C的特征（可以拉成一列），同一行的窗口的特征形成一个序列
- RNN输出：每个窗口对应256维特征
- 使用RNN和不适用RNN的效果对比，CTPN是本文的方法（Connectionist Text Proposal Network）

论文阅读（Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network）

Side-refinement
- 文本线构造算法（多个细长的proposal合并成一条文本线）
  - 主要思想：每两个相近的proposal组成一个pair，合并不同的pair直到无法再合并为止（没有公共元素）
  - 判断两个proposal，Bi和Bj组成pair的条件：
    1. Bj->Bi，且Bi->Bj。（Bj->Bi表示Bj是Bi的最好邻居）
    2. Bj->Bi条件1：Bj是Bi的邻居中距离Bi最近的，且该距离小于50个像素
    3. Bj->Bi条件2：Bj和Bi的vertical overlap大于0.7
- 固定要regression的box的宽度和水平位置会导致predict的box的水平位置不准确，所以作者引入了side-refinement，用于水平位置的regression。where x_side is the predicted x-coordinate of the nearest horizontal side (e.g., left or right side) to current anchor. x^∗ side is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT bounding box and anchor location. c^a_xis the center of anchor in x-axis. wa is the width of anchor, which is fixed, w_a = 16

论文阅读（Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network）

- 使用side-refinement的效果对比

论文阅读（Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network）

实验结果

时间：0.14s with GPU
ICDAR2011，ICDAR2013，ICDAR2015库上检测结果

论文阅读（Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network）

总结与收获点

这篇文章的方法最大亮点在于把RNN引入检测问题（以前一般做识别）。文本检测，先用CNN得到深度特征，然后用固定宽度的anchor来检测text proposal（文本线的一部分），并把同一行anchor对应的特征串成序列，输入到RNN中，最后用全连接层来分类或回归，并将正确的text proposal进行合并成文本线。这种把RNN和CNN无缝结合的方法提高了检测精度。

秒客网

论文阅读（Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network）

Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network

目录

作者和相关链接

几个关键的Idea出发点

方法概括

基本流程如Fig 1，整个检测分六步：

方法细节

Detecting Text in Fine-scale proposals

Recurrent Connectionist Text Proposals

Side-refinement

实验结果

总结与收获点

相关文章

论文阅读（Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network）

Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network

目录

作者和相关链接

几个关键的Idea出发点

方法概括

基本流程如Fig 1， 整个检测分六步：

方法细节

Detecting Text in Fine-scale proposals

Recurrent Connectionist Text Proposals

Side-refinement

实验结果

总结与收获点

相关文章

基本流程如Fig 1，整个检测分六步：