Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks

NIPS 2018，作者就是 SENet 的作者。

CV 算法里面通常的模式

augment functions that perform local decisions with functions that operate on a larger context, providing a cue for resolving local ambiguities

using simple aggregations of low level features can be effective at encoding contextual information for visual tasks,

为什么用 context 往往能够提升 object detection 算法性能?

context 的作用是 providing a cue for resolving local ambiguities
但是 CNN 的理论感受野其实很大, 照理说是可以包含足够的 context 的, 但实际上因为理论感受野和有效感受野的区别, 有效感受野通常很小, 所以如果有方式可以显式地利用上 context 来消除 local ambiguities, 因此引入 context 往往能提升性能.

为什么要追求 context exploitation？因为更大的 context 有助于 resolving local ambiguities。

怎么从 Gather-Excite 的角度来看 Squeeze-and-Excitation network?

SENet 做得就是 reweighting feature channels as a function of features from the full extent of input

SENet 的 Gather 操作: the squeeze operator (用 Global Average Pooling 实现) acts as a lightweight context aggregator

因此 Squeeze-and- Excitation Networks can be viewed as a particular GE pairing, in which the gather operator is a parameter-free operation (global average pooling) and the excite operator is a fully connected subnetwork

Convolution operator 是一个 local operator，SENet 成功的地方就是通过 Squeeze 操作（实际用 Global Average Pooling 实现，本质上是一个轻量级的 context aggregator），可以让网络在早期就能够有 Global 信息的指导。用论文中的话来讲，这个思路叫作 capturing contextual long-range feature interactions，也就是 context exploitation。

Gather-Excite 操作

Gather operator 负责 aggregates contextual information across large neighbourhoods of each feature map; 换种说法 aggregates neuron responses over a given spatial extent,
Excite operator 负责 modulates the feature maps by conditioning on the aggregates; 换种说法 takes in both the aggregates and the original input to produce a new tensor with the same dimensions of the original input

define a gather operator with extent ratio e

spatial extent 是怎么体现的呢?

Gather 其实是个 depth-wise conv, each output location u of the channel c, the gather operator has a receptive field of the input that lies within a single channel and has an area bounded by (2e − 1)2.

excite operator

这个 excite operator 还负责上采样回复成原始大小

最终, 用一个公式表达 Gather-Excite Operator 就是

其中 interp(·) denotes resizing to the original input size via nearest neighbour interpolation

作者是受 bag-of-visual-words 的成功启发的，bag-of-visual-words 就是一个从 local descriptors 中 pooling information 来构建一个 global image representation 的成功典范，从 local 到 global；与 bag-of-visual-words 类似，Convolutional operator 抽取的也是一个 local descriptor，而我们同样需要比 local 更大范围的 contextual 信息的指导。

与 SENet、BAM、CBAM 的关系

SENet 是只对 channel 做权重；GENet、BAM、CBAM 都是对 Spatial 和 channel 做权重（也就是整个 feature map tensor 的每个点）；其中，SENet 是 GENet 的特殊情况，当 selection operator 的范围是整个 feature map 的时候，形式就和 SENet 一样的，是对一个 channel 里的所有点都施加一样的权重。具体的来说，SENet 是 GENet 的 gather operator 是不含参数的 global average pooling 操作，excite operator 是一个全连接层时候的特殊情况。

Spatial Attention Module 里，SENet 因为没有对 Spatial Axis 做 Attention 所以跟其他三个都不同；BAM 和 CBAM 是分成 Spatial Axis 和 Channel Axis 两个维度来，对于 Spatial Attention 算出来的权重，这个 Spatial location 的所有 channel 上的都一样，所以最后 Spatial Attention Weight 和 Channel Attention Weight 融合还有一个 broadcasting 的过程；而在 GENet 中是对 3D feature tensor 逐个 pixel 计算过去的，Spatial 和 Channel 并没有分开，反映在 Channel 和 Spatial 上的每个 pixel 都会被依次遍历；因为 context patch 是 2D 的，即使是同一个 Spatial location，不同 Channel 上的 Spatial weight 也是不同的。

YimianDai/Hu2018GENet.md

CV 算法里面通常的模式

为什么用 context 往往能够提升 object detection 算法性能?

怎么从 Gather-Excite 的角度来看 Squeeze-and-Excitation network?

Gather-Excite 操作

与 SENet、BAM、CBAM 的关系