NIPS 2018,作者就是 SENet 的作者。
augment functions that perform local decisions with functions that operate on a larger context, providing a cue for resolving local ambiguities
using simple aggregations of low level features can be effective at encoding contextual information for visual tasks,
- context 的作用是 providing a cue for resolving local ambiguities
- 但是 CNN 的理论感受野其实很大, 照理说是可以包含足够的 context 的, 但实际上因为理论感受野和有效感受野的区别, 有效感受野通常很小, 所以如果有方式可以显式地利用上 context 来消除 local ambiguities, 因此引入 context 往往能提升性能.
为什么要追求 context exploitation?因为更大的 context 有助于 resolving local ambiguities。
SENet 做得就是 reweighting feature channels as a function of features from the full extent of input
SENet 的 Gather 操作: the squeeze operator (用 Global Average Pooling 实现) acts as a lightweight context aggregator
因此 Squeeze-and- Excitation Networks can be viewed as a particular GE pairing, in which the gather operator is a parameter-free operation (global average pooling) and the excite operator is a fully connected subnetwork
Convolution operator 是一个 local operator,SENet 成功的地方就是通过 Squeeze 操作(实际用 Global Average Pooling 实现,本质上是一个轻量级的 context aggregator),可以让网络在早期就能够有 Global 信息的指导。用论文中的话来讲,这个思路叫作 capturing contextual long-range feature interactions,也就是 context exploitation。
- Gather operator 负责 aggregates contextual information across large neighbourhoods of each feature map; 换种说法 aggregates neuron responses over a given spatial extent,
- Excite operator 负责 modulates the feature maps by conditioning on the aggregates; 换种说法 takes in both the aggregates and the original input to produce a new tensor with the same dimensions of the original input
define a gather operator with extent ratio e
spatial extent 是怎么体现的呢?
Gather 其实是个 depth-wise conv, each output location u of the channel c, the gather operator has a receptive field of the input that lies within a single channel and has an area bounded by (2e − 1)2.
excite operator
这个 excite operator 还负责上采样回复成原始大小
最终, 用一个公式表达 Gather-Excite Operator 就是
其中 interp(·) denotes resizing to the original input size via nearest neighbour interpolation
作者是受 bag-of-visual-words 的成功启发的,bag-of-visual-words 就是一个 从 local descriptors 中 pooling information 来构建一个 global image representation 的成功典范,从 local 到 global;与 bag-of-visual-words 类似,Convolutional operator 抽取的也是一个 local descriptor,而我们同样需要比 local 更大范围的 contextual 信息的指导。
SENet 是只对 channel 做权重;GENet、BAM、CBAM 都是对 Spatial 和 channel 做权重(也就是整个 feature map tensor 的每个点);其中,SENet 是 GENet 的特殊情况,当 selection operator 的范围是整个 feature map 的时候,形式就和 SENet 一样的,是对一个 channel 里的所有点都施加一样的权重。具体的来说,SENet 是 GENet 的 gather operator 是 不含参数的 global average pooling 操作,excite operator 是一个 全连接层时候的特殊情况。
Spatial Attention Module 里,SENet 因为没有对 Spatial Axis 做 Attention 所以跟其他三个都不同;BAM 和 CBAM 是分成 Spatial Axis 和 Channel Axis 两个维度来,对于 Spatial Attention 算出来的权重,这个 Spatial location 的所有 channel 上的都一样,所以最后 Spatial Attention Weight 和 Channel Attention Weight 融合还有一个 broadcasting 的过程;而在 GENet 中是对 3D feature tensor 逐个 pixel 计算过去的,Spatial 和 Channel 并没有分开,反映在 Channel 和 Spatial 上的每个 pixel 都会被依次遍历;因为 context patch 是 2D 的,即使是同一个 Spatial location,不同 Channel 上的 Spatial weight 也是不同的。