论文地址:Deep Residual Learning for Image Recognition


自该骨干网络开始,我开始接触到所谓的预训练权重的概念。

以下内容来自深度学习权重初始化

深度学习其本质是优化所有权重的值,使其达到一个最优解的状态,这其中,需要更新权重的层包括卷积层、BN层和FC层等。在最优化中,权重的初始化是得到最优解的重要步骤。如果权重初始化不恰当,则可能会导致模型陷入局部最优解,导致模型预测效果不理想,甚至使损失函数震荡,模型不收敛。而且,使用不同的权重初始化方式,模型最终达到的效果也是很不一样的。因此,掌握权重的初始化方式是炼丹师必备的炼丹术之一。

而预训练初始化是权重初始化的其中一种方式

在实际中,我们大部分是在已有的网络上进行修改,得到预期结果后,再对网络进行裁剪等操作。而预训练模型是别人已经在特定的数据集(如ImageNet)上进行大量迭代训练的,可以认为是比较好的权重初始化。加载预训练模型进行初始化,能使我们进行更高频率的工程迭代,更快得到结果。而且,通常加载预训练模型会得到较好的结果。

但是要注意几个问题:

  1. 许多文章指出,当源场景与目标场景差异过大时,加载预训练模型可能不是一种很好的方式,会导致领域失配;
  2. 何凯明在2019年的《Rethinking ImageNet Pre-training》中指出,加载预训练模型并不能使准确率提高,只要迭代步数够多,随机初始化也能产生相同的效果。但实际中,我们无法得知得迭代多少次,模型才饱和;而且迭代步数越多,时间成本也就越大。

总体来说,如果初始版本的模型存在预训练模型的话,我们可以通过加载预训练模型来进行初始化,快速得到结果;然后再根据需求,对初始版本的网络进行修改。


ResNet 要解决什么问题呢?何凯明在论文中说是“解决深层网络的一种退化问题”,并没有直接说缓解梯度消失或梯度爆炸,参考Resnet到底在解决一个什么问题呢? - 知乎 。使用 ResNet 后可以将网络做的更深。另外,论文还提出,在每个卷积层的后面,在 ReLU 激活单元的前面,使用 BN 操作(其实我们已经在前面的博客CNN学习系列:骨干网络学习之VGG中用过该操作了)。

关于 ResNet 的介绍,参看An Overview of ResNet and its Variants

ResNet 的结构:

ResNet结构

ResNet 中有两种残差结构,见下图

残差块

左边的称为 BasicBlock,它被用于 ResNet-18, ResNet-34 中,右边的叫 BottleNeck,被用于ResNet-50, ResNet-101, ResNet-152 中。

在参考了官方的源码和知乎上的一些介绍之后,完成了下面的代码和分析。

1
2
3
4
5
6
7
8
9
10
11
12
13
import torch.nn as nn
from torch.nn import functional as F


def conv3x3(in_channels, out_channels, stride=(1, 1)):
return nn.Conv2d(in_channels=in_channels, out_channels=out_channels,
kernel_size=(3, 3), stride=stride, padding=(1, 1), bias=False)


def conv1x1(in_channels, out_channels, stride=(1, 1)):
return nn.Conv2d(
in_channels=in_channels, out_channels=out_channels,
kernel_size=(1, 1), stride=stride, bias=False)

kernel_size为3,stride为1,padding为1的情况下,图像的输出大小和输入大小是一样的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class BasicBlock(nn.Module):
expansion = 1

def __init__(self, in_channels, out_channels, stride=(1, 1), down_sample=None):
super(BasicBlock, self).__init__()
self.conv1 = conv3x3(in_channels, out_channels, stride)
self.bn1 = nn.BatchNorm2d(out_channels)

self.conv2 = conv3x3(out_channels, out_channels)
self.bn2 = nn.BatchNorm2d(out_channels)
self.down_sample = down_sample
self.stride = stride

def forward(self, x):
residual = x

t = self.conv1(x)
t = self.bn1(t)
t = F.relu(t)

t = self.conv2(t)
t = self.bn2(t)

if self.down_sample is not None:
residual = self.down_sample(x)

t += residual
t = F.relu(t)

return t

注意此处的expansion,为什么要定义这个变量呢?在 BasicBlock 中,各残差块很规整,输出的 channel 都一样

但是在 BottleNeck 中,却出现了一个4倍的关系,如图

另外,我看到这里的时候有个疑惑:1x1的卷积是什么意思?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class BottleNeck(nn.Module):
expansion = 4

def __init__(self, in_channels, out_channels, stride=(1, 1), down_sample=None):
super(BottleNeck, self).__init__()
self.conv1 = conv1x1(in_channels, out_channels)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = conv3x3(out_channels, out_channels, stride)
self.bn2 = nn.BatchNorm2d(out_channels)
self.conv3 = conv1x1(out_channels, out_channels * 4)
self.bn3 = nn.BatchNorm2d(out_channels * 4)
self.down_sample = down_sample
self.stride = stride

def forward(self, x):
residual = x

t = F.relu(self.bn1(self.conv1(x)))
t = F.relu(self.bn2(self.conv2(t)))
t = self.bn3(self.conv3(t))

if self.down_sample is not None:
residual = self.down_sample(x)

t += residual
t = F.relu(t)

return t

为什么要下采样(down_sample)呢?

如果上一个残差块的输出维度和当前的残差块的维度不一样,那就对这个 t 进行 downsample 操作,如果维度一样,直接加就行了,这时直接t += residual,对应到何凯明的论文中的图就是,虚线时候的情况。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class ResNet(nn.Module):

# block 选择不同的类(BasicBlock或BottleNeck),layers=参数列表
def __init__(self, block, layers: list, num_classes=1000):
super(ResNet, self).__init__()

# 注意通道的个数和图像的大小是两回事
self.in_channels = 64
self.conv1 = nn.Conv2d(
in_channels=3, out_channels=64, kernel_size=(7, 7),
stride=(2, 2), padding=(3, 3), bias=False
)
self.bn1 = nn.BatchNorm2d(64)

self.layer2 = self.__make_layer(block, 64, layers[0])
self.layer3 = self.__make_layer(block, 128, layers[1], stride=(2, 2))
self.layer4 = self.__make_layer(block, 256, layers[2], stride=(2, 2))
self.layer5 = self.__make_layer(block, 512, layers[3], stride=(2, 2))
self.avg_pool = nn.AdaptiveAvgPool2d((1, 1))

self.fc = nn.Linear(512 * block.expansion, num_classes)

for m in self.modules():
# 判断一个对象是否是已知的类型
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)

def __make_layer(self, block, out_channels, num, stride=(1, 1)):
down_sample = None

if stride != (1, 1) or self.in_channels != out_channels * block.expansion:
down_sample = nn.Sequential(
conv1x1(self.in_channels, out_channels * block.expansion, stride),
nn.BatchNorm2d(out_channels * block.expansion),
)

layers = list()
# 每个block的第一个residual结构保存在layers列表中。
layers.append(block(self.in_channels, out_channels, stride, down_sample))
# 从 1 开始到 num-1
self.in_channels = out_channels * block.expansion
for _ in range(1, num):
layers.append(block(self.in_channels, out_channels))

return nn.Sequential(*layers)

def forward(self, t):
t = self.bn1(self.conv1(t))
t = F.relu(t)
t = F.max_pool2d(t, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))

t = self.layer2(t)
t = self.layer3(t)
t = self.layer4(t)
t = self.layer5(t)

t = self.avg_pool(t)
t = t.view(t.size(0), -1)
t = self.fc(t)

return t

初始化 ResNet 的时候,经过self.conv1卷积层之后,在kernel_size为7,padding为3的情况下,若stride为1,则输出的图像大小和输入的一样(仍是224),但是此处的stride为2,因此,图像的大小会变为原来的一半,112。对应论文中:

ps: 注意哦,并不是所有stride为2时图像大小都会变为原来的一半,要结合kernel_sizepadding等参数一起决定。

另外,在源码中,ResNet类中的for循环下面还有一些代码:

1
2
3
4
5
6
7
8
9
# Zero-initialize the last BN in each residual branch,
# so that the residual branch starts with zeros, and each residual block behaves like an identity.
# This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
if zero_init_residual:
for m in self.modules():
if isinstance(m, Bottleneck):
nn.init.constant_(m.bn3.weight, 0)
elif isinstance(m, BasicBlock):
nn.init.constant_(m.bn2.weight, 0)

这一改动对性能有0.2%~0.3%的提升,暂且留待后面研究。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def resnet18(pretrained=False):
model = ResNet(BasicBlock, [2, 2, 2, 2])
if pretrained:
model.load_state_dict(model_zoo.load_url(model_urls['resnet18']))

return model


def resnet34(pretrained=False):
model = ResNet(BasicBlock, [3, 4, 6, 3])
if pretrained:
model.load_state_dict(model_zoo.load_url(model_urls['resnet34']))

return model


def resnet50(pretrained=False):
model = ResNet(BottleNeck, [3, 4, 6, 3])
if pretrained:
model.load_state_dict(model_zoo.load_url(model_urls['resnet50']))

return model


def resnet101(pretrained=False):
model = ResNet(BottleNeck, [3, 4, 23, 3])
if pretrained:
model.load_state_dict(model_zoo.load_url(model_urls['resnet101']))

return model


def resnet152(pretrained=False):
model = ResNet(BottleNeck, [3, 8, 36, 3])
if pretrained:
model.load_state_dict(model_zoo.load_url(model_urls['resnet152']))

return model

最后,查看一下 resnet18 的结构:

1
2
3
resnet18 = resnet18()
resnet18 = resnet18.cuda()
summary(resnet18, (3, 224, 224))

执行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 112, 112] 9,408
BatchNorm2d-2 [-1, 64, 112, 112] 128
Conv2d-3 [-1, 64, 56, 56] 36,864
BatchNorm2d-4 [-1, 64, 56, 56] 128
Conv2d-5 [-1, 64, 56, 56] 36,864
BatchNorm2d-6 [-1, 64, 56, 56] 128
BasicBlock-7 [-1, 64, 56, 56] 0
Conv2d-8 [-1, 64, 56, 56] 36,864
BatchNorm2d-9 [-1, 64, 56, 56] 128
Conv2d-10 [-1, 64, 56, 56] 36,864
BatchNorm2d-11 [-1, 64, 56, 56] 128
BasicBlock-12 [-1, 64, 56, 56] 0
Conv2d-13 [-1, 128, 28, 28] 73,728
BatchNorm2d-14 [-1, 128, 28, 28] 256
Conv2d-15 [-1, 128, 28, 28] 147,456
BatchNorm2d-16 [-1, 128, 28, 28] 256
Conv2d-17 [-1, 128, 28, 28] 8,192
BatchNorm2d-18 [-1, 128, 28, 28] 256
BasicBlock-19 [-1, 128, 28, 28] 0
Conv2d-20 [-1, 128, 28, 28] 147,456
BatchNorm2d-21 [-1, 128, 28, 28] 256
Conv2d-22 [-1, 128, 28, 28] 147,456
BatchNorm2d-23 [-1, 128, 28, 28] 256
BasicBlock-24 [-1, 128, 28, 28] 0
Conv2d-25 [-1, 256, 14, 14] 294,912
BatchNorm2d-26 [-1, 256, 14, 14] 512
Conv2d-27 [-1, 256, 14, 14] 589,824
BatchNorm2d-28 [-1, 256, 14, 14] 512
Conv2d-29 [-1, 256, 14, 14] 32,768
BatchNorm2d-30 [-1, 256, 14, 14] 512
BasicBlock-31 [-1, 256, 14, 14] 0
Conv2d-32 [-1, 256, 14, 14] 589,824
BatchNorm2d-33 [-1, 256, 14, 14] 512
Conv2d-34 [-1, 256, 14, 14] 589,824
BatchNorm2d-35 [-1, 256, 14, 14] 512
BasicBlock-36 [-1, 256, 14, 14] 0
Conv2d-37 [-1, 512, 7, 7] 1,179,648
BatchNorm2d-38 [-1, 512, 7, 7] 1,024
Conv2d-39 [-1, 512, 7, 7] 2,359,296
BatchNorm2d-40 [-1, 512, 7, 7] 1,024
Conv2d-41 [-1, 512, 7, 7] 131,072
BatchNorm2d-42 [-1, 512, 7, 7] 1,024
BasicBlock-43 [-1, 512, 7, 7] 0
Conv2d-44 [-1, 512, 7, 7] 2,359,296
BatchNorm2d-45 [-1, 512, 7, 7] 1,024
Conv2d-46 [-1, 512, 7, 7] 2,359,296
BatchNorm2d-47 [-1, 512, 7, 7] 1,024
BasicBlock-48 [-1, 512, 7, 7] 0
AdaptiveAvgPool2d-49 [-1, 512, 1, 1] 0
Linear-50 [-1, 1000] 513,000
================================================================
Total params: 11,689,512
Trainable params: 11,689,512
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 43.65
Params size (MB): 44.59
Estimated Total Size (MB): 88.82
----------------------------------------------------------------

最后,附一张用visio绘制的resnet18的图