[논문 리뷰] Very Deep Convolutional Networks for Large-Scale Image Recognition 리뷰, VGG Net

AI(Artificial Intelligence)/DL(Deep Learning)

[논문 리뷰] Very Deep Convolutional Networks for Large-Scale Image Recognition 리뷰, VGG Net

탱젤 2021. 2. 16. 02:17

무려 1년전에 정리해놓은 논문 올리기 ㅎㅅㅎ

Image Recognition에 입문할 때 좋은 논문이라고 생각한다.

Very Deep Convolutional Networks for Large-Scale Image Recognition arxiv.org/abs/1409.1556

Very Deep Convolutional Networks for Large-Scale Image Recognition

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x

arxiv.org

Very Deep Convolutional Networks for Large-Scale Image Recognition

Abstract

3x3 convolution filter를 가진 architecture을 이용해 depth를 증가시킨 network를 평가
depth를 16-19 weight layer로 증가시켜 이전 기술들보다 더 개선됨

1. Introduction

Convolutional Networks, ConvNets : 최근 large-scale image & video recognition에서 큰 활약보임
본 논문에서의 accuracy 향상 시도 - ConvNet architecture 설계에서 depth를 다룸
- 더 많은 convolutional layer을 추가하여 network의 depth를 늘림 (모든 계층에서 아주 작은 3x3 convolution filter 사용해 network의 depth를 꾸준히 늘릴 수 있음)
결과: ILSVRC[1] classification과 localisation 작업에서 좋은 accuracy 달성

2. ConvNet Configurations

2.1 Architecture

Training 시 ConvNets의 input: 224x224 RGB image로 고정
유일한 Preprocessing: 각 픽셀에서 Training set에서 계산된 mean RGB 값을 빼는 것
Input Image는 a stack of conv layer을 통과 (3x3 크기의 filter를 사용)
Convolution stride는 1 pixel로 고정
Convolution padding: 3x3 conv layer에 1pixel
- Convolution layer의 spatial padding: convolution 이후에 spatial resolution이 유지되는 것
Max pooling: stride 2로 2x2 pixel window에서 수행됨
- Spatial pooling은 5개의 max pooling layer에 의해 수행되며 모든 conv layer뒤에 pooling layer오는 것은 아님)
A stack of conv layer 뒤에 세 개의 Fully-Connected(FC) layer 옴
- Conv layer 스택: 구조에 따라 깊이 다름.
- Fully-Connected layer: 모든 구조에서 구성 동일함.
- 처음 두 개의 FC layer: 4096개의 채널 있음
- 마지막 FC layer (soft-max layer): 1000way ILSVRC classification 수행 → 각 class 마다 한 개의 채널 가져서 1000개의 채널 가짐
모든 hidden layer: rectification non-linearity (ReLU) 있음

2.2 Configurations

모든 configuration은 depth만 다름
- network A의 weight layer 11개(8 conv layer+3 FC layer) ~ network E의 weight layer 19개(16 conv + 3 FC)
Conv layer의 width(채널의 개수)는 첫 번째 layer 64개부터 시작해서 각 max-pooling layer를 지나면 512에 이를 때까지 2배씩 증가, 다소 작음
Table 1) ConvNet Configuration(ConvNet 구성)

- Configuration의 depth는 왼쪽(A)에서 오른쪽(E)으로 증가하며 layer가 더 많이 추가됨 (추가된 layer는 굵게 표시)

- Conv layer 매개변수는 “conv receptive field size - number of channels”로 표시 (ReLU는 표시하지않음)

Table 2) Number of parameters (각 configuration의 parameter개수)

- 이 network는 depth가 깊지만 depth가 얕고 convolution layer의 width와 receptive field가 큰 network보다 weight의 개수가 많지 않음.

2.3 Discussion

network 전체에 걸쳐 매우 작은 3x3 receptive field 사용 (stride1)
사이에 spatial pooling이 없는 2개의 3x3 conv layer 스택이 5x5 receptive field를 가짐, 3개의 3x3 conv layer는 7x7 receptive field를 가짐
- 단일 7x7 layer 대신 3개의 3x3 conv layer를 사용하여 얻는 것은?
  1. 1개의 non-linear rectification layer 대신 3개의 non-linear rectification layer을 사용해 결정함수의 비선형성 증가
    - feature의 식별성 증가
  2. 학습 파라미터 수 감소
    - 3개의 3x3 conv layer stack이 C개의 채널을 가지고 있다고 가정하면 스택은3(3^2*C^2 ) = 27C^2 가중치에 의해 파라미터화되는데 단일 7x7 layer는 72C^2 =49C^2 매개변수가 필요해 거의 81%가 더 필요함

3. Classification Framework

3.1 Training

방법) back-propagation에 기반한 미니배치 gradient descent를 사용하여 momentum 기법으로 다항 logistic regression을 최적화

batch size: 256, momentum: 0.9
weight decay (L2 penalty multiplier set: 5*(10^4))에 의해 정규화됨
처음 두 개의 FC layer에서 dropout 정규화 (dropout 비율: 0.5)
Learning rate: 초기에 10^(-2) 로 설정되었다가 validation set 정확도가 향상하지 않아 10배 감소→총 3배 감소, 370K 반복 이후 학습 중단(74 epochs)
Deep net에서의 gradient의 불안정성으로 인한 학습 지연을 막기 위해 network 가중치의 초기화가 중요, 학습 지연 막기 위해 무작위 초기화로 교육될 정도로 얕은 구성 A (Table 1)부터 학습 시작
그 후 깊은 architecture을 훈련할 때, 처음 네 개의 conv layer과 마지막 세 개의 FC layer를 net A의 layer로 초기
- 중간 layer들은 무작위로 초기화
- 무작위로 초기화하기 위해 zero 평균과 10^(-2) variance 의 정규분포에서 가중치를 sample, biases는 0으로 초기화)
224x224 크기의 ConvNet input image 얻기 위해 rescaled training image를 무작위로 crop
- SGD(Stochastic Gradient Descent) iteration 1회당 image 1개 crop
- Training set를 더 augment 하기 위해 crop 시 random horizontal flipping과 random RGB colour shift

Training image size

Training scale S: the smallest side of an isotropically-rescaled(원본과 동일한 비가 되도록 scaling) training image
Training scale S를 설정하기 위한 두 가지 접근법
1. Single-scale training, S 고정 (S=256, S=384)
  - S=256을 먼저 사용하여 network를 훈련한 다음 S=384 network의 속도 향상을 위해 S=256으로 pre-trained 된 가중치로 초기화하고 더 작은 learning rate인 10^(-3) 사용
2. Multi-scale training
  - [Smin, Smax ] (Smin =256, Smax =512) 에서 무작위로 S 추출해 각 training image를 개별적으로 축소하는 것
  - 원본 image에 있는 물체는 random한 size를 가져 더 다양한 input을 줄 수 있음(scale-jittering)
  - Image에 있는 object가 크기가 다를 수 있기 때문에 training 동안에 고려하는 것이 좋음 (속도 상의 이유로 S=384로 pre-trained된 동일한 configuration을 가진 single-scale model의 모든 layer들을 fine-tuning함으로써 multi-scale model을 train)

3.2 Testing

Input image는 pre-define된 smallest image side로 isotropically rescale되며 test scale Q로 표시됨
- Q가 training scale S와 같을 필요 없음, 각 S에 대해 몇 가지 Q를 쓰는 것이 성능 향상에 도움
Fully Connected layer가 conv layer로 변환됨 (첫 번째 FC layer는 7x7 conv layer로 마지막 두 FC layer는 1x1 conv layer로 변환)
- resulting Fully Convolutional network가 전체 image에 적용
- 결과
  1. class의 개수와 동일한 개수의 channel을 갖는 class score map
  2. Input image size에 따라 변하는 spatial resolution 〓 input image size의 제약이 없어짐
  3. 하나의 image를 다양한 scale로 사용한 결과를 조합해 image classification accuracy 개선가능
Image의 class score의 fixed-size vector을 얻기 위해 class score map은 spatially averaged (sum-pooled)
Image를 horizontal flipping해서 test set을 augment
Image의 final score을 얻기 위해 원본과 flipped image의 soft-max class의 평균을 구함
Multi-crop evaluation: input image를 더 정밀하게 sampling시켜서 정확도 향상되지만 각 crop에 대해 network re-computation이 필요해 효율성 떨어짐

3.3 Implementation Details

Multi-GPU training 은 각 GPU에서 병렬로 처리되는 여러 GPU batch들로 각 training image를 분할하여 사용
GPU batch gradient를 계산 후 full batch의 gradient를 얻기 위해 평균계산
Gradient 계산은 GPU 전체에 걸쳐 동시에 진행됨 → 결과가 단일 GPU에서 train할 때와 정확히 일치
4-GPU 시스템에서 단일 GPU 시스템보다 속도가 3.75배 향상

4. Classification Experiments

ILSVRC-2012 dataset에 대해 설명된 ConvNet architecture가 이뤄낸 image classification 결과 제시
Dataset: 1000 class의 image를 포함, 세 개의 set로 나뉨
1. Training set (1.3M images)
2. Validation set (50K images)
3. Testing set (100K images)
Classification performance는 2가지 오류척도로 평가 (top-1 error, top-5 error)
1. Multi-class classification error, 잘못 분류된 image의 비율
2. Top-5 predicted category들 밖에 있는 image의 비율 -ILSVRC에 주로 사용되는 평가 기준
  
  (top-5 error: 모델이 예측한 최상위 5개 범주 안에 정답이 없는 경우)

4.1 Single Scale Evaluation (Table 1의 layer configuration 참고)

Test image size

A(11 layers)에서 E(19 layers)로 ConvNet의 depth가 증가함에 따라 classification error 감소, 더 깊은 모델은 더 큰 dataset 에서 유용할 수 있음
성능 C>B: 추가적인 non-linearity가 성능 향상 도움
성능 C<D: 같은 depth에도 불구하고 C(세 개의 1x1 conv layer 포함)는 D(3x3 conv layer를 사용)보다 성능 떨어짐
→ conv filter을 사용해 spatial context를 파악하는 것도 중요
B에서 3x3 conv layer을 5x5 conv layer로 바꾸고 성능 test했더니 top-1 error가 7% 증가
→ 작은 filter을 가진 deep net이 큰 filter을 가진 shallow net 성능 능가
Fixed S(256, 384) 때보다 scale jittering at training time S∈Smin, Smax 가
성능 좋음
→scale jittering에 의한 training set augmentation이 multi-scale image
statistics를 파악하는데 도움

Table 3) ConvNet performance at a single test scale.

4.2 Multi-Scale Evaluation

Single-scale (Table 3)에서보다 Multi-scale (Table 4)에서 성능 향상

→image size에 대해 multi-scale jittering [Smin, Smax ] (Smin =256, Smax =512)시 성능 좋음

Table 4) ConvNet performance at multiple test scales.

4.3 Multi-Crop Evaluation

두 평가 기법 (Multi-crop evaluation[2], dense evaluation[3])의 soft-max output을 평균화해서 complementarity를 평가
Multi-crop evaluation이 미세하게 좋은 성능 보이나 dense evaluation보다 연산량 많음, 결과적으로 두 평가기법의 조합이 각 평가기법의 성능 능가

Table 5) ConvNet evaluation techniques comparison.

4.4 ConvNet Fusion

Soft-max class posterior을 평균화하여 여러 모델의 output을 조합 → 모델의 보완성으로 성능 향상

Table 6) Multiple ConvNet fusion results.

4.5 Comparison with the state of the art

다른 기존의 모델 성능 크게 능가
Single net 성능은 GoogLeNet를 0.9%만큼 능가

Table 7) Comparison with the state of the art in ILSVRC classification

5. Conclusion

기존 ConvNet architecture보다 작은 receptive field 사용(3x3, 1 stride)
최대 19 depth까지 weight layer를 deep하게 설계하여 좋은 성능 이끌어냄

[1] ImageNet Large Scale Visual Recognition Challenge

[2] 원래 image를 multi-crop 하여 crop들 각각을 convNet에 적용

[3] Image 전체를 곧바로 convNet에 적용하고 일정한 pixel grid (픽셀간격)으로 sliding window를 적용하듯 결과 가져옴, pixel grid 크기 문제로 인해 학습 정확도가 약간 떨어질 수 있음

728x90

'AI(Artificial Intelligence) > DL(Deep Learning)' 카테고리의 다른 글

[RNN] 순환 신경망 (RNN, Recurrent Neural Network) - 1. 순차 데이터 (0)	2021.02.23
[Deep Learning] Semantic Segmentation - Deconvolution, Upsampling (0)	2021.02.18
[Deep Learning] Google Cloud TPU(CPU, GPU, NPU, TPU 개념) (0)	2021.02.10
[Deep Learning] CNN의 개념, Object Detection (0)	2021.02.08
[Deep Learning] 딥러닝/머신러닝/인공지능의 차이, 인공신경망 개념 (0)	2021.02.08

현재글[논문 리뷰] Very Deep Convolutional Networks for Large-Scale Image Recognition 리뷰, VGG Net

TY_IT💻🌱Growing

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

TY_IT💻

[논문 리뷰] Very Deep Convolutional Networks for Large-Scale Image Recognition 리뷰, VGG Net