hwchung 님의 블로그
[Paper Review] MICCAI 2015, U-Net: Convolutional Networks for Biomedical Image Segmentation 논문리뷰 본문
[Paper Review] MICCAI 2015, U-Net: Convolutional Networks for Biomedical Image Segmentation 논문리뷰
hwchung 2026. 2. 26. 14:30U-Net: Convolutional Networks for Biomedical Image Segmentation
MICCAI 2015
0. Abstract
In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. → U-Net 모델을 훈련할 때 data augmentation을 적용해서 모델이 더 잘 학습할 수 있음.
1. Introduction
While convolutional networks have already existed for a long time [8], their success was limited due to the size of the available training sets and the size of the considered networks.
However, in many visual tasks, especially in biomedical image processing, the desired output should include localization, i.e., a class label is supposed to be assigned to each pixel.
그리고, 무엇보다 biomedical data를 구하는 게 어려움. 그래서 적은 양의 이미지로 segmentation을 한다면 novelty가 있지 않을까...?
In this paper, we build upon a more elegant architecture, the so-called “fully convolutional network” [9]. We modify and extend this architecture such that it works with very few training images and yields more precise segmentations; see Figure 1.

Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.
The main idea in [9] is to supplement a usual contracting network by successive layers, where pooling operators are replaced by upsampling operators. → 일단 contracting network라는 것 자체가 network가 input image를 작게 만들면서 특징을 추출하는 방식임. 일반적으로는 convolution layer와 pooling layer를 사용해서 이미지를 점점 더 작은 크기(contracting)로 만들면서 중요한 특징을 추출함. 뭐... convolution은 image에서 pattern을 찾는 작업이고, pooling은 이미지를 축소시켜서 중요한 정보를 요약하는 작업임. 여기서 이제 pooling operator를 upsampling operator로 대체한다고 하였는데, upsampling은 이미지 크리를 확대하는 과정임. 풀링의 반대 작업이라고 이해하면 될듯함. 이 과정 속에서 이미지의 해상도를 복원하거나 세밀한 segmentation을 생성하는 데 사용된다고 할 수 있음. U-Net에서는 upsampling을 사용해서 수축된 이미지를 다시 원래 크기로 확장하면서 중요한 정보를 보존하고 segmentation을 진행함.
Hence, these layers increase the resolution of the output. → 어떻게 보면 당연한 말임.
One important modification in our architecture is that in the upsampling part we have also a large number of feature channels, which allow the network to propagate context information to higher resolution layers. → upsampling part에서도 channel 수를 많이 유지함. channel 수를 많이 유지함에 따라서 고해상도 단계에서도 단순하게 픽셀 정보가 아니라, 전체 context 정보까지 같이 유지할 수 있음.
As a consequence, the expansive path is more or less symmetric to the contracting path, and yields a u-shaped architecture.
The network does not have any fully connected layers and only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image. → U-Net은 fully connected layers를 사용하지 않음. 이는 이미지를 분할함에 있어서 픽셀 단위로 예측을 하고 이미지의 구조적인 특징을 그대로 반영함. 그리고 Valid padding만 사용을 하는데 convolution 연산에서 경계 부분은 잘라내고 나머지 유효한 부분만을 사용해서 이미지의 가장자리의 왜곡을 피할 수 있음. 그리고 segmentation map에 포함되는 픽셀은 전체 문맥을 사용할 수 있는 픽셀만 포함됨. 즉, 분할을 하기 위한 정보가 온전하게 제공된 픽셀만을 다룸. → U-Net은 지역적인 정보만을 사용하여 각 픽셀에 대해 예측을 함. 그래서 모든 위치에 대해 개별적으로 학습할 수 있기 때문에 세밀한 분할이 가능함.
This strategy allows the seamless segmentation of arbitrarily large images by an overlap-tile strategy (see Figure 2).

Fig. 2. Overlap-tile strategy for seamless segmentation of arbitrary large images (here segmentation of neuronal structures in EM stacks). Prediction of the segmentation in the yellow area, requires image data within the blue area as input. Missing input data is extrapolated by mirroring
To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. This tiling strategy is important to apply the network to large images, since otherwise the resolution would be limited by the GPU memory.
As for our tasks there is very little training data available, we use excessive data augmentation by applying elastic deformations to the available training images
To this end, we propose the use of a weighted loss, where the separating background labels between touching cells obtain a large weight in the loss function. → Weighted loss는 손실 함수(loss function)의 각 항목에 가중치(weight)를 부여해서 학습에 영향을 주는 정도를 조정하는 방법임. 예를 들어, 일반적인 손실 함수는 예측값과 실제값 간의 차이를 계산하지만, 가중치가 있으면 종류에 따라 영향을 다르게 줄 수 있음. 가중치가 있는 손실 함수에서는 특정한 class나 특정 부분에서 발생한 오류에 대해서 더 큰 penalty를 줄 수 있음.
2. Network Architecture

The network architecture is illustrated in Figure 1. It consists of a contracting path (left side) and an expansive path (right side). The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels. Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a 1x1 convolution is used to map each 64- component feature vector to the desired number of classes. In total the network has 23 convolutional layers. To allow a seamless tiling of the output segmentation map (see Figure 2), it is important to select the input tile size such that all 2x2 max-pooling operations are applied to a layer with an even x- and y-size.