Neural Network

Image segmentation tasks using the Unet neural network

Pinterest LinkedIn Tumblr

Segmentation of images with U-Net in practice


In this blog post, we’ll take a look at how Unet works, how to implement it, and what data is needed to train it. To do this, we will consider:

  1. See the original Unet article for inspiration.
  2. Pytorch as a tool to bring our vision to life .
  3. Kaggle competitions as a place where we can test our hypotheses on real data.

We will not follow the article 100%, but we will try to implement its essence, adapt it to our needs.
Presentation of the problem Our problem is one of the Kaggle Competitions .In this task, we are given an image of a car and its binary mask (which localizes the position of the car in the image). We want to create a model that will be able to separate the car image from the background with a pixel-by-pixel precision of over 99%.To understand what we want, the gif image is below:

The image on the left is the original image, on the right is the mask that will be applied to the image. We will use a Unet neural network that will learn to automatically create a mask.

  1. By feeding images of cars into the neural network.
  2. Using a loss function, comparing the output of the neural network with the corresponding masks and returning an error for the network to find out where the network is wrong.

Code structureThe code has been simplified as much as possible to understand how it works. The main code is in this file, let ‘s break it down line by line.
The codeWe will iterate through the code in and through the article. Don’t worry about the details hidden in other project files, we’ll show you the ones you need as needed.Let’s start at the beginning :

def  main ():
     # Hyperparameters 
    input_img_resize = ( 572 , 572 )   # The resize size of the input images of the neural net 
    output_img_resize = ( 388 , 388 )   # The resize size of the output images of the neural net 
    batch_size = 3 
    epochs = 50 
    threshold = 0. 5 
    validation_size = 0. 2 
    sample_size =  None

    # - Optional parameters 
    threads = cpu_count ()
    use_cuda = torch.cuda.is_available ()
    script_dir = os.path.dirname (os.path.abspath ( __file__ ))
     # Training callbacks 
    tb_viz_cb = TensorboardVisualizerCallback (os.path.join (script_dir, '../logs/tb_viz' ))
    tb_logs_cb = TensorboardLoggerCallback (os.path.join (script_dir, '../logs/tb_logs' ))
    model_saver_cb = ModelSaverCallback (os.path.join (script_dir, '../output/models/model_'  + helpers.get_model_timestamp ()), verbose = True )

In the first section, you define your hyperparameters, you can tweak them as you see fit, for example depending on your GPU memory. Optimal parameters define some useful parameters and callbacks . TensorboardVisualizerCallback is a class that will store predictions in tensorboard at every epoch of the training process, TensorboardLoggerCallback will store loss function values ​​and pixel-by-pixel “precision” in tensorboard . Finally, ModelSaverCallback will save your model after training is complete.

# Download the datasets
ds_fetcher =  DatasetFetcher () 
ds_fetcher . download_dataset ()

This section automatically loads and retrieves a dataset from Kaggle. Please note that for this piece of code to work successfully, you need to have a Kaggle account with a login and password, which must be placed in the KAGGLE_USER and KAGGLE_PASSWD environment variables before running the script. It is also required to accept the competition rules before uploading the data. This can be done on the download tab of the competition data

# Get the path to the files for the neural net
X_train, y_train, X_valid, y_valid = ds_fetcher.get_train_files (sample_size = sample_size, validation_size = validation_size)
full_x_test = ds_fetcher.get_test_files (sample_size)

Here we simply divided the train set into training and validation sets , then loaded the test dataset (for which we do not have masks, and on which we need to predict masks for checking on the public and private leaderboard on the kaggle).

# Testing callbacks 
pred_saver_cb = PredictionsSaverCallback (os.path. Join (script_dir, '../output/submit.csv.gz' ), origin_img_size, threshold)

This line defines the callback function for the test (or prediction). It will save the predictions in a gzip file every time a new prediction batch is made. Thus, the predictions will not be stored in memory, as they are very large in size.After finishing the prediction process, you can send the resulting file submit.csv.gz from the output folder to Kaggle.

# - Define our neural net architecture 
# The original paper has 1 input channel, in our case we have 3 ( RGB )
net = unet_origin . UNetOriginal (( 3 , * img_resize))
classifier = nn . classifier . CarvanaClassifier (net, epochs)
optimizer = optim . SGD (net . Parameters () , lr = 0.01 , momentum = 0.99 )

train_ds =  TrainImageDataset ( X_train , y_train, input_img_resize, output_img_resize, X_transform = aug . augment_img)
train_loader =  DataLoader (train_ds, batch_size,
                          sampler = RandomSampler (train_ds),
                          num_workers = threads,
                          pin_memory = use_cuda)

valid_ds =  TrainImageDataset ( X_valid , y_valid, input_img_resize, output_img_resize, threshold = threshold)
valid_loader =  DataLoader (valid_ds, batch_size,
                          sampler = SequentialSampler (valid_ds),
                          num_workers = threads,
                          pin_memory = use_cuda)

Here we define our net and optimizer (more on that later), then create a loader for both the train set and the validation set that will load data as batches .

print ( "Training on {} samples and validating on {} samples" 
      . format (len (train_loader . dataset), len (valid_loader . dataset)))

# Train the classifier
classifier . train (train_loader, valid_loader, epochs, callbacks = [tb_viz_cb, tb_logs_cb, model_saver_cb])

We now start training the model by running the train and validation loader and callback that we have defined. Below we will look at some of the implementation details of this method .

test_ds =  TestImageDataset (full_x_test, img_resize)
test_loader =  DataLoader (test_ds, batch_size,
                         sampler = SequentialSampler (test_ds),
                         num_workers = threads,
                         pin_memory = use_cuda)

# Predict & save
classifier . predict (test_loader, callbacks = [pred_saver_cb])
pred_saver_cb . close_saver ()

Finally, we do the same as above, but for the prediction run. We call our pred_saver_cb.close_saver () to clear and close the file that contains the predictions.
Neural network architecture implementationUnet’s article presents an approach for medical image segmentation. However, it turns out that this approach can also be used for other segmentation tasks. Including for the one we are going to work on now.
Before going forward, you must read the entire article at least once. Don’t worry if you don’t have a complete understanding of the math, you can skip this section as well as the Experimenting chapter. Our goal is to get the big picture.
The purpose of the original article is different from ours, we will need to adapt some parts according to our needs.At the time the paper was written, 2 things were missing that are now needed to speed up the convergence of the neural network:

  1. BatchNorm.
  2. Powerful GPUs.

The first was invented just 3 months before Unet , and is probably too early for the Unet authors to add it to their article.Today BatchNorm is used almost everywhere. You can get rid of it in code if you want to rate the article 100%, but you may not live to see the web converge.
In terms of GPUs, the article says:
To minimize the overhead and make maximum use of the GPU memory, we favor large input tiles over a large batch size and hence reduce the batch to a single image
They used a GPU with 6 GB of RAM, but now the GPU has more memory to accommodate images in one batch. The current batch of three works for a GPU in a GPU with 8GB of RAM. If you don’t have such a video card, try reducing the batch to 2 or 1.As for the methods of augmentations (that is, distorting the original image according to some pattern), considered in the article, we will use different ones from those described in the article, since our images are very different from biomedical images.Now let’s start from the very beginning by designing the architecture of the neural network:

This is what Unet looks like. You can find the equivalent Pytorch implementation in the module .All classes in this file have at least 2 methods:

  • __init __ () where we will initialize our neural network layers;
  • forward () which is the method called when the neural network receives an input.

Let’s take a look at the implementation details:

  • ConvBnRelu is a block containing Conv2D, BatchNorm and Relu operations. Instead of having to type 3 of them for each encoder stack (group of down operations) and decoder stacks (group of up operations), we group them into this object and reuse it as needed.
  • The StackEncoder encapsulates the entire downstream “stack” of operations, including the ConvBnRelu and MaxPool operations , as shown below:

We track the output of the last ConvBnRelu operation in x_trace and return it because we will be concatenating that output using the decoder stacks.

  • StackDecoder is the same as StackEncoder, but for decode operations, surrounded below in red:

Note that it takes into account the trim / concatenation operation (surrounded by orange) by passing in down_tensor, which is nothing more than the x_trace tensor returned by our StackEncoder .

  • UNetOriginal is where the magic happens. This is our neural network that will collect all the little bricks above. The init and forward methods are really tricky, they add a bunch of StackEncoders , a central part, and a few StackDecoders at the end . We then get the output of the StackDecoder , add the 1×1 convolution to it as per the article, but instead of defining two filters as the output, we only define 1, which will actually be our grayscale mask prediction. Next, we “shrink” our output to remove the channel size (only 1, so we don’t need to store it).

If you want to understand more of the details of each block, place a debug breakpoint in the forward method of each class to view the objects in detail. You can also print the shape of your output tensors between layers by printing ( x.size () ).
Neural network training

  1. Loss function

Now to the real world. According to the article:The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross-entropy loss function.
The point is that in our case, we want to use the dice coefficient as a loss function instead of what they call an “energy function”, as this is the metric used in the Kaggle competition , which is defined by:

X is our prediction and Y is the correctly marked mask on the current object. | X | denotes the cardinality of the set X (the number of elements in this set) and for ∩ intersection between the X and Y .
The code for the dice coefficient can be found at nn.losses.SoftDiceLoss .

class  SoftDiceLoss (nn.Module) : 
    def __init __ (self, weight = None, size_average = True) : 
        super (SoftDiceLoss, self) .__ init __ ()

    def forward (self, logits, targets) : 
        smooth = 1 
        num = targets. size ( 0 )
        probs = F.sigmoid (logits)
        m1 = probs.view (num, - 1 )
        m2 = targets.view (num, - 1 )
        intersection = (m1 * m2)

        score = 2 . * (intersection.sum ( 1 ) + smooth) / (m1.sum ( 1 ) + m2.sum ( 1 ) + smooth)
        score = 1  - score.sum () / num
         return score

The reason the intersection is implemented as a multiplication and the cardinality as sum () on axis 1 (the sum of three channels) is because the predictions and target are one-hot encoded vectors.For example, suppose the prediction at pixel (0, 0) is 0.567 and the target is 1, we get 0.567 * 1 = 0.567. If the target is 0, we get 0 at that pixel position.We also used a smooth factor of 1 for backpropagation. If the prediction is a hard threshold of 0 and 1, it is difficult to reverse propagate the dice loss .
We then compare the dice loss with the cross entropy to get our total loss function, which you can find in the _criterion method of nn.Classifier.CarvanaClassifier . According to the original article, they also use a weight map in the cross-entropy loss function to give some pixels more error during training. In our case, we don’t need such a thing, so we just use cross entropy without any weight map.
2. Optimizer
Here we will try to pay homage to the original article using the SGD optimizer and momentum 0.99. The optimizer can be found in the main method:

optimizer = optim.SGD ( (), lr = 0.01 , momentum = 0.99 )

That’s all we need to do for the optimizer.
3. Augmentations
Since we are not dealing with biomedical imaging, we will be using our own augmentations . The code can be found at img.augmentation.augment_img . There we do random displacement, rotation, flipping and scaling.
Neural network trainingNow you can start learning. As you progress through each epoch, you will be able to visualize your model’s predictions on the validation set.To do this, you need to start Tensorboard in the logs folder using the command:

tensorboard --logdir =. / logs

An example of what you can see in Tensorboard after era 1:

And after the 50 era

After training over 50 epochs, we get a pixel-by-pixel accuracy of about 95-96%. This is much better than our first era, but the quality is still unsatisfactory if we want to automate the task of image segmentation.
But at the beginning of the article the quality of the model was declared to be more than 99%? But here we were able to get a little over 95%? And for this reason, let me tell you something:

We will now use the special Unet, which you can find at nn.unet.UNet1024 . We will not go into the details of the implementation of this architecture, since it is very similar to our original Unet with some changes. All you need to do is make a few changes to your file :

input_img_resize = ( 572 , 572 )
output_img_resize = ( 388 , 388 )


input_img_resize = ( 1024 , 1024 )
output_img_resize = ( 1024 , 1024 )


net = unet_origin . UNetOriginal (( 3 , * input_img_resize))


net = unet_custom . UNet1024 (( 3 , * input_img_resize))

You can also change the optimizer from SGD to RMSProp:

optimizer = optim.SGD ( (), lr = 0.01 , momentum = 0.99 )


optimizer = optim.RMSprop ( (), lr = 0.0002 )

Then run the workout again at 50 epochs and batch size 2 (or reduce it to 1 if you don’t have enough graphics card RAM).This new architecture uses higher resolution inputs, allowing Unet to learn more representations.For example, here’s what era 24 looks like after these changes:

At 0.995 pixels, it looks much better, doesn’t it? Of course, further, you can try to tweak to improve the model yourself to achieve even better quality by changing the optimizer / era / architecture.

In this post, we saw how to create sharding using Unet. While there are many other solutions for this , Unet is very efficient for such tasks. Perhaps we will look at other methods of image segmentation in the future.

Links to used materials:


Write A Comment