#### Depthwise separable convolutions

In this 4-part series, we’ll implement image segmentation step by step from scratch using deep learning techniques in PyTorch. This part will focus on optimizing our CNN baseline model using depthwise separable convolutions to reduce the number of trainable parameters, making the model deployable on mobile and other edge devices.

Co-authored with Naresh Singh

Figure 1: Result of running image segmentation using a CNN with depth-wise separable convolutions instead of regular convolutions. From top to bottom, input images, ground truth segmentation masks, and predicted segmentation masks. Source: Author(s)

### Article outline

In this article, we will augment the Convolutional Neural Network (CNN) we built earlier to reduce the number of learnable parameters in our network. The task of identifying pet pixels (pixels belonging to cats, dogs, hamsters, etc…) in an input image remains unchanged. Our network of choice will remain SegNet, and the only change we’ll make is to replace our convolutional layers with depth-wise-separable-convolutions (DSC). Before we do this, we will dive into the theory and practice of depth-wise-separable-convolutions, and appreciate the idea behind the technique.

Throughout this article, we will reference code and results from this notebook for model training, and this notebook for a primer on DSC. If you wish to reproduce the results, you’ll need a GPU to ensure that the first notebook completes running in a reasonable amount of time. The second notebook can be run on a regular CPU.

### Articles in this series

This series is for readers at all experience levels with deep learning. If you want to learn about the practice of deep learning and vision AI along with some solid theory and hands-on experience, you’ve come to the right place! This is expected to be a 4-part series with the following articles:

Concepts and IdeasA CNN-based model**Depthwise separable convolutions (this article)**A Vision Transformer-based model

### Introduction

Let’s start this discussion with a closer look at the convolutions from the perspective of model size and computation cost. The number of trainable parameters is a good indication of the size of a model and the number of the tensor operations reflects the model complexity or computation cost. Consider that we have a convolution layer with n filters with size dₖ x dₖ. Further assume that this layer processes input with shape *m x h x w, *where *m* is the number of input channels, and *h* and *w* are height and width dimensions respectively. In this case, the convolution layer will produce an output with shape *n x h x w* as shown in Figure 2. We are assuming that the convolution uses *stride=1*. Let’s go ahead and evaluate this setup in terms of trainable parameters and computation cost.

Figure 2: Regular convolutional filters applied to input to produce output. Assume stride=1 and padding=dₖ-2. Source: Efficient Deep Learning Book

**Evaluation Of Trainable Parameters:** We have *n* filters, each of which has *m x dₖ x dₖ* learnable parameters. This results in a total of *n x m x dₖ x dₖ* learnable parameters. Bias terms are ignored to simplify this discussion. Let’s look at the PyTorch code below to validate our understanding.

import torch

from torch import nn

def num_parameters(m):

return sum([p.numel() for p in m.parameters()])

dk, m, n = 3, 16, 32

print(f”Expected number of parameters: {m * dk * dk * n}”)

conv1 = nn.Conv2d(in_channels=m, out_channels=n, kernel_size=dk, bias=False)

print(f”Actual number of parameters: {num_parameters(conv1)}”)

Prints the following.

Expected number of parameters: 4608

Actual number of parameters: 4608

Now, let’s evaluate the computation costs of convolution.

**Evaluation Of Computational Cost:** A single convolutional filter of shape *m x dₖ x dₖ* when run with a *stride=1* and a *padding=dₖ-2* on an input with size *h x w* will apply the convolutional filter *h x w* times, once for each image section with size *dₖ x dₖ* amounting to a total of *h x w* sections. It results in a cost of *m x dₖ x dₖ x h x w* per filter or output channel. Since we wish to compute n output channels, the total cost will be *m x dₖ x dₖ x h x n*. Let’s go ahead and validate this using the torchinfo PyTorch package.

from torchinfo import summary

h, w = 128, 128

print(f”Expected total multiplies: {m * dk * dk * h * w * n}”)

summary(conv1, input_size=(1, m, h, w))

Will print the following.

Expected total multiplies: 75497472

==========================================================================================

Layer (type:depth-idx) Output Shape Param #

==========================================================================================

Conv2d [1, 32, 128, 128] 4,608

==========================================================================================

Total params: 4,608

Trainable params: 4,608

Non-trainable params: 0

Total mult-adds (M): 75.50

==========================================================================================

Input size (MB): 1.05

Forward/backward pass size (MB): 4.19

Params size (MB): 0.02

Estimated Total Size (MB): 5.26

==========================================================================================

If we ignore the implementation details of a convolution layer for a moment, we would realize that, on a high level, a convolution layer just transforms a *m x h x w* input into a *n x h x w* output. The transformation is achieved through trainable filters which progressively learn features as they *see* inputs. The question that follows is: Is it possible to achieve this transformation using fewer learnable parameters and simultaneously ensuring minimum compromise in the learning capabilities of the layer? Depthwise Separable Convolutions were proposed to answer this exact question. Let’s understand them in detail and learn how they stack up on our evaluation metrics.

### Depthwise Separable Convolution

The concept of Depthwise Separable Convolutions (DSC) was first proposed by Laurent Sifre in their PhD thesis titled Rigid-Motion Scattering For Image Classification. Since then, they have been used successfully in various popular deep convolutional networks such as XceptionNet and MobileNet.

The main difference between a regular convolution, and a DSC is that a DSC is composed of 2 convolutions as described below:

A **depthwise grouped convolution**, where the number of input channels m is equal to the number of output channels such that each output channel is affected only by a single input channel. In PyTorch, this is called a “grouped” convolution. You can read more about grouped convolutions in PyTorch here.A **pointwise convolution** (filter size=1), which operates like a regular convolution such that each of the n filters operates on all m input channels to produce a single output value.Figure 3: Depthwise Separable Convolution filters applied to input to produce output. Assume *stride=1* and *padding=dₖ-2*. Source: Efficient Deep Learning Book

Let’s perform the same exercise that we did for regular convolutions for DSCs and compute the number of trainable parameters and computations.

**Evaluation Of Trainable Parameters:** The “grouped” convolutions have m filters, each of which has *dₖ x dₖ* learnable parameters which produces m output channels. This results in a total of *m x dₖ x dₖ* learnable parameters. The pointwise convolution has n filters of size *m x 1 x 1* which adds up to *n x m x 1 x 1* learnable parameters. Let’s look at the PyTorch code below to validate our understanding.

class DepthwiseSeparableConv(nn.Sequential):

def __init__(self, chin, chout, dk):

super().__init__(

# Depthwise convolution

nn.Conv2d(chin, chin, kernel_size=dk, stride=1, padding=dk-2, bias=False, groups=chin),

# Pointwise convolution

nn.Conv2d(chin, chout, kernel_size=1, bias=False),

)

conv2 = DepthwiseSeparableConv(chin=m, chout=n, dk=dk)

print(f”Expected number of parameters: {m * dk * dk + m * 1 * 1 * n}”)

print(f”Actual number of parameters: {num_parameters(conv2)}”)

Which will print.

Expected number of parameters: 656

Actual number of parameters: 656

We can see that the DSC version has roughly *7x* less parameters. Next, let’s focus our attention on the computation costs for a DSC layer.

**Evaluation Of Computational Cost:** Let’s assume our input has spatial dimensions *m x h x w*. In the grouped convolution segment of DSC, we have **m** filters, each with size *dₖ x dₖ*. A filter is applied to its corresponding input channel resulting in the segment cost of *m x dₖ x dₖ x h x w*. For the pointwise convolution, we apply **n** filters of size *m x 1 x 1*** **to produce **n **output channels. This results in the segment cost of *n x m x 1 x 1 x h x w*. We need to add up the costs of the grouped and pointwise operations to compute the total cost. Let’s go ahead and validate this using the torchinfo PyTorch package.

print(f”Expected total multiplies: {m * dk * dk * h * w + m * 1 * 1 * h * w * n}”)

s2 = summary(conv2, input_size=(1, m, h, w))

print(f”Actual multiplies: {s2.total_mult_adds}”)

print(s2)

Which will print.

Expected total multiplies: 10747904

Actual multiplies: 10747904

==========================================================================================

Layer (type:depth-idx) Output Shape Param #

==========================================================================================

DepthwiseSeparableConv [1, 32, 128, 128] —

├─Conv2d: 1-1 [1, 16, 128, 128] 144

├─Conv2d: 1-2 [1, 32, 128, 128] 512

==========================================================================================

Total params: 656

Trainable params: 656

Non-trainable params: 0

Total mult-adds (M): 10.75

==========================================================================================

Input size (MB): 1.05

Forward/backward pass size (MB): 6.29

Params size (MB): 0.00

Estimated Total Size (MB): 7.34

==========================================================================================

Let’s compare the sizes and costs of both the convolutions for a few examples to gain some intuition.

#### Size and Cost comparison for regular and depthwise separable convolutions

To compare the size and cost of regular and depthwise separable convolution, we will assume an input size of *128 x 128* to the network, a kernel size of *3 x 3*, and a network that progressively halves the spatial dimensions and doubles the number of channel dimensions. We assume a single 2d-conv layer at every step, but in practice, there could be more.

Figure 4: Comparing the number of trainable parameters (size) and multi-adds (cost) of regular and depthwise separable convolutions. We also show the ratio of the size and cost for the 2 types of convolutions. Source: Author(s).

You can see that on average both the size and computational cost of DSC is about 11% to 12% of the cost of regular convolutions for the configuration mentioned above.

Figure 5: Relative size and cost of regular v/s DSC. Source: Author(s).

Now that we have developed a good understanding of the types of convolutions and their relative costs, you must be wondering if there’s any downside of using DSCs. Everything we’ve seen so far seems to suggest that they are better in every way! Well, we haven’t yet considered an important aspect which is the impact they have on the accuracy of our model. Let’s dive into it via an experiment below.

### SegNet Using Depthwise Separable Convolutions

This notebook contains all the code for this section.

We will adapt our SegNet model from the previous post and replace all the regular convolutional layers with a DSC layer. Once we do this, we notice that the number of parameters in our notebook drops from 15.27M to 1.75M parameters, which is a reduction of 88.5%! This is inline with our earlier estimates of an 11% to 12% reduction in the number of trainable parameters of the network.

A similar configuration as before was used during model training and validation. The configuration is specified below.

The *random horizontal flip* and *colour jitter* data augmentations are applied to the training set to prevent overfittingThe images are resized to 128×128 pixels in a non-aspect preserving resize operationNo input normalization is applied to the images — instead a batch normalization layer is used as the first layer of the modelThe model is trained for 20 epochs using the Adam optimizer with a LR of 0.001 and no LR schedulerThe cross-entropy loss function is used to classify a pixel as belonging to a pet, the background, or a pet border

The model achieved a validation accuracy of 86.96% after 20 training epochs. This is less than the 88.28% accuracy achieved by the model using regular convolutions for the same number of training epochs. We have determined experimentally that training for more epochs improves the accuracy of both models, so 20 epochs is definitely not the end of the training cycle. We stop at 20 epochs for the purposes of this article for demonstration purposes.

We plotted a gif showing how the model is learning to predict the segmentation masks for 21 images in the validation set.

Figure 6: A gif showing how the SegNet model with DSC is learning to predict segmentation masks for 21 images in the validation set. Source: Author(s)

Now that we have seen how the model progresses through the training cycle, let’s compare the train cycles of models with regular convolutions and DSC.

#### Accuracy Comparisons

We found it useful to look at the training cycles of the models using regular convolutions and DSC. The main difference we noticed is in the early phases (epochs) of training, after which both models settled roughly into the same prediction flow. In fact after training both models for 100 epochs, we noticed that the accuracy of the model with DSC is just about 1% less than the model with regular convolutions. This is inline with our observations from just 20 epochs of training.

Figure 7: A gif showing the progression of segmentation masks predicted by the SegNet model using regular convolutions v/s DSC. Source: Author(s).

You would have noticed that both models get the predictions roughly right after just 6 training epochs — i.e. one can visually see that the models are predicting something useful. Most of the hard work of training the model is then above ensuring that the borders of the predicted masks are as tight as possible and as close to the actual pets in the image as possible. This means that while one can expect a lesser absolute increase in accuracy in the later training epochs, the impact of this on the quality of predictions is much more. We’ve noticed that a single digit of accuracy improvement at higher absolute accuracy values (going from 89% to 90%) results in significant qualitative improvements to the predictions.

#### Comparison with a UNet model

We ran an experiment that changed a lot of hyperparameters with a focus on improving the overall accuracy to get a sense of how far this setting is from close to optimal. Here’s the configuration of that experiment.

Image size: 128 x 128 — same as the experiments so farTrain epochs: 100 — current experiments trained for 20 epochsAugmentations: A lot more augmentations such as image rotation, channel dropping, random block removal. We used Albumentations instead of torchvision transforms. Albumentations automatically transforms segmentation masks for usLR Scheduler: A StepLR scheduler was used with a decay of 0.8x every 25 train epochsLoss function: We tried 4 different loss functions: Cross Entropy, Focal, Dice, Weighted Cross Entropy. Dice performed worst whereas the rest were pretty much comparable to each other. In fact the difference in best accuracy between the rest after 100 epochs was in the 4th digit after the decimal (assuming the accuracy is a number between 0.0 and 1.0)Convolution type: RegularModel type: UNet — current experiments used a SegNet model

We achieved a best validation accuracy of 91.3% for the setting above. We noticed that the image size significantly impacts the best validation accuracy. For example, when we changed the image size to 256 x 256, the best validation accuracy went up to 93.0%. However, training took much longer, and used more memory, which meant that we had to reduce the batch size.

Figure 8: Result of training a UNet model for 100 train epochs with the hyperparameters mentioned above. Source: Author(s).

You can see that the predictions are much smoother and crisper compared to the ones we have been seeing so far.

### Conclusion

In part-3 of this series, we learned about depth wise separable convolutions (DSC) as a technique to reduce model size and training/inference cost without a significant loss in validation accuracy. We learned about the size/cost tradeoff to expect between regular and DSC for a specific setting.

We showed how to adapt the SegNet model to use DSC in PyTorch. This technique can be applied to any deep CNN. In fact we can selectively replace some of the convolutional layers with DSC — i.e. we don’t need to necessarily replace all of them. Choosing which layers to replace will depend on the balance you wish to strike between model size/runtime-cost and prediction accuracy. This decision will depend on your specific use case and deployment setup.

While this article trained models for 20 epochs, we explained that this is insufficient for production workloads, and provided a glimpse into what one can expect if one trains the model for more epochs. In addition, we provided an introduction to some of the hyperparameters that one can tune during model training. While this list is by no means comprehensive, it should allow you to appreciate the complexity and decision making needed to train an image segmentation model for production workloads.

In the next part of this series, we’ll take a look at Vision Transformers, and how we can use this model architecture to perform image segmentation for the pets segmentation task.

### References and further reading

Efficient Deep Learning Book Chapter 04 — Efficient ArchitecturesA Basic Introduction to Separable Convolutions

Efficient Image Segmentation Using PyTorch: Part 3 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.