Model Compression in Deep Vision Networks (Part 2)

A survey of two compression techniques for complex CNN architectures

12 min readNov 1, 2022

This article covers the second part of a review of 4 research papers read for the course Practical Deep Learning Systems Performance (COMSE6998) under the guidance of Prof. Parijat Dube at Columbia University. You can read the first part in Ashutosh’s blog where he covers the first two papers.

Introduction

Convolutional Neural Networks (CNNs) are one of the most popular deep learning models owing to their ability to work well on image data and vision-related tasks. There has been an immense amount of research to create complex architectures to achieve state-of-the-art results on various image-related applications, including image classification, segmentation, etc. However, as we create more complex architectures for better results, we also increase the size, number of parameters, and memory requirements of these models. Thus, we are often limited by compute resources and memory constraints that impact the performance of these networks on different kinds of devices.

As a result, there has been a correspondingly increasing interest in researching ways to reduce the size or memory requirement of large CNN architectures in order to allow deployment on smaller devices with limited resources. Various techniques have been devised to achieve maximum compression in CNNs, including different types of Pruning, Quantization, Knowledge Distillation, etc.

In this post, I focus on two recent compression techniques — one involving the combination of weight pruning and knowledge distillation for optimal compression, and the next working with adaptive filter pruning using an adversarial approach.

Paper 1: Combining Weight Pruning and Knowledge Distillation for CNN Compression

Popular compression methods like weight quantization, knowledge distillation, and weight pruning are effective for simpler networks such as VGG or AlexNet. However, they cannot be applied directly to intricately connected networks like ResNets that involve dimensionality dependencies and skip connections. Applying these methods out-of-the-box to these models can break their structure and render the model untrainable.

To mitigate this problem, the authors of this paper combine two different techniques: Weight Pruning and Knowledge Distillation. First, they apply weight pruning to a subset of the layers that do not cause network breakage (called prunable layers). Next, they use knowledge distillation from a teacher-student architecture to compress the remaining layers (unprunable layers). Thus, they are able to compress both types of layers and obtain a much smaller network, without compromising much on the model accuracy.

Methodology

The proposed approach first prunes the initial model (baseline network) using an activation analysis to find and remove neurons that do not influence the final prediction. Then, they use this model to distill knowledge onto a student network to obtain a smaller model.

Figure 1: Prunable layers (yellow) and un-prunable layers (red) in two residual blocks of the ResNet50 architecture

The dimensionality dependencies between layers of different blocks in ResNet-like architectures limit the direct application of neuron pruning. If one prunes a layer with a skip connection, both sides of the add operation would have different dimensions. This would break the network architecture and hamper the training of the model.

The authors, therefore, only focus on pruning specific neurons from certain layers that are prunable (do not break the network architecture) and use finetuning to recover the model accuracy. The prunable neurons are the ones with lesser importance, and the importance of a neuron can be computed using the Average Percentage of Zeros (as described in [1]) in the activation output of each neuron. Thus, with M validation samples and the dimension of the output feature map being N, the importance of a filter c in a prunable layer i can be computed as:

If a neuron has a larger average percentage of zeroes than the number of neurons within one standard deviation around the mean, it does not contribute as much as those neurons and can safely be pruned without impacting the accuracy too much. This process is repeated for all layers, and once completed, the network can be finetuned in an attempt to recover the baseline model’s accuracy. This can be continued multiple times till the drop in accuracy is significant, and we obtain the partially pruned model.

Next, we can move to compress the unprunable layers using knowledge distillation. The authors choose to use the partially pruned model as the teacher instead of the original model in order to ensure the student network does not have redundancy. Finally, for the student network, they use a loss function based on the cosine similarity between the deep layers of the network and the prediction layer of the teacher network. They choose a hyperparameter λ to define the importance of the final prediction over the total loss. Thus, for a teacher network T with M intermediate layers, student network S, and Tₒ and Dₒ being the output layers, the distillation loss is computed as:

The student network can thus learn the compressed knowledge of the teacher in a two-step process: doing a forward pass on a batch of input samples passed through both networks and then optimizing the student network’s weights by backpropagating the error computed with the distillation loss only on the student. This creates the final compressed model with both pruning and knowledge distillation.

The effectiveness of this approach was verified on both image-based regression and classification: on 3D head pose estimation on ResNet50 and image classification on ResNet110 and ResNet164, respectively. For the baseline ResNet50 model, the first step involved compressing prunable layers for 16 steps. The authors observed that the maximum amount of redundancy was in the first and deeper layers, as shown in figure 4 below.

Figure 4: Number of pruned neurons with pruning step and layer of the baseline model

Next, the authors used this model as a teacher for the student network having the same architecture as the teacher, but with the unprunable layers having 1/32 filters of the teacher to achieve compression while acquiring distilled knowledge. Thus, the final compressed model architecture can be seen in figure 5.

Figure 5: Final model architecture (bottom) after distillation from teacher to student. The red layers are unprunable and compressed via knowledge distillation, and the number of filters in each layer is indicated in the parentheses

The authors summarized metrics like model size and the regression MAE (as shown below in table 1). We can see the smallest models are the student networks, and the student with distilled knowledge from the teacher achieves a lower MAE than the student network trained from scratch, without learning from the pruned teacher. Thus, the MAE increases slightly from 4.22 to 5.25 on compression, but the model size decreases significantly. The authors also test the model for robustness by checking the performance with motion blur, occlusion, and changes in brightness, as highlighted in table 2, where the student network performs strongly with reduced MAE in most scenarios.

Figure 6: Results on image-regression for 3D Pose Estimation

Similarly, for image classification, the authors tested on two ResNet models, with 6 pruning steps in stage 1 and half the filters for unprunable layers in stage 2. Again, we can see that the approach achieved a significant compression rate with only a slight accuracy drop of around 1%, as highlighted in table 3. The authors also compare the inference time for both regression and classification tasks, as highlighted in tables 5 and 6.

Figure 7: Results on Image Classification

Thus, we can see that the approach achieves significant results in terms of both sizeable compression rates and minimal accuracy drops across both regression and classification tasks. Additionally, the approach also performs strongly in terms of reduced inference times for deployment in low-resource devices, making it a strong competitor against various other state-of-the-art compression techniques.

Paper 2: Play and Prune: Adaptive Filter Pruning for Deep Model Compression

This paper deviates slightly from the first one in two aspects — first, instead of combining knowledge distillation with pruning, this paper adopts a min-max adversarial framework for joint pruning and accuracy optimization, and second, it does not require any external finetuning, unlike most pruning-based compression algorithms. Additionally, this approach also does not require a pre-defined pruning level to be specified for each layer, and can instead work flexibly with a specified error tolerance level, based on which the framework decides which filters to prune from which layers.

Overview of Approach

The authors propose a min-max framework for pruning called Play and Prune, which prunes and finetunes the CNN jointly to achieve maximal pruning with minimal accuracy drop using adaptive pruning rates. The proposed approach comprises two adversaries: the Adaptive Filter Pruning (AFP) module and the Pruning Rate Controller (PRC) module. While the AFP tries to prune as much as possible to minimize the number of filters in the model, the PRC tries to maximize the accuracy after each series of pruning.

In a nutshell, the approach mainly involves the following two steps:

The AFP will prune a filter only when the drop in accuracy is smaller than the specified error tolerance (ε).
If the accuracy drop is higher than ε, the PRC tries to recover the accuracy using implicit finetuning.

If the PRC cannot finetune the model enough to bring the error within the tolerance, the AFP does not prune this filter and the game converges.

Methodology

Based on the above discussion, we can see that the AFP P tries to minimize the number of filters in the network, whereas the PRC C optimizes the accuracy given that number of filters. Thus, the objective function can be written as:

where K is the number of layers in the CNN architecture, Lᵢ is the iᵗʰ layer, FLᵢ denotes the set of filters in the iᵗʰ layer, and #w denotes the number of remaining filters after pruning by AFP.

After each pruning step, PRC checks the accuracy drop. If it is more than ε, the PRC continues to finetune the network to recover model performance and resets the pruning rate to zero. If not, the AFP continues to prune. Thus, the pruning rates in each iteration are decided adaptively. Finally, the game converges when the performance drop is more than ε and the PRC cannot recover it. This is also represented diagrammatically in figure 9 below. The last pruning is rolled back and the previous model with accuracy drop ≤ ε is restored, generating the optimal compressed model.

Working of the AFP

The AFP is responsible for deciding which filters to prune. It first needs to identify the candidate set of filters to be pruned. In each epoch, it uses a filter partitioning approach to divide all the filters into two sets: important filters (I) and unimportant filters (U). The unimportant filters are the top α% filters with the minimum sum of absolute filter weights in each layer. The remaining filters in the layer belong to the important filter set. This is summarized in figure 10 below.

Figure 10: Computation of important and unimportant sets

Next, the AFP decides which filters to prune from the candidate set U, as removing all of them might result in a steep drop in accuracy. This is done using a modified objective function, as shown in figure 11. Here, C(Θ) is the original cost function with remaining filters Θ’, #w is the collection of all filters in the layer, and λA is the l₁ regularization constant. The authors also choose an adaptive weight threshold Wγᵢ such that removing any filter with the sum of absolute weights less than this threshold results in a negligible accuracy drop.

Thus, the working of the AFP can be summarized as follows: in each epoch, it selects the top α% filters with the least importance to create U and I, then discards filters below the adaptive weight threshold in each layer, and finally prunes a subset of U using the adaptive regularization constant and weight threshold chosen dynamically by PRC.

Working of the PRC

The PRC is responsible for dynamically updating the hyperparameters λ and Wγ. They are calculated as follows:

The parameters in the equations are as follows: C(#w) is the accuracy with #w remaining filters, E is the accuracy of the unpruned network, and the number C(#w)−(E −ε) denotes how far we are from tolerance error level ε, δw is the pruning rate accelerator or decelerator. Using the above equations, the PRC dynamically controls the rate at which AFP prunes the filters. This optimization focuses only on the accuracy gain until it is back within the tolerance level, and if it is within this level the adaptive pruning rate depends on how far we are from the tolerance level.

To verify the working of the min-max framework, the authors conducted experiments on both small (CIFAR-10) and large (ILSVRC-2012) datasets on VGG16 and ResNet50 networks. The results for VGG16 compression on CIFAR-10 are shown in table 1, where PP-1 is the first pruned model after 34 epochs and PP-2 is the second pruned model with 62 epochs.

Table 1: Pruning results for each layer in the pruned models

They also found that iterative pruning performs much better than pruning in one shot. For the same FLOPs pruning of 84.5%, pruning in one shot has a significant accuracy drop of 8.03%.

The authors also go one step ahead to compare the theoretical vs practical speedup obtained by reduction in FLOPs. The practical speedup is often lesser than expected because of factors like the number of CPU/GPU cores, intermediate layer bottleneck, batch size, etc. These results are summarized in figure 14. Considering one example, for VGG16 with a batch size of 512, the theoretical speedup is 6.43x, whereas the observed GPU speedup is lower at 4.02x. The gap is much smaller on the CPU, with a theoretical speedup of 6.43x and an observed speedup of 6.24x.

Figure 14: Speedup on CPU and GPU over different batch sizes for VGG16 on CIFAR-10

On the smaller CIFAR-10 dataset, compression of both the VGG16 and ResNet50 models perform much better than various other popular pruning approaches. Similarly, for the larger ImageNet dataset as well, the framework shows better results than the other compression approaches for both VGG16 and ResNet50. These results are summarized in figure 13.

Figure 13: Summary of Compression of VGG16 and ResNet50 on CIFAR-10 and ImageNet datasets

Thus, we can see that the min-max Play and Prune framework achieves a significant reduction in model size, number of parameters, and FLOPs. Additionally, this approach leads to substantially improved performance compared to other recently proposed filter pruning strategies while being highly stable and efficient to train. The authors also highlight how this approach generalizes well to other image-related tasks like object detection, and can easily be extended to work with other compression techniques like quantization, weight pruning, etc.

Insights

We can observe several insights from both papers. Some of these insights are highlighted as follows:

The authors of the first paper combine weight pruning and knowledge distillation and use a new cosine similarity-based student-teacher loss to compress complex architectures like ResNets. The approach outperforms compression rates for both regression and classification tasks on images
The second paper introduces an adversarial approach to compression, where two modules compete with each other to achieve maximum compression with minimum accuracy drop. The approach uses adaptive pruning rates to optimally compress the network and not significantly lose out on performance

Conclusion

Finding ways to compress complex CNN architectures holds immense promise considering most state-of-the-art networks have complicated structures. Additionally, pruning filters in convolutional layers can prove more beneficial than pruning fully connected layers if we consider the reduction in FLOPs, as most computations occur in the convolutional layers. The adversarial approach can open doors to more approaches in the model compression domain to achieve maximal compression with minimal accuracy drop. Model compression in vision networks is an important domain with a lot of ongoing research, and the papers summarized above contribute immensely to this area.