Zahra Babaiee

Supervisor: Radu Grosu

Towards Bio-inspired, Small and Robust Deep Learning Vision Systems

 

Cyber-Physical Systems (CPS) are systems of collaborating embedded systems which are in intensive connection with the ever changing surrounding physical world. These computer systems are composed of a computational and a physical core. The computational core receives information about the environment through sensors in the physical part and by a controller, tells the actuators of the physical component what to do. Many of the prominent applications of embedded systems are reliant on computer vision. These applications range from industrial machine vision systems and autonomous vehicles, to image processing in medicine and disease diagnosis. In recent years, deep learning methods have been shown to achieve the state of the art results in many fields, with computer vision as one of the most prominent ones. Deep learning methods outperform other machine learning methods on various vision problems, including image classification, object recognition and image generation. Thanks to the huge boost of deep learning methods and Appearance of large, labelled and high-quality image datasets like Imagenet [12], computer vision systems are now part of our everyday life, including face recognition systems on smart phones and autonomous driving. While there is a great interest to deploy deep vision systems in our homes, factories, and workplaces to help us solve complex tasks, we should still act careful. They need to be ecient, interpretable, and above all safe in the sense that they work reliably and consistently in uncertain, complex environments. Many deep learning computer vision models use Convolutional Neural Networks (CNNs) [49] which are heavily inspired by the visual cortex. They consist of hierarchical layers that extract localized features. In each convolution layer, small-sized kernel shifts over the input image, convolving each patch with filters. These kernels function like the receptive fields in the retina; they change the activity of the neurons connected to that patch in the next layer. Fukushima’s Neocognitron was one of the first hierarchical neural networks [17] that inspired many other variants. [18, 50, 9] Large CNNs achieve considerable performance level, but with significant computing, memory and energy footprint. These models are dense and over-parameterized. For instance, a ResNet-152 [27] has 60 million parameters (only in convolution layers) and requires 11.3 GFLOPs in a forward pass. Training such a network on full ImageNet dataset takes about 1.5 weeks with 4 GPUs. This over-parameterization limits usage of these systems on resourcelimited environments such as mobile or embedded devices due to the memory and computation costs. Thus, it’s important to come up with smaller models that can perform without significantly losing the accuracy and performance of the bigger counterparts. This can be achieved either by designing smaller network architectures, or training a huge over-parameterized network and then sparsify it by pruneing either the synapses, or the neurons. In addition to the heavy costs of huge networks, they su er from interpretability issues. As the networks’ size grows, also grows the concerns about their black-box nature, compromising the trust in these networks to be used in critical areas. Interpretability is specially important in safety-critical applications like medicine, and can cause algorithmic discrimination and ethical problems [69]. Great success and development of deep vision systems and their usage in safety-critical embedded systems makes security and robustness aspects of these networks increasingly important.

Research has shown that CNNs are often brittle to different noise levels and image perturbations [13, 20, 31], and can be easily fooled by altered images that are completely unrecognisable by humans, and recognise them as certain objects by up to 99:99% confidence [58]. The robustness of these networks is even more in danger when they are sparsified [53]. In my thesis, I intend to focus on creating biologically inspired deep models for vision with two main characteristics: a) with small memory footprint, both via design and via pruning redundant synapses, and b) robust to distribution shifts and adversarial attacks.