Alessio Colucci

Supervisor: Muhammad Shafique & Andreas Steininger

Fault Analysis and Mitigation for Modern Deep Learning Systems

 

In the last decade, Neural Networks (NNs) have become widely used due to their proficiency in complex tasks. However, while they have reached a level of performance and accuracy allowing system designers to deploy them even on resource-constrained IoT devices, there have been few studies focusing on their reliability for safety-critical real-world scenarios. Most reliability research has focused on permanent faults, neglecting transient faults which have become more and more common with the further scaling of technology nodes. Additionally, real-world deployments utilize different optimization techniques, e.g. quantization and pruning, and different models, e.g., Spiking Neural Networks (SNNs), for further improving performance, and using old mitigation techniques, e.g. Triple-Modular Redundancy, which are not cost-effective for usage with NNs.
Hence, there is a need for further experiments and analysis of transient faults in NNs, especially for SNNs and compressed NNs, in order to develop better error models and mitigations. Our work has focused on developing a modern fault injection framework, to better analyze different NNs, while still being independent of the model or platform used. With this framework, different fault models can be experimented on, to gather the results of the injections and train a general error model for NNs. This model will be used to fine-tune the injected faults, and develop cost-effective mitigations for NNs.