All news at TU Wien

PIANO-SSM

Diagonal State Space Models for efficient MIDI-to-RAW Audio Synthesis at DAFx25

[Translate to English:] Piano SSM Grafik

1 of 2 images or videos

 Logo von Konferenz

1 of 2 images or videos

Researchers from the Embedded Machine Learning CD-Lab at ICT introduce Piano-SSM, a lightweight and interpretable neural architecture for real-time piano audio synthesis from MIDI input, leveraging diagonal state space models without requiring domain-specific priors.

Audio Samples: https://domdal.github.io/piano-ssm-samples/, opens an external URL in a new window 
Repository: https://github.com/domdal/piano-ssm/, opens an external URL in a new window 

At the forthcoming 28th International Conference on Digital Audio Effects (DAFx25), researchers from TU Wien’s Christian Doppler Laboratory for Embedded Machine Learning will present Piano-SSM, a novel neural architecture for efficient and interpretable MIDI-to-raw audio synthesis.
Piano-SSM builds upon recent advances in deep diagonal state space models (SSMs) and proposes an end-to-end trainable architecture that synthesizes high-quality piano audio directly from MIDI input. Unlike prior approaches such as DDSP-Piano, which integrate domain-specific submodels, Piano-SSM eliminates the need for handcrafted acoustic priors. It instead relies solely on a compact SSM-based sequence model with as few as 270k parameters for the largest model, achieving competitive results on established datasets.
Quantitative evaluations on the MAESTRO v3.0.0 and MAPS datasets show that Piano-SSM outperforms or closely matches state-of-the-art baselines in terms of Multi-Scale Spectral Loss (MSSL), while enabling real-time autoregressive inference on conventional CPUs. A C++17 header-only implementation demonstrates that even the largest model variant synthesizes one second of 44.1 kHz audio in 0.44 seconds with minimal latency (10.1 µs I/O delay), validating its applicability for embedded and low-latency systems.
A key contribution is the model’s sampling rate flexibility: Piano-SSM can be trained at a high rate and synthesized at a lower rate without retraining. Analysis of discretization effects reveals a strong correlation between performance loss and eigenvalue aliasing, offering insights into the trade-offs between accuracy and computational efficiency.