All news at TU Wien

Beyond models

Félix Iglesias Vázquez on open code practices and the development of versatile algorithms for real-world data challenges.

The image shows a man working on a laptop. Next to him and in the background, graphic point clouds and lines can be seen.

Félix Iglesias Vázquez with network traffic captures

We meet Dr Félix Iglesias Vázquez at the Institute of Telecommunications on TU Wien’s Gußhaus campus, where he has been working for over 12 years. Surrounded by screens and servers, we dive into a conversation about data-centric research, open code practices for truly reproducible science, and the responsible use of AI and large language models.

Driving a network security working group with a background in electrical engineering, data analysis, and machine learning, Félix focuses on developing innovative methodologies and algorithms to detect anomalies across complex datasets from various research fields.

Theory meets application

“From the theoretical part to the application – maybe it's too much to be everywhere. But if you are working on theory or methodology, it's very important to always have the application in mind – trying to solve a real problem. Not just play with mathematics.”

We discuss the intricacies of developing new theoretical approaches and designing models for data analysis while staying grounded in real-world applications. This balance is especially important in domains dealing with highly personalised data, such as medical research, where measurement instruments, laboratory artefacts, and patient records play a critical role in providing context for analysis. Similar challenges arise in network security, where researchers face the persistent challenge of obtaining high-quality, well-documented datasets while respecting privacy, security, and data anonymisation requirements.

Mapping anomalies

“I'm going to use the tools and models that have been published for anomaly detection, but then you realise that they don't work in your domain. And you start asking yourself, why? Why aren't these tools working?”

One key insight is that anomaly detection must align with the nature of the data and the research context. In some domains, anomalies manifest as compact, dense clusters instead of isolated outliers, which requires tailored detection strategies. This calls for a data-centric approach and a broader perspective on anomalies – not just isolated points, but also novelties or patterns that don’t fit predefined norms.

Data-centric methodologies 

“When I use a dataset, I want to know exactly which application it serves, which problem it addresses, how it is used, and what the metadata tells me – especially about the labelling. […] Then you need to rethink the concept of anomaly. Because an anomaly is not only an isolated point in the space, it can also be a novelty. Right? Something that doesn't fit a predefined normality – you have to change your perspective.”

Dr Iglesias emphasises the importance of findable metadata and thorough documentation, explaining that poor data quality and reliance on synthetic benchmark datasets – unrepresentative of real communications – have slowed progress in the field. He has shifted to data-centric approaches, which prioritise the quality, relevance, and context of datasets over merely refining algorithms. Since model-centric tools often fail in new domains due to mismatched data and intended applications, his group focuses on robust, adaptable methods and well-matched datasets, strategically developing versatile algorithms and publishing code – such as anomaly detection systems SDO (Sparse Data Observers) and Go-flows – that work effectively across various domains.

Reproducible code publishing

The ideal for code sharing is the fewer libraries you need, the better. The best solution that I found is just to dockerize everything – to publish the code and also a Docker version for reproducibility. In this Docker container, you can put the libraries, pin dependencies, and even define the operating system environment to create infrastructure independence. It's like its own ecosystem to run and reproduce the experiments.

The emphasis is on becoming more mindful of adhering not only to the FAIR principles but also to field-specific standards and smart data models that make datasets more interpretable and useful for the community. To address reproducibility challenges caused by software dependencies and library versioning, he advocates for Docker containerization, ensuring that experiments can be reliably replicated over time.

When discussing recent advances in AI, Dr Iglesias acknowledges that large language models (LLMs) are powerful tools, especially when acting as agents that can test and interpret results in complex environments. As we discuss the broader evolution of data analysis, cautious optimism is voiced about the potential and risks of AI agents, particularly concerning error propagation and model degradation.

Look into the future

“The problem with machine learning and artificial intelligence is not that they fail – because they can fail. The problem is if they fail and we don't notice – this is a catastrophe, because in such architectures, it is quite likely that all systems fail in a similar way.“

According to Dr Iglesias, the real danger is not just random mistakes, but systematic, invisible failures that can propagate on a large scale. That’s why it’s especially important to have strong monitoring and careful checks by humans whose different backgrounds and experiences can offer context-aware evaluations of these technologies. 

Félix further underscores the need for critical thinking, transparency, and adaptability in scientific practice, expressing the hope that his algorithms and educational influence will inspire his students to form their own opinions.

Contact

Dr Félix Iglesias Vázquez
Institute of Telecommunications
TU Wien
felix.iglesias@tuwien.ac.at

Center for Research Data Management
TU Wien
research.data@tuwien.ac.at