Alle News an der TU Wien

Dissertation of Roman Parzer

Random projections and dimensionality reduction in high-dimensional statistical learning

CSTAT team celebrating Roman's defense

Abstract

This thesis is concerned with the development of tools for the analysis of high-dimensional data, with a focus on different robustness aspects in high-dimensional statistical learning. The recent advances in technology have allowed  more and more quantities to be tracked and stored, which has lead to a huge increase in the amount of data, making available datasets more complex and larger than ever, both in dimension (number of variables p) and size (number of observations n). Since many classical statistical methods do not consider the arising practical and computational limitations,  there is a need for new methods specifically designed with these problems in mind.  One specific problem setting is to predict a response variable of interest Y using a large set of p available  predictor variables X using some regression function g with Y=g(X,u) and an unknown error term u.  Furthermore, we consider the case where the number of variables is much larger than the number of observations (p>n). Such a setting often requires dimensionality reduction techniques to make computation more feasible. Some techniques considered in this work include pre-screening of variables before applying the regression methods, or the so-called compressed regression which uses projection matrices. In this high-dimensional setting, we will also focus on building statistical models which are robust against   anomalies in the data (e.g. outliers). This can be achieved by allowing adapted likelihoods in the Bayesian setting, or by using different robust estimators in the frequentist setting. Moreover, decisions based on predictions from probabilistic models may be sensitive to model misspecification. In this work, we aim to investigate frameworks which robustify the proposed models with regards to misspecification. One such possible framework is the specification of scoring rules tailored to the decision problem at hand. The developed methodology will be illustrated on applications relevant to advancing sustainability efforts, including the analysis of environmental, medical or biological data.