Bayesian inference techniques for Deep Learning
Date Issued
January 2018
Author(s)
Advisor
Abstract
Deep learning has achieved state of the art performance in various challenging machine
learning tasks pushing the Artificial Intelligence frontier into new heights. Tasks like object
recognition, speech perception, language understanding and robotics are improving year by
year. This is mainly due to the recent breakthroughs in Bayesian inference, the increased
volume of datasets and the increased computational power. These make it feasible to tractably
train these challenging hierarchical structured models that contain millions of parameters.
Deep Learning is an umbrella term which entails numerous deep architecture models that
are able to capture even the most complex dynamics of the environment. Typically, they are
trained under the maximum likelihood estimation paradigm. Unfortunately, in many real
world tasks the high dimensionality of the observations results in even the largest datasets to
being sparse. As such, there is an immense need for the training algorithm to compensate the
uncertainty introduced by the data sparsity, overcome the model’s overfitting tendencies and
in result generalize well.
The statistical method of Bayesian inference provides a mathematically coherent way of
dealing with data sparsity and overfitting. It essentially uses the Bayes theorem to accumulate
evidence-based knowledge. This is achieved by postulating probability distributions over the
parameters instead of trying to derive point estimates of them. Under the Bayesian view, we
impose a prior distribution that encapsulates our initial belief about the model’s dynamics
and we correct that belief as we are presented with more data; this consists in inferring the
posterior distribution. It is conspicuous that the choice of the distribution heavily controls the
expressiveness of the model.
In this thesis, we present innovative approaches to train deep networks by considering sparsity,
skewness and heavy tails on the form of the parameters distribution. Specifically, among our
contributions, we impose a sparsity inducing distribution over the network synaptic weights
to improve generalization. On a different vein, we consider the imposition of a skew normal
distribution over the latent variables to increase the deep networks capacity. In parallel, we
examine the efficacy of inferring the feature functions by devising a novel random sampling
rational combined by an optimizable sample weighting scheme. The models derived by the
aforementioned approaches are trained by means of approximate Bayesian inference scheme
to allow for scalability in large datasets. We exhibit the advantages of these methods over
existing approaches by conducting an extensive experimental evaluation using benchmark
learning tasks pushing the Artificial Intelligence frontier into new heights. Tasks like object
recognition, speech perception, language understanding and robotics are improving year by
year. This is mainly due to the recent breakthroughs in Bayesian inference, the increased
volume of datasets and the increased computational power. These make it feasible to tractably
train these challenging hierarchical structured models that contain millions of parameters.
Deep Learning is an umbrella term which entails numerous deep architecture models that
are able to capture even the most complex dynamics of the environment. Typically, they are
trained under the maximum likelihood estimation paradigm. Unfortunately, in many real
world tasks the high dimensionality of the observations results in even the largest datasets to
being sparse. As such, there is an immense need for the training algorithm to compensate the
uncertainty introduced by the data sparsity, overcome the model’s overfitting tendencies and
in result generalize well.
The statistical method of Bayesian inference provides a mathematically coherent way of
dealing with data sparsity and overfitting. It essentially uses the Bayes theorem to accumulate
evidence-based knowledge. This is achieved by postulating probability distributions over the
parameters instead of trying to derive point estimates of them. Under the Bayesian view, we
impose a prior distribution that encapsulates our initial belief about the model’s dynamics
and we correct that belief as we are presented with more data; this consists in inferring the
posterior distribution. It is conspicuous that the choice of the distribution heavily controls the
expressiveness of the model.
In this thesis, we present innovative approaches to train deep networks by considering sparsity,
skewness and heavy tails on the form of the parameters distribution. Specifically, among our
contributions, we impose a sparsity inducing distribution over the network synaptic weights
to improve generalization. On a different vein, we consider the imposition of a skew normal
distribution over the latent variables to increase the deep networks capacity. In parallel, we
examine the efficacy of inferring the feature functions by devising a novel random sampling
rational combined by an optimizable sample weighting scheme. The models derived by the
aforementioned approaches are trained by means of approximate Bayesian inference scheme
to allow for scalability in large datasets. We exhibit the advantages of these methods over
existing approaches by conducting an extensive experimental evaluation using benchmark
File(s)![Thumbnail Image]()
![Thumbnail Image]()
Name
Παρταουρίδης Χαράλαμπος.pdf
Size
1.68 MB
Format
Adobe PDF
Checksum (MD5)
cc034fa506d071524c878002be64bc0b
Name
Abstract.pdf
Size
116.1 KB
Format
Adobe PDF
Checksum (MD5)
acb902cfb25330c59def76083a5ea55a

