Add A3I GMM and autoencoder

This commit is contained in:
2024-09-27 14:07:23 +02:00
parent 4c887f7bc2
commit 1cbdee98dd

View File

@ -64,7 +64,151 @@ The KDE model, bandwidth, and threshold are fitted as in \Cref{ch:ad_low}.
KDE on this dataset has the following problems:
\begin{itemize}
\item It is highly subject to the curse of dimensionality and requires more data to be reliable.
\item As the dataset is used during inference, a large dataset is computationally expensive.
\begin{remark}
KDE does not compress the input features and grows with the training set. In other words, it has low bias and high variance.
\end{remark}
\item As the dataset is used during inference, a large dataset is computationally expensive (with time complexity $O(mn)$, where $m$ is the number of dimensions and $n$ the number of samples).
\item It only provides an alarm signal and not an explanation of the anomaly. In low-dimensionality, this is still acceptable, but with high-dimensional data it is harder to find an explanation.
\end{itemize}
\end{description}
\end{description}
\subsection{Gaussian mixture model}
\begin{description}
\item[Gaussian mixture model (GMM)] \marginnote{Gaussian mixture model (GMM)}
Selection-based ensemble that estimates a distribution as a weighted sum of Gaussians. It is assumed that data can be generated by the following probabilistic model:
\[ X_Z \]
where:
\begin{itemize}
\item $X_k$ are random variables following a multivariate Gaussian distribution. The number of components is a hyperparameter that can be tuned to balance bias-variance.
\item $Z$ is a random variable representing the index $k$ of the variable $X$ to use to generate data.
\end{itemize}
More specifically, the PDF of a GMM $g$ is defined as:
\[ g(x, \mu, \Sigma, \tau) = \sum_{k=1}^{n} \tau_k f(x, \mu_k, \Sigma_k) \]
where:
\begin{itemize}
\item $f$ is the PDF of a multivariate normal distribution.
\item $\mu_k$ and $\Sigma_k$ are the mean and covariance matrix for the $k$-th component, respectively.
\item $\tau_k$ is the weight for the $k$-th component and corresponds to $\prob{Z=k}$.
\end{itemize}
\begin{description}
\item[Training]
Likelihood maximization can be used to train a GMM to approximate another distribution:
\[ \arg\max_{\mu, \Sigma, \tau} \mathbb{E}_{x \sim X} \left[ L(x, \mu, \Sigma, \tau) \right] \qquad \text{subject to } \sum_{k=1}^{n} \tau_k = 1 \]
where the expectation can be approximated using the training set (i.e., empirical risk minimization):
\[ \mathbb{E}_{x \sim X} \left[ L(x, \mu, \Sigma, \tau) \right] \approx \prod_{i=1}^{m} g(x_i, \mu, \Sigma, \tau) \]
\begin{remark}
Empirical risk minimization can be solved in two ways:
\begin{itemize}
\item Use a single large sample (traditional approach).
\item Use many smaller samples (as in cross-validation).
\end{itemize}
\end{remark}
By putting the definitions together, we obtain the following problem:
\[ \arg\max_{\mu, \Sigma, \tau} \prod_{i=1}^{m} \sum_{k=1}^{n} \tau_k f(x, \mu_k, \Sigma_k) \qquad \text{subject to } \sum_{k=1}^{n} \tau_k = 1 \]
which:
\begin{itemize}
\item Cannot be solved using gradient descent as it is an unconstrained method.
\item Cannot be solved using mixed-integer linear programming as the problem is non-linear.
\item Cannot be decomposed as the variables $\mu$, $\Sigma$, and $\tau$ appear in every term.
\end{itemize}
It is possible to simplify the formulation of the problem by introducing new variables:
\begin{itemize}
\item A new latent random variable $Z_i$ is added for each example. $Z_i = k$ iff the $i$-th example is drawn from the $k$-th component. The PDF of the GMM can be reformulated to use $Z_i$ and without the summation as:
\[ \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau) = \tau_{z_i} f(x, \mu_k, \Sigma_k) \]
The expectation becomes:
\[ \mathbb{E}_{x \sim X, \{z_i\} \sim \{Z_i\}} \left[ L(x, z, \mu, \Sigma, \tau) \right] \approx \mathbb{E}_{\{z_i\} \sim \{Z_i\}} \left[ \prod_{i=1}^{m} \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau) \right] \]
Note that, as the distributions to sample $z_i$ are unknown, $Z_i$ cannot be approximated in the same way as $X$.
\item New variables $\tilde{\tau}_{i, k}$ are introduced to represent the distribution of the $Z_i$ variables. In other words, $\tilde{\tau}_{i, k}$ corresponds to $\prob{Z_i = k}$. This allows to approximate the expectation as:
\[ \mathbb{E}_{\hat{x} \sim X, \hat{z} \sim Z} \left[ L(\hat{x}, \hat{z}, \mu, \Sigma, \tau) \right] \approx \prod_{i=1}^{m} \prod_{k=1}^{n} \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau)^{\tilde{\tau}_{i, k}} \]
The intuitive idea is that, if $Z_i$ is sampled, $\tilde{\tau}_{i, k}$ of the samples would be the $k$-th component. Therefore, the corresponding density should be multiplied by itself for $\tilde{\tau}_{i, k}$ times.
\end{itemize}
Finally, the GMM training can be formulated as follows:
\[ \arg\max_{\mu, \Sigma, \tau, \tilde{\tau}} \prod_{i=1}^{m} \prod_{k=1}^{n} \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau)^{\tilde{\tau}_{i, k}} \,\,\text{ s.t. } \sum_{k=1}^{n} \tau_k = 1, \forall i =1\dots m: \sum_{k=1}^{n} \tilde{\tau}_{i, k} = 1 \]
which can be simplified by applying a logarithm and solved using expectation-maximization.
\end{description}
\end{description}
\begin{remark}
Differently from KDE, GMM allows making various types of prediction:
\begin{itemize}
\item Evaluate the (log) density of a sample.
\item Generate a sample.
\item Estimate the probability that a sample belongs to a component.
\item Assign samples to a component (i.e., clustering).
\begin{remark}
GMM can be seen as a generalization of $k$-means.
\end{remark}
\end{itemize}
\end{remark}
\begin{remark}
Differently from KDE, most of the computation is done at training time.
\end{remark}
\begin{description}
\item[Number of components estimation]
The number of Gaussians to use in GMM can be determined through grid search and cross-validation.
\begin{remark}
Other method such as the elbow method can also be applied. Some variants of GMM are able to infer the number of components.
\end{remark}
\item[Threshold optimization]
The threshold can be determined in the same way as in \Cref{sec:ad_taxi_kde_uni}.
\end{description}
\subsection{Autoencoder}
\begin{description}
\item[Autoencoder] \marginnote{Autoencoder}
Neural network trained to reconstruct its input. It is composed of two components:
\begin{descriptionlist}
\item[Encoder] $e(x, \theta_e)$ that maps an input $x$ into a latent vector $z$.
\item[Decoder] $d(z, \theta_d)$ that maps $z$ into a reconstruction of $x$.
\end{descriptionlist}
\begin{description}
\item[Training]
Training aims to minimize the reconstruction MSE:
\[ \arg\min_{\theta_e, \theta_d} \left\Vert d\left( e(x_i, \theta_e), \theta_d \right) - x_i \right\Vert_2^2 \]
To avoid trivial embeddings, the following can be done:
\begin{itemize}
\item Use a small-dimensional latent space.
\item Use L1 regularization to prefer sparse encodings.
\end{itemize}
\end{description}
\item[Autoencoder for anomaly detection]
By evaluating the quality of the reconstruction, an autoencoder can be used for anomaly detection:
\[ \Vert x - d(e(x, \theta_e), \theta_d) \Vert_2^2 \geq \varepsilon \]
The advantages of this approach are the following:
\begin{itemize}
\item The size of the neural network does not scale with the training data.
\item Neural networks have good performances in high-dimensional spaces.
\item Inference is fast.
\end{itemize}
However, the task of reconstruction can be harder that density estimation.
\end{description}
\begin{remark}
It is always a good idea to normalize the input of a neural network to have a more stable gradient descent. Moreover, with normalized data, common weight initialization techniques make the output approximately normalized too.
\end{remark}