mirror of
https://github.com/NotXia/unibo-ai-notes.git
synced 2025-12-14 18:51:52 +01:00
Add A3I GMM and autoencoder
This commit is contained in:
@ -64,7 +64,151 @@ The KDE model, bandwidth, and threshold are fitted as in \Cref{ch:ad_low}.
|
||||
KDE on this dataset has the following problems:
|
||||
\begin{itemize}
|
||||
\item It is highly subject to the curse of dimensionality and requires more data to be reliable.
|
||||
\item As the dataset is used during inference, a large dataset is computationally expensive.
|
||||
\begin{remark}
|
||||
KDE does not compress the input features and grows with the training set. In other words, it has low bias and high variance.
|
||||
\end{remark}
|
||||
|
||||
\item As the dataset is used during inference, a large dataset is computationally expensive (with time complexity $O(mn)$, where $m$ is the number of dimensions and $n$ the number of samples).
|
||||
|
||||
\item It only provides an alarm signal and not an explanation of the anomaly. In low-dimensionality, this is still acceptable, but with high-dimensional data it is harder to find an explanation.
|
||||
\end{itemize}
|
||||
\end{description}
|
||||
\end{description}
|
||||
|
||||
|
||||
\subsection{Gaussian mixture model}
|
||||
|
||||
\begin{description}
|
||||
\item[Gaussian mixture model (GMM)] \marginnote{Gaussian mixture model (GMM)}
|
||||
Selection-based ensemble that estimates a distribution as a weighted sum of Gaussians. It is assumed that data can be generated by the following probabilistic model:
|
||||
\[ X_Z \]
|
||||
where:
|
||||
\begin{itemize}
|
||||
\item $X_k$ are random variables following a multivariate Gaussian distribution. The number of components is a hyperparameter that can be tuned to balance bias-variance.
|
||||
|
||||
\item $Z$ is a random variable representing the index $k$ of the variable $X$ to use to generate data.
|
||||
\end{itemize}
|
||||
|
||||
More specifically, the PDF of a GMM $g$ is defined as:
|
||||
\[ g(x, \mu, \Sigma, \tau) = \sum_{k=1}^{n} \tau_k f(x, \mu_k, \Sigma_k) \]
|
||||
where:
|
||||
\begin{itemize}
|
||||
\item $f$ is the PDF of a multivariate normal distribution.
|
||||
\item $\mu_k$ and $\Sigma_k$ are the mean and covariance matrix for the $k$-th component, respectively.
|
||||
\item $\tau_k$ is the weight for the $k$-th component and corresponds to $\prob{Z=k}$.
|
||||
\end{itemize}
|
||||
|
||||
\begin{description}
|
||||
\item[Training]
|
||||
Likelihood maximization can be used to train a GMM to approximate another distribution:
|
||||
\[ \arg\max_{\mu, \Sigma, \tau} \mathbb{E}_{x \sim X} \left[ L(x, \mu, \Sigma, \tau) \right] \qquad \text{subject to } \sum_{k=1}^{n} \tau_k = 1 \]
|
||||
where the expectation can be approximated using the training set (i.e., empirical risk minimization):
|
||||
\[ \mathbb{E}_{x \sim X} \left[ L(x, \mu, \Sigma, \tau) \right] \approx \prod_{i=1}^{m} g(x_i, \mu, \Sigma, \tau) \]
|
||||
|
||||
\begin{remark}
|
||||
Empirical risk minimization can be solved in two ways:
|
||||
\begin{itemize}
|
||||
\item Use a single large sample (traditional approach).
|
||||
\item Use many smaller samples (as in cross-validation).
|
||||
\end{itemize}
|
||||
\end{remark}
|
||||
|
||||
By putting the definitions together, we obtain the following problem:
|
||||
\[ \arg\max_{\mu, \Sigma, \tau} \prod_{i=1}^{m} \sum_{k=1}^{n} \tau_k f(x, \mu_k, \Sigma_k) \qquad \text{subject to } \sum_{k=1}^{n} \tau_k = 1 \]
|
||||
which:
|
||||
\begin{itemize}
|
||||
\item Cannot be solved using gradient descent as it is an unconstrained method.
|
||||
\item Cannot be solved using mixed-integer linear programming as the problem is non-linear.
|
||||
\item Cannot be decomposed as the variables $\mu$, $\Sigma$, and $\tau$ appear in every term.
|
||||
\end{itemize}
|
||||
|
||||
It is possible to simplify the formulation of the problem by introducing new variables:
|
||||
\begin{itemize}
|
||||
\item A new latent random variable $Z_i$ is added for each example. $Z_i = k$ iff the $i$-th example is drawn from the $k$-th component. The PDF of the GMM can be reformulated to use $Z_i$ and without the summation as:
|
||||
\[ \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau) = \tau_{z_i} f(x, \mu_k, \Sigma_k) \]
|
||||
The expectation becomes:
|
||||
\[ \mathbb{E}_{x \sim X, \{z_i\} \sim \{Z_i\}} \left[ L(x, z, \mu, \Sigma, \tau) \right] \approx \mathbb{E}_{\{z_i\} \sim \{Z_i\}} \left[ \prod_{i=1}^{m} \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau) \right] \]
|
||||
Note that, as the distributions to sample $z_i$ are unknown, $Z_i$ cannot be approximated in the same way as $X$.
|
||||
|
||||
\item New variables $\tilde{\tau}_{i, k}$ are introduced to represent the distribution of the $Z_i$ variables. In other words, $\tilde{\tau}_{i, k}$ corresponds to $\prob{Z_i = k}$. This allows to approximate the expectation as:
|
||||
\[ \mathbb{E}_{\hat{x} \sim X, \hat{z} \sim Z} \left[ L(\hat{x}, \hat{z}, \mu, \Sigma, \tau) \right] \approx \prod_{i=1}^{m} \prod_{k=1}^{n} \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau)^{\tilde{\tau}_{i, k}} \]
|
||||
The intuitive idea is that, if $Z_i$ is sampled, $\tilde{\tau}_{i, k}$ of the samples would be the $k$-th component. Therefore, the corresponding density should be multiplied by itself for $\tilde{\tau}_{i, k}$ times.
|
||||
\end{itemize}
|
||||
|
||||
Finally, the GMM training can be formulated as follows:
|
||||
\[ \arg\max_{\mu, \Sigma, \tau, \tilde{\tau}} \prod_{i=1}^{m} \prod_{k=1}^{n} \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau)^{\tilde{\tau}_{i, k}} \,\,\text{ s.t. } \sum_{k=1}^{n} \tau_k = 1, \forall i =1\dots m: \sum_{k=1}^{n} \tilde{\tau}_{i, k} = 1 \]
|
||||
which can be simplified by applying a logarithm and solved using expectation-maximization.
|
||||
\end{description}
|
||||
\end{description}
|
||||
|
||||
\begin{remark}
|
||||
Differently from KDE, GMM allows making various types of prediction:
|
||||
\begin{itemize}
|
||||
\item Evaluate the (log) density of a sample.
|
||||
\item Generate a sample.
|
||||
\item Estimate the probability that a sample belongs to a component.
|
||||
\item Assign samples to a component (i.e., clustering).
|
||||
\begin{remark}
|
||||
GMM can be seen as a generalization of $k$-means.
|
||||
\end{remark}
|
||||
\end{itemize}
|
||||
\end{remark}
|
||||
|
||||
|
||||
\begin{remark}
|
||||
Differently from KDE, most of the computation is done at training time.
|
||||
\end{remark}
|
||||
|
||||
|
||||
\begin{description}
|
||||
\item[Number of components estimation]
|
||||
The number of Gaussians to use in GMM can be determined through grid search and cross-validation.
|
||||
|
||||
\begin{remark}
|
||||
Other method such as the elbow method can also be applied. Some variants of GMM are able to infer the number of components.
|
||||
\end{remark}
|
||||
|
||||
\item[Threshold optimization]
|
||||
The threshold can be determined in the same way as in \Cref{sec:ad_taxi_kde_uni}.
|
||||
\end{description}
|
||||
|
||||
|
||||
\subsection{Autoencoder}
|
||||
|
||||
\begin{description}
|
||||
\item[Autoencoder] \marginnote{Autoencoder}
|
||||
Neural network trained to reconstruct its input. It is composed of two components:
|
||||
\begin{descriptionlist}
|
||||
\item[Encoder] $e(x, \theta_e)$ that maps an input $x$ into a latent vector $z$.
|
||||
\item[Decoder] $d(z, \theta_d)$ that maps $z$ into a reconstruction of $x$.
|
||||
\end{descriptionlist}
|
||||
|
||||
\begin{description}
|
||||
\item[Training]
|
||||
Training aims to minimize the reconstruction MSE:
|
||||
\[ \arg\min_{\theta_e, \theta_d} \left\Vert d\left( e(x_i, \theta_e), \theta_d \right) - x_i \right\Vert_2^2 \]
|
||||
|
||||
To avoid trivial embeddings, the following can be done:
|
||||
\begin{itemize}
|
||||
\item Use a small-dimensional latent space.
|
||||
\item Use L1 regularization to prefer sparse encodings.
|
||||
\end{itemize}
|
||||
\end{description}
|
||||
|
||||
|
||||
\item[Autoencoder for anomaly detection]
|
||||
By evaluating the quality of the reconstruction, an autoencoder can be used for anomaly detection:
|
||||
\[ \Vert x - d(e(x, \theta_e), \theta_d) \Vert_2^2 \geq \varepsilon \]
|
||||
|
||||
The advantages of this approach are the following:
|
||||
\begin{itemize}
|
||||
\item The size of the neural network does not scale with the training data.
|
||||
\item Neural networks have good performances in high-dimensional spaces.
|
||||
\item Inference is fast.
|
||||
\end{itemize}
|
||||
|
||||
However, the task of reconstruction can be harder that density estimation.
|
||||
\end{description}
|
||||
|
||||
\begin{remark}
|
||||
It is always a good idea to normalize the input of a neural network to have a more stable gradient descent. Moreover, with normalized data, common weight initialization techniques make the output approximately normalized too.
|
||||
\end{remark}
|
||||
Reference in New Issue
Block a user