Add A3I GMM and autoencoder

2026-02-04 07:41:43 +01:00 · 2024-09-27 14:07:23 +02:00
parent 4c887f7bc2
commit 1cbdee98dd
1 changed files with 146 additions and 2 deletions
--- a/src/year2/artificial-intelligence-in-industry/sections/_anomaly_detection_high_dim.tex
+++ b/src/year2/artificial-intelligence-in-industry/sections/_anomaly_detection_high_dim.tex
@ -64,7 +64,151 @@ The KDE model, bandwidth, and threshold are fitted as in \Cref{ch:ad_low}.
        KDE on this dataset has the following problems:
        \begin{itemize}
            \item It is highly subject to the curse of dimensionality and requires more data to be reliable.
-            \item As the dataset is used during inference, a large dataset is computationally expensive.
+            \begin{remark}
+                KDE does not compress the input features and grows with the training set. In other words, it has low bias and high variance.
+            \end{remark}
+
+            \item As the dataset is used during inference, a large dataset is computationally expensive (with time complexity $O(mn)$, where $m$ is the number of dimensions and $n$ the number of samples).
+
            \item It only provides an alarm signal and not an explanation of the anomaly. In low-dimensionality, this is still acceptable, but with high-dimensional data it is harder to find an explanation.
        \end{itemize}
-\end{description}
+\end{description}
+
+
+\subsection{Gaussian mixture model}
+
+\begin{description}
+    \item[Gaussian mixture model (GMM)] \marginnote{Gaussian mixture model (GMM)}
+        Selection-based ensemble that estimates a distribution as a weighted sum of Gaussians. It is assumed that data can be generated by the following probabilistic model:
+        \[ X_Z \]
+        where:
+        \begin{itemize}
+            \item $X_k$ are random variables following a multivariate Gaussian distribution. The number of components is a hyperparameter that can be tuned to balance bias-variance.
+
+            \item $Z$ is a random variable representing the index $k$ of the variable $X$ to use to generate data.
+        \end{itemize}
+
+        More specifically, the PDF of a GMM $g$ is defined as:
+        \[ g(x, \mu, \Sigma, \tau) = \sum_{k=1}^{n} \tau_k f(x, \mu_k, \Sigma_k) \]
+        where:
+        \begin{itemize}
+            \item $f$ is the PDF of a multivariate normal distribution.
+            \item $\mu_k$ and $\Sigma_k$ are the mean and covariance matrix for the $k$-th component, respectively.
+            \item $\tau_k$ is the weight for the $k$-th component and corresponds to $\prob{Z=k}$.
+        \end{itemize}
+
+        \begin{description}
+            \item[Training]
+                Likelihood maximization can be used to train a GMM to approximate another distribution:
+                \[ \arg\max_{\mu, \Sigma, \tau} \mathbb{E}_{x \sim X} \left[ L(x, \mu, \Sigma, \tau) \right] \qquad \text{subject to } \sum_{k=1}^{n} \tau_k = 1 \]
+                where the expectation can be approximated using the training set (i.e., empirical risk minimization):
+                \[ \mathbb{E}_{x \sim X} \left[ L(x, \mu, \Sigma, \tau) \right] \approx \prod_{i=1}^{m} g(x_i, \mu, \Sigma, \tau) \]
+
+                \begin{remark}
+                    Empirical risk minimization can be solved in two ways:
+                    \begin{itemize}
+                        \item Use a single large sample (traditional approach).
+                        \item Use many smaller samples (as in cross-validation).
+                    \end{itemize}
+                \end{remark}
+
+                By putting the definitions together, we obtain the following problem:
+                \[ \arg\max_{\mu, \Sigma, \tau} \prod_{i=1}^{m} \sum_{k=1}^{n} \tau_k f(x, \mu_k, \Sigma_k) \qquad \text{subject to } \sum_{k=1}^{n} \tau_k = 1 \]
+                which:
+                \begin{itemize}
+                    \item Cannot be solved using gradient descent as it is an unconstrained method.
+                    \item Cannot be solved using mixed-integer linear programming as the problem is non-linear.
+                    \item Cannot be decomposed as the variables $\mu$, $\Sigma$, and $\tau$ appear in every term.
+                \end{itemize}
+
+                It is possible to simplify the formulation of the problem by introducing new variables:
+                \begin{itemize}
+                    \item A new latent random variable $Z_i$ is added for each example. $Z_i = k$ iff the $i$-th example is drawn from the $k$-th component. The PDF of the GMM can be reformulated to use $Z_i$ and without the summation as:
+                    \[ \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau) = \tau_{z_i} f(x, \mu_k, \Sigma_k) \]
+                    The expectation becomes:
+                    \[ \mathbb{E}_{x \sim X, \{z_i\} \sim \{Z_i\}} \left[ L(x, z, \mu, \Sigma, \tau) \right] \approx \mathbb{E}_{\{z_i\} \sim \{Z_i\}} \left[ \prod_{i=1}^{m} \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau) \right] \]
+                    Note that, as the distributions to sample $z_i$ are unknown, $Z_i$ cannot be approximated in the same way as $X$.
+
+                    \item New variables $\tilde{\tau}_{i, k}$ are introduced to represent the distribution of the $Z_i$ variables. In other words, $\tilde{\tau}_{i, k}$ corresponds to $\prob{Z_i = k}$. This allows to approximate the expectation as:
+                    \[ \mathbb{E}_{\hat{x} \sim X, \hat{z} \sim Z} \left[ L(\hat{x}, \hat{z}, \mu, \Sigma, \tau) \right] \approx \prod_{i=1}^{m} \prod_{k=1}^{n} \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau)^{\tilde{\tau}_{i, k}} \]
+                    The intuitive idea is that, if $Z_i$ is sampled, $\tilde{\tau}_{i, k}$ of the samples would be the $k$-th component. Therefore, the corresponding density should be multiplied by itself for $\tilde{\tau}_{i, k}$ times.
+                \end{itemize}
+
+            Finally, the GMM training can be formulated as follows:
+            \[ \arg\max_{\mu, \Sigma, \tau, \tilde{\tau}} \prod_{i=1}^{m} \prod_{k=1}^{n} \tilde{g}_i(x_i, z_i, \mu, \Sigma, \tau)^{\tilde{\tau}_{i, k}} \,\,\text{ s.t. } \sum_{k=1}^{n} \tau_k = 1, \forall i =1\dots m: \sum_{k=1}^{n} \tilde{\tau}_{i, k} = 1 \]
+            which can be simplified by applying a logarithm and solved using expectation-maximization.
+        \end{description}
+\end{description}
+
+\begin{remark}
+    Differently from KDE, GMM allows making various types of prediction:
+    \begin{itemize}
+        \item Evaluate the (log) density of a sample.
+        \item Generate a sample.
+        \item Estimate the probability that a sample belongs to a component.
+        \item Assign samples to a component (i.e., clustering).
+            \begin{remark}
+                GMM can be seen as a generalization of $k$-means.
+            \end{remark}
+    \end{itemize}
+\end{remark}
+
+
+\begin{remark}
+    Differently from KDE, most of the computation is done at training time.
+\end{remark}
+
+
+\begin{description}
+    \item[Number of components estimation] 
+        The number of Gaussians to use in GMM can be determined through grid search and cross-validation.
+
+        \begin{remark}
+            Other method such as the elbow method can also be applied. Some variants of GMM are able to infer the number of components.
+        \end{remark}
+
+    \item[Threshold optimization]
+        The threshold can be determined in the same way as in \Cref{sec:ad_taxi_kde_uni}.
+\end{description}
+
+
+\subsection{Autoencoder}
+
+\begin{description}
+    \item[Autoencoder] \marginnote{Autoencoder}
+        Neural network trained to reconstruct its input. It is composed of two components:
+        \begin{descriptionlist}
+            \item[Encoder] $e(x, \theta_e)$ that maps an input $x$ into a latent vector $z$. 
+            \item[Decoder] $d(z, \theta_d)$ that maps $z$ into a reconstruction of $x$.
+        \end{descriptionlist}
+
+        \begin{description}
+            \item[Training]
+                Training aims to minimize the reconstruction MSE:
+                \[ \arg\min_{\theta_e, \theta_d} \left\Vert d\left( e(x_i, \theta_e), \theta_d \right) - x_i \right\Vert_2^2 \]
+
+                To avoid trivial embeddings, the following can be done:
+                \begin{itemize}
+                    \item Use a small-dimensional latent space.
+                    \item Use L1 regularization to prefer sparse encodings.
+                \end{itemize}
+        \end{description}
+
+
+    \item[Autoencoder for anomaly detection]
+        By evaluating the quality of the reconstruction, an autoencoder can be used for anomaly detection:
+        \[ \Vert x - d(e(x, \theta_e), \theta_d) \Vert_2^2 \geq \varepsilon \]
+        
+        The advantages of this approach are the following:
+        \begin{itemize}
+            \item The size of the neural network does not scale with the training data.
+            \item Neural networks have good performances in high-dimensional spaces.
+            \item Inference is fast.
+        \end{itemize}
+
+        However, the task of reconstruction can be harder that density estimation.
+\end{description}
+
+\begin{remark}
+    It is always a good idea to normalize the input of a neural network to have a more stable gradient descent. Moreover, with normalized data, common weight initialization techniques make the output approximately normalized too.
+\end{remark}