Add A3I KDE taxi calls

2026-02-04 07:41:43 +01:00 · 2024-09-19 18:02:34 +02:00
parent b9fc71fcb9
commit 83d98d32c2
6 changed files with 168 additions and 2 deletions
--- a/src/year2/artificial-intelligence-in-industry/a3i.tex
+++ b/src/year2/artificial-intelligence-in-industry/a3i.tex
@ -5,9 +5,13 @@
 \def\lastupdate{{PLACEHOLDER-LAST-UPDATE}}
 \def\giturl{{PLACEHOLDER-GIT-URL}}

+\setcounter{secnumdepth}{3}
+\setcounter{tocdepth}{3}
+
 \begin{document}

    \makenotesfront
+    \input{./sections/_preliminaries.tex}
    \input{./sections/_anomaly_detection.tex}

 \end{document}
--- a/src/year2/artificial-intelligence-in-industry/img/_ad_taxi_autocorrelation.pdf
+++ b/src/year2/artificial-intelligence-in-industry/img/_ad_taxi_autocorrelation.pdf
--- a/src/year2/artificial-intelligence-in-industry/img/_ad_taxi_splits.pdf
+++ b/src/year2/artificial-intelligence-in-industry/img/_ad_taxi_splits.pdf
--- a/src/year2/artificial-intelligence-in-industry/img/_ad_taxi_subset.pdf
+++ b/src/year2/artificial-intelligence-in-industry/img/_ad_taxi_subset.pdf
--- a/src/year2/artificial-intelligence-in-industry/sections/_anomaly_detection.tex
+++ b/src/year2/artificial-intelligence-in-industry/sections/_anomaly_detection.tex
@ -47,7 +47,7 @@ Assuming that the data follows a Gaussian distribution, mean and variance can be
 Classify a data point as an anomaly if it is too unlikely.

 \begin{description}
-    \item[Formalization] 
+    \item[Problem formalization] 
        Given a random variable $X$ with values $x$ to represent the number of taxi calls, we want to find its probability density function (PDF) $f(x)$.

        An anomaly is determined whether:
@ -55,6 +55,148 @@ Classify a data point as an anomaly if it is too unlikely.
        where $\varepsilon$ is a threshold.

        \begin{remark}
-            The PDF can be reasonably used even though the dataset is discrete if its data points are sufficiently fine-grained.
+            A PDF can be reasonably used even though the dataset is discrete if its data points are sufficiently fine-grained.
        \end{remark}
+
+        \begin{remark}
+            It is handy to use negated log probabilities as:
+            \begin{itemize}
+                \item The logarithm adds numerical stability.
+                \item The negation makes the probability an alarm signal, which is a more common measure.
+            \end{itemize}
+            Therefore, the detection condition becomes:
+            \[ - \log f(x) \geq \varepsilon \]
+        \end{remark}
+
+    \item[Solution formalization]
+        The problem can be tackled using a density estimation technique.
+\end{description}
+
+
+\subsubsection{Univariate kernel density estimation}
+
+\begin{description}
+    \item[Kernel density estimation (KDE)] \marginnote{Kernel density estimation (KDE)}
+        Based on the assumption that whether there is a data point, there are more around it. Therefore, each data point is the center of a density kernel.
+
+        \begin{description}
+            \item[Density kernel] 
+                A kernel $K(x, h)$ is defined by:
+                \begin{itemize}
+                    \item The input variable $x$.
+                    \item The bandwidth $h$.
+                \end{itemize}
+        
+                \begin{description}
+                    \item[Gaussian kernel]
+                        Kernel defined as:
+                        \[ K(x, h) \propto e^{-\frac{x^2}{2h^2}} \]
+                        where:
+                        \begin{itemize}
+                            \item The mean is $0$.
+                            \item $h$ is the standard deviation.
+                        \end{itemize}
+                        As the mean is $0$, an affine transformation can be used to center the kernel on a data point $\mu$ as $K(x - \mu, h)$.
+                \end{description}
+        \end{description}
+
+        Given $m$ training data points $\bar{x}_i$, the density of any point $x$ can be computed as the kernel average:
+        \[ f(x, \bar{x}, h) = \frac{1}{m} \sum_{i=0}^{m} K(x - \bar{x}_i, h) \]
+
+        Therefore, the train data themselves are used as the parameters of the model while the bandwidth $h$ has to be estimated. 
+
+        \begin{remark}
+            According to some statistical arguments, a rule-of-thumb to estimate $h$ in the univariate case is the following:
+            \[ h = 0.9 \cdot \min\left\{ \hat{\sigma}, \frac{\texttt{IQR}}{1.34} \right\} \cdot m^{-\frac{1}{5}} \]
+            where:
+            \begin{itemize}
+                \item $\texttt{IQR}$ is the inter-quartile range.
+                \item $\hat{\sigma}$ is the standard deviation computed over the whole dataset.
+            \end{itemize}
+        \end{remark}
+
+    \item[Data split] 
+        Time series are usually split chronologically:
+        \begin{descriptionlist}
+            \item[Train] Should ideally contain only data representing the normal pattern. A small amount of anomalies might be tolerated as they have low probabilities.
+            \item[Validation] Used to find the threshold $\varepsilon$.
+            \item[Test] Used to evaluate the model.
+        \end{descriptionlist}
+
+        \begin{figure}[H]
+            \centering
+            \includegraphics[width=0.8\linewidth]{./img/_ad_taxi_splits.pdf}
+            \caption{Train, validation, and test splits}
+        \end{figure}
+
+    \item[Metrics] 
+        It is not straightforward to define a metric for anomaly detection. A cost model to measure the benefits of a prediction is more suited. A simple cost model can be based on:
+        \begin{descriptionlist}
+            \item[True positives (\texttt{TP})] Windows for which at least an anomaly is detected;
+            \item[False positives (\texttt{FP})] Detections that are not actually anomalies;
+            \item[False negatives (\texttt{FN})] Undetected anomalies;
+            \item[Advance (\texttt{adv})] Time between an anomaly and when it is first detected;
+        \end{descriptionlist}
+        and is computed as:
+        \[ (c_\text{false} \cdot \texttt{FP}) + (c_\text{miss} \cdot \texttt{FN}) + (c_\text{late} \cdot \texttt{adv}_{\leq 0}) \]
+        where $c_\text{false}$, $c_\text{miss}$, and $c_\text{late}$ are hyperparameters.
+
+    \item[Threshold optimization]
+        Using the train and validation set, it is possible to find the best threshold $\varepsilon$ that minimizes the cost model through linear search.
+
+        \begin{remark}
+            The train set can be used alongside the validation set to estimate $\varepsilon$ as this operation is not used to prevent overfitting.
+        \end{remark}
+\end{description}
+
+\begin{remark}
+    The evaluation data should be representative of the real world distribution. Therefore, in this case, to evaluate the model the whole dataset can be used.
+\end{remark}
+
+\begin{remark}
+    KDE assumes that the Markov property holds. Therefore, each data point is considered independent to the others.
+\end{remark}
+
+
+\subsubsection{Multivariate kernel density estimation}
+
+\begin{remark}
+    In this dataset, nearby points tend to have similar values.
+    \begin{figure}[H]
+        \centering
+        \includegraphics[width=0.7\linewidth]{./img/_ad_taxi_subset.pdf}
+        \caption{Subset of the dataset}
+    \end{figure}
+\end{remark}
+
+\begin{description}
+    \item[Autocorrelation plot]
+        Plot to visualize the correlation between nearby points of a series.
+        Given the original series, it is duplicated, shifted by a lag $l$, and the Pearson correlation coefficient is then computed between the two series. This operation is repeated over different values of $l$.
+
+        \begin{figure}[H]
+            \centering
+            \includegraphics[width=0.7\linewidth]{./img/_ad_taxi_autocorrelation.pdf}
+            \caption{
+                \parbox[t]{0.6\linewidth}{
+                    Autocorrelation plot of the subset of the dataset. There is strong correlation up to 4-5 lags.
+                }
+            }
+        \end{figure}
+
+
+    \item[Sliding window]
+        Given a window size $w$ and a stride $s$, the dataset is split into sequences of $w$ continuous elements.
+
+        \begin{remark}
+            Incomplete sequences at the start and end of the dataset are ignored.
+        \end{remark}
+
+        \begin{remark}
+            In \texttt{pandas}, the \texttt{rolling} method of \texttt{Dataframe} allows to create a slicing window iterator. This approach creates the windows row-wise and also considers incomplete windows.
+            However, a usually more efficient approach is to construct the sequences column-wise by hand.
+        \end{remark}
+
+    \item[Multivariate KDE]
+        Extension of KDE to vector variables.
 \end{description}
--- a/src/year2/artificial-intelligence-in-industry/sections/_preliminaries.tex
+++ b/src/year2/artificial-intelligence-in-industry/sections/_preliminaries.tex
@ -0,0 +1,20 @@
+\chapter{Preliminaries}
+
+
+\begin{description}
+    \item[Problem formalization] \marginnote{Problem formalization}
+        Defines the ideal goal.
+
+    \item[Solution formalization] \marginnote{Solution formalization}
+        Defines the actual possible approaches to solve a problem.
+\end{description}
+
+
+\begin{description}
+    \item[Occam's razor] \marginnote{Occam's razor}
+        Principle for which, between two hypotheses, the simpler one is usually correct.
+
+        \begin{remark}
+            This approach has less variance and more bias, making it more robust.
+        \end{remark}
+\end{description}