Add A3I statistical hypothesis + knowledge injection

This commit is contained in:
2024-11-23 19:26:11 +01:00
parent eb397bba38
commit 8457ccb892
12 changed files with 406 additions and 4 deletions

View File

@ -19,5 +19,6 @@
\include{./sections/_wear_anomalies.tex}
\include{./sections/_arrivals_predicition.tex}
\include{./sections/_features_selection.tex}
\include{./sections/_knowledge_injection.tex}
\end{document}

Binary file not shown.

After

Width:  |  Height:  |  Size: 140 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

View File

@ -50,7 +50,7 @@ Each row of the dataset represents a patient and the features are:
\end{figure}
\end{remark}
\subsection{Neuro-probabilistic model}
\subsection{Neuro-probabilistic model} \label{sec:arrivals_neuroprob}
\begin{description}
\item[Poisson distribution] \marginnote{Poisson distribution}

View File

@ -362,10 +362,10 @@ The dataset contains anonymized biomedical data and is composed of a binary targ
\end{description}
\subsection{Semantics for feature selection}
\subsection{Optimization-oriented feature selection}
\begin{description}
\item[Semantics for feature selection] \marginnote{Semantics for feature selection}
\item[Optimization-oriented feature selection] \marginnote{Optimization-oriented feature selection}
Given the set of features $\mathcal{J}$, solve the problem:
\[ \arg\min_{\mathcal{S} \subseteq \mathcal{J}} \left\{ |\mathcal{S}|: \forall (x, y).\, \hat{y} = \hat{f}_\mathcal{S}(x_\mathcal{S}), \mathcal{L}(y, \hat{y}) \leq \theta \right\} \]
In other words, this is the problem of determining the smallest subset of features $\mathcal{S}$ such that a model $\hat{f}_\mathcal{S}$ trained only considering them has an acceptable performance.
@ -383,4 +383,160 @@ The dataset contains anonymized biomedical data and is composed of a binary targ
\begin{remark}
This approach aims at obtaining the best possible performance. Therefore, it is reasonable for a normal machine learning problem. However, for determining features importance, this is not ideal.
\end{remark}
\end{description}
\end{description}
\subsection{Statistical hypothesis testing}
\begin{description}
\item[Statistical hypothesis testing] \marginnote{Statistical hypothesis testing}
Given:
\begin{itemize}
\item A random (possibly multivariate) variable $X$,
\item A hypothesis $H(X)$.
\end{itemize}
We can define:
\begin{itemize}
\item A competing null hypothesis $H_0(X)$,
\item An experimental statistics $T[X]$ (e.g., expected value).
\end{itemize}
\begin{remark}
$H_0$ is often the negation of $H$ and $T[X]$ is a value that supports the hypothesis when large enough.
\end{remark}
By assuming that the null hypothesis holds, we need to compute:
\begin{itemize}
\item The empirical value $T[x \mid H_0]$.
\item The theoretical probability $\prob{T[X \mid H_0]}$.
\end{itemize}
\begin{description}
\item[$p$-value] \marginnote{$p$-value}
Probability $\prob{T[X \mid H_0]}$ of observing an empirical value as extreme as $\prob{T[X]}$ under the null hypothesis (e.g., $\prob{T[X \mid H_0] \geq t}$ or $\prob{T[X \mid H_0] \leq t}$, depending on the case). If $p < 1-\alpha$, for a given confidence interval $\alpha$ (e.g., $0.95$), $H_0$ is likely false.
\end{description}
\end{description}
\begin{example}
For this problem:
\begin{itemize}
\item Variables can be feature-target pairs $(X, Y)$.
\item The hypothesis can be $H \equiv \text{``$X$ is important to predict $Y$''}$ (following a given correlation score $r[X, Y]$ such as average Shapely value, permutation importance, \dots).
\item The null hypothesis can be $H_0 \equiv \text{``X is important by chance''}$.
\end{itemize}
Given the correlation score for the original dataset $r^* = r[x, y]$, we can define the following inequality:
\[ r[\tilde{x}, \tilde{y}] \leq r^* \]
where $(\tilde{x}, \tilde{y})$ is sampled from $(\tilde{X}, \tilde{Y})$, which is a similar but uncorrelated version of $(X, Y)$. If $X$ and $Y$ are correlated, it is expected that the inequality holds most of the time, otherwise it should be half of the time. Therefore, we can define the test statistics for this problem as:
\[ T[X, Y \mid H_0] \equiv \sum_{i=1}^{m} \prob{r[\tilde{X}, \tilde{Y}] \leq r^*} \]
where $m$ is the number of samples drawn from $(\tilde{X}, \tilde{Y})$. If $T[X, Y \mid H_0] \sim \frac{m}{2}$, we can state that $X$ and $Y$ might not be correlated.
The test $r[\tilde{X}, \tilde{Y}] \leq r^*$ has a binary outcome and therefore its theoretical probability follows a Bernoulli distribution (if $H_0$ holds, it follows $\mathcal{B}(\frac{1}{2})$). If the experiment is repeated $n$ times, the probability of $T[X, Y \mid H_0]$ follows a binomial distribution $\mathcal{B}(n, \frac{1}{2})$.
In practice, to create an uncorrelated dataset $(\tilde{X}, \tilde{Y})$, it is possible to permute the values of a feature in $X$. Then, it is possible to compute the empirical value $T[X, Y \mid H_0]$ and match it against the theoretical probability. The $p$-value can be computed as:
\[ p = \prob{T[X, Y \mid H_0] \geq t} \]
where $t$ defines a target interval in the distribution and depends on the theoretical distribution and the confidence level $\alpha$. If $p < 1-\alpha$, we can reject $H_0$. To reduce sampling noise, the experiment can be repeated multiple times.
\begin{figure}[H]
\centering
\includegraphics[width=0.6\linewidth]{./img/_biomed_pvalue.pdf}
\caption{
\parbox[t]{0.5\linewidth}{
Binomial distribution with 30 samples. The interval defined with $\alpha=0.95$ is in green.
}
}
\end{figure}
\end{example}
\begin{remark}
When $p < 1-\alpha$, it is possible to reject $H_0$ and state that $H$ is most likely true. However, in all other cases nothing can be concluded.
With binary hypotheses, it is possible to negate $H$ and repeat the procedure with $\lnot H$, allowing to create two target intervals to have more information.
\end{remark}
\begin{example}
For this case, we have that:
\begin{itemize}
\item The negated hypothesis is $\lnot H \equiv$ ``$X$ is not important to predict $Y$''.
\item $H_0$ remains the same.
\item The test statistics becomes $n - T[X, Y \mid H_0]$.
\end{itemize}
By proceeding as before, we can ultimately obtain the following intervals:
\begin{itemize}
\item One that supports $H$.
\item One that supports $\lnot H$.
\item One where no claim can be made.
\end{itemize}
\begin{figure}[H]
\centering
\includegraphics[width=0.6\linewidth]{./img/_biomed_pvalue_two_tails.pdf}
\end{figure}
\end{example}
\begin{description}
\item[Boruta] \marginnote{Boruta}
Algorithm for feature selection based on statistical hypothesis testing with the hypothesis $H \equiv$ ``feature $j$ is important among those in the dataset according to a given metric''
Given a dataset $(X, Y)$, Boruta works as follows:
\begin{enumerate}
\item Augment the dataset with a permuted version $\tilde{X}_j$ of each feature (shadow features).
\item Run the statistical test based on:
\[ \phi_j((x, \tilde{x}), y) > \max_{j \in \tilde{X}} \phi_j((x, \tilde{x}), y) \]
where $\phi$ is a feature importance metric (e.g., obtained from decision trees). Intuitively, the importance of a feature $j$ is compared with the importances of the shadow features.
\item Repeat the experiment multiple times.
\end{enumerate}
\begin{remark}
Due to the maximum operation, the theoretical distribution of $T$ is mostly binomial. So, some statistical corrections are applied.
\end{remark}
\begin{remark}
Boruta tests both positive and negative hypothesis. Therefore, it is able to determine whether a feature is important, unimportant, or be undecided.
\end{remark}
\end{description}
\subsection{Ground-truth check}
The synthetic dataset used in this section is described by the following causal graph:
\begin{figure}[H]
\centering
\includegraphics[width=0.3\linewidth]{./img/biomed_causal_graph.png}
\end{figure}
where:
\begin{itemize}
\item Black variables are relevant for the target.
\item Gray variables are latent.
\item Red variables are irrelevant.
\end{itemize}
This graph has some common patterns:
\begin{descriptionlist}
\item[Mediator]
A variable $A$ that hides the effect of another variable $B$. If $A$ affects the target, then $B$ also does.
\begin{example}
In the causal graph, $X_2$ is a mediator between $X_0$, $X_1$ and $Y$, which might cause problems to detect $X_0$ and $X_1$ as relevant features. $X_2$ also mediates for $Z_0$, which, in this case, has a positive effect as $Z_0$ is not observed.
\end{example}
\begin{remark}
Strongly correlated features (e.g., mediator-mediator) might mislead algorithms when determining relevant features as a model might partition the importance between the two features.
\end{remark}
\item[Confounder]
A variable $A$ that has effect on two variables $B_1$ and $B_2$. Therefore, it correlates $B_1$ and $B_2$. This might cause to mistakenly consider a variable important for predicting the target.
\begin{example}
In the causal graph, $Z_1$ causes a confounder between $X_0$ and $X_4$. $X_4$ might be considered a relevant feature.
\end{example}
\end{descriptionlist}
Overall, Boruta is able to:
\begin{itemize}
\item Identity all causal variables.
\item Correctly estimate the distribution of almost all the variables.
\item Determine monotonic effects and mediators.
\end{itemize}

View File

@ -0,0 +1,245 @@
\chapter{Knowledge injection}
\begin{remark}
ML models exploit implicit knowledge from the data. Explicit knowledge (e.g., rules-of-thumb, known correlations and causal factors, laws of physics, \dots) is instead not used.
\end{remark}
\section{Approaches}
\subsection{Generative approach}
\begin{description}
\item[Generative approach] \marginnote{Generative approach}
Use symbolic knowledge to create training samples that are used to train a model.
\begin{remark}
This approach does not let the model exploit explicit knowledge and might be inefficient.
\end{remark}
\end{description}
\subsection{Lagrangian approaches}
\begin{description}
\item[Knowledge as constraint]
Use hard or soft constraints to inject knowledge.
\begin{example}[RUL estimation]
For RUL estimation, a simple explicit knowledge is that RUL decreases by 1 unit each time step. Formally, given two pairs of examples $(x_i, y_i)$ and $(x_j, y_j)$, it holds that:
\[ y_i - y_j = j - i \quad \forall i,j = 1, \dots, m \text{ with } c_i = c_j \]
where $c_k$ indicates the machine of the $k$-th sample.
Given a model $f$, it can be constrained as follows:
\[ f(x_i; \theta) - f(x_j; \theta) \approx j - i \]
Its training problem becomes:
\[ \arg\min_\theta \mathcal{L}(y, f(x; \theta)) \text{ subj. to } f(x_i; \theta) - f(x_j; \theta) \approx j - i \]
\end{example}
\item[Lagrangian approaches] \marginnote{Lagrangian approaches}
Given:
\begin{itemize}
\item A training problem: $\arg\min_\theta \{ \mathcal{L}(\hat{y}) \mid \hat{y} = f(x; \theta) \}$,
\item Some constraints defined as an inequality on a vector function $\vec{g}(\hat{y}) \leq 0$ with $\vec{g}(\hat{y}) = \{ g_k(\hat{y}) \}_{k=1}^{m}$.
\end{itemize}
\begin{description}
\item[Naive formulation]
The constraints can be formulated as a loss penalty as follows:
\[ \mathcal{L}(\theta, \vec{\lambda}) = \mathcal{L}(\hat{y}) + \vec{\lambda}^T \vec{g}(\hat{y}) \]
where $\vec{\lambda}$ is a vector of multipliers (Lagrangian multipliers).
\begin{remark}
$g_k(\hat{y}) = 0$ and $g_k(\hat{y}) \leq 0$ both satisfy the constraint. However, if $g_k(\hat{y})$ goes below $0$, it becomes an unwanted reward.
In classical Lagrangian theory, this is solved by changing the sign of $\lambda_k$. However, this works by assuming convexity and requires optimizing for $\vec{\lambda}$.
\end{remark}
\item[Clipped formulation]
The constraints can be formulated as a loss penalty as follows:
\[ \mathcal{L}(\theta, \vec{\lambda}) = \mathcal{L}(\hat{y}) + \vec{\lambda}^T \max\{ 0, \vec{g}(\hat{y}) \} \]
with $\vec{\lambda} \geq 0$.
\begin{remark}
An equality constraint $g_k(\hat{y}) = 0$ can be formulated as:
\[
g_k(\hat{y}) \leq 0 \land -g_k(\hat{y}) \leq 0
\]
In penalty terms, it becomes:
\[
\lambda_k' \max \{ 0, g_k(\hat{y}) \} + \lambda_k'' \max \{ 0, -g_k(\hat{y}) \} = \lambda_k | g_k(\hat{y}) |
\]
where $\lambda_k = \lambda_k' + \lambda_k''$.
An alternative formulation is also:
\[ \vec{\lambda}^T \vec{g}(\hat{y})^2 \]
which is due to normal distribution properties.
\end{remark}
\end{description}
\begin{remark}
Lagrangian approaches usually work with differentiable constraints, but it is not strictly required. However, differentiability is needed for training with gradient descent.
\end{remark}
\begin{description}
\item[Multipliers calibration (maximal accuracy)]
Consider constraints in a soft manner. The penalty can be considered as a regularizer and $\vec{\lambda}$ can be considered as a hyperparameter and optimized through hyperparameter tuning.
\begin{remark}
This approach is viable when sufficient data is available.
\end{remark}
\item[Multipliers calibration (constraints satisfaction)]
Consider constraints in a hard manner.
\begin{description}
\item[Naive approach]
Use a large $\vec{\lambda}$.
\begin{remark}
This approach leads to numerical instability and disproportional gradients.
\end{remark}
\item[Dual ascent] \marginnote{Dual ascent}
\begin{remark}
The loss with the constraint penalty is differentiable in $\vec{\lambda}$:
\[ \nabla_\lambda \mathcal{L}(\theta, \vec{\lambda}) = \max \{ 0, \vec{g}(f(x, \theta)) \} \]
where:
\begin{itemize}
\item If the constraint is satisfied, the partial derivative is $0$.
\item If the constraint is violated, the partial derivative is equal to the violation.
\end{itemize}
\end{remark}
Alternate gradient descent and ascent. The method works as follows:
\begin{enumerate}
\item Initialize the multipliers $\vec{\lambda}^{(0)} = 0$.
\item Until a stop condition is met:
\begin{enumerate}
\item Obtain $\vec{\lambda}^{(k)}$ via gradient ascent with $\nabla_\vec{\lambda} \mathcal{L}(\theta^{(k-1)}, \vec{\lambda})$.
\item Obtain $\theta^{(k)}$ via gradient descent with $\nabla_\theta \mathcal{L}(\theta, \vec{\lambda}^{(k)})$.
\end{enumerate}
\end{enumerate}
\end{description}
\end{description}
\begin{remark}
Lagrangian approaches are enforced at training time. Constraints might be violated at test time.
\end{remark}
\begin{example}[RUL estimation]
The constraint:
\[ f(x_i; \theta) - f(x_j; \theta) \approx j - i \]
can be formulated as the following penalty:
\[ \lambda \sum_{\substack{i,j=1, \dots, m\\c_i = c_j}} (f(x_i; \theta) - f(x_j; \theta) - (j-i))^2 \]
where $\lambda$ is the same for all pairs for simplicity.
Moreover, to avoid redundances, it is reasonable to only consider consecutive time steps (denoted as $i \prec j$):
\[ \lambda \sum_{\substack{i \prec j\\c_i = c_j}} (f(x_i; \theta) - f(x_j; \theta) - (j-i))^2 \]
\indenttbox
\begin{remark}
With batches, consecutivity refers to the closest available time step of the same machine in the same batch.
It is also necessary to sample batches in such a way that at least two time steps of the same machine are considered (otherwise the gradient step would be wasted).
\end{remark}
\begin{figure}[H]
\centering
\begin{subfigure}{0.6\linewidth}
\centering
\includegraphics[width=\linewidth]{./img/_rul_pre_lagrangian.pdf}
\caption{Results without knowledge injection}
\end{subfigure}
\begin{subfigure}{0.6\linewidth}
\centering
\includegraphics[width=\linewidth]{./img/_rul_lagrangian.pdf}
\caption{Results with knowledge injection (note that the scales are off)}
\end{subfigure}
\end{figure}
\end{example}
\end{description}
\subsection{Ordinary differential equations learning}
\begin{description}
\item[Ordinary differential equation (ODE)] \marginnote{Ordinary differential equation (ODE)}
Equation defined as:
\[ \dot{y} = f(y, t) \]
where $\dot{y}$ is the rate of change and $y$ is the state variable defined as a function of (usually) time $t$ (i.e., $y(t)$ is the state at time $t$).
\begin{remark}
This class of differential equations can be seen as a transition system with continuous steps.
\end{remark}
\begin{remark}
ODEs are useful to represent physics knowledge.
\end{remark}
\item[Initial value problem] \marginnote{Initial value problem}
Given an ODE $\dot{y} = f(y, t)$ and an initial state $y(0) = y_0$, determine how the states unfold (i.e., run a simulation).
\begin{description}
\item[Euler method] \marginnote{Euler method}
Solve an initial value problem by discretizing time and computing the transition function at time $t$ as:
\[ y_k = y_{k-1} + (t_k - t_{k-1}) f(y_{k-1}, t_{k-1}) \]
It is assumed that the solution is piece-wise linear (i.e., linearity between two consecutive states).
\begin{figure}[H]
\centering
\includegraphics[width=0.7\linewidth]{./img/_rc_euler.pdf}
\end{figure}
\begin{remark}
Euler method does not perform well in terms of accuracy. There are variants with better performance.
\end{remark}
\end{description}
\item[Learning ODE] \marginnote{Learning ODE}
Estimate the parameters of an ODE from the data. The training problem is defined as:
\[ \arg\min_\theta \{ \mathcal{L}(\hat{y}(t), y) \mid \dot{\hat{y}} = f(\hat{y}, t; \theta) \land \hat{y}(0) = y_0 \} \]
A possible approach is to discretize time and optimize on the relaxed problem. The steps are:
\begin{enumerate}
\item Solve the initial value problem using a numerical method (e.g., Euler).
\item Compute the loss $\mathcal{L}$ and optimize over $\theta$.
\end{enumerate}
\begin{remark}
This path can be taken if the integration steps are differentiable. Euler method is differentiable.
\end{remark}
\begin{description}
\item[Architecture]
The method can be implemented in an RNN fashion with the Euler method encoded into a repeated layer. At each step, the networks takes as input the state and the time variable, and estimates a new state.
\begin{figure}[H]
\centering
\includegraphics[width=0.3\linewidth]{./img/rc_ode_learning.png}
\end{figure}
\end{description}
\begin{remark}
This approach is slow to converge and is not very accurate.
\end{remark}
\begin{example}
Consider an RC circuit with a voltage source $V_S$, a capacitor $C$, and a resistor $R$.
\begin{figure}[H]
\centering
\includegraphics[width=0.23\linewidth]{./img/rc_circuit.png}
\end{figure}
Its dynamic behavior is described by the ODE:
\[ \dot{V} = \frac{1}{\tau}(V_S - V) \]
where $\tau = RC$.
Assume that we have a dataset of the ground-truth voltage $y$ at each time step and we want to find the parameters $V_S$ and $\tau$ of the ODE. The problem is defined as:
\[ \arg\min_{V_S, \tau} \mathcal{L}(\hat{y}(t), y) \text{ subj. to } \dot{\hat{y}} = \frac{1}{\tau} (V_S - \hat{y}) \land \hat{y}(0) = y_0 \]
A layer of the neural network estimates $V_S$ and $\tau$. The time steps are passed through the network and all the predicted voltages are used to compute the loss.
Note that $V_S$ and $\tau$ must be positive values. We can use the same trick as in \Cref{sec:arrivals_neuroprob} by using $\exp\log$ with a scaling factor to start with a reasonable guess.
\end{example}
\end{description}