mirror of
https://github.com/NotXia/unibo-ai-notes.git
synced 2025-12-18 04:11:47 +01:00
Fix typos <noupdate>
This commit is contained in:
@ -124,7 +124,7 @@ Given a floating-point system $\mathcal{F}(\beta, t, L, U)$, the representation
|
||||
|
||||
\subsection{Machine precision}
|
||||
Machine precision $\varepsilon_{\text{mach}}$ determines the accuracy of a floating-point system. \marginnote{Machine precision}
|
||||
Depending on the approximation approach, machine precision can be computes as:
|
||||
Depending on the approximation approach, machine precision can be computed as:
|
||||
\begin{descriptionlist}
|
||||
\item[Truncation] $\varepsilon_{\text{mach}} = \beta^{1-t}$
|
||||
\item[Rounding] $\varepsilon_{\text{mach}} = \frac{1}{2}\beta^{1-t}$
|
||||
|
||||
@ -34,11 +34,11 @@ Note that $\max \{ f(x) \} = \min \{ -f(x)$ \}.
|
||||
\subsection{Optimality conditions}
|
||||
|
||||
\begin{description}
|
||||
\item[First order condition] \marginnote{First order condition}
|
||||
\item[First-order condition] \marginnote{First-order condition}
|
||||
Let $f: \mathbb{R}^N \rightarrow \mathbb{R}$ be continuous and differentiable in $\mathbb{R}^N$.
|
||||
\[ \text{If } \vec{x}^* \text{ local minimum of } f \Rightarrow \nabla f(\vec{x}^*) = \nullvec \]
|
||||
|
||||
\item[Second order condition] \marginnote{Second order condition}
|
||||
\item[Second-order condition] \marginnote{Second-order condition}
|
||||
Let $f: \mathbb{R}^N \rightarrow \mathbb{R}$ be continuous and twice differentiable.
|
||||
\[
|
||||
\text{If } \nabla f(\vec{x}^*) = \nullvec \text{ and } \nabla^2 f(\vec{x}^*) \text{ positive definite} \Rightarrow
|
||||
@ -46,7 +46,7 @@ Note that $\max \{ f(x) \} = \min \{ -f(x)$ \}.
|
||||
\]
|
||||
\end{description}
|
||||
|
||||
As the second order condition requires to compute the Hessian matrix, which is expensive, in practice only the first order condition is checked.
|
||||
As the second-order condition requires computing the Hessian matrix, which is expensive, in practice only the first-order condition is checked.
|
||||
|
||||
|
||||
|
||||
@ -147,7 +147,7 @@ A generic gradient-like method can then be defined as:
|
||||
|
||||
\begin{description}
|
||||
\item[Choice of the initialization point] \marginnote{Initialization point}
|
||||
The starting point of an iterative method is a user defined parameter.
|
||||
The starting point of an iterative method is a user-defined parameter.
|
||||
For simple problems, it is usually chosen randomly in $[-1, +1]$.
|
||||
|
||||
For complex problems, the choice of the initialization point is critical as
|
||||
@ -184,9 +184,9 @@ A generic gradient-like method can then be defined as:
|
||||
\item[Difficult topologies]
|
||||
\marginnote{Cliff}
|
||||
A cliff in the objective function causes problems when evaluating the gradient at the edge.
|
||||
With a small step size, there is a slow down in convergence.
|
||||
With a small step size, there is a slowdown in convergence.
|
||||
With a large step size, there is an overshoot that may cause the algorithm to diverge.
|
||||
% a slow down when evaluating
|
||||
% a slowdown when evaluating
|
||||
% the gradient at the edge using a small step size and
|
||||
% an overshoot when the step is too large.
|
||||
|
||||
|
||||
@ -145,7 +145,7 @@ This method has time complexity $O(\frac{n^3}{6})$.
|
||||
\section{Iterative methods}
|
||||
\marginnote{Iterative methods}
|
||||
Iterative methods solve a linear system by computing a sequence that converges to the exact solution.
|
||||
Compared to direct methods, they are less precise but computationally faster and more adapt for large systems.
|
||||
Compared to direct methods, they are less precise but computationally faster and more suited for large systems.
|
||||
|
||||
The overall idea is to build a sequence of vectors $\vec{x}_k$
|
||||
that converges to the exact solution $\vec{x}^*$:
|
||||
@ -192,7 +192,7 @@ Obviously, as the sequence is truncated, a truncation error is introduced when u
|
||||
|
||||
\section{Condition number}
|
||||
Inherent error causes inaccuracies during the resolution of a system.
|
||||
This problem is independent from the algorithm and is estimated using exact arithmetic.
|
||||
This problem is independent of the algorithm and is estimated using exact arithmetic.
|
||||
|
||||
Given a system $\matr{A}\vec{x} = \vec{b}$, we perturbate $\matr{A}$ and/or $\vec{b}$ and study the inherited error.
|
||||
For instance, if we perturbate $\vec{b}$, we obtain the following system:
|
||||
@ -210,8 +210,8 @@ Finally, we can define the \textbf{condition number} of a matrix $\matr{A}$ as:
|
||||
\[ K(\matr{A}) = \Vert \matr{A} \Vert \cdot \Vert \matr{A}^{-1} \Vert \]
|
||||
|
||||
A system is \textbf{ill-conditioned} if $K(\matr{A})$ is large \marginnote{Ill-conditioned}
|
||||
(i.e. a small perturbation of the input causes a large change of the output).
|
||||
Otherwise it is \textbf{well-conditioned}. \marginnote{Well-conditioned}
|
||||
(i.e. a small perturbation of the input causes a large change in the output).
|
||||
Otherwise, it is \textbf{well-conditioned}. \marginnote{Well-conditioned}
|
||||
|
||||
|
||||
\section{Linear least squares problem}
|
||||
|
||||
@ -118,7 +118,7 @@ The parameters are determined as the most likely to predict the correct label gi
|
||||
Moreover, as the dataset is identically distributed,
|
||||
each $p_\vec{\uptheta}(y_n \vert \bm{x}_n)$ of the product has the same distribution.
|
||||
|
||||
By applying the logarithm, we have that the negative log-likelihood of a i.i.d. dataset is defined as:
|
||||
By applying the logarithm, we have that the negative log-likelihood of an i.i.d. dataset is defined as:
|
||||
\[ \mathcal{L}(\vec{\uptheta}) = -\sum_{n=1}^{N} \log p_\vec{\uptheta}(y_n \vert \bm{x}_n) \]
|
||||
and to find good parameters $\vec{\uptheta}$, we solve the problem:
|
||||
\[
|
||||
@ -170,7 +170,7 @@ The parameters are determined as the most likely to predict the correct label gi
|
||||
\begin{subfigure}{.45\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=.75\linewidth]{img/gaussian_mle_bad.png}
|
||||
\caption{When the parameters are bad, the label will be far the mean}
|
||||
\caption{When the parameters are bad, the label will be far from the mean}
|
||||
\end{subfigure}
|
||||
|
||||
\caption{Geometric interpretation of the Gaussian likelihood}
|
||||
@ -223,7 +223,7 @@ we want to estimate the function $f$.
|
||||
|
||||
\begin{description}
|
||||
\item[Model]
|
||||
We use as predictor:
|
||||
We use as the predictor:
|
||||
\[ f(\vec{x}) = \vec{x}^T \vec{\uptheta} \]
|
||||
Because of the noise, we use a probabilistic model with likelihood:
|
||||
\[ p_\vec{\uptheta}(y \,\vert\, \vec{x}) = \mathcal{N}(y \,\vert\, f(\vec{x}), \sigma^2) \]
|
||||
|
||||
@ -322,7 +322,7 @@ Note: sometimes, instead of the full posterior, the maximum is considered (with
|
||||
\item[Expected value (multivariate)] \marginnote{Expected value (multivariate)}
|
||||
A multivariate random variable $X$ can be seen as
|
||||
a vector of univariate random variables $\begin{pmatrix} X_1, \dots, X_D \end{pmatrix}^T$.
|
||||
Its expected value can be computed element wise as:
|
||||
Its expected value can be computed element-wise as:
|
||||
\[
|
||||
\mathbb{E}_X[g(\bm{x})] =
|
||||
\begin{pmatrix} \mathbb{E}_{X_1}[g(x_1)] \\ \vdots \\ \mathbb{E}_{X_D}[g(x_D)] \end{pmatrix} \in \mathbb{R}^D
|
||||
@ -466,7 +466,7 @@ Moreover, we have that:
|
||||
\begin{descriptionlist}
|
||||
\item[Uniform distribution] \marginnote{Uniform distribution}
|
||||
Given a discrete random variable $X$ with $\vert \mathcal{T}_X \vert = N$,
|
||||
$X$ has an uniform distribution if:
|
||||
$X$ has a uniform distribution if:
|
||||
\[ p_X(x) = \frac{1}{N}, \forall x \in \mathcal{T}_X \]
|
||||
|
||||
\item[Poisson distribution] \marginnote{Poisson distribution}
|
||||
|
||||
@ -71,7 +71,7 @@
|
||||
the second matrix contains in the $i$-th row the gradient of $g_i$.
|
||||
|
||||
Therefore, if $g_i$ are in turn multivariate functions $g_1(s, t), g_2(s, t): \mathbb{R}^2 \rightarrow \mathbb{R}$,
|
||||
the chain rule can be applies as follows:
|
||||
the chain rule can be applied as follows:
|
||||
\[
|
||||
\frac{\text{d}f}{\text{d}(s, t)} =
|
||||
\begin{pmatrix}
|
||||
@ -257,7 +257,7 @@ The computation graph can be expressed as:
|
||||
\]
|
||||
where $g_i$ are elementary functions and $x_{\text{Pa}(x_i)}$ are the parent nodes of $x_i$ in the graph.
|
||||
In other words, each intermediate variable is expressed as an elementary function of its preceding nodes.
|
||||
The derivatives of $f$ can then be computed step-by-step going backwards as:
|
||||
The derivatives of $f$ can then be computed step-by-step going backward as:
|
||||
\[ \frac{\partial f}{\partial x_D} = 1 \text{, as by definition } f = x_D \]
|
||||
\[
|
||||
\frac{\partial f}{\partial x_i} = \sum_{\forall x_c: x_i \in \text{Pa}(x_c)} \frac{\partial f}{\partial x_c} \frac{\partial x_c}{\partial x_i}
|
||||
@ -266,7 +266,7 @@ The derivatives of $f$ can then be computed step-by-step going backwards as:
|
||||
where $\text{Pa}(x_c)$ is the set of parent nodes of $x_c$ in the graph.
|
||||
In other words, to compute the partial derivative of $f$ w.r.t. $x_i$,
|
||||
we apply the chain rule by computing
|
||||
the partial derivative of $f$ w.r.t. the variables following $x_i$ in the graph (as the computation goes backwards).
|
||||
the partial derivative of $f$ w.r.t. the variables following $x_i$ in the graph (as the computation goes backward).
|
||||
|
||||
Automatic differentiation is applicable to all functions that can be expressed as a computational graph and
|
||||
when the elementary functions are differentiable.
|
||||
|
||||
Reference in New Issue
Block a user