Fix typos <noupdate>

This commit is contained in:
2024-01-01 17:09:15 +01:00
parent 9e25277c0d
commit a7933cf3ba
27 changed files with 193 additions and 190 deletions

View File

@ -124,7 +124,7 @@ Given a floating-point system $\mathcal{F}(\beta, t, L, U)$, the representation
\subsection{Machine precision}
Machine precision $\varepsilon_{\text{mach}}$ determines the accuracy of a floating-point system. \marginnote{Machine precision}
Depending on the approximation approach, machine precision can be computes as:
Depending on the approximation approach, machine precision can be computed as:
\begin{descriptionlist}
\item[Truncation] $\varepsilon_{\text{mach}} = \beta^{1-t}$
\item[Rounding] $\varepsilon_{\text{mach}} = \frac{1}{2}\beta^{1-t}$

View File

@ -34,11 +34,11 @@ Note that $\max \{ f(x) \} = \min \{ -f(x)$ \}.
\subsection{Optimality conditions}
\begin{description}
\item[First order condition] \marginnote{First order condition}
\item[First-order condition] \marginnote{First-order condition}
Let $f: \mathbb{R}^N \rightarrow \mathbb{R}$ be continuous and differentiable in $\mathbb{R}^N$.
\[ \text{If } \vec{x}^* \text{ local minimum of } f \Rightarrow \nabla f(\vec{x}^*) = \nullvec \]
\item[Second order condition] \marginnote{Second order condition}
\item[Second-order condition] \marginnote{Second-order condition}
Let $f: \mathbb{R}^N \rightarrow \mathbb{R}$ be continuous and twice differentiable.
\[
\text{If } \nabla f(\vec{x}^*) = \nullvec \text{ and } \nabla^2 f(\vec{x}^*) \text{ positive definite} \Rightarrow
@ -46,7 +46,7 @@ Note that $\max \{ f(x) \} = \min \{ -f(x)$ \}.
\]
\end{description}
As the second order condition requires to compute the Hessian matrix, which is expensive, in practice only the first order condition is checked.
As the second-order condition requires computing the Hessian matrix, which is expensive, in practice only the first-order condition is checked.
@ -147,7 +147,7 @@ A generic gradient-like method can then be defined as:
\begin{description}
\item[Choice of the initialization point] \marginnote{Initialization point}
The starting point of an iterative method is a user defined parameter.
The starting point of an iterative method is a user-defined parameter.
For simple problems, it is usually chosen randomly in $[-1, +1]$.
For complex problems, the choice of the initialization point is critical as
@ -184,9 +184,9 @@ A generic gradient-like method can then be defined as:
\item[Difficult topologies]
\marginnote{Cliff}
A cliff in the objective function causes problems when evaluating the gradient at the edge.
With a small step size, there is a slow down in convergence.
With a small step size, there is a slowdown in convergence.
With a large step size, there is an overshoot that may cause the algorithm to diverge.
% a slow down when evaluating
% a slowdown when evaluating
% the gradient at the edge using a small step size and
% an overshoot when the step is too large.

View File

@ -145,7 +145,7 @@ This method has time complexity $O(\frac{n^3}{6})$.
\section{Iterative methods}
\marginnote{Iterative methods}
Iterative methods solve a linear system by computing a sequence that converges to the exact solution.
Compared to direct methods, they are less precise but computationally faster and more adapt for large systems.
Compared to direct methods, they are less precise but computationally faster and more suited for large systems.
The overall idea is to build a sequence of vectors $\vec{x}_k$
that converges to the exact solution $\vec{x}^*$:
@ -192,7 +192,7 @@ Obviously, as the sequence is truncated, a truncation error is introduced when u
\section{Condition number}
Inherent error causes inaccuracies during the resolution of a system.
This problem is independent from the algorithm and is estimated using exact arithmetic.
This problem is independent of the algorithm and is estimated using exact arithmetic.
Given a system $\matr{A}\vec{x} = \vec{b}$, we perturbate $\matr{A}$ and/or $\vec{b}$ and study the inherited error.
For instance, if we perturbate $\vec{b}$, we obtain the following system:
@ -210,8 +210,8 @@ Finally, we can define the \textbf{condition number} of a matrix $\matr{A}$ as:
\[ K(\matr{A}) = \Vert \matr{A} \Vert \cdot \Vert \matr{A}^{-1} \Vert \]
A system is \textbf{ill-conditioned} if $K(\matr{A})$ is large \marginnote{Ill-conditioned}
(i.e. a small perturbation of the input causes a large change of the output).
Otherwise it is \textbf{well-conditioned}. \marginnote{Well-conditioned}
(i.e. a small perturbation of the input causes a large change in the output).
Otherwise, it is \textbf{well-conditioned}. \marginnote{Well-conditioned}
\section{Linear least squares problem}

View File

@ -118,7 +118,7 @@ The parameters are determined as the most likely to predict the correct label gi
Moreover, as the dataset is identically distributed,
each $p_\vec{\uptheta}(y_n \vert \bm{x}_n)$ of the product has the same distribution.
By applying the logarithm, we have that the negative log-likelihood of a i.i.d. dataset is defined as:
By applying the logarithm, we have that the negative log-likelihood of an i.i.d. dataset is defined as:
\[ \mathcal{L}(\vec{\uptheta}) = -\sum_{n=1}^{N} \log p_\vec{\uptheta}(y_n \vert \bm{x}_n) \]
and to find good parameters $\vec{\uptheta}$, we solve the problem:
\[
@ -170,7 +170,7 @@ The parameters are determined as the most likely to predict the correct label gi
\begin{subfigure}{.45\textwidth}
\centering
\includegraphics[width=.75\linewidth]{img/gaussian_mle_bad.png}
\caption{When the parameters are bad, the label will be far the mean}
\caption{When the parameters are bad, the label will be far from the mean}
\end{subfigure}
\caption{Geometric interpretation of the Gaussian likelihood}
@ -223,7 +223,7 @@ we want to estimate the function $f$.
\begin{description}
\item[Model]
We use as predictor:
We use as the predictor:
\[ f(\vec{x}) = \vec{x}^T \vec{\uptheta} \]
Because of the noise, we use a probabilistic model with likelihood:
\[ p_\vec{\uptheta}(y \,\vert\, \vec{x}) = \mathcal{N}(y \,\vert\, f(\vec{x}), \sigma^2) \]

View File

@ -322,7 +322,7 @@ Note: sometimes, instead of the full posterior, the maximum is considered (with
\item[Expected value (multivariate)] \marginnote{Expected value (multivariate)}
A multivariate random variable $X$ can be seen as
a vector of univariate random variables $\begin{pmatrix} X_1, \dots, X_D \end{pmatrix}^T$.
Its expected value can be computed element wise as:
Its expected value can be computed element-wise as:
\[
\mathbb{E}_X[g(\bm{x})] =
\begin{pmatrix} \mathbb{E}_{X_1}[g(x_1)] \\ \vdots \\ \mathbb{E}_{X_D}[g(x_D)] \end{pmatrix} \in \mathbb{R}^D
@ -466,7 +466,7 @@ Moreover, we have that:
\begin{descriptionlist}
\item[Uniform distribution] \marginnote{Uniform distribution}
Given a discrete random variable $X$ with $\vert \mathcal{T}_X \vert = N$,
$X$ has an uniform distribution if:
$X$ has a uniform distribution if:
\[ p_X(x) = \frac{1}{N}, \forall x \in \mathcal{T}_X \]
\item[Poisson distribution] \marginnote{Poisson distribution}

View File

@ -71,7 +71,7 @@
the second matrix contains in the $i$-th row the gradient of $g_i$.
Therefore, if $g_i$ are in turn multivariate functions $g_1(s, t), g_2(s, t): \mathbb{R}^2 \rightarrow \mathbb{R}$,
the chain rule can be applies as follows:
the chain rule can be applied as follows:
\[
\frac{\text{d}f}{\text{d}(s, t)} =
\begin{pmatrix}
@ -257,7 +257,7 @@ The computation graph can be expressed as:
\]
where $g_i$ are elementary functions and $x_{\text{Pa}(x_i)}$ are the parent nodes of $x_i$ in the graph.
In other words, each intermediate variable is expressed as an elementary function of its preceding nodes.
The derivatives of $f$ can then be computed step-by-step going backwards as:
The derivatives of $f$ can then be computed step-by-step going backward as:
\[ \frac{\partial f}{\partial x_D} = 1 \text{, as by definition } f = x_D \]
\[
\frac{\partial f}{\partial x_i} = \sum_{\forall x_c: x_i \in \text{Pa}(x_c)} \frac{\partial f}{\partial x_c} \frac{\partial x_c}{\partial x_i}
@ -266,7 +266,7 @@ The derivatives of $f$ can then be computed step-by-step going backwards as:
where $\text{Pa}(x_c)$ is the set of parent nodes of $x_c$ in the graph.
In other words, to compute the partial derivative of $f$ w.r.t. $x_i$,
we apply the chain rule by computing
the partial derivative of $f$ w.r.t. the variables following $x_i$ in the graph (as the computation goes backwards).
the partial derivative of $f$ w.r.t. the variables following $x_i$ in the graph (as the computation goes backward).
Automatic differentiation is applicable to all functions that can be expressed as a computational graph and
when the elementary functions are differentiable.