Fix typos <noupdate>

2025-12-18 04:11:47 +01:00 · 2024-01-01 17:09:15 +01:00
parent 9e25277c0d
commit a7933cf3ba
27 changed files with 193 additions and 190 deletions
--- a/src/statistical-and-mathematical-methods-for-ai/sections/_finite_numbers.tex
+++ b/src/statistical-and-mathematical-methods-for-ai/sections/_finite_numbers.tex
@ -124,7 +124,7 @@ Given a floating-point system $\mathcal{F}(\beta, t, L, U)$, the representation

 \subsection{Machine precision}
 Machine precision $\varepsilon_{\text{mach}}$ determines the accuracy of a floating-point system. \marginnote{Machine precision}
-Depending on the approximation approach, machine precision can be computes as:
+Depending on the approximation approach, machine precision can be computed as:
 \begin{descriptionlist}
    \item[Truncation] $\varepsilon_{\text{mach}} = \beta^{1-t}$
    \item[Rounding] $\varepsilon_{\text{mach}} = \frac{1}{2}\beta^{1-t}$
--- a/src/statistical-and-mathematical-methods-for-ai/sections/_gradient_methods.tex
+++ b/src/statistical-and-mathematical-methods-for-ai/sections/_gradient_methods.tex
@ -34,11 +34,11 @@ Note that $\max \{ f(x) \} = \min \{ -f(x)$ \}.
 \subsection{Optimality conditions}

 \begin{description}
-    \item[First order condition] \marginnote{First order condition}
+    \item[First-order condition] \marginnote{First-order condition}
        Let $f: \mathbb{R}^N \rightarrow \mathbb{R}$ be continuous and differentiable in $\mathbb{R}^N$.
        \[ \text{If } \vec{x}^* \text{ local minimum of } f \Rightarrow \nabla f(\vec{x}^*) = \nullvec \]

-    \item[Second order condition] \marginnote{Second order condition}
+    \item[Second-order condition] \marginnote{Second-order condition}
        Let $f: \mathbb{R}^N \rightarrow \mathbb{R}$ be continuous and twice differentiable.
        \[ 
            \text{If } \nabla f(\vec{x}^*) = \nullvec \text{ and } \nabla^2 f(\vec{x}^*) \text{ positive definite} \Rightarrow 
@ -46,7 +46,7 @@ Note that $\max \{ f(x) \} = \min \{ -f(x)$ \}.
        \]
 \end{description}

-As the second order condition requires to compute the Hessian matrix, which is expensive, in practice only the first order condition is checked.
+As the second-order condition requires computing the Hessian matrix, which is expensive, in practice only the first-order condition is checked.



@ -147,7 +147,7 @@ A generic gradient-like method can then be defined as:

 \begin{description}
    \item[Choice of the initialization point] \marginnote{Initialization point}
-        The starting point of an iterative method is a user defined parameter.
+        The starting point of an iterative method is a user-defined parameter.
        For simple problems, it is usually chosen randomly in $[-1, +1]$.
        
        For complex problems, the choice of the initialization point is critical as 
@ -184,9 +184,9 @@ A generic gradient-like method can then be defined as:
    \item[Difficult topologies]
        \marginnote{Cliff}
        A cliff in the objective function causes problems when evaluating the gradient at the edge.
-        With a small step size, there is a slow down in convergence. 
+        With a small step size, there is a slowdown in convergence. 
        With a large step size, there is an overshoot that may cause the algorithm to diverge.
-        % a slow down when evaluating 
+        % a slowdown when evaluating 
        % the gradient at the edge using a small step size and 
        % an overshoot when the step is too large.

--- a/src/statistical-and-mathematical-methods-for-ai/sections/_linear_systems.tex
+++ b/src/statistical-and-mathematical-methods-for-ai/sections/_linear_systems.tex
@ -145,7 +145,7 @@ This method has time complexity $O(\frac{n^3}{6})$.
 \section{Iterative methods}
 \marginnote{Iterative methods}
 Iterative methods solve a linear system by computing a sequence that converges to the exact solution.
-Compared to direct methods, they are less precise but computationally faster and more adapt for large systems. 
+Compared to direct methods, they are less precise but computationally faster and more suited for large systems. 

 The overall idea is to build a sequence of vectors $\vec{x}_k$ 
 that converges to the exact solution $\vec{x}^*$:
@ -192,7 +192,7 @@ Obviously, as the sequence is truncated, a truncation error is introduced when u

 \section{Condition number}
 Inherent error causes inaccuracies during the resolution of a system.
-This problem is independent from the algorithm and is estimated using exact arithmetic.
+This problem is independent of the algorithm and is estimated using exact arithmetic.

 Given a system $\matr{A}\vec{x} = \vec{b}$, we perturbate $\matr{A}$ and/or $\vec{b}$ and study the inherited error.
 For instance, if we perturbate $\vec{b}$, we obtain the following system:
@ -210,8 +210,8 @@ Finally, we can define the \textbf{condition number} of a matrix $\matr{A}$ as:
 \[ K(\matr{A}) = \Vert \matr{A} \Vert \cdot \Vert \matr{A}^{-1} \Vert \]

 A system is \textbf{ill-conditioned} if $K(\matr{A})$ is large \marginnote{Ill-conditioned}
-(i.e. a small perturbation of the input causes a large change of the output).
-Otherwise it is \textbf{well-conditioned}. \marginnote{Well-conditioned}
+(i.e. a small perturbation of the input causes a large change in the output).
+Otherwise, it is \textbf{well-conditioned}. \marginnote{Well-conditioned}


 \section{Linear least squares problem}
--- a/src/statistical-and-mathematical-methods-for-ai/sections/_machine_learning.tex
+++ b/src/statistical-and-mathematical-methods-for-ai/sections/_machine_learning.tex
@ -118,7 +118,7 @@ The parameters are determined as the most likely to predict the correct label gi
        Moreover, as the dataset is identically distributed, 
        each $p_\vec{\uptheta}(y_n \vert \bm{x}_n)$ of the product has the same distribution.

-        By applying the logarithm, we have that the negative log-likelihood of a i.i.d. dataset is defined as:
+        By applying the logarithm, we have that the negative log-likelihood of an i.i.d. dataset is defined as:
        \[ \mathcal{L}(\vec{\uptheta}) = -\sum_{n=1}^{N} \log p_\vec{\uptheta}(y_n \vert \bm{x}_n) \]
        and to find good parameters $\vec{\uptheta}$, we solve the problem:
        \[ 
@ -170,7 +170,7 @@ The parameters are determined as the most likely to predict the correct label gi
            \begin{subfigure}{.45\textwidth}
                \centering
                \includegraphics[width=.75\linewidth]{img/gaussian_mle_bad.png}
-                \caption{When the parameters are bad, the label will be far the mean}
+                \caption{When the parameters are bad, the label will be far from the mean}
            \end{subfigure}

            \caption{Geometric interpretation of the Gaussian likelihood}
@ -223,7 +223,7 @@ we want to estimate the function $f$.

 \begin{description}
    \item[Model]
-        We use as predictor:
+        We use as the predictor:
        \[ f(\vec{x}) = \vec{x}^T \vec{\uptheta} \]
        Because of the noise, we use a probabilistic model with likelihood:
        \[ p_\vec{\uptheta}(y \,\vert\, \vec{x}) = \mathcal{N}(y \,\vert\, f(\vec{x}), \sigma^2) \]
--- a/src/statistical-and-mathematical-methods-for-ai/sections/_probability.tex
+++ b/src/statistical-and-mathematical-methods-for-ai/sections/_probability.tex
@ -322,7 +322,7 @@ Note: sometimes, instead of the full posterior, the maximum is considered (with
    \item[Expected value (multivariate)] \marginnote{Expected value (multivariate)}
        A multivariate random variable $X$ can be seen as 
        a vector of univariate random variables $\begin{pmatrix} X_1, \dots, X_D \end{pmatrix}^T$.
-        Its expected value can be computed element wise as:
+        Its expected value can be computed element-wise as:
        \[ 
            \mathbb{E}_X[g(\bm{x})] = 
            \begin{pmatrix} \mathbb{E}_{X_1}[g(x_1)] \\ \vdots \\ \mathbb{E}_{X_D}[g(x_D)] \end{pmatrix} \in \mathbb{R}^D
@ -466,7 +466,7 @@ Moreover, we have that:
 \begin{descriptionlist}
    \item[Uniform distribution] \marginnote{Uniform distribution}
        Given a discrete random variable $X$ with $\vert \mathcal{T}_X \vert = N$,
-        $X$ has an uniform distribution if:
+        $X$ has a uniform distribution if:
        \[ p_X(x) = \frac{1}{N}, \forall x \in \mathcal{T}_X \]
    
    \item[Poisson distribution] \marginnote{Poisson distribution}
--- a/src/statistical-and-mathematical-methods-for-ai/sections/_vector_calculus.tex
+++ b/src/statistical-and-mathematical-methods-for-ai/sections/_vector_calculus.tex
@ -71,7 +71,7 @@
        the second matrix contains in the $i$-th row the gradient of $g_i$.

        Therefore, if $g_i$ are in turn multivariate functions $g_1(s, t), g_2(s, t): \mathbb{R}^2 \rightarrow \mathbb{R}$,
-        the chain rule can be applies as follows:
+        the chain rule can be applied as follows:
        \[
            \frac{\text{d}f}{\text{d}(s, t)} = 
            \begin{pmatrix}
@ -257,7 +257,7 @@ The computation graph can be expressed as:
 \]
 where $g_i$ are elementary functions and $x_{\text{Pa}(x_i)}$ are the parent nodes of $x_i$ in the graph.
 In other words, each intermediate variable is expressed as an elementary function of its preceding nodes.
-The derivatives of $f$ can then be computed step-by-step going backwards as:
+The derivatives of $f$ can then be computed step-by-step going backward as:
 \[ \frac{\partial f}{\partial x_D} = 1 \text{, as by definition } f = x_D \]
 \[ 
    \frac{\partial f}{\partial x_i} = \sum_{\forall x_c: x_i \in \text{Pa}(x_c)} \frac{\partial f}{\partial x_c} \frac{\partial x_c}{\partial x_i}
@ -266,7 +266,7 @@ The derivatives of $f$ can then be computed step-by-step going backwards as:
 where $\text{Pa}(x_c)$ is the set of parent nodes of $x_c$ in the graph.
 In other words, to compute the partial derivative of $f$ w.r.t. $x_i$, 
 we apply the chain rule by computing 
-the partial derivative of $f$ w.r.t. the variables following $x_i$ in the graph (as the computation goes backwards).
+the partial derivative of $f$ w.r.t. the variables following $x_i$ in the graph (as the computation goes backward).

 Automatic differentiation is applicable to all functions that can be expressed as a computational graph and 
 when the elementary functions are differentiable.