Fix typos <noupdate>

2026-02-04 07:41:43 +01:00 · 2024-12-22 19:01:52 +01:00
parent e7bd230286
commit f72d4164d2
2 changed files with 30 additions and 26 deletions
--- a/src/year2/natural-language-processing/sections/_basic_text.tex
+++ b/src/year2/natural-language-processing/sections/_basic_text.tex
@ -138,7 +138,7 @@
            
            \item \texttt{\char`\\ w} matches a single alphanumeric or underscore character (same as \texttt{[a-zA-Z0-9\_]}).
            
-            \item \texttt{\char`\\ w} matches a single non-alphanumeric and non-underscore character (same as \texttt{[\textasciicircum\char`\\ w]}).
+            \item \texttt{\char`\\ W} matches a single non-alphanumeric and non-underscore character (same as \texttt{[\textasciicircum\char`\\ w]}).

            \item \texttt{\char`\\ s} matches a single whitespace (space or tab).
            
@ -150,7 +150,11 @@
        Operator to refer to previously matched substrings.

        \begin{example}
-            In the regex \texttt{/the (.*)er they were, the \char`\\ 1er they will be/}, \texttt{\char`\\ 1} should match the same content matched by \texttt{(.*)}.
+            In the regex:
+            \begin{center}
+                \texttt{/the (.*)er they were, the \char`\\ 1er they will be/}
+            \end{center}
+            \texttt{\char`\\ 1} should match the same content matched by \texttt{(.*)}.
        \end{example}
 \end{description}

@ -236,7 +240,7 @@ Tokenization is done by two components:
                Given a training corpus $C$, BPE determines the vocabulary as follows:
                \begin{enumerate}
                    \item Start with a vocabulary $V$ containing all the $1$-grams of $C$ and an empty set of merge rules $M$.
-                    \item While the desired size of the vocabulary has not been reached:
+                    \item Until the desired size of the vocabulary is reached:
                    \begin{enumerate}
                        \item Determine the pair of tokens $t_1 \in V$ and $t_2 \in V$ such that, among all the possible pairs, the $n$-gram $t_1 + t_2 = t_1t_2$ obtained by merging them is the most frequent in the corpus $C$.
                        \item Add $t_1t_2$ to $V$ and the merge rule $t_1+t_2$ to $M$.
@ -315,7 +319,7 @@ Tokenization is done by two components:
    \item[WordPiece] \marginnote{WordPiece}
        Similar to BPE with the addition of merge rules ranking and a special leading/tailing set of characters (usually \texttt{\#\#}) to identify subwords (e.g., \texttt{new\#\#}, \texttt{\#\#est} are possible tokens).

-    \item[Unigram] \marginnote{Unigram}
+    \item[Unigram tokenization] \marginnote{Unigram tokenization}
        Starts with a big vocabulary and remove tokens following a loss function.
 \end{description}

@ -353,7 +357,7 @@ Tokenization is done by two components:
        Reduce terms to their stem.

        \begin{remark}
-            Stemming is a simpler approach to lemmatization.
+            Stemming is a simpler approach than lemmatization.
        \end{remark}

        \begin{description}
@ -381,10 +385,10 @@ Tokenization is done by two components:
        \end{remark}

    \item[Levenshtein distance] \marginnote{Levenshtein distance}
-        Edit distance where:
+        Minimum edit distance where:
        \begin{itemize}
-            \item Insertions cost $1$;
-            \item Deletions cost $1$;
+            \item Insertions cost $1$,
+            \item Deletions cost $1$,
            \item Substitutions cost $2$.
        \end{itemize}

--- a/src/year2/natural-language-processing/sections/_language_models.tex
+++ b/src/year2/natural-language-processing/sections/_language_models.tex
@ -74,19 +74,19 @@
            \textit{\textnormal{[...]} was called a ``stellar and versatile \texttt{acress} whose combination of sass and glamour has defined her \textnormal{[...]}''}
        }
    \end{center}
-    By using the Corpus of Contemporary English (COCA), we can determine the following words as candidates:
+    By using the Corpus of Contemporary American English (COCA), we can determine the following words as candidates:
    \[
        \texttt{actress} \cdot \texttt{cress} \cdot \texttt{caress} \cdot \texttt{access} \cdot \texttt{across} \cdot \texttt{acres}
    \]

    \begin{description}
-        \item[Language model] By considering a language model without context, the priors are computed as $\prob{w} = \frac{\texttt{count}(w)}{\vert \texttt{COCA} \vert}$ (where $\vert \texttt{COCA} \vert = \num{404253213}$):
+        \item[Language model] By considering a language model without context, the priors are computed as $\prob{w | \varepsilon} = \frac{\texttt{count}(w)}{\vert \texttt{COCA} \vert}$ (where $\vert \texttt{COCA} \vert = \num{404253213}$):
        \begin{table}[H]
            \centering
            \footnotesize
            \begin{tabular}{ccl}
                \toprule
-                $w$ & $\texttt{count}(w)$ & $\prob{w}$ \\
+                $w$ & $\texttt{count}(w)$ & $\prob{w | \varepsilon}$ \\
                \midrule
                \texttt{actress}    & \num{9321}    & $0.0000231$   \\
                \texttt{cress}      & \num{220}     & $0.000000544$ \\
@ -125,15 +125,15 @@
        \footnotesize
        \begin{tabular}{cl}
            \toprule
-            $w$ & $\prob{x | w} \prob{w}$ \\
+            $w$ & $\prob{x | w} \prob{w | \varepsilon}$ \\
            \midrule
-            \texttt{actress} & $2.7 \cdot 10^{9}$ \\
-            \texttt{cress}   & $0.00078 \cdot 10^{9}$ \\
-            \texttt{caress}  & $0.0028 \cdot 10^{9}$ \\
-            \texttt{access}  & $0.019 \cdot 10^{9}$ \\
-            \texttt{across}  & $2.8 \cdot 10^{9}$ \\
-            \texttt{acres}   & $1.02 \cdot 10^{9}$ \\
-            \texttt{acres}   & $1.09 \cdot 10^{9}$ \\
+            \texttt{actress} & $2.7 \cdot 10^{-9}$ \\
+            \texttt{cress}   & $0.00078 \cdot 10^{-9}$ \\
+            \texttt{caress}  & $0.0028 \cdot 10^{-9}$ \\
+            \texttt{access}  & $0.019 \cdot 10^{-9}$ \\
+            \texttt{across}  & $2.8 \cdot 10^{-9}$ \\
+            \texttt{acres}   & $1.02 \cdot 10^{-9}$ \\
+            \texttt{acres}   & $1.09 \cdot 10^{-9}$ \\
            \bottomrule
        \end{tabular}
    \end{table}
@ -165,8 +165,8 @@
    Finally, we have that:
    \[
        \begin{split}
-            \prob{\texttt{versatile \underline{actress} whose} | \texttt{versatile acress whose}} &= 2.7 \cdot 210 \cdot 10^{-19} \\
-            \prob{\texttt{versatile \underline{across} whose} | \texttt{versatile acress whose}} &= 2.8 \cdot 10^{-19} \\
+            \prob{\texttt{versatile \underline{actress} whose} | \texttt{versatile acress whose}} &= (2.7 \cdot 10^{-9}) \cdot (210 \cdot 10^{-10}) \\
+            \prob{\texttt{versatile \underline{across} whose} | \texttt{versatile acress whose}} &= (2.8 \cdot 10^{-9}) \cdot (1 \cdot 10^{-10}) \\
        \end{split}
    \]
    So \texttt{actress} is the most likely correction for \texttt{acress} in this model.
@ -223,7 +223,7 @@

                \begin{description}
                    \item[Estimating $\mathbf{N}$-gram probabilities]
-                        Consider the bigram case, the probability that a token $w_i$ follows $w_{i-1}$ can be determined by counting:
+                        Consider the bigram case, the probability that a token $w_i$ follows $w_{i-1}$ can be determined through counting:
                        \[ \prob{w_i | w_{i-1}} = \frac{\texttt{count}(w_{i-1} w_i)}{\texttt{count}(w_{i-1})} \]
                \end{description}

@ -272,14 +272,14 @@
        Measure the quality of a model independently of the task.

    \item[Perplexity (\texttt{PP})] \marginnote{Perplexity}
-        Probability-based metric based on the inverse probability of a sequence (usually the test set) normalized by the number of words:
+        Probability-based metric based on the inverse probability of a sequence (usually using the test set) normalized by the number of words:
        \[
            \begin{split}
                \prob{w_{1..N}} &= \prod_{i} \prob{w_i | w_{1..i-1}} \\
                \texttt{PP}(w_{1..N}) &= \prob{w_{1..N}}^{-\frac{1}{N}} \in [1, +\infty]
            \end{split}
        \]
-        A lower perplexity represents a generally better model.
+        Generally, a lower perplexity represents a better model.

        \begin{example}
            For bigram models, perplexity is computed as:
@ -293,7 +293,7 @@
        \begin{remark}[Perplexity intuition]
            Perplexity can be seen as a measure of surprise of a language model when evaluating a sequence.

-            Alternatively, it can also be seen as a weighted average branching factor (i.e., average number of possible next words that follow any word, accounting for their probabilities). For instance, consider a vocabulary of digits and a training corpus where every digit appears with uniform probability $0.1$. The perplexity of any sequence using a 1-gram model is:
+            Alternatively, it can also be seen as a weighted average branching factor (i.e., average number of possible unique next words that follow any word, accounting for their probabilities). For instance, consider a vocabulary of digits and a training corpus where every digit appears with uniform probability $0.1$. The perplexity of any sequence using a 1-gram model is:
            \[ \texttt{PP}(w_{1..N}) = \left( 0.1^{N} \right)^{-\frac{1}{N}} = 10 \]
            Now consider a training corpus where $0$ occurs $91\%$ of the time and the other digits $1\%$ of the time. The perplexity of the sequence \texttt{0 0 0 0 0 3 0 0 0 0} is:
            \[ \texttt{PP}(\texttt{0 0 0 0 0 3 0 0 0 0}) = \left( 0.91^9 \cdot 0.01 \right)^{-\frac{1}{10}} \approx 1.73 \]
@ -349,7 +349,7 @@ There are two types of vocabulary systems:

 \subsection{Unseen sequences}

-Only for $n$-grams that occur enough times a representative probability can be estimated. For increasing values of $n$, the sparsity grows causing many unseen $n$-grams that produce a probability of $0$, with the risk of performing divisions by zero (e.g., perplexity) or zeroing probabilities (e.g., when applying the chain rule).
+Only for $n$-grams that occur enough times a representative probability can be estimated. For increasing values of $n$, the sparsity grows causing many unseen $n$-grams that produce a probability of $0$, with the risk of performing divisions by zero (e.g., when computing perplexity) or zeroing probabilities (e.g., when applying the chain rule).

 \begin{description}
    \item[Laplace smoothing] \marginnote{Laplace smoothing}