diff --git a/src/year2/natural-language-processing/sections/_basic_text.tex b/src/year2/natural-language-processing/sections/_basic_text.tex
index 2a9aa2a..201d61e 100644
--- a/src/year2/natural-language-processing/sections/_basic_text.tex
+++ b/src/year2/natural-language-processing/sections/_basic_text.tex
@@ -138,7 +138,7 @@
             
             \item \texttt{\char`\\ w} matches a single alphanumeric or underscore character (same as \texttt{[a-zA-Z0-9\_]}).
             
-            \item \texttt{\char`\\ w} matches a single non-alphanumeric and non-underscore character (same as \texttt{[\textasciicircum\char`\\ w]}).
+            \item \texttt{\char`\\ W} matches a single non-alphanumeric and non-underscore character (same as \texttt{[\textasciicircum\char`\\ w]}).
 
             \item \texttt{\char`\\ s} matches a single whitespace (space or tab).
             
@@ -150,7 +150,11 @@
         Operator to refer to previously matched substrings.
 
         \begin{example}
-            In the regex \texttt{/the (.*)er they were, the \char`\\ 1er they will be/}, \texttt{\char`\\ 1} should match the same content matched by \texttt{(.*)}.
+            In the regex:
+            \begin{center}
+                \texttt{/the (.*)er they were, the \char`\\ 1er they will be/}
+            \end{center}
+            \texttt{\char`\\ 1} should match the same content matched by \texttt{(.*)}.
         \end{example}
 \end{description}
 
@@ -236,7 +240,7 @@ Tokenization is done by two components:
                 Given a training corpus $C$, BPE determines the vocabulary as follows:
                 \begin{enumerate}
                     \item Start with a vocabulary $V$ containing all the $1$-grams of $C$ and an empty set of merge rules $M$.
-                    \item While the desired size of the vocabulary has not been reached:
+                    \item Until the desired size of the vocabulary is reached:
                     \begin{enumerate}
                         \item Determine the pair of tokens $t_1 \in V$ and $t_2 \in V$ such that, among all the possible pairs, the $n$-gram $t_1 + t_2 = t_1t_2$ obtained by merging them is the most frequent in the corpus $C$.
                         \item Add $t_1t_2$ to $V$ and the merge rule $t_1+t_2$ to $M$.
@@ -315,7 +319,7 @@ Tokenization is done by two components:
     \item[WordPiece] \marginnote{WordPiece}
         Similar to BPE with the addition of merge rules ranking and a special leading/tailing set of characters (usually \texttt{\#\#}) to identify subwords (e.g., \texttt{new\#\#}, \texttt{\#\#est} are possible tokens).
 
-    \item[Unigram] \marginnote{Unigram}
+    \item[Unigram tokenization] \marginnote{Unigram tokenization}
         Starts with a big vocabulary and remove tokens following a loss function.
 \end{description}
 
@@ -353,7 +357,7 @@ Tokenization is done by two components:
         Reduce terms to their stem.
 
         \begin{remark}
-            Stemming is a simpler approach to lemmatization.
+            Stemming is a simpler approach than lemmatization.
         \end{remark}
 
         \begin{description}
@@ -381,10 +385,10 @@ Tokenization is done by two components:
         \end{remark}
 
     \item[Levenshtein distance] \marginnote{Levenshtein distance}
-        Edit distance where:
+        Minimum edit distance where:
         \begin{itemize}
-            \item Insertions cost $1$;
-            \item Deletions cost $1$;
+            \item Insertions cost $1$,
+            \item Deletions cost $1$,
             \item Substitutions cost $2$.
         \end{itemize}
 
diff --git a/src/year2/natural-language-processing/sections/_language_models.tex b/src/year2/natural-language-processing/sections/_language_models.tex
index f32c470..67577d4 100644
--- a/src/year2/natural-language-processing/sections/_language_models.tex
+++ b/src/year2/natural-language-processing/sections/_language_models.tex
@@ -74,19 +74,19 @@
             \textit{\textnormal{[...]} was called a ``stellar and versatile \texttt{acress} whose combination of sass and glamour has defined her \textnormal{[...]}''}
         }
     \end{center}
-    By using the Corpus of Contemporary English (COCA), we can determine the following words as candidates:
+    By using the Corpus of Contemporary American English (COCA), we can determine the following words as candidates:
     \[
         \texttt{actress} \cdot \texttt{cress} \cdot \texttt{caress} \cdot \texttt{access} \cdot \texttt{across} \cdot \texttt{acres}
     \]
 
     \begin{description}
-        \item[Language model] By considering a language model without context, the priors are computed as $\prob{w} = \frac{\texttt{count}(w)}{\vert \texttt{COCA} \vert}$ (where $\vert \texttt{COCA} \vert = \num{404253213}$):
+        \item[Language model] By considering a language model without context, the priors are computed as $\prob{w | \varepsilon} = \frac{\texttt{count}(w)}{\vert \texttt{COCA} \vert}$ (where $\vert \texttt{COCA} \vert = \num{404253213}$):
         \begin{table}[H]
             \centering
             \footnotesize
             \begin{tabular}{ccl}
                 \toprule
-                $w$ & $\texttt{count}(w)$ & $\prob{w}$ \\
+                $w$ & $\texttt{count}(w)$ & $\prob{w | \varepsilon}$ \\
                 \midrule
                 \texttt{actress}    & \num{9321}    & $0.0000231$   \\
                 \texttt{cress}      & \num{220}     & $0.000000544$ \\
@@ -125,15 +125,15 @@
         \footnotesize
         \begin{tabular}{cl}
             \toprule
-            $w$ & $\prob{x | w} \prob{w}$ \\
+            $w$ & $\prob{x | w} \prob{w | \varepsilon}$ \\
             \midrule
-            \texttt{actress} & $2.7 \cdot 10^{9}$ \\
-            \texttt{cress}   & $0.00078 \cdot 10^{9}$ \\
-            \texttt{caress}  & $0.0028 \cdot 10^{9}$ \\
-            \texttt{access}  & $0.019 \cdot 10^{9}$ \\
-            \texttt{across}  & $2.8 \cdot 10^{9}$ \\
-            \texttt{acres}   & $1.02 \cdot 10^{9}$ \\
-            \texttt{acres}   & $1.09 \cdot 10^{9}$ \\
+            \texttt{actress} & $2.7 \cdot 10^{-9}$ \\
+            \texttt{cress}   & $0.00078 \cdot 10^{-9}$ \\
+            \texttt{caress}  & $0.0028 \cdot 10^{-9}$ \\
+            \texttt{access}  & $0.019 \cdot 10^{-9}$ \\
+            \texttt{across}  & $2.8 \cdot 10^{-9}$ \\
+            \texttt{acres}   & $1.02 \cdot 10^{-9}$ \\
+            \texttt{acres}   & $1.09 \cdot 10^{-9}$ \\
             \bottomrule
         \end{tabular}
     \end{table}
@@ -165,8 +165,8 @@
     Finally, we have that:
     \[
         \begin{split}
-            \prob{\texttt{versatile \underline{actress} whose} | \texttt{versatile acress whose}} &= 2.7 \cdot 210 \cdot 10^{-19} \\
-            \prob{\texttt{versatile \underline{across} whose} | \texttt{versatile acress whose}} &= 2.8 \cdot 10^{-19} \\
+            \prob{\texttt{versatile \underline{actress} whose} | \texttt{versatile acress whose}} &= (2.7 \cdot 10^{-9}) \cdot (210 \cdot 10^{-10}) \\
+            \prob{\texttt{versatile \underline{across} whose} | \texttt{versatile acress whose}} &= (2.8 \cdot 10^{-9}) \cdot (1 \cdot 10^{-10}) \\
         \end{split}
     \]
     So \texttt{actress} is the most likely correction for \texttt{acress} in this model.
@@ -223,7 +223,7 @@
 
                 \begin{description}
                     \item[Estimating $\mathbf{N}$-gram probabilities]
-                        Consider the bigram case, the probability that a token $w_i$ follows $w_{i-1}$ can be determined by counting:
+                        Consider the bigram case, the probability that a token $w_i$ follows $w_{i-1}$ can be determined through counting:
                         \[ \prob{w_i | w_{i-1}} = \frac{\texttt{count}(w_{i-1} w_i)}{\texttt{count}(w_{i-1})} \]
                 \end{description}
 
@@ -272,14 +272,14 @@
         Measure the quality of a model independently of the task.
 
     \item[Perplexity (\texttt{PP})] \marginnote{Perplexity}
-        Probability-based metric based on the inverse probability of a sequence (usually the test set) normalized by the number of words:
+        Probability-based metric based on the inverse probability of a sequence (usually using the test set) normalized by the number of words:
         \[
             \begin{split}
                 \prob{w_{1..N}} &= \prod_{i} \prob{w_i | w_{1..i-1}} \\
                 \texttt{PP}(w_{1..N}) &= \prob{w_{1..N}}^{-\frac{1}{N}} \in [1, +\infty]
             \end{split}
         \]
-        A lower perplexity represents a generally better model.
+        Generally, a lower perplexity represents a better model.
 
         \begin{example}
             For bigram models, perplexity is computed as:
@@ -293,7 +293,7 @@
         \begin{remark}[Perplexity intuition]
             Perplexity can be seen as a measure of surprise of a language model when evaluating a sequence.
 
-            Alternatively, it can also be seen as a weighted average branching factor (i.e., average number of possible next words that follow any word, accounting for their probabilities). For instance, consider a vocabulary of digits and a training corpus where every digit appears with uniform probability $0.1$. The perplexity of any sequence using a 1-gram model is:
+            Alternatively, it can also be seen as a weighted average branching factor (i.e., average number of possible unique next words that follow any word, accounting for their probabilities). For instance, consider a vocabulary of digits and a training corpus where every digit appears with uniform probability $0.1$. The perplexity of any sequence using a 1-gram model is:
             \[ \texttt{PP}(w_{1..N}) = \left( 0.1^{N} \right)^{-\frac{1}{N}} = 10 \]
             Now consider a training corpus where $0$ occurs $91\%$ of the time and the other digits $1\%$ of the time. The perplexity of the sequence \texttt{0 0 0 0 0 3 0 0 0 0} is:
             \[ \texttt{PP}(\texttt{0 0 0 0 0 3 0 0 0 0}) = \left( 0.91^9 \cdot 0.01 \right)^{-\frac{1}{10}} \approx 1.73 \]
@@ -349,7 +349,7 @@ There are two types of vocabulary systems:
 
 \subsection{Unseen sequences}
 
-Only for $n$-grams that occur enough times a representative probability can be estimated. For increasing values of $n$, the sparsity grows causing many unseen $n$-grams that produce a probability of $0$, with the risk of performing divisions by zero (e.g., perplexity) or zeroing probabilities (e.g., when applying the chain rule).
+Only for $n$-grams that occur enough times a representative probability can be estimated. For increasing values of $n$, the sparsity grows causing many unseen $n$-grams that produce a probability of $0$, with the risk of performing divisions by zero (e.g., when computing perplexity) or zeroing probabilities (e.g., when applying the chain rule).
 
 \begin{description}
     \item[Laplace smoothing] \marginnote{Laplace smoothing}