diff --git a/src/year2/natural-language-processing/sections/_language_models.tex b/src/year2/natural-language-processing/sections/_language_models.tex index 67577d4..1e27d38 100644 --- a/src/year2/natural-language-processing/sections/_language_models.tex +++ b/src/year2/natural-language-processing/sections/_language_models.tex @@ -80,13 +80,13 @@ \] \begin{description} - \item[Language model] By considering a language model without context, the priors are computed as $\prob{w | \varepsilon} = \frac{\texttt{count}(w)}{\vert \texttt{COCA} \vert}$ (where $\vert \texttt{COCA} \vert = \num{404253213}$): + \item[Language model] By considering a language model without context, the priors are computed as $\prob{w | \varnothing} = \frac{\texttt{count}(w)}{\vert \texttt{COCA} \vert}$ (where $\vert \texttt{COCA} \vert = \num{404253213}$): \begin{table}[H] \centering \footnotesize \begin{tabular}{ccl} \toprule - $w$ & $\texttt{count}(w)$ & $\prob{w | \varepsilon}$ \\ + $w$ & $\texttt{count}(w)$ & $\prob{w | \varnothing}$ \\ \midrule \texttt{actress} & \num{9321} & $0.0000231$ \\ \texttt{cress} & \num{220} & $0.000000544$ \\ @@ -125,7 +125,7 @@ \footnotesize \begin{tabular}{cl} \toprule - $w$ & $\prob{x | w} \prob{w | \varepsilon}$ \\ + $w$ & $\prob{x | w} \prob{w | \varnothing}$ \\ \midrule \texttt{actress} & $2.7 \cdot 10^{-9}$ \\ \texttt{cress} & $0.00078 \cdot 10^{-9}$ \\ @@ -155,7 +155,7 @@ \bottomrule \end{tabular} \end{table} - This allows to measure the likelihood of a sentence as: + This allows to measure the likelihood of a word within its context as: \[ \begin{split} \prob{\texttt{versatile \underline{actress} whose}} &= \prob{\texttt{actress} | \texttt{versatile}} \prob{\texttt{whose} | \texttt{actress}} = 210 \cdot 10^{-10} \\ @@ -293,7 +293,7 @@ \begin{remark}[Perplexity intuition] Perplexity can be seen as a measure of surprise of a language model when evaluating a sequence. - Alternatively, it can also be seen as a weighted average branching factor (i.e., average number of possible unique next words that follow any word, accounting for their probabilities). For instance, consider a vocabulary of digits and a training corpus where every digit appears with uniform probability $0.1$. The perplexity of any sequence using a 1-gram model is: + Alternatively, it can also be seen as a weighted average branching factor (i.e., average number of possible unique next words that follow any word accounting for their probabilities). For instance, consider a vocabulary of digits and a training corpus where every digit appears with uniform probability $0.1$. The perplexity of any sequence using a 1-gram model is: \[ \texttt{PP}(w_{1..N}) = \left( 0.1^{N} \right)^{-\frac{1}{N}} = 10 \] Now consider a training corpus where $0$ occurs $91\%$ of the time and the other digits $1\%$ of the time. The perplexity of the sequence \texttt{0 0 0 0 0 3 0 0 0 0} is: \[ \texttt{PP}(\texttt{0 0 0 0 0 3 0 0 0 0}) = \left( 0.91^9 \cdot 0.01 \right)^{-\frac{1}{10}} \approx 1.73 \]