mirror of
https://github.com/NotXia/unibo-ai-notes.git
synced 2025-12-14 18:51:52 +01:00
Fix typos <noupdate>
This commit is contained in:
@ -138,7 +138,7 @@
|
||||
|
||||
\item \texttt{\char`\\ w} matches a single alphanumeric or underscore character (same as \texttt{[a-zA-Z0-9\_]}).
|
||||
|
||||
\item \texttt{\char`\\ w} matches a single non-alphanumeric and non-underscore character (same as \texttt{[\textasciicircum\char`\\ w]}).
|
||||
\item \texttt{\char`\\ W} matches a single non-alphanumeric and non-underscore character (same as \texttt{[\textasciicircum\char`\\ w]}).
|
||||
|
||||
\item \texttt{\char`\\ s} matches a single whitespace (space or tab).
|
||||
|
||||
@ -150,7 +150,11 @@
|
||||
Operator to refer to previously matched substrings.
|
||||
|
||||
\begin{example}
|
||||
In the regex \texttt{/the (.*)er they were, the \char`\\ 1er they will be/}, \texttt{\char`\\ 1} should match the same content matched by \texttt{(.*)}.
|
||||
In the regex:
|
||||
\begin{center}
|
||||
\texttt{/the (.*)er they were, the \char`\\ 1er they will be/}
|
||||
\end{center}
|
||||
\texttt{\char`\\ 1} should match the same content matched by \texttt{(.*)}.
|
||||
\end{example}
|
||||
\end{description}
|
||||
|
||||
@ -236,7 +240,7 @@ Tokenization is done by two components:
|
||||
Given a training corpus $C$, BPE determines the vocabulary as follows:
|
||||
\begin{enumerate}
|
||||
\item Start with a vocabulary $V$ containing all the $1$-grams of $C$ and an empty set of merge rules $M$.
|
||||
\item While the desired size of the vocabulary has not been reached:
|
||||
\item Until the desired size of the vocabulary is reached:
|
||||
\begin{enumerate}
|
||||
\item Determine the pair of tokens $t_1 \in V$ and $t_2 \in V$ such that, among all the possible pairs, the $n$-gram $t_1 + t_2 = t_1t_2$ obtained by merging them is the most frequent in the corpus $C$.
|
||||
\item Add $t_1t_2$ to $V$ and the merge rule $t_1+t_2$ to $M$.
|
||||
@ -315,7 +319,7 @@ Tokenization is done by two components:
|
||||
\item[WordPiece] \marginnote{WordPiece}
|
||||
Similar to BPE with the addition of merge rules ranking and a special leading/tailing set of characters (usually \texttt{\#\#}) to identify subwords (e.g., \texttt{new\#\#}, \texttt{\#\#est} are possible tokens).
|
||||
|
||||
\item[Unigram] \marginnote{Unigram}
|
||||
\item[Unigram tokenization] \marginnote{Unigram tokenization}
|
||||
Starts with a big vocabulary and remove tokens following a loss function.
|
||||
\end{description}
|
||||
|
||||
@ -353,7 +357,7 @@ Tokenization is done by two components:
|
||||
Reduce terms to their stem.
|
||||
|
||||
\begin{remark}
|
||||
Stemming is a simpler approach to lemmatization.
|
||||
Stemming is a simpler approach than lemmatization.
|
||||
\end{remark}
|
||||
|
||||
\begin{description}
|
||||
@ -381,10 +385,10 @@ Tokenization is done by two components:
|
||||
\end{remark}
|
||||
|
||||
\item[Levenshtein distance] \marginnote{Levenshtein distance}
|
||||
Edit distance where:
|
||||
Minimum edit distance where:
|
||||
\begin{itemize}
|
||||
\item Insertions cost $1$;
|
||||
\item Deletions cost $1$;
|
||||
\item Insertions cost $1$,
|
||||
\item Deletions cost $1$,
|
||||
\item Substitutions cost $2$.
|
||||
\end{itemize}
|
||||
|
||||
|
||||
@ -74,19 +74,19 @@
|
||||
\textit{\textnormal{[...]} was called a ``stellar and versatile \texttt{acress} whose combination of sass and glamour has defined her \textnormal{[...]}''}
|
||||
}
|
||||
\end{center}
|
||||
By using the Corpus of Contemporary English (COCA), we can determine the following words as candidates:
|
||||
By using the Corpus of Contemporary American English (COCA), we can determine the following words as candidates:
|
||||
\[
|
||||
\texttt{actress} \cdot \texttt{cress} \cdot \texttt{caress} \cdot \texttt{access} \cdot \texttt{across} \cdot \texttt{acres}
|
||||
\]
|
||||
|
||||
\begin{description}
|
||||
\item[Language model] By considering a language model without context, the priors are computed as $\prob{w} = \frac{\texttt{count}(w)}{\vert \texttt{COCA} \vert}$ (where $\vert \texttt{COCA} \vert = \num{404253213}$):
|
||||
\item[Language model] By considering a language model without context, the priors are computed as $\prob{w | \varepsilon} = \frac{\texttt{count}(w)}{\vert \texttt{COCA} \vert}$ (where $\vert \texttt{COCA} \vert = \num{404253213}$):
|
||||
\begin{table}[H]
|
||||
\centering
|
||||
\footnotesize
|
||||
\begin{tabular}{ccl}
|
||||
\toprule
|
||||
$w$ & $\texttt{count}(w)$ & $\prob{w}$ \\
|
||||
$w$ & $\texttt{count}(w)$ & $\prob{w | \varepsilon}$ \\
|
||||
\midrule
|
||||
\texttt{actress} & \num{9321} & $0.0000231$ \\
|
||||
\texttt{cress} & \num{220} & $0.000000544$ \\
|
||||
@ -125,15 +125,15 @@
|
||||
\footnotesize
|
||||
\begin{tabular}{cl}
|
||||
\toprule
|
||||
$w$ & $\prob{x | w} \prob{w}$ \\
|
||||
$w$ & $\prob{x | w} \prob{w | \varepsilon}$ \\
|
||||
\midrule
|
||||
\texttt{actress} & $2.7 \cdot 10^{9}$ \\
|
||||
\texttt{cress} & $0.00078 \cdot 10^{9}$ \\
|
||||
\texttt{caress} & $0.0028 \cdot 10^{9}$ \\
|
||||
\texttt{access} & $0.019 \cdot 10^{9}$ \\
|
||||
\texttt{across} & $2.8 \cdot 10^{9}$ \\
|
||||
\texttt{acres} & $1.02 \cdot 10^{9}$ \\
|
||||
\texttt{acres} & $1.09 \cdot 10^{9}$ \\
|
||||
\texttt{actress} & $2.7 \cdot 10^{-9}$ \\
|
||||
\texttt{cress} & $0.00078 \cdot 10^{-9}$ \\
|
||||
\texttt{caress} & $0.0028 \cdot 10^{-9}$ \\
|
||||
\texttt{access} & $0.019 \cdot 10^{-9}$ \\
|
||||
\texttt{across} & $2.8 \cdot 10^{-9}$ \\
|
||||
\texttt{acres} & $1.02 \cdot 10^{-9}$ \\
|
||||
\texttt{acres} & $1.09 \cdot 10^{-9}$ \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
@ -165,8 +165,8 @@
|
||||
Finally, we have that:
|
||||
\[
|
||||
\begin{split}
|
||||
\prob{\texttt{versatile \underline{actress} whose} | \texttt{versatile acress whose}} &= 2.7 \cdot 210 \cdot 10^{-19} \\
|
||||
\prob{\texttt{versatile \underline{across} whose} | \texttt{versatile acress whose}} &= 2.8 \cdot 10^{-19} \\
|
||||
\prob{\texttt{versatile \underline{actress} whose} | \texttt{versatile acress whose}} &= (2.7 \cdot 10^{-9}) \cdot (210 \cdot 10^{-10}) \\
|
||||
\prob{\texttt{versatile \underline{across} whose} | \texttt{versatile acress whose}} &= (2.8 \cdot 10^{-9}) \cdot (1 \cdot 10^{-10}) \\
|
||||
\end{split}
|
||||
\]
|
||||
So \texttt{actress} is the most likely correction for \texttt{acress} in this model.
|
||||
@ -223,7 +223,7 @@
|
||||
|
||||
\begin{description}
|
||||
\item[Estimating $\mathbf{N}$-gram probabilities]
|
||||
Consider the bigram case, the probability that a token $w_i$ follows $w_{i-1}$ can be determined by counting:
|
||||
Consider the bigram case, the probability that a token $w_i$ follows $w_{i-1}$ can be determined through counting:
|
||||
\[ \prob{w_i | w_{i-1}} = \frac{\texttt{count}(w_{i-1} w_i)}{\texttt{count}(w_{i-1})} \]
|
||||
\end{description}
|
||||
|
||||
@ -272,14 +272,14 @@
|
||||
Measure the quality of a model independently of the task.
|
||||
|
||||
\item[Perplexity (\texttt{PP})] \marginnote{Perplexity}
|
||||
Probability-based metric based on the inverse probability of a sequence (usually the test set) normalized by the number of words:
|
||||
Probability-based metric based on the inverse probability of a sequence (usually using the test set) normalized by the number of words:
|
||||
\[
|
||||
\begin{split}
|
||||
\prob{w_{1..N}} &= \prod_{i} \prob{w_i | w_{1..i-1}} \\
|
||||
\texttt{PP}(w_{1..N}) &= \prob{w_{1..N}}^{-\frac{1}{N}} \in [1, +\infty]
|
||||
\end{split}
|
||||
\]
|
||||
A lower perplexity represents a generally better model.
|
||||
Generally, a lower perplexity represents a better model.
|
||||
|
||||
\begin{example}
|
||||
For bigram models, perplexity is computed as:
|
||||
@ -293,7 +293,7 @@
|
||||
\begin{remark}[Perplexity intuition]
|
||||
Perplexity can be seen as a measure of surprise of a language model when evaluating a sequence.
|
||||
|
||||
Alternatively, it can also be seen as a weighted average branching factor (i.e., average number of possible next words that follow any word, accounting for their probabilities). For instance, consider a vocabulary of digits and a training corpus where every digit appears with uniform probability $0.1$. The perplexity of any sequence using a 1-gram model is:
|
||||
Alternatively, it can also be seen as a weighted average branching factor (i.e., average number of possible unique next words that follow any word, accounting for their probabilities). For instance, consider a vocabulary of digits and a training corpus where every digit appears with uniform probability $0.1$. The perplexity of any sequence using a 1-gram model is:
|
||||
\[ \texttt{PP}(w_{1..N}) = \left( 0.1^{N} \right)^{-\frac{1}{N}} = 10 \]
|
||||
Now consider a training corpus where $0$ occurs $91\%$ of the time and the other digits $1\%$ of the time. The perplexity of the sequence \texttt{0 0 0 0 0 3 0 0 0 0} is:
|
||||
\[ \texttt{PP}(\texttt{0 0 0 0 0 3 0 0 0 0}) = \left( 0.91^9 \cdot 0.01 \right)^{-\frac{1}{10}} \approx 1.73 \]
|
||||
@ -349,7 +349,7 @@ There are two types of vocabulary systems:
|
||||
|
||||
\subsection{Unseen sequences}
|
||||
|
||||
Only for $n$-grams that occur enough times a representative probability can be estimated. For increasing values of $n$, the sparsity grows causing many unseen $n$-grams that produce a probability of $0$, with the risk of performing divisions by zero (e.g., perplexity) or zeroing probabilities (e.g., when applying the chain rule).
|
||||
Only for $n$-grams that occur enough times a representative probability can be estimated. For increasing values of $n$, the sparsity grows causing many unseen $n$-grams that produce a probability of $0$, with the risk of performing divisions by zero (e.g., when computing perplexity) or zeroing probabilities (e.g., when applying the chain rule).
|
||||
|
||||
\begin{description}
|
||||
\item[Laplace smoothing] \marginnote{Laplace smoothing}
|
||||
|
||||
Reference in New Issue
Block a user