Fix typos <noupdate>

This commit is contained in:
2025-01-07 11:21:29 +01:00
parent 8837682582
commit 420d39903e
6 changed files with 13 additions and 13 deletions

View File

@ -46,7 +46,7 @@
\vec{s}^{(t)} \cdot \vec{h}^{(N)} \vec{s}^{(t)} \cdot \vec{h}^{(N)}
\end{bmatrix} \in \mathbb{R}^{N} \end{bmatrix} \in \mathbb{R}^{N}
\] \]
$\vec{e}^{(t)}$ is used to determine the attention distribution $\vec{\alpha}^{(t)}$ that is required to obtain the attention output $\vec{a}^{(t)}$ as the weighted encoder hidden states : $\vec{e}^{(t)}$ is used to determine the attention distribution $\vec{\alpha}^{(t)}$ that is required to obtain the attention output $\vec{a}^{(t)}$ as the weighted sum of the encoder hidden states:
\[ \[
\begin{gathered} \begin{gathered}
\mathbb{R}^{N} \ni \vec{\alpha}^{(t)} = \texttt{softmax}(\vec{e}^{(t)}) \\ \mathbb{R}^{N} \ni \vec{\alpha}^{(t)} = \texttt{softmax}(\vec{e}^{(t)}) \\

View File

@ -71,7 +71,7 @@
\hat{c} &= \arg\max_{c \in C} \prob{c | d} \\ \hat{c} &= \arg\max_{c \in C} \prob{c | d} \\
&= \arg\max_{c \in C} \underbrace{\prob{d | c}}_{\text{likelihood}} \underbrace{\prob{c}}_{\text{prior}} \\ &= \arg\max_{c \in C} \underbrace{\prob{d | c}}_{\text{likelihood}} \underbrace{\prob{c}}_{\text{prior}} \\
&= \arg\max_{c \in C} \prob{w_1, \dots, w_n | c} \prob{c} \\ &= \arg\max_{c \in C} \prob{w_1, \dots, w_n | c} \prob{c} \\
&= \arg\max_{c \in C} \prod_{i} \prob{w_i | c} \prob{c} \\ &\approx \arg\max_{c \in C} \prod_{i} \prob{w_i | c} \prob{c} \\
&= \arg\max_{c \in C} \sum_{i} \log\prob{w_i | c} \log\prob{c} \\ &= \arg\max_{c \in C} \sum_{i} \log\prob{w_i | c} \log\prob{c} \\
\end{split} \end{split}
\] \]
@ -82,7 +82,7 @@
\qquad \qquad
\prob{w_i | c} = \frac{\texttt{count}(w_i, c)}{\sum_{v \in V} \texttt{count}(v, c)} \prob{w_i | c} = \frac{\texttt{count}(w_i, c)}{\sum_{v \in V} \texttt{count}(v, c)}
\] \]
where $N_c$ is the number of documents with class $c$ and $\texttt{count}(w, c)$ counts the occurrences of the word $w$ in the training samples with class $c$. where $N_c$ is the number of documents of class $c$ and $\texttt{count}(w, c)$ counts the occurrences of the word $w$ in the training samples of class $c$.
\begin{remark} \begin{remark}
Laplace smoothing is used to avoid zero probabilities. Laplace smoothing is used to avoid zero probabilities.
@ -226,7 +226,7 @@ Naive Bayes has the following properties:
\end{cases} \end{cases}
\] \]
By applying a log-transformation and inverting the sign, this corresponds to the cross-entropy loss in the binary case: By applying a log-transformation and inverting the sign, this corresponds to the cross-entropy loss in the binary case:
\[ \mathcal{L}_{\text{BCE}}(\hat{y}, y) = -\log \prob{y | x} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})] \] \[ \mathcal{L}_{\text{BCE}}(\hat{y}, y) = -\log(\prob{y | x}) = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})] \]
\item[Optimization] \item[Optimization]
As cross-entropy is convex, SGD is well suited to find the parameters $\vec{\theta}$ of a logistic regressor $f$ over batches of $m$ examples by solving: As cross-entropy is convex, SGD is well suited to find the parameters $\vec{\theta}$ of a logistic regressor $f$ over batches of $m$ examples by solving:
@ -255,7 +255,7 @@ Logistic regression has the following properties:
\end{itemize} \end{itemize}
\begin{remark} \begin{remark}
Logistic regression is also a good baseline when experimenting with text classifications. Logistic regression is also a good baseline when experimenting with text classification.
As they are lightweight to train, it is a good idea to test both naive Bayes and logistic regression to determine the best baseline for other experiments. As they are lightweight to train, it is a good idea to test both naive Bayes and logistic regression to determine the best baseline for other experiments.
\end{remark} \end{remark}

View File

@ -60,7 +60,7 @@
\end{remark} \end{remark}
\item[Beam search] \marginnote{Beam search} \item[Beam search] \marginnote{Beam search}
Given a beam width $k$, perform a breadth-first search keeping at each branching level the top-$k$ tokens based on the probability of that sequence computes as: Given a beam width $k$, perform a breadth-first search keeping at each branching level the top-$k$ tokens based on the probability of that sequence computed as:
\[ \log\left( \prob{y \mid x} \right) = \sum_{i=1}^{t} \log\left( \prob{ y_i \mid x, y_1, \dots, y_{i-1} } \right) \] \[ \log\left( \prob{y \mid x} \right) = \sum_{i=1}^{t} \log\left( \prob{ y_i \mid x, y_1, \dots, y_{i-1} } \right) \]
\begin{example} \begin{example}
@ -192,7 +192,7 @@
Architecture that produces contextual embeddings by considering both left-to-right and right-to-left context. Architecture that produces contextual embeddings by considering both left-to-right and right-to-left context.
\begin{remark} \begin{remark}
This architecture does feature extraction and is more suited for classification tasks. This architecture performs feature extraction and is more suited for classification tasks.
\end{remark} \end{remark}
\begin{description} \begin{description}
@ -206,7 +206,7 @@
\end{description} \end{description}
\item[Contextual embedding] \marginnote{Contextual embedding} \item[Contextual embedding] \marginnote{Contextual embedding}
Represent the meaning of word instances (i.e., dynamically depending on the surroundings). Represent the meaning of word instances (i.e., in a dynamic manner depending on the surroundings).
\begin{remark}[Sequence embedding] \begin{remark}[Sequence embedding]
Encoders usually have a classifier token (e.g., \texttt{[CLS]}) to model the whole sentence. Encoders usually have a classifier token (e.g., \texttt{[CLS]}) to model the whole sentence.

View File

@ -61,7 +61,7 @@
\item Fine-tune the language model (i.e., train the policy) using an RL algorithm (e.g., PPO) and the learned reward model. \item Fine-tune the language model (i.e., train the policy) using an RL algorithm (e.g., PPO) and the learned reward model.
Given a prompt $x$ and an answer $y$, the reward $r$ used for RL update is computed as: Given a prompt $x$ and an answer $y$, the reward $r$ used for the RL update is computed as:
\[ r = r_\theta(y \mid x) - \lambda_\text{KL} D_\text{KL}(\pi_{\text{PPO}}(y \mid x) \Vert \pi_{\text{base}}(y \mid x)) \] \[ r = r_\theta(y \mid x) - \lambda_\text{KL} D_\text{KL}(\pi_{\text{PPO}}(y \mid x) \Vert \pi_{\text{base}}(y \mid x)) \]
where: where:
\begin{itemize} \begin{itemize}

View File

@ -207,7 +207,7 @@
Provide an initial hidden state to the RNN (e.g., speech-to-text). Provide an initial hidden state to the RNN (e.g., speech-to-text).
\end{description} \end{description}
\item[Sequence labelling] \marginnote{Sequence labelling} \item[Sequence labeling] \marginnote{Sequence labeling}
Assign a class to each input token (e.g., POS-tagging, named-entity recognition, structure prediction, \dots). Assign a class to each input token (e.g., POS-tagging, named-entity recognition, structure prediction, \dots).
\item[Sequence classification] \marginnote{Sequence classification} \item[Sequence classification] \marginnote{Sequence classification}

View File

@ -406,7 +406,7 @@
\item[Neural language modeling] \marginnote{Neural language modeling} \item[Neural language modeling] \marginnote{Neural language modeling}
Use a neural network to predict the next word $w_{n+1}$ given an input sequence $w_{1..n}$. The general flow is the following: Use a neural network to predict the next word $w_{n+1}$ given an input sequence $w_{1..n}$. The general flow is the following:
\begin{enumerate} \begin{enumerate}
\item Encode the input words into one-hot vectors ($\mathbb{R}^{|V| \times n}$). \item Encode the input words into one-hot vectors ($\mathbb{N}^{|V| \times n}$).
\item Project the input vectors with an embedding matrix $\matr{E} \in \mathbb{R}^{d \times |V|}$ that encodes them into $d$-dimensional vectors. \item Project the input vectors with an embedding matrix $\matr{E} \in \mathbb{R}^{d \times |V|}$ that encodes them into $d$-dimensional vectors.
\item Pass the embedding into the hidden layers. \item Pass the embedding into the hidden layers.
\item The final layer is a probability distribution over the vocabulary ($\mathbb{R}^{|V| \times 1}$). \item The final layer is a probability distribution over the vocabulary ($\mathbb{R}^{|V| \times 1}$).
@ -478,7 +478,7 @@
Use a binary logistic regressor as classifier. The two classes are: Use a binary logistic regressor as classifier. The two classes are:
\begin{itemize} \begin{itemize}
\item Context words within the context window (positive label). \item Context words within the context window (positive label).
\item Words randomly sampled (negative label). \item Randomly sampled words (negative label).
\end{itemize} \end{itemize}
The probabilities can be computed as: The probabilities can be computed as:
\[ \[