Fix errors and typos <noupdate>

This commit is contained in:
2024-12-23 19:18:23 +01:00
parent f72d4164d2
commit d0229c69dc
11 changed files with 85 additions and 81 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 371 KiB

View File

@ -34,16 +34,16 @@
The overall flow is the following:
\begin{enumerate}
\item The encoder computes its hidden states $\vec{h}^{(1)}, \dots, \vec{h}^{(N)} \in \mathbb{R}^{h}$.
\item The decoder processes the input tokens one at the time. Its hidden state is initialized with $\vec{h}^{(N)}$. Consider the token in position $t$, the output is determined as follows:
\item The decoder processes the input tokens one at the time beginning with a \texttt{<start>} token. Its hidden state is initialized with $\vec{h}^{(N)}$. Consider the token at position $t$, the output is determined as follows:
\begin{enumerate}
\item The decoder outputs the hidden state $\vec{s}^{(t)}$.
\item Attention scores $\vec{e}^{(t)}$ are determined as the dot product between $\vec{s}^{(t)}$ and $\vec{h}^{(i)}$:
\[
\vec{e}^{(t)} =
\begin{bmatrix}
\vec{s}^{(t)} \odot \vec{h}^{(1)} &
\vec{s}^{(t)} \cdot \vec{h}^{(1)} &
\cdots &
\vec{s}^{(t)} \odot \vec{h}^{(N)}
\vec{s}^{(t)} \cdot \vec{h}^{(N)}
\end{bmatrix} \in \mathbb{R}^{N}
\]
$\vec{e}^{(t)}$ is used to determine the attention distribution $\vec{\alpha}^{(t)}$ that is required to obtain the attention output $\vec{a}^{(t)}$ as the weighted encoder hidden states :
@ -104,7 +104,7 @@
\begin{example}[Character-aware neural LM]
\phantom{}\\
\begin{minipage}{0.6\linewidth}
RNN-LM that works on a character level:
RNN-LM that works on the character level:
\begin{itemize}
\item Given a token, each character is embedded and concatenated.
\item Convolutions are used to refine the representation.
@ -147,8 +147,8 @@
Then, the attention weights $\alpha_{i,j}$ between two embeddings $\vec{x}_i$ and $\vec{x}_j$ are computed as:
\[
\begin{gathered}
\texttt{scores}(\vec{x}_i, \vec{x}_j) = \frac{\vec{q}_i \vec{k}_j}{\sqrt{d_k}} \\
\alpha_{i,j} = \texttt{softmax}_j\left( \left[\texttt{scores}(\vec{x}_i, \vec{x}_1), \dots, \texttt{scores}(\vec{x}_i, \vec{x}_T)\right] \right) \\
\texttt{score}(\vec{x}_i, \vec{x}_j) = \frac{\vec{q}_i \vec{k}_j}{\sqrt{d_k}} \\
\alpha_{i,j} = \texttt{softmax}_j\left( \left[\texttt{score}(\vec{x}_i, \vec{x}_1), \dots, \texttt{score}(\vec{x}_i, \vec{x}_T)\right] \right) \\
\end{gathered}
\]
The output $\vec{a}_i \in \mathbb{R}^{1 \times d_v}$ is a weighted sum of the values of each token:
@ -163,12 +163,12 @@
\item[Causal attention] \marginnote{Causal attention}
Self-attention mechanism where only past tokens can be used to determine the representation of a token at a specific position. It is computed by modifying the standard self-attention as:
Self-attention mechanism where only past tokens can be used to determine the representation of a token at a specific position. It is computed by modifying the standard self-attention as follows:
\[
\begin{gathered}
\forall j \leq i: \texttt{scores}(\vec{x}_i, \vec{x}_j) = \frac{\vec{q}_i \vec{k}_j}{\sqrt{d_k}} \qquad
\forall j > i: \texttt{scores}(\vec{x}_i, \vec{x}_j) = -\infty \\
\alpha_{i,j} = \texttt{softmax}_j\left( \left[\texttt{scores}(\vec{x}_i, \vec{x}_1), \dots, \texttt{scores}(\vec{x}_i, \vec{x}_T)\right] \right) \\
\forall j \leq i: \texttt{score}(\vec{x}_i, \vec{x}_j) = \frac{\vec{q}_i \vec{k}_j}{\sqrt{d_k}} \qquad
\forall j > i: \texttt{score}(\vec{x}_i, \vec{x}_j) = -\infty \\
\alpha_{i,j} = \texttt{softmax}_j\left( \left[\texttt{score}(\vec{x}_i, \vec{x}_1), \dots, \texttt{score}(\vec{x}_i, \vec{x}_T)\right] \right) \\
\vec{a}_i = \sum_{t: t \leq i} \alpha_{i,t} \vec{v}_t
\end{gathered}
\]
@ -181,11 +181,11 @@
\end{description}
\subsection{Components}
\subsection{Embeddings}
\begin{description}
\item[Input embedding] \marginnote{Input embedding}
The input is tokenized using standard tokenizers (e.g., BPE, SentencePiece, \dots). Each token is encoded using a learned embedding matrix.
The input is tokenized using standard tokenizers (e.g., BPE, SentencePiece, \dots). Each token is then encoded using a learned embedding matrix.
\begin{figure}[H]
\centering
@ -203,7 +203,12 @@
\centering
\includegraphics[width=0.45\linewidth]{./img/_positional_encoding.pdf}
\end{figure}
\end{description}
\subsection{Transformer block}
\begin{description}
\item[Transformer block] \marginnote{Transformer block}
Module with the same input and output dimensionality (i.e., allows stacking multiple blocks) composed of:
\begin{descriptionlist}
@ -221,13 +226,13 @@
Where the hidden dimension $d_\text{ff}$ is usually larger than $d_\text{model}$.
\item[Normalization layer]
Applies token-wise normalization (i.e., layer norm) to help training stability.
Applies layer normalization (i.e., normalize each sequence of the batch independently) to help training stability.
\item[Residual connection]
Helps to propagate information during training.
\begin{remark}[Residual stream]
An interpretation of residual connections is the residual stream where the input token in enhanced by the output of multi-head attention and the feedforward network.
An interpretation of residual connections is that of residual stream where the input token is enhanced by the output of multi-head attention and the feedforward network.
\begin{figure}[H]
\centering
@ -242,7 +247,7 @@
\caption{Overall attention block}
\end{figure}
\item[Language modelling head] \marginnote{Language modelling head}
\item[Language modeling head] \marginnote{Language modeling head}
Takes as input the output corresponding to a token of the transformer blocks stack and outputs a distribution over the vocabulary.
\begin{figure}[H]

View File

@ -9,7 +9,7 @@
\begin{itemize}
\item The holder of attitude (i.e., the source).
\item The target of attitude (i.e., the aspect).
\item The type of attitude (e.g., positive and negative).
\item The type of attitude (e.g., positive or negative).
\item The text containing the attitude.
\end{itemize}
@ -43,12 +43,12 @@
\end{example}
\item[Supervised machine learning] \marginnote{Supervised machine learning}
Use a training set of $N$ labeled data $\{ (d_i, c_i) \}$ to fit a classifier.
Use a training set of $N$ labeled document-class data points $\{ (d_i, c_i) \}$ to fit a classifier.
An ML model can be:
\begin{descriptionlist}
\item[Generative] Informally, it learns the distribution of the data.
\item[Discriminative] Informally, it learns to exploit the features to determine the class.
\item[Generative] Informally, it learns the distribution of the data (i.e., $\prob{d_i | c_i}$).
\item[Discriminative] Informally, it learns to exploit the features to determine the class (i.e., $\prob{c_i | d_i}$).
\end{descriptionlist}
\end{descriptionlist}
\end{description}
@ -58,9 +58,9 @@
\begin{description}
\item[Bag-of-words (BoW)] \marginnote{Bag-of-words (BoW)}
Represents a document using the frequencies of its words.
Representation of a document using the frequency of its words.
Given a vocabulary $V$ and a document $d$, the bag-of-words embedding of $d$ is a vector in $\mathbb{N}^{\vert V \vert}$ where the $i$-th position contains the number of occurrences of the $i$-th token in $d$.
Given a vocabulary $V$ and a document $d$, the bag-of-words embedding of $d$ is a vector in $\mathbb{N}^{\vert V \vert}$ where the $i$-th position contains the number of occurrences of the $i$-th token of $V$ in $d$.
\item[Multinomial naive Bayes classifier] \marginnote{Multinomial naive Bayes classifier}
Generative probabilistic classifier based on the assumption that features are independent given the class.
@ -76,13 +76,13 @@
\end{split}
\]
Given a training set $D$ with $N_c$ classes and a vocabulary $V$, $\prob{w_i | c}$ and $\prob{c}$ are determined during training by maximum likelihood estimation as follows:
Given a training set $D$ and a vocabulary $V$, $\prob{w_i | c}$ and $\prob{c}$ are determined during training by maximum likelihood estimation as follows:
\[
\prob{c} = \frac{N_c}{\vert D \vert}
\qquad
\prob{w_i | c} = \frac{\texttt{count}(w_i, c)}{\sum_{v \in V} \texttt{count}(v, c)}
\]
where $\texttt{count}(w, c)$ counts the occurrences of the word $w$ in the training samples with class $c$.
where $N_c$ is the number of documents with class $c$ and $\texttt{count}(w, c)$ counts the occurrences of the word $w$ in the training samples with class $c$.
\begin{remark}
Laplace smoothing is used to avoid zero probabilities.
@ -149,7 +149,7 @@ Possible optimizations for naive Bayes applied to sentiment analysis are the fol
\end{example}
\item[Parse tree]
Build a tree to encode the sentiment and interactions of the words. By propagating the sentiments bottom-up it is possible to determine the overall sentiment of the sequence.
Build a tree to encode the sentiment and interactions of the words. By propagating the sentiments bottom-up, it is possible to determine the overall sentiment of the sequence.
\begin{example}
The parse tree for the sentence ``\texttt{This film doesn't care about cleverness, wit or any other kind of intelligent humor.}'' is the following:
@ -208,11 +208,11 @@ Naive Bayes has the following properties:
\item[Binary logistic regression] \marginnote{Binary logistic regression}
Discriminative probabilistic model that computes the joint distribution $\prob{c | d}$ of a class $c$ and a document $d$.
Discriminative probabilistic model that computes the joint distribution $\prob{c | d}$ of the class $c$ given the document $d$.
Given the input features $\vec{x} = [x_1, \dots, x_n]$, logistic regression computes the following:
\[
\sigma\left( \sum_{i=1}^{n} w_i x_i \right) + b = \sigma(\vec{w}\vec{x} + b)
\sigma\left( \sum_{i=1}^{n} w_i x_i + b \right) = \sigma(\vec{w}\vec{x} + b)
\]
where $\sigma$ is the sigmoid function.
@ -233,8 +233,8 @@ Naive Bayes has the following properties:
\[ \arg\min_{\vec{\theta}} \sum_{i=1}^{m} \mathcal{L}_\text{BCE}(\hat{y}^{(i)}, f(x^{(i)}; \vec{\theta})) + \alpha \mathcal{R}(\vec{\theta}) \]
where $\alpha$ is the regularization factor and $\mathcal{R}(\vec{\theta})$ is the regularization term. Typical regularization approaches are:
\begin{descriptionlist}
\item[Ridge regression (L1)] $\mathcal{R}(\vec{\theta}) = \Vert \vec{\theta} \Vert^2_2 = \sum_{j=1}^{n} \vec{\theta}_j^2$.
\item[Lasso regression (L2)] $\mathcal{R}(\vec{\theta}) = \Vert \vec{\theta} \Vert_1 = \sum_{j=1}^{n} \vert \vec{\theta}_j \vert$.
\item[Lasso regression (L1)] $\mathcal{R}(\vec{\theta}) = \Vert \vec{\theta} \Vert_1 = \sum_{j=1}^{n} \vert \vec{\theta}_j \vert$.
\item[Ridge regression (L2)] $\mathcal{R}(\vec{\theta}) = \Vert \vec{\theta} \Vert^2_2 = \sum_{j=1}^{n} \vec{\theta}_j^2$.
\end{descriptionlist}
\end{description}
@ -242,7 +242,7 @@ Naive Bayes has the following properties:
Extension of logistic regression to the multi-class case. The joint probability becomes $\prob{y = c | x}$ and softmax is used in place of the sigmoid.
Cross-entropy is extended over the classes $C$:
\[ \mathcal{L}_\text{CE}(\hat{y}, y) = - \sum_{c \in C} \mathbbm{1}\{y = c\} \log \prob{y = c | x} \]
\[ \mathcal{L}_\text{CE}(\hat{y}, y) = - \sum_{c \in C} \mathbbm{1}\{y = c\} \log(\prob{y = c | x}) \]
\end{description}

View File

@ -34,9 +34,9 @@
\end{enumerate}
\item[ALERT benchmark] \marginnote{ALERT benchmark}
Benchmark to test the safeness of an LLM based on 32 risk categories. The testing data are created as follows:
Benchmark to test the safeness of an LLM based on 32 risk categories. The testing data were created as follows:
\begin{enumerate}
\item Filter the ``\textit{Helpfulness \& Harmlessness-RLHF}'' dataset of \textit{Anthropic} by considering for each example the first prompt and red team attacks only.
\item Filter the ``\textit{Helpfulness \& Harmlessness-RLHF}'' dataset of \textit{Anthropic} by considering for each example the first prompt and red team (i.e., malicious) attacks only.
\item Use templates to automatically generate additional prompts.
\item Augment the prompts by formatting them as adversarial attacks. Examples of attacks are:
\begin{descriptionlist}

View File

@ -60,7 +60,7 @@
\end{remark}
\item[Beam search] \marginnote{Beam search}
Given a beam width $k$, perform a breadth-first search keeping at each branching level the top-$k$ tokens based on the probability of that sequence:
Given a beam width $k$, perform a breadth-first search keeping at each branching level the top-$k$ tokens based on the probability of that sequence computes as:
\[ \log\left( \prob{y \mid x} \right) = \sum_{i=1}^{t} \log\left( \prob{ y_i \mid x, y_1, \dots, y_{i-1} } \right) \]
\begin{example}
@ -93,7 +93,7 @@
\end{remark}
\item[Temperature sampling]
Skew the distribution to emphasize the most likely words and decrease the probability of less likely words. Given the logits $\vec{u}$ and the temperature $\tau$, the output distribution $\vec{y}$ is determined as:
Skew the distribution to emphasize the most likely words and decrease the probability of less likely ones. Given the logits $\vec{u}$ and the temperature $\tau$, the output distribution $\vec{y}$ is determined as:
\[ \vec{y} = \texttt{softmax}\left( \frac{\vec{u}}{\tau} \right) \]
where:
\begin{itemize}
@ -171,11 +171,11 @@
Specialize a model by adding new learnable parameters.
\begin{description}
\item[Parameter-efficient fine-tuning (PEFT)] \marginnote{Parameter-efficient fine-tuning (PEFT)}
Continue training a selected subset of parameters (e.g., LoRA \Cref{sec:lora}).
\item[Task-specific fine-tuning] \marginnote{Task-specific fine-tuning}
Add a new trainable head on top of the model.
\item[Parameter-efficient fine-tuning (PEFT)] \marginnote{Parameter-efficient fine-tuning (PEFT)}
Continue training a selected subset of parameters (e.g., LoRA \Cref{sec:lora}).
\end{description}
\item[Supervised fine-tuning] \marginnote{Supervised fine-tuning}
@ -215,7 +215,7 @@
\begin{example}[Word sense disambiguation]
Task of determining the sense of each word of a sequence. Senses usually come from an existing ontology (e.g., WordNet). An approach to solve the problem is the following:
\begin{enumerate}
\item Compute the embedding $\vec{v}_i$ of words using a pre-trained encoder (e.g., BERT).
\item Compute the embeddings $\vec{v}_i$ of the words using a pre-trained encoder (e.g., BERT).
\item Represent the embedding of a sense as the average of the tokens of that sense:
\[ \vec{v}_s = \frac{1}{n} \sum_i \vec{v}_i \]
\item Predict the sense of a word $\vec{t}$ as:
@ -240,7 +240,7 @@
\subsection{Pre-training}
\begin{description}
\item[Masked language modelling] \marginnote{Masked language modelling}
\item[Masked language modeling] \marginnote{Masked language modeling}
Task of predicting missing or corrupted tokens in a sequence.
\begin{remark}
@ -287,8 +287,7 @@
\item[Fine-tuning for sequence labeling]
Add a classification head on top of each token. A conditional random field (CRF) layers can also be added to produce globally more coherent tags.
\begin{description}
\item[Named entity recognition (NER)] \marginnote{Named entity recognition (NER)}
\begin{example}[Named entity recognition (NER)]
Task of assigning to each word of a sequence its entity class. NER taggers usually also capture concepts spanning across multiple tokens. To achieve this, additional information is provided with the entity class:
\begin{descriptionlist}
\item[Begin] Starting token of a concept.
@ -308,7 +307,7 @@
The entity (so, also a span of text) is the atomic unit for NER metrics.
\end{remark}
\end{description}
\end{description}
\end{example}
\end{description}
@ -356,8 +355,8 @@
Given the sequence:
\[ \texttt{<bos> thank you \underline{for inviting} me to your party \underline{last} week <eos>} \]
Some spans of text are masked with placeholder tokens as follows:
\[ \texttt{<bos> thank you <X> me to your party <Y> week <eos>} \]
\[ \texttt{<bos> thank you \underline{<X>} me to your party \underline{<Y>} week <eos>} \]
The masked sequence is passed through the encoder, while the decoder has to predict the masked tokens:
\[ \texttt{<bos> <X> for inviting <Y> last <Z> <eos>} \]
\[ \texttt{<bos> <X> \underline{for inviting} <Y> \underline{last} <Z> <eos>} \]
\end{example}
\end{description}

View File

@ -21,7 +21,7 @@
\end{figure}
\begin{remark}
If performed correctly, after performing instruction tuning on a model, it is able to also solve tasks that were not present in the tuning dataset.
If done correctly, after performing instruction tuning on a model, it should be able to also solve tasks that were not present in the tuning dataset.
\end{remark}
\begin{figure}[H]

View File

@ -44,7 +44,7 @@
\begin{description}
\item[Mixture of experts] \marginnote{Mixture of experts}
Specialize smaller models on subset of data and train a router to forward the input to the correct expert.
Specialize smaller models on subsets of data and train a router to forward the input to the correct expert.
\begin{remark}
This approach can be easily deployed on distributed systems.
@ -79,7 +79,7 @@
\[
\forall t_i \in V_\text{dom}: \matr{E}_\text{dom}(t_i) = \frac{1}{|\mathcal{T}_\text{s}(t_i)|} \sum_{t_j \in \mathcal{T}_\text{s}(t_i)}\matr{E}_\text{s}(t_j)
\]
In other words, each token in $V_\text{dom}$ is encoded as the average of embeddings of the tokens that compose it in the starting embedding model (if the token appear in both vocabularies, the embedding is the same).
In other words, each token in $V_\text{dom}$ is encoded as the average of the embeddings of the tokens that compose it in the starting embedding model (if the token appear in both vocabularies, the embedding is the same).
\end{description}
\end{description}
@ -125,7 +125,7 @@
\begin{remark}
Some studies show that an explanation for in-context learning is that causal attention has the same effect of gradient updates (i.e., the left part of the prompt influences the right part).
Another possible explanation is based on the concept of induction heads which are attention heads that specialize in predicting repeated sequences (i.e., in-context learning is seen as the capability of imitating past data). Ablation studies show that by identifying and removing induction heads, in-context learning performance of a model drastically drops.
Another possible explanation is based on the concept of induction heads which are attention heads that specialize in predicting repeated sequences (i.e., in-context learning is seen as the capability of imitating past data). Ablation studies show that by identifying and removing induction heads, the in-context learning performance of a model drastically drops.
\begin{figure}[H]
\centering
@ -144,7 +144,7 @@
\end{figure}
\item[Chain-of-thought prompting] \marginnote{Chain-of-thought prompting}
Provide in the prompt examples of reasoning to make the model provide the output step-by-step.
Provide in the prompt examples of reasoning to make the model generate the output step-by-step\footnote{\includegraphics[width=1.2cm]{img/meme.jpg}}.
\begin{remark}
Empirical results show that the best prompt for chain-of-thought is to add to the prompt \texttt{think step by step}.

View File

@ -16,7 +16,7 @@
\end{figure}
\item[Inverted index] \marginnote{Inverted index}
Mapping from terms to documents with pre-computed term frequency (and/or term positions). It allows narrowing down the document search space by considering only those that match the terms in the query.
Mapping from terms to documents with pre-computed term frequencies (and/or term positions). It allows narrowing down the document search space by considering only those that match the terms in the query.
\end{description}
@ -28,7 +28,7 @@
\begin{description}
\item[Document scoring]
Given a document $d$ a query $q$, and their respective embeddings $\vec{d}$ and $\vec{q}$, their similarity score is computed as their cosine similarity:
Given a document $d$, a query $q$, and their respective embeddings $\vec{d}$ and $\vec{q}$, their similarity score is computed as their cosine similarity:
\[
\begin{split}
\texttt{score}(q, d) &= \frac{\vec{q} \cdot \vec{d}}{|\vec{q}| \cdot |\vec{d}|} \\
@ -56,18 +56,18 @@
\begin{table}[H]
\centering
\footnotesize
\begin{tabular}{c|ccc|ccc}
\begin{tabular}{c|cccc|cccc}
\toprule
\textbf{Word} & \textbf{Count} & \textbf{TF} & $\textbf{TF-IDF}/|q|$ & \textbf{Count} & \textbf{TF} & $\textbf{TF-IDF}/|d_1|$ \\
\textbf{Word} & \textbf{Count} & \texttt{tf} & \texttt{tf-idf} & $\texttt{tf-idf}/|q|$ & \textbf{Count} & \texttt{tf} & \texttt{tf-idf} & $\texttt{tf-idf}/|d_1|$ \\
\midrule
\texttt{sweet} & $1$ & $1.000$ & $0.383$ & $2$ & $1.301$ & $0.357$ \\
\texttt{nurse} & $0$ & $0$ & $0$ & $1$ & $1.000$ & $0.661$ \\
\texttt{love} & $1$ & $1.000$ & $0.924$ & $1$ & $1.000$ & $0.661$ \\
\texttt{how} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
\texttt{sorrow} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
\texttt{is} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
\texttt{sweet} & $1$ & $1.000$ & $0.125$ & $0.383$ & $2$ & $1.301$ & $0.163$ & $0.357$ \\
\texttt{nurse} & $0$ & $0$ & $0$ & $0$ & $1$ & $1.000$ & $0.301$ & $0.661$ \\
\texttt{love} & $1$ & $1.000$ & $0.301$ & $0.924$ & $1$ & $1.000$ & $0.301$ & $0.661$ \\
\texttt{how} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
\texttt{sorrow} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
\texttt{is} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
\midrule
& \multicolumn{3}{l|}{$|d_1| = \sqrt{0.163^2 + 0.301^2 + 0.301^2} = 0.456$} & \multicolumn{3}{l}{$|q| = \sqrt{0.125^2 + 0.301^2} = 0.326$} \\
& \multicolumn{4}{l|}{$|q| = \sqrt{0.125^2 + 0.301^2} = 0.326$} & \multicolumn{4}{l}{$|d_1| = \sqrt{0.163^2 + 0.301^2 + 0.301^2} = 0.456$} \\
\bottomrule
\end{tabular}
\end{table}
@ -117,7 +117,7 @@
\begin{description}
\item[Document scoring]
Use nearest neighbor using the dot product as distance. To speed-up search, approximate k-NN algorithms can be used.
Use nearest neighbor with the dot product as distance. To speed-up search, approximate k-NN algorithms can be used.
\end{description}
\end{description}
@ -203,7 +203,7 @@
22 & N & 0.36 & 0.88 \\
23 & N & 0.35 & 0.88 \\
24 & N & 0.33 & 0.88 \\
25 & Y & 0.36 & 01.0 \\
25 & Y & 0.36 & 1.00 \\
\bottomrule
\end{tabular}
\end{table}
@ -220,7 +220,7 @@
\end{figure}
Average precision computed over all the predictions (i.e., $t=25$) is:
\[ \texttt{AP} = 0.6 \]
\[ \texttt{AP} = \frac{1.0 + 0.66 + 0.60 + 0.66 + 0.63 + 0.55 + 0.47 + 0.44 + 0.36}{9} = 0.6 \]
\end{example}
@ -242,12 +242,12 @@
\begin{figure}[H]
\centering
\includegraphics[width=0.4\linewidth]{./img/_qa_bert.pdf}
\includegraphics[width=0.35\linewidth]{./img/_qa_bert.pdf}
\caption{Example of span labeling architecture}
\end{figure}
\begin{remark}
A sliding window can be used if the passage is too long for the context length of the model.
A sliding window over the whole passage can be used if it is too long for the context length of the model.
\end{remark}
\begin{remark}
@ -266,7 +266,7 @@
Macro F1 score computed by considering predictions and ground-truth as bag of tokens (i.e., average token overlap).
\item[Mean reciprocal rank] \marginnote{Mean reciprocal rank}
Given a system that provides a ranked list of answers to a question $q$, the reciprocal rank for $q$ is:
Given a system that provides a ranked list of answers to a question $q_i$, the reciprocal rank for $q_i$ is:
\[ \frac{1}{\texttt{rank}_i} \]
where $\texttt{rank}_i$ is the index of the first correct answer in the provided ranked list.
Mean reciprocal rank is computed over a set of queries $Q$:

View File

@ -21,9 +21,9 @@
\begin{enumerate}
\item Compute the embedding $\vec{e}^{(t)}$ of $w^{(t)}$.
\item Compute the hidden state $\vec{h}^{(t)}$ considering the hidden state $\vec{h}^{(t-1)}$ of the previous step:
\[ \vec{h}^{(t)} = f(\matr{W}_e \vec{e}^{(t)} + \matr{W}_h \vec{h}^{(t-1)} + b_1) \]
\[ \vec{h}^{(t)} = f(\matr{W}_e \vec{e}^{(t)} + \matr{W}_h \vec{h}^{(t-1)} + \vec{b}_1) \]
\item Compute the output vocabulary distribution $\hat{\vec{y}}^{(t)}$:
\[ \hat{\vec{y}}^{(t)} = \texttt{softmax}(\matr{U}\vec{h}^{(t)} + b_2) \]
\[ \hat{\vec{y}}^{(t)} = \texttt{softmax}(\matr{U}\vec{h}^{(t)} + \vec{b}_2) \]
\item Repeat for the next token.
\end{enumerate}
@ -46,7 +46,7 @@
During training, as the ground-truth is known, the input at each step is the correct token even if the previous step outputted the wrong value.
\begin{remark}
This allows to stay close to the ground-truth and avoid completely wrong training steps.
This allows to stay closer to the ground-truth and avoid completely wrong training steps.
\end{remark}
\end{description}
\end{description}
@ -60,7 +60,7 @@
\subsection{Long short-term memory}
\begin{remark}[Vanishing gradient]
In RNNS, the gradient of distant tokens vanishes through time. Therefore, long-term effects are hard to model.
In RNNs, the gradient of distant tokens vanishes through time. Therefore, long-term effects are hard to model.
\end{remark}
\begin{description}
@ -112,7 +112,7 @@
\begin{description}
\item[Gated recurrent units (GRU)] \marginnote{Gated recurrent units (GRU)}
Architecture simpler than LSTM with fewer gates and without the cell state.
Architecture simpler than LSTMs with fewer gates and without the cell state.
\begin{description}
\item[Gates] \phantom{}
@ -222,6 +222,6 @@
\end{itemize}
\begin{example}[Question answering]
The RNN encoder embeds the question that is used alongside the context (i.e., source from which the answer has to be extracted) to solve a labelling task (i.e., classify each token of the context as non-relevant or relevant).
The RNN encoder embeds the question that is used alongside the context (i.e., source from which the answer has to be extracted) to solve a labeling task (i.e., classify each token of the context as non-relevant or relevant).
\end{example}
\end{description}

View File

@ -261,11 +261,11 @@
\item[Inverse document frequency (\texttt{idf})]
Inverse occurrence count of a word $t$ across all documents:
\[ \texttt{idf}(t) = \log_{10}\left( \frac{N}{\texttt{df}_t} \right) \]
\[ \texttt{idf}(t) = \log_{10}\left( \frac{\texttt{count}(t)}{\texttt{df}_t} \right) \]
where $\texttt{df}_t$ is the number of documents in which the term $t$ occurs.
\begin{remark}
Words that occur in a few documents have a high \texttt{idf}. Therefore, stop words, which appear often, have a low \texttt{idf}.
Words that occur in a few documents have a high \texttt{idf}. Therefore, stop words, which appear often, have a low \texttt{idf} and are down-weighted.
\end{remark}
\end{descriptionlist}
@ -414,7 +414,7 @@
\begin{figure}[H]
\centering
\includegraphics[width=0.6\linewidth]{./img/_neural_language_model_example.pdf}
\includegraphics[width=0.5\linewidth]{./img/_neural_language_model_example.pdf}
\caption{Example of neural language model with a context of $3$ tokens}
\end{figure}
@ -445,7 +445,7 @@
\begin{figure}[H]
\centering
\includegraphics[width=0.5\linewidth]{./img/word2vec_alternatives.png}
\includegraphics[width=0.45\linewidth]{./img/word2vec_alternatives.png}
\end{figure}
\begin{description}
@ -509,13 +509,13 @@
A word is represented both as itself and a bag of $n$-grams. Both whole words and $n$-grams have an embedding. The overall embedding of a word is represented through the sum of its constituent $n$-grams.
\begin{example}
With $n=3$, the word \texttt{where} is represented both as \texttt{<where>} and \texttt{<wh, whe, her, ere, re>} (\texttt{<} and \texttt{>} are boundary characters).
With $n=3$, the word \texttt{where} is represented both as \texttt{<where>} and \texttt{<wh}, \texttt{whe}, \texttt{her}, \texttt{ere}, \texttt{re>} (\texttt{<} and \texttt{>} are boundary characters).
\end{example}
\item[GloVe] \marginnote{GloVe}
Based on the term-term co-occurrence (within a window) probability matrix that indicates for each word its probability of co-occurring with the other words.
Similarly to Word2vec, the objective is to learn two sets of embeddings $\matr{\theta} = \langle\matr{W}, \matr{C}\rangle$ such that their similarity is close to their log-probability of co-occurring. Given the term-term matrix $\matr{X}$, the loss for a target word $w$ and a context word $c$ is defined as:
The objective is to learn two sets of embeddings $\matr{\theta} = \langle\matr{W}, \matr{C}\rangle$ such that their similarity is close to their log-probability of co-occurring. Given the term-term matrix $\matr{X}$, the loss for a target word $w$ and a context word $c$ is defined as:
\[ \mathcal{L}(\matr{\theta}) = \left( \vec{c} \cdot \vec{w} - \log( \matr{X}[c, w] ) \right)^2 \]
\begin{remark}
@ -599,7 +599,7 @@
\centering
\includegraphics[width=0.7\linewidth]{./img/_embedding_history.png}
\caption{
\parbox[t]{0.7\linewidth}{
\parbox[t]{0.75\linewidth}{
Neighboring embeddings of the same words encoded using Word2vec trained on different corpora from different decades
}
}
@ -629,7 +629,7 @@
\begin{example}
Using the parallelogram model to solve:
\[ \texttt{father} : \texttt{doctor} :: \texttt{mother} : x \]
finds as the closest words $x =$ \texttt{homemaker}, \texttt{nurse}, \texttt{receptionist}, \dots
The closest words for $x$ are \texttt{homemaker}, \texttt{nurse}, \texttt{receptionist}, \dots
\end{example}
\begin{example}
@ -637,7 +637,7 @@
\end{example}
\begin{example}
Using the Google News dataset as training corpus, there is a correlation between the women bias of the jobs embeddings and the percentage of women over men in those jobs.
Using the Google News dataset as training corpus, there is a correlation between the women bias in job words embeddings and the percentage of women over men in those jobs.
Woman bias for a word $w$ is computed as:
\[ d_\text{women}(w) - d_\text{men}(w) \]

View File

@ -45,7 +45,7 @@
Waveform as-is in the real world.
\item[Digital signal] \marginnote{Digital signal}
Sampled (i.e., measure uniform time steps) and quantized (discretize values) version of an analog waveform.
Sampled (i.e., measure uniform time steps) and quantized (i.e., discretize values) version of an analog waveform.
\end{description}
\item[Fourier transform] \marginnote{Fourier transform}