mirror of
https://github.com/NotXia/unibo-ai-notes.git
synced 2025-12-14 18:51:52 +01:00
Fix errors and typos <noupdate>
This commit is contained in:
BIN
src/year2/natural-language-processing/img/meme.jpg
Normal file
BIN
src/year2/natural-language-processing/img/meme.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 371 KiB |
@ -34,16 +34,16 @@
|
||||
The overall flow is the following:
|
||||
\begin{enumerate}
|
||||
\item The encoder computes its hidden states $\vec{h}^{(1)}, \dots, \vec{h}^{(N)} \in \mathbb{R}^{h}$.
|
||||
\item The decoder processes the input tokens one at the time. Its hidden state is initialized with $\vec{h}^{(N)}$. Consider the token in position $t$, the output is determined as follows:
|
||||
\item The decoder processes the input tokens one at the time beginning with a \texttt{<start>} token. Its hidden state is initialized with $\vec{h}^{(N)}$. Consider the token at position $t$, the output is determined as follows:
|
||||
\begin{enumerate}
|
||||
\item The decoder outputs the hidden state $\vec{s}^{(t)}$.
|
||||
\item Attention scores $\vec{e}^{(t)}$ are determined as the dot product between $\vec{s}^{(t)}$ and $\vec{h}^{(i)}$:
|
||||
\[
|
||||
\vec{e}^{(t)} =
|
||||
\begin{bmatrix}
|
||||
\vec{s}^{(t)} \odot \vec{h}^{(1)} &
|
||||
\vec{s}^{(t)} \cdot \vec{h}^{(1)} &
|
||||
\cdots &
|
||||
\vec{s}^{(t)} \odot \vec{h}^{(N)}
|
||||
\vec{s}^{(t)} \cdot \vec{h}^{(N)}
|
||||
\end{bmatrix} \in \mathbb{R}^{N}
|
||||
\]
|
||||
$\vec{e}^{(t)}$ is used to determine the attention distribution $\vec{\alpha}^{(t)}$ that is required to obtain the attention output $\vec{a}^{(t)}$ as the weighted encoder hidden states :
|
||||
@ -104,7 +104,7 @@
|
||||
\begin{example}[Character-aware neural LM]
|
||||
\phantom{}\\
|
||||
\begin{minipage}{0.6\linewidth}
|
||||
RNN-LM that works on a character level:
|
||||
RNN-LM that works on the character level:
|
||||
\begin{itemize}
|
||||
\item Given a token, each character is embedded and concatenated.
|
||||
\item Convolutions are used to refine the representation.
|
||||
@ -147,8 +147,8 @@
|
||||
Then, the attention weights $\alpha_{i,j}$ between two embeddings $\vec{x}_i$ and $\vec{x}_j$ are computed as:
|
||||
\[
|
||||
\begin{gathered}
|
||||
\texttt{scores}(\vec{x}_i, \vec{x}_j) = \frac{\vec{q}_i \vec{k}_j}{\sqrt{d_k}} \\
|
||||
\alpha_{i,j} = \texttt{softmax}_j\left( \left[\texttt{scores}(\vec{x}_i, \vec{x}_1), \dots, \texttt{scores}(\vec{x}_i, \vec{x}_T)\right] \right) \\
|
||||
\texttt{score}(\vec{x}_i, \vec{x}_j) = \frac{\vec{q}_i \vec{k}_j}{\sqrt{d_k}} \\
|
||||
\alpha_{i,j} = \texttt{softmax}_j\left( \left[\texttt{score}(\vec{x}_i, \vec{x}_1), \dots, \texttt{score}(\vec{x}_i, \vec{x}_T)\right] \right) \\
|
||||
\end{gathered}
|
||||
\]
|
||||
The output $\vec{a}_i \in \mathbb{R}^{1 \times d_v}$ is a weighted sum of the values of each token:
|
||||
@ -163,12 +163,12 @@
|
||||
|
||||
|
||||
\item[Causal attention] \marginnote{Causal attention}
|
||||
Self-attention mechanism where only past tokens can be used to determine the representation of a token at a specific position. It is computed by modifying the standard self-attention as:
|
||||
Self-attention mechanism where only past tokens can be used to determine the representation of a token at a specific position. It is computed by modifying the standard self-attention as follows:
|
||||
\[
|
||||
\begin{gathered}
|
||||
\forall j \leq i: \texttt{scores}(\vec{x}_i, \vec{x}_j) = \frac{\vec{q}_i \vec{k}_j}{\sqrt{d_k}} \qquad
|
||||
\forall j > i: \texttt{scores}(\vec{x}_i, \vec{x}_j) = -\infty \\
|
||||
\alpha_{i,j} = \texttt{softmax}_j\left( \left[\texttt{scores}(\vec{x}_i, \vec{x}_1), \dots, \texttt{scores}(\vec{x}_i, \vec{x}_T)\right] \right) \\
|
||||
\forall j \leq i: \texttt{score}(\vec{x}_i, \vec{x}_j) = \frac{\vec{q}_i \vec{k}_j}{\sqrt{d_k}} \qquad
|
||||
\forall j > i: \texttt{score}(\vec{x}_i, \vec{x}_j) = -\infty \\
|
||||
\alpha_{i,j} = \texttt{softmax}_j\left( \left[\texttt{score}(\vec{x}_i, \vec{x}_1), \dots, \texttt{score}(\vec{x}_i, \vec{x}_T)\right] \right) \\
|
||||
\vec{a}_i = \sum_{t: t \leq i} \alpha_{i,t} \vec{v}_t
|
||||
\end{gathered}
|
||||
\]
|
||||
@ -181,11 +181,11 @@
|
||||
\end{description}
|
||||
|
||||
|
||||
\subsection{Components}
|
||||
\subsection{Embeddings}
|
||||
|
||||
\begin{description}
|
||||
\item[Input embedding] \marginnote{Input embedding}
|
||||
The input is tokenized using standard tokenizers (e.g., BPE, SentencePiece, \dots). Each token is encoded using a learned embedding matrix.
|
||||
The input is tokenized using standard tokenizers (e.g., BPE, SentencePiece, \dots). Each token is then encoded using a learned embedding matrix.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
@ -203,7 +203,12 @@
|
||||
\centering
|
||||
\includegraphics[width=0.45\linewidth]{./img/_positional_encoding.pdf}
|
||||
\end{figure}
|
||||
\end{description}
|
||||
|
||||
|
||||
\subsection{Transformer block}
|
||||
|
||||
\begin{description}
|
||||
\item[Transformer block] \marginnote{Transformer block}
|
||||
Module with the same input and output dimensionality (i.e., allows stacking multiple blocks) composed of:
|
||||
\begin{descriptionlist}
|
||||
@ -221,13 +226,13 @@
|
||||
Where the hidden dimension $d_\text{ff}$ is usually larger than $d_\text{model}$.
|
||||
|
||||
\item[Normalization layer]
|
||||
Applies token-wise normalization (i.e., layer norm) to help training stability.
|
||||
Applies layer normalization (i.e., normalize each sequence of the batch independently) to help training stability.
|
||||
|
||||
\item[Residual connection]
|
||||
Helps to propagate information during training.
|
||||
|
||||
\begin{remark}[Residual stream]
|
||||
An interpretation of residual connections is the residual stream where the input token in enhanced by the output of multi-head attention and the feedforward network.
|
||||
An interpretation of residual connections is that of residual stream where the input token is enhanced by the output of multi-head attention and the feedforward network.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
@ -242,7 +247,7 @@
|
||||
\caption{Overall attention block}
|
||||
\end{figure}
|
||||
|
||||
\item[Language modelling head] \marginnote{Language modelling head}
|
||||
\item[Language modeling head] \marginnote{Language modeling head}
|
||||
Takes as input the output corresponding to a token of the transformer blocks stack and outputs a distribution over the vocabulary.
|
||||
|
||||
\begin{figure}[H]
|
||||
|
||||
@ -9,7 +9,7 @@
|
||||
\begin{itemize}
|
||||
\item The holder of attitude (i.e., the source).
|
||||
\item The target of attitude (i.e., the aspect).
|
||||
\item The type of attitude (e.g., positive and negative).
|
||||
\item The type of attitude (e.g., positive or negative).
|
||||
\item The text containing the attitude.
|
||||
\end{itemize}
|
||||
|
||||
@ -43,12 +43,12 @@
|
||||
\end{example}
|
||||
|
||||
\item[Supervised machine learning] \marginnote{Supervised machine learning}
|
||||
Use a training set of $N$ labeled data $\{ (d_i, c_i) \}$ to fit a classifier.
|
||||
Use a training set of $N$ labeled document-class data points $\{ (d_i, c_i) \}$ to fit a classifier.
|
||||
|
||||
An ML model can be:
|
||||
\begin{descriptionlist}
|
||||
\item[Generative] Informally, it learns the distribution of the data.
|
||||
\item[Discriminative] Informally, it learns to exploit the features to determine the class.
|
||||
\item[Generative] Informally, it learns the distribution of the data (i.e., $\prob{d_i | c_i}$).
|
||||
\item[Discriminative] Informally, it learns to exploit the features to determine the class (i.e., $\prob{c_i | d_i}$).
|
||||
\end{descriptionlist}
|
||||
\end{descriptionlist}
|
||||
\end{description}
|
||||
@ -58,9 +58,9 @@
|
||||
|
||||
\begin{description}
|
||||
\item[Bag-of-words (BoW)] \marginnote{Bag-of-words (BoW)}
|
||||
Represents a document using the frequencies of its words.
|
||||
Representation of a document using the frequency of its words.
|
||||
|
||||
Given a vocabulary $V$ and a document $d$, the bag-of-words embedding of $d$ is a vector in $\mathbb{N}^{\vert V \vert}$ where the $i$-th position contains the number of occurrences of the $i$-th token in $d$.
|
||||
Given a vocabulary $V$ and a document $d$, the bag-of-words embedding of $d$ is a vector in $\mathbb{N}^{\vert V \vert}$ where the $i$-th position contains the number of occurrences of the $i$-th token of $V$ in $d$.
|
||||
|
||||
\item[Multinomial naive Bayes classifier] \marginnote{Multinomial naive Bayes classifier}
|
||||
Generative probabilistic classifier based on the assumption that features are independent given the class.
|
||||
@ -76,13 +76,13 @@
|
||||
\end{split}
|
||||
\]
|
||||
|
||||
Given a training set $D$ with $N_c$ classes and a vocabulary $V$, $\prob{w_i | c}$ and $\prob{c}$ are determined during training by maximum likelihood estimation as follows:
|
||||
Given a training set $D$ and a vocabulary $V$, $\prob{w_i | c}$ and $\prob{c}$ are determined during training by maximum likelihood estimation as follows:
|
||||
\[
|
||||
\prob{c} = \frac{N_c}{\vert D \vert}
|
||||
\qquad
|
||||
\prob{w_i | c} = \frac{\texttt{count}(w_i, c)}{\sum_{v \in V} \texttt{count}(v, c)}
|
||||
\]
|
||||
where $\texttt{count}(w, c)$ counts the occurrences of the word $w$ in the training samples with class $c$.
|
||||
where $N_c$ is the number of documents with class $c$ and $\texttt{count}(w, c)$ counts the occurrences of the word $w$ in the training samples with class $c$.
|
||||
|
||||
\begin{remark}
|
||||
Laplace smoothing is used to avoid zero probabilities.
|
||||
@ -149,7 +149,7 @@ Possible optimizations for naive Bayes applied to sentiment analysis are the fol
|
||||
\end{example}
|
||||
|
||||
\item[Parse tree]
|
||||
Build a tree to encode the sentiment and interactions of the words. By propagating the sentiments bottom-up it is possible to determine the overall sentiment of the sequence.
|
||||
Build a tree to encode the sentiment and interactions of the words. By propagating the sentiments bottom-up, it is possible to determine the overall sentiment of the sequence.
|
||||
|
||||
\begin{example}
|
||||
The parse tree for the sentence ``\texttt{This film doesn't care about cleverness, wit or any other kind of intelligent humor.}'' is the following:
|
||||
@ -208,11 +208,11 @@ Naive Bayes has the following properties:
|
||||
|
||||
|
||||
\item[Binary logistic regression] \marginnote{Binary logistic regression}
|
||||
Discriminative probabilistic model that computes the joint distribution $\prob{c | d}$ of a class $c$ and a document $d$.
|
||||
Discriminative probabilistic model that computes the joint distribution $\prob{c | d}$ of the class $c$ given the document $d$.
|
||||
|
||||
Given the input features $\vec{x} = [x_1, \dots, x_n]$, logistic regression computes the following:
|
||||
\[
|
||||
\sigma\left( \sum_{i=1}^{n} w_i x_i \right) + b = \sigma(\vec{w}\vec{x} + b)
|
||||
\sigma\left( \sum_{i=1}^{n} w_i x_i + b \right) = \sigma(\vec{w}\vec{x} + b)
|
||||
\]
|
||||
where $\sigma$ is the sigmoid function.
|
||||
|
||||
@ -233,8 +233,8 @@ Naive Bayes has the following properties:
|
||||
\[ \arg\min_{\vec{\theta}} \sum_{i=1}^{m} \mathcal{L}_\text{BCE}(\hat{y}^{(i)}, f(x^{(i)}; \vec{\theta})) + \alpha \mathcal{R}(\vec{\theta}) \]
|
||||
where $\alpha$ is the regularization factor and $\mathcal{R}(\vec{\theta})$ is the regularization term. Typical regularization approaches are:
|
||||
\begin{descriptionlist}
|
||||
\item[Ridge regression (L1)] $\mathcal{R}(\vec{\theta}) = \Vert \vec{\theta} \Vert^2_2 = \sum_{j=1}^{n} \vec{\theta}_j^2$.
|
||||
\item[Lasso regression (L2)] $\mathcal{R}(\vec{\theta}) = \Vert \vec{\theta} \Vert_1 = \sum_{j=1}^{n} \vert \vec{\theta}_j \vert$.
|
||||
\item[Lasso regression (L1)] $\mathcal{R}(\vec{\theta}) = \Vert \vec{\theta} \Vert_1 = \sum_{j=1}^{n} \vert \vec{\theta}_j \vert$.
|
||||
\item[Ridge regression (L2)] $\mathcal{R}(\vec{\theta}) = \Vert \vec{\theta} \Vert^2_2 = \sum_{j=1}^{n} \vec{\theta}_j^2$.
|
||||
\end{descriptionlist}
|
||||
\end{description}
|
||||
|
||||
@ -242,7 +242,7 @@ Naive Bayes has the following properties:
|
||||
Extension of logistic regression to the multi-class case. The joint probability becomes $\prob{y = c | x}$ and softmax is used in place of the sigmoid.
|
||||
|
||||
Cross-entropy is extended over the classes $C$:
|
||||
\[ \mathcal{L}_\text{CE}(\hat{y}, y) = - \sum_{c \in C} \mathbbm{1}\{y = c\} \log \prob{y = c | x} \]
|
||||
\[ \mathcal{L}_\text{CE}(\hat{y}, y) = - \sum_{c \in C} \mathbbm{1}\{y = c\} \log(\prob{y = c | x}) \]
|
||||
\end{description}
|
||||
|
||||
|
||||
|
||||
@ -34,9 +34,9 @@
|
||||
\end{enumerate}
|
||||
|
||||
\item[ALERT benchmark] \marginnote{ALERT benchmark}
|
||||
Benchmark to test the safeness of an LLM based on 32 risk categories. The testing data are created as follows:
|
||||
Benchmark to test the safeness of an LLM based on 32 risk categories. The testing data were created as follows:
|
||||
\begin{enumerate}
|
||||
\item Filter the ``\textit{Helpfulness \& Harmlessness-RLHF}'' dataset of \textit{Anthropic} by considering for each example the first prompt and red team attacks only.
|
||||
\item Filter the ``\textit{Helpfulness \& Harmlessness-RLHF}'' dataset of \textit{Anthropic} by considering for each example the first prompt and red team (i.e., malicious) attacks only.
|
||||
\item Use templates to automatically generate additional prompts.
|
||||
\item Augment the prompts by formatting them as adversarial attacks. Examples of attacks are:
|
||||
\begin{descriptionlist}
|
||||
|
||||
@ -60,7 +60,7 @@
|
||||
\end{remark}
|
||||
|
||||
\item[Beam search] \marginnote{Beam search}
|
||||
Given a beam width $k$, perform a breadth-first search keeping at each branching level the top-$k$ tokens based on the probability of that sequence:
|
||||
Given a beam width $k$, perform a breadth-first search keeping at each branching level the top-$k$ tokens based on the probability of that sequence computes as:
|
||||
\[ \log\left( \prob{y \mid x} \right) = \sum_{i=1}^{t} \log\left( \prob{ y_i \mid x, y_1, \dots, y_{i-1} } \right) \]
|
||||
|
||||
\begin{example}
|
||||
@ -93,7 +93,7 @@
|
||||
\end{remark}
|
||||
|
||||
\item[Temperature sampling]
|
||||
Skew the distribution to emphasize the most likely words and decrease the probability of less likely words. Given the logits $\vec{u}$ and the temperature $\tau$, the output distribution $\vec{y}$ is determined as:
|
||||
Skew the distribution to emphasize the most likely words and decrease the probability of less likely ones. Given the logits $\vec{u}$ and the temperature $\tau$, the output distribution $\vec{y}$ is determined as:
|
||||
\[ \vec{y} = \texttt{softmax}\left( \frac{\vec{u}}{\tau} \right) \]
|
||||
where:
|
||||
\begin{itemize}
|
||||
@ -171,11 +171,11 @@
|
||||
Specialize a model by adding new learnable parameters.
|
||||
|
||||
\begin{description}
|
||||
\item[Parameter-efficient fine-tuning (PEFT)] \marginnote{Parameter-efficient fine-tuning (PEFT)}
|
||||
Continue training a selected subset of parameters (e.g., LoRA \Cref{sec:lora}).
|
||||
|
||||
\item[Task-specific fine-tuning] \marginnote{Task-specific fine-tuning}
|
||||
Add a new trainable head on top of the model.
|
||||
|
||||
\item[Parameter-efficient fine-tuning (PEFT)] \marginnote{Parameter-efficient fine-tuning (PEFT)}
|
||||
Continue training a selected subset of parameters (e.g., LoRA \Cref{sec:lora}).
|
||||
\end{description}
|
||||
|
||||
\item[Supervised fine-tuning] \marginnote{Supervised fine-tuning}
|
||||
@ -215,7 +215,7 @@
|
||||
\begin{example}[Word sense disambiguation]
|
||||
Task of determining the sense of each word of a sequence. Senses usually come from an existing ontology (e.g., WordNet). An approach to solve the problem is the following:
|
||||
\begin{enumerate}
|
||||
\item Compute the embedding $\vec{v}_i$ of words using a pre-trained encoder (e.g., BERT).
|
||||
\item Compute the embeddings $\vec{v}_i$ of the words using a pre-trained encoder (e.g., BERT).
|
||||
\item Represent the embedding of a sense as the average of the tokens of that sense:
|
||||
\[ \vec{v}_s = \frac{1}{n} \sum_i \vec{v}_i \]
|
||||
\item Predict the sense of a word $\vec{t}$ as:
|
||||
@ -240,7 +240,7 @@
|
||||
\subsection{Pre-training}
|
||||
|
||||
\begin{description}
|
||||
\item[Masked language modelling] \marginnote{Masked language modelling}
|
||||
\item[Masked language modeling] \marginnote{Masked language modeling}
|
||||
Task of predicting missing or corrupted tokens in a sequence.
|
||||
|
||||
\begin{remark}
|
||||
@ -287,8 +287,7 @@
|
||||
\item[Fine-tuning for sequence labeling]
|
||||
Add a classification head on top of each token. A conditional random field (CRF) layers can also be added to produce globally more coherent tags.
|
||||
|
||||
\begin{description}
|
||||
\item[Named entity recognition (NER)] \marginnote{Named entity recognition (NER)}
|
||||
\begin{example}[Named entity recognition (NER)]
|
||||
Task of assigning to each word of a sequence its entity class. NER taggers usually also capture concepts spanning across multiple tokens. To achieve this, additional information is provided with the entity class:
|
||||
\begin{descriptionlist}
|
||||
\item[Begin] Starting token of a concept.
|
||||
@ -308,7 +307,7 @@
|
||||
The entity (so, also a span of text) is the atomic unit for NER metrics.
|
||||
\end{remark}
|
||||
\end{description}
|
||||
\end{description}
|
||||
\end{example}
|
||||
\end{description}
|
||||
|
||||
|
||||
@ -356,8 +355,8 @@
|
||||
Given the sequence:
|
||||
\[ \texttt{<bos> thank you \underline{for inviting} me to your party \underline{last} week <eos>} \]
|
||||
Some spans of text are masked with placeholder tokens as follows:
|
||||
\[ \texttt{<bos> thank you <X> me to your party <Y> week <eos>} \]
|
||||
\[ \texttt{<bos> thank you \underline{<X>} me to your party \underline{<Y>} week <eos>} \]
|
||||
The masked sequence is passed through the encoder, while the decoder has to predict the masked tokens:
|
||||
\[ \texttt{<bos> <X> for inviting <Y> last <Z> <eos>} \]
|
||||
\[ \texttt{<bos> <X> \underline{for inviting} <Y> \underline{last} <Z> <eos>} \]
|
||||
\end{example}
|
||||
\end{description}
|
||||
@ -21,7 +21,7 @@
|
||||
\end{figure}
|
||||
|
||||
\begin{remark}
|
||||
If performed correctly, after performing instruction tuning on a model, it is able to also solve tasks that were not present in the tuning dataset.
|
||||
If done correctly, after performing instruction tuning on a model, it should be able to also solve tasks that were not present in the tuning dataset.
|
||||
\end{remark}
|
||||
|
||||
\begin{figure}[H]
|
||||
|
||||
@ -44,7 +44,7 @@
|
||||
|
||||
\begin{description}
|
||||
\item[Mixture of experts] \marginnote{Mixture of experts}
|
||||
Specialize smaller models on subset of data and train a router to forward the input to the correct expert.
|
||||
Specialize smaller models on subsets of data and train a router to forward the input to the correct expert.
|
||||
|
||||
\begin{remark}
|
||||
This approach can be easily deployed on distributed systems.
|
||||
@ -79,7 +79,7 @@
|
||||
\[
|
||||
\forall t_i \in V_\text{dom}: \matr{E}_\text{dom}(t_i) = \frac{1}{|\mathcal{T}_\text{s}(t_i)|} \sum_{t_j \in \mathcal{T}_\text{s}(t_i)}\matr{E}_\text{s}(t_j)
|
||||
\]
|
||||
In other words, each token in $V_\text{dom}$ is encoded as the average of embeddings of the tokens that compose it in the starting embedding model (if the token appear in both vocabularies, the embedding is the same).
|
||||
In other words, each token in $V_\text{dom}$ is encoded as the average of the embeddings of the tokens that compose it in the starting embedding model (if the token appear in both vocabularies, the embedding is the same).
|
||||
\end{description}
|
||||
\end{description}
|
||||
|
||||
@ -125,7 +125,7 @@
|
||||
\begin{remark}
|
||||
Some studies show that an explanation for in-context learning is that causal attention has the same effect of gradient updates (i.e., the left part of the prompt influences the right part).
|
||||
|
||||
Another possible explanation is based on the concept of induction heads which are attention heads that specialize in predicting repeated sequences (i.e., in-context learning is seen as the capability of imitating past data). Ablation studies show that by identifying and removing induction heads, in-context learning performance of a model drastically drops.
|
||||
Another possible explanation is based on the concept of induction heads which are attention heads that specialize in predicting repeated sequences (i.e., in-context learning is seen as the capability of imitating past data). Ablation studies show that by identifying and removing induction heads, the in-context learning performance of a model drastically drops.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
@ -144,7 +144,7 @@
|
||||
\end{figure}
|
||||
|
||||
\item[Chain-of-thought prompting] \marginnote{Chain-of-thought prompting}
|
||||
Provide in the prompt examples of reasoning to make the model provide the output step-by-step.
|
||||
Provide in the prompt examples of reasoning to make the model generate the output step-by-step\footnote{\includegraphics[width=1.2cm]{img/meme.jpg}}.
|
||||
|
||||
\begin{remark}
|
||||
Empirical results show that the best prompt for chain-of-thought is to add to the prompt \texttt{think step by step}.
|
||||
|
||||
@ -16,7 +16,7 @@
|
||||
\end{figure}
|
||||
|
||||
\item[Inverted index] \marginnote{Inverted index}
|
||||
Mapping from terms to documents with pre-computed term frequency (and/or term positions). It allows narrowing down the document search space by considering only those that match the terms in the query.
|
||||
Mapping from terms to documents with pre-computed term frequencies (and/or term positions). It allows narrowing down the document search space by considering only those that match the terms in the query.
|
||||
\end{description}
|
||||
|
||||
|
||||
@ -28,7 +28,7 @@
|
||||
|
||||
\begin{description}
|
||||
\item[Document scoring]
|
||||
Given a document $d$ a query $q$, and their respective embeddings $\vec{d}$ and $\vec{q}$, their similarity score is computed as their cosine similarity:
|
||||
Given a document $d$, a query $q$, and their respective embeddings $\vec{d}$ and $\vec{q}$, their similarity score is computed as their cosine similarity:
|
||||
\[
|
||||
\begin{split}
|
||||
\texttt{score}(q, d) &= \frac{\vec{q} \cdot \vec{d}}{|\vec{q}| \cdot |\vec{d}|} \\
|
||||
@ -56,18 +56,18 @@
|
||||
\begin{table}[H]
|
||||
\centering
|
||||
\footnotesize
|
||||
\begin{tabular}{c|ccc|ccc}
|
||||
\begin{tabular}{c|cccc|cccc}
|
||||
\toprule
|
||||
\textbf{Word} & \textbf{Count} & \textbf{TF} & $\textbf{TF-IDF}/|q|$ & \textbf{Count} & \textbf{TF} & $\textbf{TF-IDF}/|d_1|$ \\
|
||||
\textbf{Word} & \textbf{Count} & \texttt{tf} & \texttt{tf-idf} & $\texttt{tf-idf}/|q|$ & \textbf{Count} & \texttt{tf} & \texttt{tf-idf} & $\texttt{tf-idf}/|d_1|$ \\
|
||||
\midrule
|
||||
\texttt{sweet} & $1$ & $1.000$ & $0.383$ & $2$ & $1.301$ & $0.357$ \\
|
||||
\texttt{nurse} & $0$ & $0$ & $0$ & $1$ & $1.000$ & $0.661$ \\
|
||||
\texttt{love} & $1$ & $1.000$ & $0.924$ & $1$ & $1.000$ & $0.661$ \\
|
||||
\texttt{how} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
|
||||
\texttt{sorrow} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
|
||||
\texttt{is} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
|
||||
\texttt{sweet} & $1$ & $1.000$ & $0.125$ & $0.383$ & $2$ & $1.301$ & $0.163$ & $0.357$ \\
|
||||
\texttt{nurse} & $0$ & $0$ & $0$ & $0$ & $1$ & $1.000$ & $0.301$ & $0.661$ \\
|
||||
\texttt{love} & $1$ & $1.000$ & $0.301$ & $0.924$ & $1$ & $1.000$ & $0.301$ & $0.661$ \\
|
||||
\texttt{how} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
|
||||
\texttt{sorrow} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
|
||||
\texttt{is} & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ & $0$ \\
|
||||
\midrule
|
||||
& \multicolumn{3}{l|}{$|d_1| = \sqrt{0.163^2 + 0.301^2 + 0.301^2} = 0.456$} & \multicolumn{3}{l}{$|q| = \sqrt{0.125^2 + 0.301^2} = 0.326$} \\
|
||||
& \multicolumn{4}{l|}{$|q| = \sqrt{0.125^2 + 0.301^2} = 0.326$} & \multicolumn{4}{l}{$|d_1| = \sqrt{0.163^2 + 0.301^2 + 0.301^2} = 0.456$} \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
@ -117,7 +117,7 @@
|
||||
|
||||
\begin{description}
|
||||
\item[Document scoring]
|
||||
Use nearest neighbor using the dot product as distance. To speed-up search, approximate k-NN algorithms can be used.
|
||||
Use nearest neighbor with the dot product as distance. To speed-up search, approximate k-NN algorithms can be used.
|
||||
\end{description}
|
||||
\end{description}
|
||||
|
||||
@ -203,7 +203,7 @@
|
||||
22 & N & 0.36 & 0.88 \\
|
||||
23 & N & 0.35 & 0.88 \\
|
||||
24 & N & 0.33 & 0.88 \\
|
||||
25 & Y & 0.36 & 01.0 \\
|
||||
25 & Y & 0.36 & 1.00 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
@ -220,7 +220,7 @@
|
||||
\end{figure}
|
||||
|
||||
Average precision computed over all the predictions (i.e., $t=25$) is:
|
||||
\[ \texttt{AP} = 0.6 \]
|
||||
\[ \texttt{AP} = \frac{1.0 + 0.66 + 0.60 + 0.66 + 0.63 + 0.55 + 0.47 + 0.44 + 0.36}{9} = 0.6 \]
|
||||
\end{example}
|
||||
|
||||
|
||||
@ -242,12 +242,12 @@
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.4\linewidth]{./img/_qa_bert.pdf}
|
||||
\includegraphics[width=0.35\linewidth]{./img/_qa_bert.pdf}
|
||||
\caption{Example of span labeling architecture}
|
||||
\end{figure}
|
||||
|
||||
\begin{remark}
|
||||
A sliding window can be used if the passage is too long for the context length of the model.
|
||||
A sliding window over the whole passage can be used if it is too long for the context length of the model.
|
||||
\end{remark}
|
||||
|
||||
\begin{remark}
|
||||
@ -266,7 +266,7 @@
|
||||
Macro F1 score computed by considering predictions and ground-truth as bag of tokens (i.e., average token overlap).
|
||||
|
||||
\item[Mean reciprocal rank] \marginnote{Mean reciprocal rank}
|
||||
Given a system that provides a ranked list of answers to a question $q$, the reciprocal rank for $q$ is:
|
||||
Given a system that provides a ranked list of answers to a question $q_i$, the reciprocal rank for $q_i$ is:
|
||||
\[ \frac{1}{\texttt{rank}_i} \]
|
||||
where $\texttt{rank}_i$ is the index of the first correct answer in the provided ranked list.
|
||||
Mean reciprocal rank is computed over a set of queries $Q$:
|
||||
|
||||
@ -21,9 +21,9 @@
|
||||
\begin{enumerate}
|
||||
\item Compute the embedding $\vec{e}^{(t)}$ of $w^{(t)}$.
|
||||
\item Compute the hidden state $\vec{h}^{(t)}$ considering the hidden state $\vec{h}^{(t-1)}$ of the previous step:
|
||||
\[ \vec{h}^{(t)} = f(\matr{W}_e \vec{e}^{(t)} + \matr{W}_h \vec{h}^{(t-1)} + b_1) \]
|
||||
\[ \vec{h}^{(t)} = f(\matr{W}_e \vec{e}^{(t)} + \matr{W}_h \vec{h}^{(t-1)} + \vec{b}_1) \]
|
||||
\item Compute the output vocabulary distribution $\hat{\vec{y}}^{(t)}$:
|
||||
\[ \hat{\vec{y}}^{(t)} = \texttt{softmax}(\matr{U}\vec{h}^{(t)} + b_2) \]
|
||||
\[ \hat{\vec{y}}^{(t)} = \texttt{softmax}(\matr{U}\vec{h}^{(t)} + \vec{b}_2) \]
|
||||
\item Repeat for the next token.
|
||||
\end{enumerate}
|
||||
|
||||
@ -46,7 +46,7 @@
|
||||
During training, as the ground-truth is known, the input at each step is the correct token even if the previous step outputted the wrong value.
|
||||
|
||||
\begin{remark}
|
||||
This allows to stay close to the ground-truth and avoid completely wrong training steps.
|
||||
This allows to stay closer to the ground-truth and avoid completely wrong training steps.
|
||||
\end{remark}
|
||||
\end{description}
|
||||
\end{description}
|
||||
@ -60,7 +60,7 @@
|
||||
\subsection{Long short-term memory}
|
||||
|
||||
\begin{remark}[Vanishing gradient]
|
||||
In RNNS, the gradient of distant tokens vanishes through time. Therefore, long-term effects are hard to model.
|
||||
In RNNs, the gradient of distant tokens vanishes through time. Therefore, long-term effects are hard to model.
|
||||
\end{remark}
|
||||
|
||||
\begin{description}
|
||||
@ -112,7 +112,7 @@
|
||||
|
||||
\begin{description}
|
||||
\item[Gated recurrent units (GRU)] \marginnote{Gated recurrent units (GRU)}
|
||||
Architecture simpler than LSTM with fewer gates and without the cell state.
|
||||
Architecture simpler than LSTMs with fewer gates and without the cell state.
|
||||
|
||||
\begin{description}
|
||||
\item[Gates] \phantom{}
|
||||
@ -222,6 +222,6 @@
|
||||
\end{itemize}
|
||||
|
||||
\begin{example}[Question answering]
|
||||
The RNN encoder embeds the question that is used alongside the context (i.e., source from which the answer has to be extracted) to solve a labelling task (i.e., classify each token of the context as non-relevant or relevant).
|
||||
The RNN encoder embeds the question that is used alongside the context (i.e., source from which the answer has to be extracted) to solve a labeling task (i.e., classify each token of the context as non-relevant or relevant).
|
||||
\end{example}
|
||||
\end{description}
|
||||
|
||||
@ -261,11 +261,11 @@
|
||||
|
||||
\item[Inverse document frequency (\texttt{idf})]
|
||||
Inverse occurrence count of a word $t$ across all documents:
|
||||
\[ \texttt{idf}(t) = \log_{10}\left( \frac{N}{\texttt{df}_t} \right) \]
|
||||
\[ \texttt{idf}(t) = \log_{10}\left( \frac{\texttt{count}(t)}{\texttt{df}_t} \right) \]
|
||||
where $\texttt{df}_t$ is the number of documents in which the term $t$ occurs.
|
||||
|
||||
\begin{remark}
|
||||
Words that occur in a few documents have a high \texttt{idf}. Therefore, stop words, which appear often, have a low \texttt{idf}.
|
||||
Words that occur in a few documents have a high \texttt{idf}. Therefore, stop words, which appear often, have a low \texttt{idf} and are down-weighted.
|
||||
\end{remark}
|
||||
\end{descriptionlist}
|
||||
|
||||
@ -414,7 +414,7 @@
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.6\linewidth]{./img/_neural_language_model_example.pdf}
|
||||
\includegraphics[width=0.5\linewidth]{./img/_neural_language_model_example.pdf}
|
||||
\caption{Example of neural language model with a context of $3$ tokens}
|
||||
\end{figure}
|
||||
|
||||
@ -445,7 +445,7 @@
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.5\linewidth]{./img/word2vec_alternatives.png}
|
||||
\includegraphics[width=0.45\linewidth]{./img/word2vec_alternatives.png}
|
||||
\end{figure}
|
||||
|
||||
\begin{description}
|
||||
@ -509,13 +509,13 @@
|
||||
A word is represented both as itself and a bag of $n$-grams. Both whole words and $n$-grams have an embedding. The overall embedding of a word is represented through the sum of its constituent $n$-grams.
|
||||
|
||||
\begin{example}
|
||||
With $n=3$, the word \texttt{where} is represented both as \texttt{<where>} and \texttt{<wh, whe, her, ere, re>} (\texttt{<} and \texttt{>} are boundary characters).
|
||||
With $n=3$, the word \texttt{where} is represented both as \texttt{<where>} and \texttt{<wh}, \texttt{whe}, \texttt{her}, \texttt{ere}, \texttt{re>} (\texttt{<} and \texttt{>} are boundary characters).
|
||||
\end{example}
|
||||
|
||||
\item[GloVe] \marginnote{GloVe}
|
||||
Based on the term-term co-occurrence (within a window) probability matrix that indicates for each word its probability of co-occurring with the other words.
|
||||
|
||||
Similarly to Word2vec, the objective is to learn two sets of embeddings $\matr{\theta} = \langle\matr{W}, \matr{C}\rangle$ such that their similarity is close to their log-probability of co-occurring. Given the term-term matrix $\matr{X}$, the loss for a target word $w$ and a context word $c$ is defined as:
|
||||
The objective is to learn two sets of embeddings $\matr{\theta} = \langle\matr{W}, \matr{C}\rangle$ such that their similarity is close to their log-probability of co-occurring. Given the term-term matrix $\matr{X}$, the loss for a target word $w$ and a context word $c$ is defined as:
|
||||
\[ \mathcal{L}(\matr{\theta}) = \left( \vec{c} \cdot \vec{w} - \log( \matr{X}[c, w] ) \right)^2 \]
|
||||
|
||||
\begin{remark}
|
||||
@ -599,7 +599,7 @@
|
||||
\centering
|
||||
\includegraphics[width=0.7\linewidth]{./img/_embedding_history.png}
|
||||
\caption{
|
||||
\parbox[t]{0.7\linewidth}{
|
||||
\parbox[t]{0.75\linewidth}{
|
||||
Neighboring embeddings of the same words encoded using Word2vec trained on different corpora from different decades
|
||||
}
|
||||
}
|
||||
@ -629,7 +629,7 @@
|
||||
\begin{example}
|
||||
Using the parallelogram model to solve:
|
||||
\[ \texttt{father} : \texttt{doctor} :: \texttt{mother} : x \]
|
||||
finds as the closest words $x =$ \texttt{homemaker}, \texttt{nurse}, \texttt{receptionist}, \dots
|
||||
The closest words for $x$ are \texttt{homemaker}, \texttt{nurse}, \texttt{receptionist}, \dots
|
||||
\end{example}
|
||||
|
||||
\begin{example}
|
||||
@ -637,7 +637,7 @@
|
||||
\end{example}
|
||||
|
||||
\begin{example}
|
||||
Using the Google News dataset as training corpus, there is a correlation between the women bias of the jobs embeddings and the percentage of women over men in those jobs.
|
||||
Using the Google News dataset as training corpus, there is a correlation between the women bias in job words embeddings and the percentage of women over men in those jobs.
|
||||
|
||||
Woman bias for a word $w$ is computed as:
|
||||
\[ d_\text{women}(w) - d_\text{men}(w) \]
|
||||
|
||||
@ -45,7 +45,7 @@
|
||||
Waveform as-is in the real world.
|
||||
|
||||
\item[Digital signal] \marginnote{Digital signal}
|
||||
Sampled (i.e., measure uniform time steps) and quantized (discretize values) version of an analog waveform.
|
||||
Sampled (i.e., measure uniform time steps) and quantized (i.e., discretize values) version of an analog waveform.
|
||||
\end{description}
|
||||
|
||||
\item[Fourier transform] \marginnote{Fourier transform}
|
||||
|
||||
Reference in New Issue
Block a user