mirror of
https://github.com/NotXia/unibo-ai-notes.git
synced 2025-12-15 19:12:22 +01:00
Fix typos <noupdate>
This commit is contained in:
@ -46,7 +46,7 @@
|
|||||||
\vec{s}^{(t)} \cdot \vec{h}^{(N)}
|
\vec{s}^{(t)} \cdot \vec{h}^{(N)}
|
||||||
\end{bmatrix} \in \mathbb{R}^{N}
|
\end{bmatrix} \in \mathbb{R}^{N}
|
||||||
\]
|
\]
|
||||||
$\vec{e}^{(t)}$ is used to determine the attention distribution $\vec{\alpha}^{(t)}$ that is required to obtain the attention output $\vec{a}^{(t)}$ as the weighted encoder hidden states :
|
$\vec{e}^{(t)}$ is used to determine the attention distribution $\vec{\alpha}^{(t)}$ that is required to obtain the attention output $\vec{a}^{(t)}$ as the weighted sum of the encoder hidden states:
|
||||||
\[
|
\[
|
||||||
\begin{gathered}
|
\begin{gathered}
|
||||||
\mathbb{R}^{N} \ni \vec{\alpha}^{(t)} = \texttt{softmax}(\vec{e}^{(t)}) \\
|
\mathbb{R}^{N} \ni \vec{\alpha}^{(t)} = \texttt{softmax}(\vec{e}^{(t)}) \\
|
||||||
|
|||||||
@ -71,7 +71,7 @@
|
|||||||
\hat{c} &= \arg\max_{c \in C} \prob{c | d} \\
|
\hat{c} &= \arg\max_{c \in C} \prob{c | d} \\
|
||||||
&= \arg\max_{c \in C} \underbrace{\prob{d | c}}_{\text{likelihood}} \underbrace{\prob{c}}_{\text{prior}} \\
|
&= \arg\max_{c \in C} \underbrace{\prob{d | c}}_{\text{likelihood}} \underbrace{\prob{c}}_{\text{prior}} \\
|
||||||
&= \arg\max_{c \in C} \prob{w_1, \dots, w_n | c} \prob{c} \\
|
&= \arg\max_{c \in C} \prob{w_1, \dots, w_n | c} \prob{c} \\
|
||||||
&= \arg\max_{c \in C} \prod_{i} \prob{w_i | c} \prob{c} \\
|
&\approx \arg\max_{c \in C} \prod_{i} \prob{w_i | c} \prob{c} \\
|
||||||
&= \arg\max_{c \in C} \sum_{i} \log\prob{w_i | c} \log\prob{c} \\
|
&= \arg\max_{c \in C} \sum_{i} \log\prob{w_i | c} \log\prob{c} \\
|
||||||
\end{split}
|
\end{split}
|
||||||
\]
|
\]
|
||||||
@ -82,7 +82,7 @@
|
|||||||
\qquad
|
\qquad
|
||||||
\prob{w_i | c} = \frac{\texttt{count}(w_i, c)}{\sum_{v \in V} \texttt{count}(v, c)}
|
\prob{w_i | c} = \frac{\texttt{count}(w_i, c)}{\sum_{v \in V} \texttt{count}(v, c)}
|
||||||
\]
|
\]
|
||||||
where $N_c$ is the number of documents with class $c$ and $\texttt{count}(w, c)$ counts the occurrences of the word $w$ in the training samples with class $c$.
|
where $N_c$ is the number of documents of class $c$ and $\texttt{count}(w, c)$ counts the occurrences of the word $w$ in the training samples of class $c$.
|
||||||
|
|
||||||
\begin{remark}
|
\begin{remark}
|
||||||
Laplace smoothing is used to avoid zero probabilities.
|
Laplace smoothing is used to avoid zero probabilities.
|
||||||
@ -226,7 +226,7 @@ Naive Bayes has the following properties:
|
|||||||
\end{cases}
|
\end{cases}
|
||||||
\]
|
\]
|
||||||
By applying a log-transformation and inverting the sign, this corresponds to the cross-entropy loss in the binary case:
|
By applying a log-transformation and inverting the sign, this corresponds to the cross-entropy loss in the binary case:
|
||||||
\[ \mathcal{L}_{\text{BCE}}(\hat{y}, y) = -\log \prob{y | x} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})] \]
|
\[ \mathcal{L}_{\text{BCE}}(\hat{y}, y) = -\log(\prob{y | x}) = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})] \]
|
||||||
|
|
||||||
\item[Optimization]
|
\item[Optimization]
|
||||||
As cross-entropy is convex, SGD is well suited to find the parameters $\vec{\theta}$ of a logistic regressor $f$ over batches of $m$ examples by solving:
|
As cross-entropy is convex, SGD is well suited to find the parameters $\vec{\theta}$ of a logistic regressor $f$ over batches of $m$ examples by solving:
|
||||||
@ -255,7 +255,7 @@ Logistic regression has the following properties:
|
|||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\begin{remark}
|
\begin{remark}
|
||||||
Logistic regression is also a good baseline when experimenting with text classifications.
|
Logistic regression is also a good baseline when experimenting with text classification.
|
||||||
|
|
||||||
As they are lightweight to train, it is a good idea to test both naive Bayes and logistic regression to determine the best baseline for other experiments.
|
As they are lightweight to train, it is a good idea to test both naive Bayes and logistic regression to determine the best baseline for other experiments.
|
||||||
\end{remark}
|
\end{remark}
|
||||||
|
|||||||
@ -60,7 +60,7 @@
|
|||||||
\end{remark}
|
\end{remark}
|
||||||
|
|
||||||
\item[Beam search] \marginnote{Beam search}
|
\item[Beam search] \marginnote{Beam search}
|
||||||
Given a beam width $k$, perform a breadth-first search keeping at each branching level the top-$k$ tokens based on the probability of that sequence computes as:
|
Given a beam width $k$, perform a breadth-first search keeping at each branching level the top-$k$ tokens based on the probability of that sequence computed as:
|
||||||
\[ \log\left( \prob{y \mid x} \right) = \sum_{i=1}^{t} \log\left( \prob{ y_i \mid x, y_1, \dots, y_{i-1} } \right) \]
|
\[ \log\left( \prob{y \mid x} \right) = \sum_{i=1}^{t} \log\left( \prob{ y_i \mid x, y_1, \dots, y_{i-1} } \right) \]
|
||||||
|
|
||||||
\begin{example}
|
\begin{example}
|
||||||
@ -192,7 +192,7 @@
|
|||||||
Architecture that produces contextual embeddings by considering both left-to-right and right-to-left context.
|
Architecture that produces contextual embeddings by considering both left-to-right and right-to-left context.
|
||||||
|
|
||||||
\begin{remark}
|
\begin{remark}
|
||||||
This architecture does feature extraction and is more suited for classification tasks.
|
This architecture performs feature extraction and is more suited for classification tasks.
|
||||||
\end{remark}
|
\end{remark}
|
||||||
|
|
||||||
\begin{description}
|
\begin{description}
|
||||||
@ -206,7 +206,7 @@
|
|||||||
\end{description}
|
\end{description}
|
||||||
|
|
||||||
\item[Contextual embedding] \marginnote{Contextual embedding}
|
\item[Contextual embedding] \marginnote{Contextual embedding}
|
||||||
Represent the meaning of word instances (i.e., dynamically depending on the surroundings).
|
Represent the meaning of word instances (i.e., in a dynamic manner depending on the surroundings).
|
||||||
|
|
||||||
\begin{remark}[Sequence embedding]
|
\begin{remark}[Sequence embedding]
|
||||||
Encoders usually have a classifier token (e.g., \texttt{[CLS]}) to model the whole sentence.
|
Encoders usually have a classifier token (e.g., \texttt{[CLS]}) to model the whole sentence.
|
||||||
|
|||||||
@ -61,7 +61,7 @@
|
|||||||
|
|
||||||
\item Fine-tune the language model (i.e., train the policy) using an RL algorithm (e.g., PPO) and the learned reward model.
|
\item Fine-tune the language model (i.e., train the policy) using an RL algorithm (e.g., PPO) and the learned reward model.
|
||||||
|
|
||||||
Given a prompt $x$ and an answer $y$, the reward $r$ used for RL update is computed as:
|
Given a prompt $x$ and an answer $y$, the reward $r$ used for the RL update is computed as:
|
||||||
\[ r = r_\theta(y \mid x) - \lambda_\text{KL} D_\text{KL}(\pi_{\text{PPO}}(y \mid x) \Vert \pi_{\text{base}}(y \mid x)) \]
|
\[ r = r_\theta(y \mid x) - \lambda_\text{KL} D_\text{KL}(\pi_{\text{PPO}}(y \mid x) \Vert \pi_{\text{base}}(y \mid x)) \]
|
||||||
where:
|
where:
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
|
|||||||
@ -207,7 +207,7 @@
|
|||||||
Provide an initial hidden state to the RNN (e.g., speech-to-text).
|
Provide an initial hidden state to the RNN (e.g., speech-to-text).
|
||||||
\end{description}
|
\end{description}
|
||||||
|
|
||||||
\item[Sequence labelling] \marginnote{Sequence labelling}
|
\item[Sequence labeling] \marginnote{Sequence labeling}
|
||||||
Assign a class to each input token (e.g., POS-tagging, named-entity recognition, structure prediction, \dots).
|
Assign a class to each input token (e.g., POS-tagging, named-entity recognition, structure prediction, \dots).
|
||||||
|
|
||||||
\item[Sequence classification] \marginnote{Sequence classification}
|
\item[Sequence classification] \marginnote{Sequence classification}
|
||||||
|
|||||||
@ -406,7 +406,7 @@
|
|||||||
\item[Neural language modeling] \marginnote{Neural language modeling}
|
\item[Neural language modeling] \marginnote{Neural language modeling}
|
||||||
Use a neural network to predict the next word $w_{n+1}$ given an input sequence $w_{1..n}$. The general flow is the following:
|
Use a neural network to predict the next word $w_{n+1}$ given an input sequence $w_{1..n}$. The general flow is the following:
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item Encode the input words into one-hot vectors ($\mathbb{R}^{|V| \times n}$).
|
\item Encode the input words into one-hot vectors ($\mathbb{N}^{|V| \times n}$).
|
||||||
\item Project the input vectors with an embedding matrix $\matr{E} \in \mathbb{R}^{d \times |V|}$ that encodes them into $d$-dimensional vectors.
|
\item Project the input vectors with an embedding matrix $\matr{E} \in \mathbb{R}^{d \times |V|}$ that encodes them into $d$-dimensional vectors.
|
||||||
\item Pass the embedding into the hidden layers.
|
\item Pass the embedding into the hidden layers.
|
||||||
\item The final layer is a probability distribution over the vocabulary ($\mathbb{R}^{|V| \times 1}$).
|
\item The final layer is a probability distribution over the vocabulary ($\mathbb{R}^{|V| \times 1}$).
|
||||||
@ -478,7 +478,7 @@
|
|||||||
Use a binary logistic regressor as classifier. The two classes are:
|
Use a binary logistic regressor as classifier. The two classes are:
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item Context words within the context window (positive label).
|
\item Context words within the context window (positive label).
|
||||||
\item Words randomly sampled (negative label).
|
\item Randomly sampled words (negative label).
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
The probabilities can be computed as:
|
The probabilities can be computed as:
|
||||||
\[
|
\[
|
||||||
|
|||||||
Reference in New Issue
Block a user