Add NLP logistic regression, metrics, and emotion

This commit is contained in:
2024-10-15 19:41:51 +02:00
parent 2170040f15
commit c10defa867
4 changed files with 325 additions and 2 deletions

View File

@ -63,7 +63,7 @@
Given a vocabulary $V$ and a document $d$, the bag-of-words embedding of $d$ is a vector in $\mathbb{N}^{\vert V \vert}$ where the $i$-th position contains the number of occurrences of the $i$-th token in $d$.
\item[Multinomial naive Bayes classifier] \marginnote{Multinomial naive Bayes classifier}
Probabilistic (i.e., generative) classifier based on the assumption that features are independent given the class.
Generative probabilistic classifier based on the assumption that features are independent given the class.
Given a document $d = \{ w_1, \dots, w_n \}$, a naive Bayes classifier returns the class $\hat{c}$ with maximum posterior probability:
\[
@ -128,4 +128,327 @@
\prob{\texttt{-}} = \frac{3}{5} \qquad \prob{\texttt{predictable} | \texttt{-}} = \frac{1+1}{14+20} \quad \prob{\texttt{no} | \texttt{-}} = \frac{1+1}{14+20} \quad \prob{\texttt{fun} | \texttt{-}} = \frac{0+1}{14+20}
\end{gathered}
\]
\end{example}
\end{example}
\subsection{Optimizations}
Possible optimizations for naive Bayes applied to sentiment analysis are the following:
\begin{descriptionlist}
\item[Binarization] \marginnote{Binarization}
Generally, the information regarding the occurrence of a word is more important than its frequency. Therefore, instead of applying bag-of-words by counting, it is possible to produce a one-hot encoded vector to indicate which words are in the document.
\item[Negation encoding] \marginnote{Negation encoding}
To encode negations, two approaches can be taken:
\begin{description}
\item[Negation annotation]
Add to negated words an annotation so that they are treated as a new word.
\begin{example}
Prepend \texttt{NOT\char`_} to each word between a negation and the next punctuation:
\[ \text{didn't like this movie.} \mapsto \text{didn't \texttt{NOT\char`_}like \texttt{NOT\char`_}this \texttt{NOT\char`_}movie.} \]
\end{example}
\item[Parse tree]
Build a tree to encode the sentiment and interactions of the words. By propagating the sentiments bottom-up it is possible to determine the overall sentiment of the sequence.
\begin{example}
The parse tree for the sentence ``\texttt{This film doesn't care about cleverness, wit or any other kind of intelligent humor.}'' is the following:
\begin{figure}[H]
\centering
\includegraphics[width=0.6\linewidth]{./img/_sentiment_parse_tree.pdf}
\end{figure}
Due to the negation (\texttt{doesn't}), the whole positive sequence is negated.
\end{example}
\end{description}
\item[Sentiment lexicon] \marginnote{Sentiment lexicon}
If training data is insufficient, external domain knowledge, such as sentiment lexicon, can be used.
\begin{example}
A possible way to use a lexicon is to count the number of positive and negative words according to that corpus.
\end{example}
\begin{remark}
Possible ways to create a lexicon are through:
\begin{itemize}
\item Expert annotators.
\item Crowdsourcing in a two-step procedure:
\begin{enumerate}
\item Ask questions related to synonyms (e.g., which word is closest in meaning to \textit{startle}?).
\item Rate the association of words with emotions (e.g., how does \textit{startle} associate with \textit{joy}, \textit{fear}, \textit{anger}, \dots?).
\end{enumerate}
\item Semi-supervised induction of labels from a small set of annotated data (i.e., seed labels). It works by looking for words that appear together with the ones with a known sentiment.
\item Supervised learning using annotated data.
\end{itemize}
\end{remark}
\end{descriptionlist}
\subsection{Properties}
Naive Bayes has the following properties:
\begin{itemize}
\item It is generally effective with short sequences and fewer data samples.
\item It is robust to irrelevant features (i.e., words that appear in both negative and positive sentences) as they cancel out each other.
\item It has good performance in domains with many equally important features (contrarily to decision trees).
\item The independence assumption might produce overestimated predictions.
\end{itemize}
\begin{remark}
Naive Bayes is a good baseline when experimenting with text classifications.
\end{remark}
\section{Logistic regression}
\begin{description}
\item[Features engineering] \marginnote{Features engineering}
Determine features by hand from the data (e.g., number of positive and negative lexicon).
\item[Binary logistic regression] \marginnote{Binary logistic regression}
Discriminative probabilistic model that computes the joint distribution $\prob{c | d}$ of a class $c$ and a document $d$.
Given the input features $\vec{x} = [x_1, \dots, x_n]$, logistic regression computes the following:
\[
\sigma\left( \sum_{i=1}^{n} w_i x_i \right) + b = \sigma(\vec{w}\vec{x} + b)
\]
where $\sigma$ is the sigmoid function.
\begin{description}
\item[Loss]
The loss function should aim to maximize the probability of predicting the correct label $\hat{y}$ given the observation $\vec{x}$. This can be expressed as a Bernoulli distribution:
\[
\prob{y | x} = \hat{y}^y (1-\hat{y})^{1-y} = \begin{cases}
1 - \hat{y} & \text{if $y=0$} \\
\hat{y} & \text{if $y=1$} \\
\end{cases}
\]
By applying a log-transformation and inverting the sign, this corresponds to the cross-entropy loss in the binary case:
\[ \mathcal{L}_{\text{BCE}}(\hat{y}, y) = -\log \prob{y | x} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})] \]
\item[Optimization]
As cross-entropy is convex, SGD is well suited to find the parameters $\vec{\theta}$ of a logistic regressor $f$ over batches of $m$ examples by solving:
\[ \arg\min_{\vec{\theta}} \sum_{i=1}^{m} \mathcal{L}_\text{BCE}(\hat{y}^{(i)}, f(x^{(i)}; \vec{\theta})) + \alpha \mathcal{R}(\vec{\theta}) \]
where $\alpha$ is the regularization factor and $\mathcal{R}(\vec{\theta})$ is the regularization term. Typical regularization approaches are:
\begin{descriptionlist}
\item[Ridge regression (L1)] $\mathcal{R}(\vec{\theta}) = \Vert \vec{\theta} \Vert^2_2 = \sum_{j=1}^{n} \vec{\theta}_j^2$.
\item[Lasso regression (L2)] $\mathcal{R}(\vec{\theta}) = \Vert \vec{\theta} \Vert_1 = \sum_{j=1}^{n} \vert \vec{\theta}_j \vert$.
\end{descriptionlist}
\end{description}
\item[Multinomial logistic regression] \marginnote{Multinomial logistic regression}
Extension of logistic regression to the multi-class case. The joint probability becomes $\prob{y = c | x}$ and softmax is used in place of the sigmoid.
Cross-entropy is extended over the classes $C$:
\[ \mathcal{L}_\text{CE}(\hat{y}, y) = - \sum_{c \in C} \mathbbm{1}\{y = c\} \log \prob{y = c | x} \]
\end{description}
\subsection{Properties}
Logistic regression has the following properties:
\begin{itemize}
\item It is generally effective with large documents or datasets.
\item It is robust to correlated features.
\end{itemize}
\begin{remark}
Logistic regression is also a good baseline when experimenting with text classifications.
As they are lightweight to train, it is a good idea to test both naive Bayes and logistic regression to determine the best baseline for other experiments.
\end{remark}
\section{Metrics}
\subsection{Binary classification}
\begin{description}
\item[Contingency table] \marginnote{Contingency table}
$2 \times 2$ table matching predictions to ground truths. It contains true positives (\texttt{TP}), false positives (\texttt{FP}), false negatives (\texttt{FN}), and true negatives (\texttt{TN}).
\item[Recall] \marginnote{Recall}
$\frac{\texttt{TP}}{\texttt{TP} + \texttt{FN}}$.
\item[Precision] \marginnote{Precision}
$\frac{\texttt{TP}}{\texttt{TP} + \texttt{FP}}$.
\item[Accuracy] \marginnote{Accuracy}
$\frac{\texttt{TP} + \texttt{TN}}{\texttt{TP} + \texttt{FP} + \texttt{FN} + \texttt{TN}}$.
\begin{remark}
Accuracy is a reasonable metric only when classes are balanced.
\end{remark}
\item[F1 score] \marginnote{F1 score}
$\frac{2 \cdot \texttt{recall} \cdot \texttt{precision}}{\texttt{recall} + \texttt{precision}}$.
\end{description}
\subsection{Multi-class classification}
\begin{description}
\item[Confusion matrix] \marginnote{Confusion matrix}
$c \times c$ table matching predictions to ground truths.
\item[Precision/Recall] \marginnote{Precision/Recall}
Precision and recall can be defined class-wise (i.e., consider a class as the positive label and the others as the negative).
\item[Micro-average precision/recall] \marginnote{Micro-average precision/recall}
Compute the contingency table of each class and collapse them into a single table. Compute precision or recall on the pooled contingency table.
\begin{remark}
This approach is sensitive to the most frequent class.
\end{remark}
\item[Macro-average precision/recall] \marginnote{Macro-average precision/recall}
Compute precision or recall class-wise and then average over the classes.
\begin{remark}
This approach is reasonable if the classes are equally important.
\end{remark}
\begin{remark}
Macro-average is more common in NLP.
\end{remark}
\end{description}
\begin{example}
\phantom{}
\begin{figure}[H]
\centering
\includegraphics[width=0.5\linewidth]{./img/_confusion_matrix_example.pdf}
\caption{Confusion matrix}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.85\linewidth]{./img/_micro_macro_average_example.pdf}
\caption{
\parbox[t]{0.6\linewidth}{
Class-wise contingency tables, pooled contingency table, and micro/macro-average precision
}
}
\end{figure}
\end{example}
\subsection{Cross-validation}
\begin{description}
\item[$\mathbf{n}$-fold cross-validation] \marginnote{$n$-fold cross-validation}
Tune a classifier on different sections of the training data:
\begin{enumerate}
\item Randomly choose a training and validation set.
\item Train the classifier.
\item Evaluate the classifier on a held-out test set.
\item Repeat for $n$ times.
\end{enumerate}
\end{description}
\subsection{Statistical significance}
\begin{description}
\item[$\mathbf{p}$-value] \marginnote{$p$-value}
Measure to determine whether a model $A$ is outperforming a model $B$ on a given test set by chance (i.e., test the null hypothesis $H_0$ or, in other words, test that there is no relation between $A$ and $B$).
Given:
\begin{itemize}
\item A test set $x$,
\item A random variable $X$ over the test sets (i.e., another test set),
\item Two models $A$ and $B$, such that $A$ is better than $B$ by $\delta(x)$ on the test set $x$,
\end{itemize}
the $p$-value is defined as:
\[ p\text{-value}(x) = \prob{\delta(X) > \delta(x) | H_0} \]
There are two cases:
\begin{itemize}
\item $p\text{-value}(x)$ is big: the null hypothesis holds (i.e., $\prob{\delta(X) > \delta(x)}$ is high under the assumption that $A$ and $B$ are not related), so $A$ outperforms $B$ by chance.
\item $p\text{-value}(x)$ is small (i.e., $< 0.05$ or $< 0.01$): the null hypothesis is rejected, so $A$ actually outperforms $B$.
\end{itemize}
\item[Bootstrapping test] \marginnote{Bootstrapping test}
Approach to compute $p$-values.
Given a test set $x$, multiple virtual test sets $\bar{x}^{(i)}$ are created by sampling with replacement (it is assumed that the new sets are representative). The performance difference $\delta(\cdot)$ is computed between two models and the $p$-value is determined as the frequency of:
\[ \delta(\bar{x}^{(i)}) > 2\delta(x) \]
\begin{remark}
$\delta(x)$ is doubled due to theoretical reasons.
\end{remark}
\begin{example}
Consider two models $A$ and $B$, and a test set $x$ with 10 samples. From $x$, multiple new sets (in this case of the same size) can be sampled. In the following table, each cell indicates which model correctly predicted the class:
\begin{center}
\footnotesize
\begin{tabular}{ccccccccccc|ccc}
\toprule
& 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & $A\%$ & $B\%$ & $\delta(\cdot)$ \\
\midrule
$x$ & $AB$ & $A$ & $AB$ & $B$ & $A$ & $B$ & $A$ & $AB$ & -- & $A$ & $0.7$ & $0.5$ & $0.2$ \\
$\bar{x}^{(1)}$ & $A$ & $AB$ & $A$ & $B$ & $B$ & $A$ & $B$ & $AB$ & -- & $AB$ & $0.6$ & $0.6$ & $0.0$ \\
$\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ \\
\bottomrule
\end{tabular}
\end{center}
A possible way to sample $\bar{x}^{(1)}$ is (w.r.t. the indexes of the examples in $x$) $[2, 3, 3, 2, 4, 6, 2, 4, 1, 9, 1]$.
\end{example}
\end{description}
\section{Affective meaning}
The affective meaning of a text corpus can vary depending on:
\begin{descriptionlist}
\item[Personality traits]
Stable behavior and personality of a person (e.g., nervous, anxious, reckless, \dots).
\item[Attitude]
Enduring sentiment towards objects or people (e.g., liking, loving, hating, \dots).
\item[Interpersonal stance]
Affective stance taken in a specific interaction (e.g., distant, cold, warm, \dots).
\item[Mood]
Affective state of low intensity and long duration often without an apparent cause (e.g., cheerful, gloomy, irritable, \dots).
\item[Emotion]
Brief response to an external or internal event of major significance (e.g., angry, sad, joyful, \dots).
\begin{remark}
Emotion is the most common subject in affective computing.
\end{remark}
\end{descriptionlist}
\subsection{Emotion}
\begin{description}
\item[Theory of emotion]
There are two main theories of emotion:
\begin{descriptionlist}
\item[Basic emotions] \marginnote{Basic emotions}
Discrete and fixed range of atomic emotions.
\begin{remark}
Emotions associated to a word might be in contrast. For instance, in the NRC Word-Emotion Association Lexicon the word \texttt{thirst} is associated to both \textit{anticipation} and \textit{surprise}.
\end{remark}
\item[Continuum emotions] \marginnote{Continuum emotions}
Describe emotions in a 2 or 3 dimensional space with the following features:
\begin{descriptionlist}
\item[Valence] Pleasantness of a stimulus.
\item[Arousal] Intensity of emotion.
\item[Dominance] Degree of control on the stimulus.
\end{descriptionlist}
\begin{remark}
Valence is often used as a measure of sentiment.
\end{remark}
\end{descriptionlist}
\end{description}