Add DL object detection

This commit is contained in:
2024-04-26 13:39:34 +02:00
parent dde9a66b67
commit 95b614b172
7 changed files with 209 additions and 2 deletions

View File

@ -6,7 +6,7 @@
\usepackage{geometry}
\usepackage{graphicx, xcolor}
\usepackage{amsmath, amsfonts, amssymb, amsthm, mathtools, bm, upgreek, cancel}
\usepackage{amsmath, amsfonts, amssymb, amsthm, mathtools, bm, upgreek, cancel, bbm}
\usepackage[bottom]{footmisc}
\usepackage[pdfusetitle]{hyperref}
\usepackage[nameinlink]{cleveref}

Binary file not shown.

After

Width:  |  Height:  |  Size: 258 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.2 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 159 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 312 KiB

View File

@ -656,4 +656,211 @@ The architecture is composed of two steps:
Segmentation was therefore done on a cropped portion of the input image.
Another approach is to use padding to maintain the same shape of the input in the output.
\end{remark}
\end{remark}
\section{Object detection}
\begin{description}
\item[Intersection over union] \marginnote{Intersection over union}
Metric used to determine the quality of a bounding box w.r.t. a ground truth:
\[ \texttt{IoU}(A, B) = \frac{\vert A \cap B \vert}{\vert A \cup B \vert} \]
\item[Object detection] \marginnote{Object detection}
Find bounding boxes containing a specific object or category.
There are two main strategies:
\begin{description}
\item[Region proposal] \marginnote{Region proposal}
Object-independent method that uses selective search algorithms to exploit the texture and the structure of the image to find locations of interest.
\item[Single-shot] \marginnote{Single shot}
Fast method oriented towards real-time applications.
\end{description}
\begin{figure}[H]
\centering
\includegraphics[width=0.35\linewidth]{./img/object_detection.png}
\caption{Example of bounding boxes}
\end{figure}
\end{description}
\subsection{YOLOv3}
YOLO is a fully convolutional neural network belonging to the family of single-shot methods.
% Given an image, YOLO downsamples it to obtain a feature map .
% Each cell of the feature map makes bounding box predictions.
\begin{description}
\item[Anchor box] \marginnote{Anchor box}
It has been shown that directly predicting the width and height of the bounding boxes leads to unstable gradients during training.
A common solution to this problem is to use pre-defined bounding boxes (anchors).
Anchors are selected using k-means clustering on the bounding boxes of the training set using \texttt{IoU} as metric (i.e. the most common shapes are identified).
Then, the network learns to draw bounding boxes by placing and scaling the anchors.
\item[Architecture] \marginnote{YOLO architecture}
An input image is progressively downsampled through convolutions by a factor of $2^5$ to obtain a feature map of $S \times S$ cells
(e.g. a $416 \times 416$ image is downsampled into a $13 \times 13$ grid).
Each entry of the feature map has a depth of $(B \times (5+C))$ where:
\begin{itemize}
\item $B$ is the number of bounding boxes (one per anchor) the cell proposes.
\item $C$ is the number of object classes.
\end{itemize}
Therefore, each bounding box prediction has associated $5+C$ attributes:
\begin{itemize}
\item $t_x$ and $t_y$ describe the center coordinates of the box (relative to the predicting cell).
\item $t_w$ and $t_h$ describe the width and height of the box (relative to the anchor).
\item $p_o$ is an objectness score that indicates the probability that an object is contained in the predicted bounding box (useful for thresholding).
\item $p_1, \dots, p_C$ are the probabilities associated to each class.
Since YOLOv3, the probability of each class is given by a sigmoid instead of passing everything through a softmax.
This allows to associate an object with multiple categories.
\end{itemize}
\begin{figure}[H]
\centering
\includegraphics[width=0.6\linewidth]{./img/yolo_architecture.png}
\end{figure}
\item[Inference] \marginnote{YOLO inference}
\begin{remark}
Each cell of the feature map is identified by a set of coordinates relative to the feature map itself
(e.g. the first cell is at coordinate $(0,0)$, the one to its right is at $(0, 1)$).
\end{remark}
Given a cell of the feature map at coordinates $(c_x, c_y)$, consider its $i$-th bounding box prediction.
The bounding box is computed using the following parameters:
\begin{itemize}
\item The predicted relative position and dimension $\langle t_x, t_y, t_w, t_h \rangle$ of the box.
\item The width $p_w$ and height $p_h$ of the anchor associated with the $i$-th prediction of the cell.
\end{itemize}
Then, the bounding box position and dimension (relative to the feature map) are computed as follows:
\[
\begin{split}
b_x &= c_x + \sigma(t_x) \\
b_y &= c_y + \sigma(t_y) \\
b_w &= p_w \cdot e^{t_w} \\
b_h &= p_h \cdot e^{t_h} \\
\end{split}
\]
where:
\begin{itemize}
\item $(b_x, b_y)$ are the coordinates of the center of the box.
\item $b_w$ and $b_h$ are the width and height of the box.
\item $\sigma$ is the sigmoid function.
\end{itemize}
\begin{figure}[H]
\centering
\includegraphics[width=0.5\linewidth]{./img/yolo_anchor.png}
\end{figure}
\item[Training] \marginnote{YOLO training}
During training, for each ground truth bounding box,
only the cell at its center and the anchor with the highest \texttt{IoU} are considered for its prediction.
In other words, only that combination of cell and anchor influences the loss function.
Given a $S \times S$ feature map and $B$ anchors, for each prediction, YOLO uses two losses:
\begin{descriptionlist}
\item[Localization loss]
Measures the positioning of the bounding boxes:
\[
\mathcal{L}_\text{loc} = \lambda_\text{coord} \sum_{i=0}^{S \times S} \sum_{j=0}^{B} \mathbbm{1}_{ij}^\text{obj} \Big(
(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 +
(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2
\Big)
\]
where:
\begin{itemize}
\item $\mathbbm{1}_{ij}^\text{obj}$ is a delta function that is 1 if the $j$-th anchor of the $i$-th cell is responsible for detecting the object.
\item $(x_i, y_i)$ are the predicted coordinates of the box. $(\hat{x}_i, \hat{y}_i)$ are the ground truth coordinates.
\item $w_i$ and $h_i$ are the predicted width and height of the box. $\hat{w}_i$ and $\hat{h}_i$ are the ground truth dimensions.
\item $\lambda_\text{coord}$ is a hyperparameter (the default is 5).
\end{itemize}
\item[Classification loss]
Considers the objectness score and the predicted classes:
\[
\begin{split}
\mathcal{L}_\text{cls} = &\sum_{i=0}^{S \times S} \sum_{j=0}^{B} (\mathbbm{1}_{ij}^\text{obj} + \lambda_\text{no-obj}(1-\mathbbm{1}_{ij}^\text{obj}))(C_{ij} - \hat{C}_{ij})^2 \\
&+ \sum_{i=0}^{S \times S} \sum_{c \in \mathcal{C}} \mathbbm{1}_{i}^\text{obj} (p_i(c) - \hat{p}_i(c))^2
\end{split}
\]
where:
\begin{itemize}
\item $\mathbbm{1}_{ij}^\text{obj}$ is defined as above.
\item $\mathbbm{1}_{i}^\text{obj}$ is 1 if the $i$-th cell is responsible for classifying the object.
\item $C_{ij}$ is the predicted objectness score. $\hat{C}_{ij}$ is the ground truth.
\item $p_i(c)$ is the predicted probability of belonging to class $c$. $\hat{p}_i(c)$ is the ground truth.
\item $\lambda_\text{no-obj}$ is a hyperparameter (the default is 0.5).
It is useful to down-weight cells that are not responsible for detecting this specific instance.
\end{itemize}
\end{descriptionlist}
The final loss is the sum of the two losses:
\[\mathcal{L} = \mathcal{L}_\text{loc} + \mathcal{L}_\text{cls} \]
\end{description}
\subsection{Multi-scale processing}
\begin{description}
\item[Feature pyramid] \marginnote{Feature pyramid}
Techniques to manipulate the input image to detect objects at different scales.
Possible approaches are:
\begin{descriptionlist}
\item[Featurized image pyramid]
A pyramid of images at different scales is built. The features at each scale are computed independently (which makes this approach slow).
\item[Single feature map]
Progressively extract features from a single image and only use features at the highest level.
\item[Pyramidal feature hierarchy]
Reuse the hierarchical features extracted by a convolutional network and use them as in the featurized image pyramid approach.
\item[Feature Pyramid Network]
Progressively extract higher-level features in a forward pass and then inject them back into the previous pyramid layers.
\begin{figure}[H]
\centering
\includegraphics[width=0.4\linewidth]{./img/pyramid_network.png}
% \caption{Feature pyramid network workflow}
\end{figure}
\end{descriptionlist}
\begin{remark}
YOLOv3 predicts feature maps at scales 13, 26 and 52 using a feature pyramid network.
\end{remark}
\begin{figure}[H]
\centering
\includegraphics[width=0.95\linewidth]{./img/feature_pyramid.png}
\caption{Feature pyramid recap}
\end{figure}
\end{description}
\subsection{Non-maximum suppression}
\begin{description}
\item[Non-maximum suppression] \marginnote{Non-maximum suppression}
Method to remove multiple detections of the same object.
Given the bounding boxes $BB_c$ of a class $c$ and a threshold $t$, NMS does the following:
\begin{enumerate}
\item Sort $BB_c$ according to the objectness score.
\item While $BB_c$ is not empty:
\begin{enumerate}
\item Pop the first box $p$ from $BB_c$.
\item $p$ is considered as a true prediction.
\item Remove from $BB_c$ all the boxes $s$ with $\texttt{IoU}(p, s) > t$.
\end{enumerate}
\end{enumerate}
\end{description}