mirror of
https://github.com/NotXia/unibo-ai-notes.git
synced 2025-12-14 18:51:52 +01:00
Add ML4CV depth estimation + contrastive loss
This commit is contained in:
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
|
After Width: | Height: | Size: 685 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 657 KiB |
@ -15,5 +15,6 @@
|
||||
\include{./sections/_object_detection.tex}
|
||||
\include{./sections/_segmentation.tex}
|
||||
\include{./sections/_depth_estimation.tex}
|
||||
\include{./sections/_metric_learning.tex}
|
||||
|
||||
\end{document}
|
||||
@ -70,7 +70,7 @@
|
||||
\end{description}
|
||||
|
||||
\begin{description}
|
||||
\item[Monodepth]
|
||||
\item[Monodepth (no left-right)] \marginnote{Monodepth (no left-right)}
|
||||
Network that takes as input the left (or right) image of a stereo vision system and predicts the left (or right) disparity.
|
||||
|
||||
\begin{description}
|
||||
@ -116,6 +116,111 @@
|
||||
\includegraphics[width=0.7\linewidth]{./img/_monodepth_correct.pdf}
|
||||
\caption{Actual training flow}
|
||||
\end{figure}
|
||||
|
||||
\begin{description}
|
||||
\item[Reconstruction loss] \marginnote{Reconstruction loss}
|
||||
Mix between the structural similarity index (SSIM) (which measures a perceptual distance) and L1 norm:
|
||||
\[ \mathcal{L}_{\text{ap}}(x^{(i, L)}) = \frac{1}{N} \sum_{(u, v)} \alpha \frac{1-\texttt{SSIM}(x_{u, v}^{(i, L)}, \hat{x}_{u, v}^{(i, L)})}{2} + (1-\alpha) \left\Vert x_{u, v}^{(i, L)} - \hat{x}_{u, v}^{(i, L)} \right\Vert_1 \]
|
||||
where $x^{(i, L)}$ is the $i$-th input left image and $\hat{x}^{(i, L)}$ the reconstructed left image.
|
||||
|
||||
\item[Disparity smoothness] \marginnote{Disparity smoothness}
|
||||
Loss penalty to exploit the fact that disparity tends to be locally smooth and only change at edges:
|
||||
\[ \mathcal{L}_{\text{ds}}(x^{(i, L)}) = \frac{1}{N} \sum_{(u, v)} \left(\left\vert \partial_u d_{u, v}^{(i, L)} \right\vert e^{- \Vert \partial_u x_{u, v}^{(i, L)} \Vert_1} + \left\vert \partial_v d_{u, v}^{(i, L)} \right\vert e^{- \Vert \partial_v x_{u, v}^{(i, L)} \Vert_1} \right) \]
|
||||
where $x^{(i, L)}$ is the $i$-th input left image and $d^{(i, L)}$ the predicted left disparity.
|
||||
|
||||
In this way:
|
||||
\begin{itemize}
|
||||
\item If the gradient of $x_{u, v}^{(i, L)}$ is small (i.e., $e^{- \Vert \partial_* x_{u, v}^{(i, L)} \Vert_1} \rightarrow 1$), the gradient of $d_{u, v}^{(i, L)}$ is forced to be small too.
|
||||
\item If the gradient of $x_{u, v}^{(i, L)}$ is big (i.e., $e^{- \Vert \partial_* x_{u, v}^{(i, L)} \Vert_1} \rightarrow 0$), the gradient of $d_{u, v}^{(i, L)}$ can be indifferently large or small.
|
||||
\end{itemize}
|
||||
\end{description}
|
||||
\end{description}
|
||||
|
||||
\begin{remark}
|
||||
Monodepth without left-right processing has fairly good results but it exhibits texture-copy artifacts and errors at depth discontinuities.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.9\linewidth]{./img/monodepth_no_lr_results.png}
|
||||
\end{figure}
|
||||
\end{remark}
|
||||
|
||||
\begin{remark}
|
||||
Monodepth in this form requires stereo images but only exploits one of the images.
|
||||
\end{remark}
|
||||
|
||||
|
||||
\item[Monodepth (left-right)] \marginnote{Monodepth (left-right)}
|
||||
Make the network predict both left and right disparity and reconstruct both left and right images.
|
||||
|
||||
\begin{description}
|
||||
\item[Disparity consistency loss] \marginnote{Disparity consistency loss}
|
||||
Enforces that the shifts of the two estimated disparities are consistent:
|
||||
\[ \mathcal{L}_{\text{lr}}(x^{(i, L)}, x^{(i, R)}) = \frac{1}{N} \sum_{(u, v)} \left| d_{u, v}^{(i, L)} - d^{(i, R)}_{u+d_{u, v}^{(i, L)}, v} \right| + \frac{1}{N} \sum_{(u, v)} \left| d_{u+d_{u, v}^{(i, R)}, v}^{(i, L)} - d^{(i, R)}_{u, v} \right| \]
|
||||
where $d^{(i, L)}$ and $d^{(i, R)}$ are the $i$-th left and right predicted disparity, respectively.
|
||||
|
||||
The overall loss is the following:
|
||||
\[
|
||||
\begin{split}
|
||||
\mathcal{L}(x^{(i, L)}, x^{(i, R)}) = &\,\alpha_\text{ap} \left( \mathcal{L}_\text{ap}(x^{(i, L)}) + \mathcal{L}_\text{ap}(x^{(i, R)}) \right) \\
|
||||
&+ \alpha_\text{ds} \left( \mathcal{L}_\text{ds}(x^{(i, L)}) + \mathcal{L}_\text{ds}(x^{(i, R)}) \right) \\
|
||||
&+ \alpha_\text{lr} \mathcal{L}_\text{lr}(x^{(i, L)}, x^{(i, R)})
|
||||
\end{split}
|
||||
\]
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.7\linewidth]{./img/_monodepth_lr.pdf}
|
||||
\end{figure}
|
||||
|
||||
\item[Inference]
|
||||
Use the left disparity to determine depth. Everything else is not necessary.
|
||||
|
||||
\item[Architecture]
|
||||
Monodepth is implemented as a U-Net like network:
|
||||
\begin{itemize}
|
||||
\item Up-convolutions are substituted with bilinear up-sampling to avoid checkerboard artifacts.
|
||||
\item Disparity maps are computed at several resolutions and processed by the loss for alignment reason.
|
||||
\end{itemize}
|
||||
|
||||
\begin{remark}
|
||||
It has been seen that better encoders help performance.
|
||||
\end{remark}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.7\linewidth]{./img/monodepth_lr_results.png}
|
||||
\caption{Comparison of Monodepth with and without left-right processing}
|
||||
\end{figure}
|
||||
\end{description}
|
||||
\end{description}
|
||||
|
||||
|
||||
\subsection{Structure from motion learner}
|
||||
|
||||
\begin{description}
|
||||
\item[Structure from motion learner (SfMLearner)] \marginnote{Structure from motion learner (SfMLearner)}
|
||||
Relaxes the assumption of stereo images by using monocular video frames.
|
||||
|
||||
The network takes as input a target image and nearby image(s) and is based on two flows:
|
||||
\begin{descriptionlist}
|
||||
\item[Depth CNN] Takes as input the target image and estimates its depth map.
|
||||
|
||||
\item[Pose CNN] Takes as input the target and nearby images and estimates the camera poses to project from target to nearby image.
|
||||
\end{descriptionlist}
|
||||
The outputs of both networks are used to reconstruct the target image and a reconstruction loss is used for training.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.5\linewidth]{./img/_sfmlearner.pdf}
|
||||
\caption{SfMLearner with two nearby images}
|
||||
\end{figure}
|
||||
\end{description}
|
||||
|
||||
|
||||
\subsection{Depth Pro}
|
||||
|
||||
\begin{description}
|
||||
\item[Depth Pro] \marginnote{Depth Pro}
|
||||
Extension of SfMLearner trained using more datasets.
|
||||
\end{description}
|
||||
@ -0,0 +1,124 @@
|
||||
\chapter{Metric learning}
|
||||
|
||||
|
||||
\begin{description}
|
||||
\item[Metric learning] \marginnote{Metric learning}
|
||||
Task of training a network that produces discriminative embeddings (i.e., with a clustered structure) such that:
|
||||
\begin{itemize}
|
||||
\item The distance of related objects (i.e., intra-class distance) is minimized.
|
||||
\item The distance of different objects (i.e., inter-class distance) is maximized.
|
||||
\end{itemize}
|
||||
\end{description}
|
||||
|
||||
\section{Face recognition}
|
||||
|
||||
\begin{description}
|
||||
\item[Face recognition] \marginnote{Face recognition}
|
||||
Given a database of identities, classify a query face.
|
||||
|
||||
\begin{description}
|
||||
\item[Open-world setting]
|
||||
System where it is easy to add or remove identities.
|
||||
\end{description}
|
||||
\end{description}
|
||||
|
||||
|
||||
\subsection{Face recognition as classification}
|
||||
|
||||
\begin{description}
|
||||
\item[Plain classifier] \marginnote{Plain classifier}
|
||||
Consider each identity as a class and use a CNN with a softmax head to classify the input image.
|
||||
|
||||
\begin{remark}
|
||||
This approach requires a large softmax and do not allow adding or removing identities without retraining.
|
||||
\end{remark}
|
||||
|
||||
\item[kNN classifier] \marginnote{kNN classifier}
|
||||
Use a feature extractor to embed faces and use a kNN classifier to recognize it.
|
||||
|
||||
\begin{description}
|
||||
\item[Gallery] \marginnote{Gallery}
|
||||
Set of embeddings of known identities.
|
||||
\end{description}
|
||||
|
||||
\begin{remark}
|
||||
This approach allows to easily add or remove new identities.
|
||||
\end{remark}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.7\linewidth]{./img/_cnn_knn_face_recognition.pdf}
|
||||
\end{figure}
|
||||
|
||||
\begin{remark}
|
||||
Feature extractors for classification are trained using the cross-entropy loss to learn semantically rich embeddings. However, when classifying, these embeddings are passed through a final linear layer. Therefore, it is sufficient that they are linearly separable.
|
||||
|
||||
In other words, the distance between elements of the same class can be arbitrarily large and the distance between different classes can be arbitrarily small, as long as they are linearly separable.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.45\linewidth]{./img/_mnist_embeddings.pdf}
|
||||
\caption{MNIST embeddings in 2D}
|
||||
\end{figure}
|
||||
\end{remark}
|
||||
\end{description}
|
||||
|
||||
|
||||
\section{Face verification}
|
||||
|
||||
\begin{description}
|
||||
\item[Face verification] \marginnote{Face verification}
|
||||
Task of confirming that two faces represent the same identity. This problem can be solved by either:
|
||||
\begin{itemize}
|
||||
\item Using better metrics than the Euclidean distance (e.g., as done in DeepFace).
|
||||
\item Using better embeddings (e.g., as done in DeepID or FaceNet).
|
||||
\end{itemize}
|
||||
|
||||
\begin{remark}
|
||||
This task can be used to solve face recognition.
|
||||
\end{remark}
|
||||
\end{description}
|
||||
|
||||
|
||||
\begin{description}
|
||||
\item[Siamese network training] \marginnote{Siamese network training}
|
||||
Train a network by comparing its outputs from two different inputs. This can be virtually seen as training two copies of the same network with shared weights.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.7\linewidth]{./img/_siamese_network.pdf}
|
||||
\end{figure}
|
||||
|
||||
\item[Contrastive loss] \marginnote{Contrastive loss}
|
||||
Loss to enforce clustered embeddings. It is defined as follows:
|
||||
\[
|
||||
\mathcal{L}\left( f(x^{(i)}), f(x^{(j)}) \right) =
|
||||
\begin{cases}
|
||||
\Vert f(x^{(i)}) - f(x^{(j)}) \Vert_2^2 & \text{if $y^{(i, j)} = +1$ (i.e., same class)} \\
|
||||
- \Vert f(x^{(i)}) - f(x^{(j)}) \Vert_2^2 & \text{if $y^{(i, j)} = 0$} \\
|
||||
\end{cases}
|
||||
\]
|
||||
As the second term is not lower-bounded (i.e., it can be arbitrarily small), a margin $m$ is included to prevent pushing different classes too far away:
|
||||
\[
|
||||
\begin{split}
|
||||
&\mathcal{L}\left( f(x^{(i)}), f(x^{(j)}) \right) =
|
||||
\begin{cases}
|
||||
\Vert f(x^{(i)}) - f(x^{(j)}) \Vert_2^2 & \text{if $y^{(i, j)} = +1$} \\
|
||||
\max\left\{0, m - \Vert f(x^{(i)}) - f(x^{(j)}) \Vert_2\right\}^2 & \text{if $y^{(i, j)} = 0$} \\
|
||||
\end{cases} \\
|
||||
&= y^{(i, j)} \Vert f(x^{(i)}) - f(x^{(j)}) \Vert_2^2 + (1-y^{(i, j)}) \max\left\{0, m - \Vert f(x^{(i)}) - f(x^{(j)}) \Vert_2\right\}
|
||||
\end{split}
|
||||
\]
|
||||
|
||||
\begin{remark}
|
||||
A margin $m^+$ can also be added to the positive branch to prevent collapsing all embeddings of the same class to the same point.
|
||||
\end{remark}
|
||||
|
||||
\begin{remark}
|
||||
The negative branch $\max\left\{0, m - \Vert f(x^{(i)}) - f(x^{(j)}) \Vert_2\right\}$ is the hinge loss, which is used in SVM.
|
||||
\end{remark}
|
||||
|
||||
\begin{remark}
|
||||
With L2 regularization, the fact that the second term is not lower-bounded is not strictly a problem as the weights lay on a hyper-sphere and therefore set a bound to the output. Still, there is no need to push the embeddings excessively far away.
|
||||
\end{remark}
|
||||
\end{description}
|
||||
Reference in New Issue
Block a user