mirror of
https://github.com/NotXia/unibo-ai-notes.git
synced 2025-12-14 18:51:52 +01:00
Fix typos <noupdate>
This commit is contained in:
@ -120,7 +120,7 @@
|
||||
\begin{description}
|
||||
\item[Reconstruction loss] \marginnote{Reconstruction loss}
|
||||
Mix between the structural similarity index (SSIM) (which measures a perceptual distance) and L1 norm:
|
||||
\[ \mathcal{L}_{\text{ap}}(x^{(i, L)}) = \frac{1}{N} \sum_{(u, v)} \alpha \frac{1-\texttt{SSIM}(x_{u, v}^{(i, L)}, \hat{x}_{u, v}^{(i, L)})}{2} + (1-\alpha) \left\Vert x_{u, v}^{(i, L)} - \hat{x}_{u, v}^{(i, L)} \right\Vert_1 \]
|
||||
\[ \mathcal{L}_{\text{ap}}(x^{(i, L)}) = \frac{1}{N} \sum_{(u, v)} \left( \alpha \frac{1-\texttt{SSIM}(x_{u, v}^{(i, L)}, \hat{x}_{u, v}^{(i, L)})}{2} + (1-\alpha) \left\Vert x_{u, v}^{(i, L)} - \hat{x}_{u, v}^{(i, L)} \right\Vert_1 \right) \]
|
||||
where $x^{(i, L)}$ is the $i$-th input left image and $\hat{x}^{(i, L)}$ the reconstructed left image.
|
||||
|
||||
\item[Disparity smoothness] \marginnote{Disparity smoothness}
|
||||
@ -206,7 +206,7 @@
|
||||
\begin{descriptionlist}
|
||||
\item[Depth CNN] Takes as input the target image and estimates its depth map.
|
||||
|
||||
\item[Pose CNN] Takes as input the target and nearby images and estimates the camera poses to project from target to nearby image.
|
||||
\item[Pose CNN] Takes as input the target and nearby images, and estimates the camera poses to project from target to nearby image.
|
||||
\end{descriptionlist}
|
||||
The outputs of both networks are used to reconstruct the target image and a reconstruction loss is used for training.
|
||||
|
||||
|
||||
@ -3,7 +3,7 @@
|
||||
|
||||
\begin{description}
|
||||
\item[Generative task] \marginnote{Generative task}
|
||||
Given the training data $\{ x^{(i)} \}$, learn the distribution of the data so that a model can sample new examples:
|
||||
Given the training data $\{ x^{(i)} \}$, learn its distribution so that a model can sample new examples:
|
||||
\[ \hat{x}^{(i)} \sim p_\text{gen}(x; \matr{\theta}) \]
|
||||
|
||||
\begin{figure}[H]
|
||||
@ -421,7 +421,7 @@
|
||||
\begin{remark}[Mode dropping/collapse]
|
||||
Only some modes of the distribution of the real data are represented by the mass of the generator.
|
||||
|
||||
Consider the training objective of the optimal generator. Its main terms model coverage and quality, respectively:
|
||||
Consider the training objective of the optimal generator. The two terms model coverage and quality, respectively:
|
||||
\[
|
||||
\begin{gathered}
|
||||
-\frac{1}{I} \sum_{i=1}^I \log \left( D(x_i; \phi) \right) - \frac{1}{J} \sum_{j=1}^J \log \left( 1- D(G(z_j; \theta); \phi) \right) \\
|
||||
@ -592,7 +592,7 @@
|
||||
\[
|
||||
\begin{gathered}
|
||||
\x_t = \sqrt{1-\beta_t} \x_{t-1} + \sqrt{\beta_t}\noise_t \\
|
||||
\x_t \sim q(\x_t \mid \x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\x_{t-1}, \beta_t\matr{I})
|
||||
\x_t \sim q(\x_t \mid \x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\x_{t-1}; \beta_t\matr{I})
|
||||
\end{gathered}
|
||||
\]
|
||||
where:
|
||||
@ -1020,7 +1020,7 @@
|
||||
\begin{description}
|
||||
\item[Forward process]
|
||||
Use a family of non-Markovian forward distributions conditioned on the real image $\x_0$ and parametrized by a positive standard deviation $\vec{\sigma}$ defined as:
|
||||
\[ q_\vec{\sigma}(\x_1, \dots, \x_T \mid x_0) = q_{\sigma_T}(\x_T \mid \x_0) \prod_{t=2}^{T} q_{\sigma_t}(\x_{t-1} \mid \x_t, \x_0) \]
|
||||
\[ q_\vec{\sigma}(\x_1, \dots, \x_T \mid \x_0) = q_{\sigma_T}(\x_T \mid \x_0) \prod_{t=2}^{T} q_{\sigma_t}(\x_{t-1} \mid \x_t, \x_0) \]
|
||||
where:
|
||||
\[
|
||||
\begin{gathered}
|
||||
@ -1052,7 +1052,7 @@
|
||||
\item[Reverse process]
|
||||
Given a latent $\x_t$ and a DDPM model $\varepsilon_t(\cdot; \params)$, generation at time step $t$ is done as follows:
|
||||
\begin{enumerate}
|
||||
\item Compute an estimate for the current time step $t$ of the real image:
|
||||
\item Compute an estimate of the real image for the current time step $t$:
|
||||
\[ \hat{\x}_0 = \frac{\x_t - \sqrt{\alpha_{t-1}} \varepsilon_t(\x_t; \params)}{\sqrt{\alpha_t}} = f_\params(\x_t) \]
|
||||
Note that the formula comes from the usual $\x_t = \sqrt{\alpha_t}\x_0 + \sqrt{1-\alpha_t}\noise_t$.
|
||||
\item Sample the next image from:
|
||||
|
||||
@ -191,7 +191,7 @@
|
||||
\end{remark}
|
||||
|
||||
\begin{remark}
|
||||
By considering an anchor, a face recognition dataset is composed of a few positive images and lots of negative images. If the embedding of a negative image is already far enough, the loss will be $0$ and does not affect training, making convergence slow.
|
||||
By considering an anchor, a face recognition dataset is composed of a few positive images and lots of negative images. However, if the embedding of a negative image is already far enough, the loss will be $0$ and does not affect training, making convergence slow.
|
||||
\end{remark}
|
||||
\end{description}
|
||||
|
||||
@ -282,7 +282,7 @@
|
||||
}
|
||||
\end{figure}
|
||||
|
||||
To enforce that the embedding must be closer to the correct template, the logit $\vec{s}_y$ of the true identity $y$ can be modified with a penalty $m$:
|
||||
To enforce the embeddings closer to the correct template, the logit $\vec{s}_y$ of the true identity $y$ can be modified with a penalty $m$:
|
||||
\[ \vec{s}_y = \cos\left( \arccos(\langle \tilde{\matr{W}}_y, \tilde{f}(x) \rangle) + m \right) \]
|
||||
Intuitively, the penalty makes softmax ``think'' that the error is bigger than what it really is, making it push the embedding closer to the correct template.
|
||||
|
||||
@ -423,7 +423,7 @@
|
||||
\end{figure}
|
||||
|
||||
\begin{remark}
|
||||
It has been seen that as bag-of-words approach for the text encoder is better than using transformers.
|
||||
It has been seen that a bag-of-words approach for the text encoder is better than using transformers.
|
||||
\end{remark}
|
||||
|
||||
\begin{remark}
|
||||
|
||||
@ -812,17 +812,25 @@
|
||||
\begin{gathered}
|
||||
p_t = \begin{cases}
|
||||
p & \text{if $y=1$} \\
|
||||
1-p & \text{otherwise}
|
||||
1-p & \text{if $y=0$}
|
||||
\end{cases}
|
||||
\\
|
||||
\texttt{BCE}(p, y) = \texttt{BCE}(p_t) = -\ln(p_t) = \begin{cases}
|
||||
-\ln(p) & \text{if $y=1$} \\
|
||||
-\ln(1-p) & \text{otherwise}
|
||||
-\ln(1-p) & \text{if $y=0$}
|
||||
\end{cases}
|
||||
\end{gathered}
|
||||
\]
|
||||
Binary focal loss down-weighs the loss as follows:
|
||||
\[ \texttt{BFL}_\gamma(p_t) = (1-p_t)^\gamma\texttt{BCE}(p_t) = -(1-p_t)^\gamma \ln(p_t) \]
|
||||
\[
|
||||
\texttt{BFL}_\gamma(p_t) =
|
||||
(1-p_t)^\gamma\texttt{BCE}(p_t) =
|
||||
-(1-p_t)^\gamma \ln(p_t) =
|
||||
\begin{cases}
|
||||
-(1-p)^\gamma \ln(p) & \text{if $y=1$} \\
|
||||
-(p)^\gamma \ln(1-p) & \text{if $y=0$} \\
|
||||
\end{cases}
|
||||
\]
|
||||
where $\gamma$ is a hyperparameter ($\gamma=0$ is equivalent to the standard unweighted \texttt{BCE}).
|
||||
|
||||
\begin{figure}[H]
|
||||
@ -855,13 +863,29 @@
|
||||
Consider the following notation:
|
||||
\[ \alpha_t = \begin{cases}
|
||||
\alpha & \text{if $y=1$} \\
|
||||
1-\alpha & \text{otherwise}
|
||||
1-\alpha & \text{if $y=0$}
|
||||
\end{cases} \]
|
||||
Binary cross-entropy with class weights can be defined as:
|
||||
\[ \texttt{WBCE}(p_t) = \alpha_t \texttt{BCE}(p_t) = -\alpha_t \ln(p_t) \]
|
||||
\[
|
||||
\texttt{WBCE}(p_t) =
|
||||
\alpha_t \texttt{BCE}(p_t) =
|
||||
-\alpha_t \ln(p_t) =
|
||||
\begin{cases}
|
||||
-\alpha \ln(p) & \text{if $y=1$} \\
|
||||
-(1-\alpha) \ln(1-p) & \text{if $y=0$} \\
|
||||
\end{cases}
|
||||
\]
|
||||
|
||||
On the same note, $\alpha$-balanced binary focal loss can be defined as:
|
||||
\[ \texttt{WBFL}_\gamma(p_t) = \alpha_t \texttt{BFL}_\gamma(p_t) = -\alpha_t (1-p_t)^\gamma \ln(p_t) \]
|
||||
\[
|
||||
\texttt{WBFL}_\gamma(p_t) =
|
||||
\alpha_t \texttt{BFL}_\gamma(p_t) =
|
||||
-\alpha_t (1-p_t)^\gamma \ln(p_t) =
|
||||
\begin{cases}
|
||||
-\alpha (1-p)^\gamma \ln(p) & \text{if $y=1$} \\
|
||||
-(1-\alpha) (p)^\gamma \ln(1-p) & \text{if $y=0$} \\
|
||||
\end{cases}
|
||||
\]
|
||||
|
||||
\begin{remark}
|
||||
Class weights and focal loss are complementary: the former balances the importance of positive and negative classification errors, while the latter focuses on the hard examples of each class.
|
||||
@ -1012,7 +1036,7 @@
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.75\linewidth]{./img/_object_detection_map_speed_plot.jpg}
|
||||
\includegraphics[width=0.55\linewidth]{./img/_object_detection_map_speed_plot.jpg}
|
||||
\caption{
|
||||
mAP -- speed comparison of the various object detection approaches
|
||||
}
|
||||
|
||||
@ -71,7 +71,7 @@
|
||||
\item[Depth-invariant offsets]
|
||||
The denominator ($D[x]$) applied to the offsets allows obtaining depth-invariant offsets.
|
||||
|
||||
Consider two objects at different depths. The focal length of the camera is $f$ and the world offset we want to apply is $o_w$. By changing depth $d$, the offset in the image plane $o_{di}$ changes due to the perspective projection rules. Therefore, to obtain the offset in the image place, we have that:
|
||||
Consider two objects at different depths. The focal length of the camera is $f$ and the world offset we want to apply is $o_w$. By changing depth $d$, the offset in the image plane $o_{di}$ changes due to the perspective projection rules. Therefore, to obtain the offset in the image plane, we have that:
|
||||
\[ o_{di} : f = o_w : d \,\Rightarrow\, o_{di} = \frac{o_w f}{d} = \frac{\Delta p}{d} \]
|
||||
|
||||
\begin{figure}[H]
|
||||
@ -118,7 +118,7 @@
|
||||
\begin{remark}
|
||||
Random forests are:
|
||||
\begin{itemize}
|
||||
\item Fast and parallelizable in both training and inference.
|
||||
\item Fast and parallelizable at both training and inference.
|
||||
\item Robust to hyperparameters change.
|
||||
\item Interpretable.
|
||||
\end{itemize}
|
||||
@ -134,7 +134,7 @@
|
||||
Slide a window across each pixel of the input image. Each slice is passed through an R-CNN to determine the class of the pixel at the center of the window.
|
||||
|
||||
The loss for an example $i$ is the sum of the cross-entropy losses at each pixel $(u, v)$:
|
||||
\[ \mathcal{L}^{(i)} = \sum_{(u, v)} \mathbbm{1}\left( c_{(u, v)}^{(i)} \right) \mathcal{L}_\text{CE}\left( \texttt{softmax}(\texttt{scores}_{(u, v)}) \right) \]
|
||||
\[ \mathcal{L}^{(i)} = \sum_{(u, v)} \mathcal{L}_\text{CE}\left( \texttt{softmax}(\texttt{scores}_{(u, v)}), \mathbbm{1}\left( c_{(u, v)}^{(i)} \right) \right) \]
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
@ -222,7 +222,7 @@
|
||||
|
||||
Given a convolution with convolutional matrix $K$, a transposed convolution is obtained by applying $K^T$ to the image.
|
||||
|
||||
In practice, a transposed convolution can be obtained by sliding the kernel on the output activation (instead of input). The values at the output activation correspond to the product between the input pixel and the kernel. If multiple kernels overlap at the same output pixel, its value is obtained as the sum of all the values that end up in that position.
|
||||
In practice, a transposed convolution can be obtained by sliding the kernel on the output activation (instead of input). The values at the output activation correspond to the product between the input pixels and the kernel. If multiple kernels overlap at the same output pixel, its value is obtained as the sum of all the values that end up in that position.
|
||||
|
||||
\begin{example}
|
||||
Consider images with $1$ channel. Given a $3 \times 3$ input image and a $3 \times 3$ transposed convolution kernel with stride $2$, the output activation has spatial dimension $5 \times 5$ and is obtained as follows:
|
||||
@ -438,7 +438,7 @@
|
||||
\centering
|
||||
\includegraphics[width=0.7\linewidth]{./img/_roi_align1.jpg}
|
||||
\end{figure}
|
||||
\item Sample some values following a regular grid within each subregion. Use bilinear interpolation to determine the values of the sampled points (as they are most likely not be pixel-perfect).
|
||||
\item Sample some values following a regular grid within each subregion. Use bilinear interpolation to determine the values of the sampled points (as they are most likely not pixel-perfect).
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.7\linewidth]{./img/_roi_align2.jpg}
|
||||
@ -477,7 +477,7 @@
|
||||
|
||||
Given the ground-truth class $c$ and mask $m$, and the predicted mask $\hat{m}$, the mask head is trained using the following loss:
|
||||
\[ \mathcal{L}_\text{mask}(c, m) = \frac{1}{28 \times 28} \sum_{u=0}^{27} \sum_{v=0}^{27} \mathcal{L}_\text{BCE}\left( \hat{m}_{c}[u, v], m[u, v] \right) \]
|
||||
In other words, the ground-truth mask is compared against the predicted mask for the correct class.
|
||||
In other words, the ground-truth mask is compared against the predicted mask at the correct class.
|
||||
|
||||
\item[Inference]
|
||||
The class prediction of R-CNN is used to select the correct channel of the mask head. The bounding box of R-CNN is used to decide how to warp the segmentation mask onto the image.
|
||||
@ -589,7 +589,7 @@
|
||||
\end{description}
|
||||
|
||||
\begin{remark}
|
||||
Although it is able to solve all types of segmentation, MaskFormer do not have state-of-the-art results and it is hard to train.
|
||||
Although it is able to solve all types of segmentation, MaskFormer does not have state-of-the-art results and it is hard to train.
|
||||
\end{remark}
|
||||
\end{description}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user