mirror of
https://github.com/NotXia/unibo-ai-notes.git
synced 2025-12-14 18:51:52 +01:00
Add DL GAN and diffusion model
This commit is contained in:
Binary file not shown.
|
Before Width: | Height: | Size: 21 KiB After Width: | Height: | Size: 270 KiB |
BIN
src/year1/deep-learning/img/gan_manifold.png
Normal file
BIN
src/year1/deep-learning/img/gan_manifold.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 847 KiB |
BIN
src/year1/deep-learning/img/latent_mapping.png
Normal file
BIN
src/year1/deep-learning/img/latent_mapping.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 2.1 MiB |
@ -49,8 +49,10 @@ Generative models are categorized into two families:
|
||||
|
||||
|
||||
\section{Variational autoencoder (VAE)}
|
||||
\marginnote{Variational autoencoder (VAE)}
|
||||
|
||||
Approach belonging to the family of compressive models.
|
||||
|
||||
Model belonging to the family of compressive models.
|
||||
|
||||
\subsection{Training}
|
||||
|
||||
@ -119,16 +121,40 @@ The decoder generates new data by simply taking as input a latent variable $z$ s
|
||||
|
||||
|
||||
|
||||
\section{Generative adversarial network (GAN)}
|
||||
\subsection{Problems}
|
||||
|
||||
Model belonging to the family of compressive models.
|
||||
|
||||
During training, the generator is paired with a discriminator that learns to distinguish between real data and generated data.
|
||||
|
||||
The loss function aims to:
|
||||
\begin{itemize}
|
||||
\item Instruct the discriminator to spot the generator.
|
||||
\item Instruct the generator to fool the discriminator.
|
||||
\item Balancing the two losses is difficult.
|
||||
\item It is subject to the posterior collapse problem, where the model learns to ignore a subset of latent variables.
|
||||
\item There might be a mismatch between the prior distribution and the learned latent distribution.
|
||||
\item Generated images are blurry.
|
||||
\end{itemize}
|
||||
|
||||
|
||||
|
||||
\section{Generative adversarial network (GAN)}
|
||||
\marginnote{Generative adversarial network (GAN)}
|
||||
|
||||
Approach belonging to the family of compressive models.
|
||||
|
||||
\subsection{Training}
|
||||
During training, the generator $G$ is paired with a discriminator $D$ that learns to distinguish between real and generated data.
|
||||
|
||||
The loss function is the following:
|
||||
\[
|
||||
\mathcal{V}(D, G) = \mathbb{E}_{x \sim p_\text{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log\big( 1 - D(G(z)) \big)]
|
||||
\]
|
||||
where:
|
||||
\begin{itemize}
|
||||
\item $\mathbb{E}_{x \sim p_\text{data}(x)}[\log D(x)]$ is the negative cross-entropy of the discriminator w.r.t. the true data distribution $p_\text{data}$
|
||||
(i.e. how well the discriminator recognizes real data).
|
||||
\item $\mathbb{E}_{z \sim p_z(z)}[\log\big( 1 - D(G(z)) \big)]$ is the negative cross-entropy of the discriminator w.r.t. the generator
|
||||
(i.e. how well the discriminator is able to detect the generator).
|
||||
\end{itemize}
|
||||
In other words, the loss aims to:
|
||||
\begin{itemize}
|
||||
\item Instruct the discriminator to spot the generator ($\max_D \mathcal{V}(D, G)$).
|
||||
\item Instruct the generator to fool the discriminator ($\min_G \mathcal{V}(D, G)$).
|
||||
\end{itemize}
|
||||
|
||||
\begin{figure}[H]
|
||||
@ -136,11 +162,32 @@ The loss function aims to:
|
||||
\includegraphics[width=0.5\linewidth]{./img/gan.png}
|
||||
\end{figure}
|
||||
|
||||
For more stability, training is done alternately by training the discriminator with the generator frozen and vice versa.
|
||||
|
||||
\begin{remark}
|
||||
GANs have the property of pushing the reconstruction towards the natural image manifold.
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.4\linewidth]{./img/gan_manifold.png}
|
||||
\caption{Comparison of GAN and MSE generated images. MSE is obtained as the pixel-wise average of the natural images.}
|
||||
\end{figure}
|
||||
\end{remark}
|
||||
|
||||
|
||||
\subsection{Problems}
|
||||
|
||||
\begin{itemize}
|
||||
\item A generator able to fool the discriminator does not necessarily mean that the generated images are good.
|
||||
\item There are problems related to counting, perspective and global structure.
|
||||
\item The generator tends to specialize on fixed samples (mode collapse).
|
||||
\end{itemize}
|
||||
|
||||
|
||||
|
||||
\section{Normalizing flows}
|
||||
\marginnote{Normalizing flows}
|
||||
|
||||
Model belonging to the family of dimension-preserving models.
|
||||
Approach belonging to the family of dimension-preserving models.
|
||||
|
||||
The generator is split into a chain of invertible transformations.
|
||||
During training, the log-likelihood is maximized.
|
||||
@ -156,14 +203,93 @@ During training, the log-likelihood is maximized.
|
||||
|
||||
|
||||
|
||||
\section{Diffusion models}
|
||||
\section{Diffusion model}
|
||||
\marginnote{Diffusion model}
|
||||
|
||||
Model belonging to the family of dimension-preserving models.
|
||||
Approach belonging to the family of dimension-preserving models.
|
||||
|
||||
The generator is split into a chain of denoising steps that attempt to remove a Gaussian noise with varying $\sigma$.
|
||||
It is assumed that the latent space is a noisy version of the image.
|
||||
|
||||
\subsection{Training (forward diffusion process)}
|
||||
|
||||
Given an image $x_0$ and a signal ratio $\alpha_t$ (that indicates how much original data is in the noisy image),
|
||||
the generator $G$ is considered as a denoiser and a training step $t$ does the following:
|
||||
\begin{enumerate}
|
||||
\item Normalize $x_0$.
|
||||
\item Generate a Gaussian noise $\varepsilon \sim \mathcal{N}(0, 1)$.
|
||||
\item Generate a noisy version $x_t$ of $x_0$ by injecting the noise as follows:
|
||||
\[ x_t = \sqrt{\alpha_t} \cdot x_0 + \sqrt{1-\alpha_t} \cdot \varepsilon \]
|
||||
\item Make the network predict the noise $G(x_t, \alpha_t)$ and train it to minimize the prediction error:
|
||||
\[ \Vert \varepsilon - G(x_t, \alpha_t) \Vert \]
|
||||
\end{enumerate}
|
||||
|
||||
\begin{remark}
|
||||
The values of $\alpha_t$ are fixed by a scheduler.
|
||||
\end{remark}
|
||||
|
||||
|
||||
\subsection{Inference (reverse diffusion process)}
|
||||
|
||||
The generation process is split into a finite chain of $T$ denoising steps that attempt to remove a Gaussian noise with varying $\sigma$.
|
||||
(i.e. it is assumed that the latent space is a noisy version of the image).
|
||||
|
||||
Given a generator $G$ and a fixed signal ratio scheduling $\alpha_T > \dots > \alpha_1$, an image is sampled as follows:
|
||||
\begin{enumerate}
|
||||
\item Start from some random noise $x_T \sim \mathcal{N}(0, 1)$.
|
||||
\item For $t$ in $T, \dots, 1$:
|
||||
\begin{enumerate}
|
||||
\item Estimate the noise using the generator $G(x_t, \alpha_t)$.
|
||||
\item Compute the denoised image $\hat{x}_0$:
|
||||
\[ \hat{x}_0 = \frac{x_t - \sqrt{1-\alpha_t} \cdot G(x_t, \alpha_t)}{\sqrt{\alpha_t}} \]
|
||||
\item Compute a new noisy image for the next iteration by re-injecting some noise with signal ratio $\alpha_{t-1}$ (i.e. inject an amount of noise that is less than this iteration):
|
||||
\[ x_{t-1} = \sqrt{\alpha_{t-1}} \cdot \hat{x}_0 + \sqrt{1 - \alpha_{t-1}} \cdot \varepsilon \]
|
||||
\end{enumerate}
|
||||
\end{enumerate}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.65\linewidth]{./img/diffusion_model.png}
|
||||
\end{figure}
|
||||
\end{figure}
|
||||
|
||||
\begin{remark}
|
||||
A conditional U-net for denoising works well as the generator.
|
||||
\end{remark}
|
||||
|
||||
|
||||
|
||||
\section{Latent space exploration}
|
||||
|
||||
\begin{description}
|
||||
\item[Representation learning] \marginnote{Representation learning}
|
||||
Learning a latent space in such a way that particular changes reflect a desired alteration of the visible space.
|
||||
|
||||
\begin{remark}
|
||||
Real-world data depend on a relatively small set of latent features.
|
||||
\end{remark}
|
||||
|
||||
\item[Disentanglement] \marginnote{Disentanglement}
|
||||
The latent space learned by a model is usually entangled (i.e. a change in one attribute might affect the others).
|
||||
|
||||
Through linear maps, it is possible to pass from one latent space to another.
|
||||
This can be done by finding a small set of points common to the starting and destination spaces (support set)
|
||||
and defining a map based on those points.
|
||||
|
||||
\begin{remark}
|
||||
The latent space seems to be independent of:
|
||||
\begin{itemize}
|
||||
\item The training process.
|
||||
\item The training architecture.
|
||||
\item The learning objective (i.e. GAN and VAE might have the same latent space).
|
||||
\end{itemize}
|
||||
\end{remark}
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=0.3\linewidth]{./img/latent_mapping.png}
|
||||
\caption{
|
||||
\parbox[t]{0.72\linewidth}{
|
||||
Example of mapping from a latent space $Z_1$ to a space $Z_2$ through $M$.
|
||||
The two spaces are evaluated on the visible space $V$.
|
||||
}
|
||||
}
|
||||
\end{figure}
|
||||
\end{description}
|
||||
Reference in New Issue
Block a user