mirror of
https://github.com/NotXia/unibo-ai-notes.git
synced 2025-12-15 19:12:22 +01:00
Fix typos <noupdate>
This commit is contained in:
@ -80,7 +80,7 @@
|
||||
\item \marginnote{Frequent itemset generation}
|
||||
Determine the itemsets with $\text{support} \geq \texttt{min\_sup}$ (frequent itemsets).
|
||||
\item \marginnote{Rule generation}
|
||||
Determine the the association rules with $\text{confidence} \geq \texttt{min\_conf}$.
|
||||
Determine the association rules with $\text{confidence} \geq \texttt{min\_conf}$.
|
||||
\end{enumerate}
|
||||
\end{description}
|
||||
|
||||
@ -250,7 +250,7 @@ Measures that take into account the statistical independence of the items.
|
||||
\hline
|
||||
High support & The rule applies to many transactions. \\
|
||||
\hline
|
||||
High confidence & The chance that the rule is true for some transaction is high. \\
|
||||
High confidence & The chance that the rule is true for some transactions is high. \\
|
||||
\hline
|
||||
High lift & Low chance that the rule is just a coincidence. \\
|
||||
\hline
|
||||
@ -329,7 +329,7 @@ Measures that take into account the statistical independence of the items.
|
||||
|
||||
|
||||
\section{Multi-level association rules}
|
||||
Organize items into an hierarchy.
|
||||
Organize items into a hierarchy.
|
||||
|
||||
\begin{description}
|
||||
\item[Specialized to general] \marginnote{Specialized to general}
|
||||
@ -345,7 +345,7 @@ Organize items into an hierarchy.
|
||||
\end{example}
|
||||
|
||||
\item[Redundant level] \marginnote{Redundant level}
|
||||
A more specialized rule in the hierarchy is redundant if its confidence is similar to the one of the more general rule.
|
||||
A more specialized rule in the hierarchy is redundant if its confidence is similar to the one of a more general rule.
|
||||
|
||||
\item[Multi-level association rule mining] \marginnote{Multi-level association rule mining}
|
||||
Run association rule mining on different levels of abstraction (general to specialized).
|
||||
|
||||
@ -65,9 +65,9 @@
|
||||
A supervised dataset can be randomly split into:
|
||||
\begin{descriptionlist}
|
||||
\item[Train set] \marginnote{Train set}
|
||||
Used to learn the model. Usually the largest split. Can be seen as an upper-bound of the model performance.
|
||||
Used to learn the model. Usually the largest split. Can be seen as an upper bound of the model performance.
|
||||
\item[Test set] \marginnote{Test set}
|
||||
Used to evaluate the trained model. Can be seen as a lower-bound of the model performance.
|
||||
Used to evaluate the trained model. Can be seen as a lower bound of the model performance.
|
||||
\item[Validation set] \marginnote{Validation set}
|
||||
Used to evaluate the model during training and/or for tuning parameters.
|
||||
\end{descriptionlist}
|
||||
@ -93,7 +93,7 @@
|
||||
|
||||
\subsection{Test set error}
|
||||
\textbf{\underline{Disclaimer: I'm very unsure about this part}}\\
|
||||
The error on the test set can be seen as a lower-bound error of the model.
|
||||
The error on the test set can be seen as a lower bound error of the model.
|
||||
If the test set error ratio is $x$, we can expect an error of $(x \pm \text{confidence interval})$.
|
||||
|
||||
Predicting the elements of the test set can be seen as a binomial process (i.e. a series of $N$ Bernoulli processes).
|
||||
@ -114,7 +114,7 @@ be between the $\frac{\alpha}{2}$ and $(1-\frac{\alpha}{2})$ quantiles of the ga
|
||||
We can estimate $p$ using the Wilson score interval\footnote{\url{https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval}}:
|
||||
\[ p = \frac{1}{1+\frac{1}{N}z^2} \left( f + \frac{1}{2N}z^2 \pm z\sqrt{\frac{1}{N}f(1-f) + \frac{z^2}{4N^2}} \right) \]
|
||||
where $z$ depends on the value of $\alpha$.
|
||||
For a pessimistic estimate, $\pm$ becomes a $+$. Vice versa, for a optimistic estimate, $\pm$ becomes a $-$.
|
||||
For a pessimistic estimate, $\pm$ becomes a $+$. Vice versa, for an optimistic estimate, $\pm$ becomes a $-$.
|
||||
|
||||
As $N$ is at the denominator, this means that for large values of $N$, the uncertainty becomes smaller.
|
||||
\begin{center}
|
||||
@ -127,25 +127,25 @@ As $N$ is at the denominator, this means that for large values of $N$, the uncer
|
||||
\item[Holdout] \marginnote{Holdout}
|
||||
The dataset is split into train, test and, if needed, validation.
|
||||
|
||||
\item[Cross validation] \marginnote{Cross validation}
|
||||
\item[Cross-validation] \marginnote{Cross-validation}
|
||||
The training data is partitioned into $k$ chunks.
|
||||
For $k$ iterations, one of the chunks if used to test and the others to train a new model.
|
||||
For $k$ iterations, one of the chunks is used to test and the others to train a new model.
|
||||
The overall error is obtained as the average of the errors of the $k$ iterations.
|
||||
|
||||
At the end, the final model is still trained on the entire training data,
|
||||
while cross validation results are used as an evaluation and comparison metric.
|
||||
Note that cross validation is done on the training set, so a final test set can still be used to
|
||||
evaluate the final model.
|
||||
In the end, the final model is still trained on the entire training data,
|
||||
while cross-validation results are used as an evaluation and comparison metric.
|
||||
Note that cross-validation is done on the training set, so a final test set can still be used to
|
||||
evaluate the resulting model.
|
||||
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[width=0.6\textwidth]{img/cross_validation.png}
|
||||
\caption{Cross validation example}
|
||||
\caption{Cross-validation example}
|
||||
\end{figure}
|
||||
|
||||
\item[Leave-one-out] \marginnote{Leave-one-out}
|
||||
Extreme case of cross validation with $k=N$, the size of the training set.
|
||||
In this case the whole dataset but one element is used for training and the remaining entry for testing.
|
||||
Extreme case of cross-validation with $k=N$, the size of the training set.
|
||||
In this case, the whole dataset but one element is used for training and the remaining entry for testing.
|
||||
|
||||
\item[Bootstrap] \marginnote{Bootstrap}
|
||||
Statistical sampling of the dataset with replacement (i.e. an entry can be selected multiple times).
|
||||
@ -192,7 +192,7 @@ Given a test set of $N$ element, possible metrics are:
|
||||
|
||||
\item[Recall/Sensitivity] \marginnote{Recall}
|
||||
Number of true positives among the real positives
|
||||
(i.e. how many real positive the model predicted).
|
||||
(i.e. how many real positives the model predicted).
|
||||
\[ \text{recall} = \frac{TP}{TP + FN} \]
|
||||
|
||||
\item[Specificity] \marginnote{Specificity}
|
||||
@ -296,7 +296,7 @@ a macro (unweighted) average or a class-weighted average.
|
||||
\item[ROC curve] \marginnote{ROC curve}
|
||||
The ROC curve can be seen as a way to represent multiple confusion matrices of a classifier
|
||||
that uses different thresholds.
|
||||
The x-axis of a ROC curve represent the false positive rate while the y-axis represent the true positive rate.
|
||||
The x-axis of a ROC curve represents the false positive rate while the y-axis represents the true positive rate.
|
||||
|
||||
A straight line is used to represent a random classifier.
|
||||
A threshold can be considered good if it is high on the y-axis and low on the x-axis.
|
||||
@ -314,7 +314,7 @@ A classifier may not perform well when predicting a minority class of the traini
|
||||
Possible solutions are:
|
||||
\begin{descriptionlist}
|
||||
\item[Undersampling] \marginnote{Undersampling}
|
||||
Randomly reduce the number of example of the majority classes.
|
||||
Randomly reduce the number of examples of the majority classes.
|
||||
|
||||
\item[Oversampling] \marginnote{Oversampling}
|
||||
Increase the examples of the minority classes.
|
||||
@ -324,7 +324,7 @@ Possible solutions are:
|
||||
\begin{enumerate}
|
||||
\item Randomly select an example $x$ belonging to the minority class.
|
||||
\item Select a random neighbor $z_i$ among its $k$-nearest neighbors $z_1, \dots, z_k$.
|
||||
\item Synthetize a new example by selecting a random point of the feature space between $x$ and $z_i$.
|
||||
\item Synthesize a new example by selecting a random point of the feature space between $x$ and $z_i$.
|
||||
\end{enumerate}
|
||||
\end{description}
|
||||
|
||||
@ -346,7 +346,7 @@ Possible solutions are:
|
||||
\begin{description}
|
||||
\item[Shannon theorem] \marginnote{Shannon theorem}
|
||||
Let $\matr{X} = \{ \vec{v}_1, \dots, \vec{v}_V \}$ be a data source where
|
||||
each of the possible value has probability $p_i = \prob{\vec{v}_i}$.
|
||||
each of the possible values has probability $p_i = \prob{\vec{v}_i}$.
|
||||
The best encoding allows to transmit $\matr{X}$ with
|
||||
an average number of bits given by the \textbf{entropy} of $X$: \marginnote{Entropy}
|
||||
\[ H(\matr{X}) = - \sum_j p_j \log_2(p_j) \]
|
||||
@ -354,7 +354,7 @@ Possible solutions are:
|
||||
If $p_j \sim 1$, then the surprise of observing $\vec{v}_j$ is low, vice versa,
|
||||
if $p_j \sim 0$, the surprise of observing $\vec{v}_j$ is high.
|
||||
|
||||
Therefore, when $H(\matr{X})$ is high, $\matr{X}$ is close to an uniform distribution.
|
||||
Therefore, when $H(\matr{X})$ is high, $\matr{X}$ is close to a uniform distribution.
|
||||
When $H(\matr{X})$ is low, $\matr{X}$ is close to a constant.
|
||||
|
||||
\begin{example}[Binary source] \phantom{}\\
|
||||
@ -382,7 +382,7 @@ Possible solutions are:
|
||||
It is computed as:
|
||||
\[ IG(c \,\vert\, d \,:\, t) = H(c) - H(c \,\vert\, d \,:\, t) \]
|
||||
When $H(c \,\vert\, d \,:\, t)$ is low, $IG(c \,\vert\, d \,:\, t)$ is high
|
||||
as splitting with threshold $t$ result in purer groups.
|
||||
as splitting with threshold $t$ results in purer groups.
|
||||
Vice versa, when $H(c \,\vert\, d \,:\, t)$ is high, $IG(c \,\vert\, d \,:\, t)$ is low
|
||||
as splitting with threshold $t$ is not very useful.
|
||||
|
||||
@ -520,7 +520,7 @@ each node requires to process all the attributes.
|
||||
Assuming an average height of $O(\log N)$,
|
||||
the overall complexity for induction (parameters search) is $O(DN \log N)$.
|
||||
|
||||
Moreover, The other operations of a binary tree have complexity:
|
||||
Moreover, the other operations of a binary tree have complexity:
|
||||
\begin{itemize}
|
||||
\item Threshold search and binary split: $O(N \log N)$ (scan the dataset for the threshold).
|
||||
\item Pruning: $O(N \log N)$ (requires to scan the dataset).
|
||||
@ -584,8 +584,8 @@ This has complexity $O(h)$, with $h$ the height of the tree.
|
||||
\item[Smooting]
|
||||
If the value $e_{ij}$ of the domain of a feature $E_i$ never appears in the dataset,
|
||||
its probability $\prob{e_{ij} \mid c}$ will be 0 for all classes.
|
||||
This nullifies all the probabilities that uses this feature when
|
||||
computing the products chain during inference.
|
||||
This nullifies all the probabilities that use this feature when
|
||||
computing the product chain during inference.
|
||||
Smoothing methods can be used to avoid this problem.
|
||||
|
||||
\begin{description}
|
||||
@ -597,14 +597,14 @@ This has complexity $O(h)$, with $h$ the height of the tree.
|
||||
\item[$\vert \mathbb{D}_{E_i} \vert$] The number of distinct values in the domain of $E_i$.
|
||||
\item[\normalfont$\text{af}_{c}$] The absolute frequency of the class $c$.
|
||||
\end{descriptionlist}
|
||||
the smoothed frequency is computed as:
|
||||
The smoothed frequency is computed as:
|
||||
\[
|
||||
\prob{e_{ij} \mid c} = \frac{\text{af}_{e_{ij}, c} + \alpha}{\text{af}_{c} + \alpha \vert \mathbb{D}_{E_i} \vert}
|
||||
\]
|
||||
|
||||
A common value of $\alpha$ is 1.
|
||||
When $\alpha = 0$, there is no smoothing.
|
||||
For higher values of $\alpha$, the smoothed feature gain more importance when computing the priors.
|
||||
For higher values of $\alpha$, the smoothed feature gains more importance when computing the priors.
|
||||
\end{description}
|
||||
|
||||
\item[Missing values] \marginnote{Missing values}
|
||||
@ -704,7 +704,7 @@ In practice, a maximum number of iterations is set.
|
||||
\end{split}
|
||||
\]
|
||||
where $M$ is the margin, $w_i$ are the weights of the hyperplane and $c_i = \{-1, 1 \}$ is the class.
|
||||
The second constraint imposes the hyperplane to have a large margine.
|
||||
The second constraint imposes the hyperplane to have a large margin.
|
||||
For positive labels ($c_i=1$), this is true when the hyperplane is positive.
|
||||
For negative labels ($c_i=-1$), this is true when the hyperplane is negative.
|
||||
|
||||
@ -736,7 +736,7 @@ Then, the data and the boundary is mapped back into the original space.
|
||||
\caption{Example of mapping from $\mathbb{R}^2$ to $\mathbb{R}^3$}
|
||||
\end{figure}
|
||||
|
||||
The kernel trick allows to avoid to explicitly map the dataset into the new space by using kernel functions.
|
||||
The kernel trick allows to avoid explicitly mapping the dataset into the new space by using kernel functions.
|
||||
Known kernel functions are:
|
||||
\begin{descriptionlist}
|
||||
\item[Linear] $K(x, y) = \langle x, y \rangle$.
|
||||
@ -755,7 +755,7 @@ depending on the effectiveness of data caching.
|
||||
\begin{itemize}
|
||||
\item Training an SVM model is generally slower.
|
||||
\item SVM is not affected by local minimums.
|
||||
\item SVM do not suffer the curse of dimensionality.
|
||||
\item SVM does not suffer the curse of dimensionality.
|
||||
\item SVM does not directly provide probability estimates.
|
||||
If needed, these can be computed using a computationally expensive method.
|
||||
\end{itemize}
|
||||
@ -771,10 +771,12 @@ depending on the effectiveness of data caching.
|
||||
\item[Activation function] \marginnote{Activation function}
|
||||
Activation functions are useful to add non-linearity.
|
||||
|
||||
\begin{remark}
|
||||
In a linear system, if there is noise in the input, it is transferred to the output
|
||||
(i.e. linearity implies that $f(x + \text{noise}) = f(x) + f(\text{noise})$).
|
||||
On the other hand, a non-linear system is generally more robust
|
||||
(i.e. non-linearity generally implies that $f(x + \text{noise}) \neq f(x) + f(\text{noise})$)
|
||||
\end{remark}
|
||||
|
||||
\item[Feedforward neural network] \marginnote{Feedforward neural network}
|
||||
Network with the following flow:
|
||||
@ -791,7 +793,7 @@ Inputs are fed to the network and backpropagation is used to update the weights.
|
||||
Size of the step for gradient descent.
|
||||
|
||||
\item[Epoch] \marginnote{Epoch}
|
||||
A round of training where the entire dataset has been processed.
|
||||
A round of training where the entire dataset is processed.
|
||||
|
||||
\item[Stopping criteria] \marginnote{Stopping criteria}
|
||||
Possible conditions to stop the training are:
|
||||
@ -874,7 +876,7 @@ Different strategies to train an ensemble classifier can be used:
|
||||
\subsection{Random forests}
|
||||
\marginnote{Random forests}
|
||||
|
||||
Different decision trees trained on a different random sampling of the training set and different subset of features.
|
||||
Multiple decision trees trained on a different random sampling of the training set and different subsets of features.
|
||||
A prediction is made by averaging the output of each tree.
|
||||
|
||||
\begin{description}
|
||||
|
||||
@ -10,7 +10,7 @@
|
||||
|
||||
\item[Dissimilarity] \marginnote{Dissimilarity}
|
||||
Measures how two objects differ.
|
||||
0 indicates no difference while the upper-bound varies.
|
||||
0 indicates no difference while the upper bound varies.
|
||||
\end{description}
|
||||
|
||||
\begin{table}[ht]
|
||||
@ -119,7 +119,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
|
||||
\begin{description}
|
||||
\item[Pearson's correlation] \marginnote{Pearson's correlation}
|
||||
Measure of linear relationship between a pair of quantitative attributes $e_1$ and $e_2$.
|
||||
To compute the Pearson's correlation, the values of $e_1$ and $e_2$ are first standardized and then ordered to obtain the vectors $\vec{e}_1$ and $\vec{e}_2$.
|
||||
To compute Pearson's correlation, the values of $e_1$ and $e_2$ are first standardized and then ordered to obtain the vectors $\vec{e}_1$ and $\vec{e}_2$.
|
||||
The correlation is then computed as the dot product between $\vec{e}_1$ and $\vec{e}_2$:
|
||||
\[ \texttt{corr}(e_1, e_2) = \langle \vec{e}_1, \vec{e}_2 \rangle \]
|
||||
|
||||
@ -202,10 +202,10 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
|
||||
Given the global centroid of the dataset $\vec{c}$ and
|
||||
$K$ clusters each with $N_i$ objects,
|
||||
the sum of squares between clusters is given by:
|
||||
\[ \texttt{SSB} = \sum_{i=1}^{K} N_i \texttt{dist}(\vec{c}_i, \vec{c})^2 \]
|
||||
\[ \texttt{SSB} = \sum_{i=1}^{K} N_i \cdot \texttt{dist}(\vec{c}_i, \vec{c})^2 \]
|
||||
|
||||
\item[Total sum of squares] \marginnote{Total sum of squares}
|
||||
Sum of the squared distances between the point of the dataset and the global centroid.
|
||||
Sum of the squared distances between the points of the dataset and the global centroid.
|
||||
It can be shown that the total sum of squares can be computed as:
|
||||
\[ \texttt{TSS} = \texttt{SSE} + \texttt{SSB} \]
|
||||
|
||||
@ -217,7 +217,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
|
||||
The Silhouette score of a data point $\vec{x}_i$ belonging to a cluster $K_i$ is given by two components:
|
||||
\begin{description}
|
||||
\item[Sparsity contribution]
|
||||
The average distance of $\vec{x}_i$ to all other points in $K_i$:
|
||||
The average distance of $\vec{x}_i$ to the other points in $K_i$:
|
||||
\[ a(\vec{x}_i) = \frac{1}{\vert K_i \vert - 1} \sum_{\vec{x}_j \in K_i, \vec{x}_j \neq \vec{x}_i} \texttt{dist}(\vec{x}_i, \vec{x}_j) \]
|
||||
|
||||
\item[Separation contribution]
|
||||
@ -278,7 +278,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
|
||||
\item an encoding function $\texttt{encode}: \mathbb{R}^D \rightarrow [1, K]$;
|
||||
\item a decoding function $\texttt{decode}: [1, K] \rightarrow \mathbb{R}^D$.
|
||||
\end{itemize}
|
||||
Distortion (or inertia) is defines as:
|
||||
Distortion (or inertia) is defined as:
|
||||
\[ \texttt{distortion} = \sum_{i=1}^{N} \big(\vec{x}_i - \texttt{decode}(\texttt{encode}(\vec{x_i})) \big)^2 \]
|
||||
|
||||
\begin{theorem}
|
||||
@ -288,7 +288,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
|
||||
\item The center of a point is the centroid of the cluster it belongs to.
|
||||
\end{enumerate}
|
||||
|
||||
Note that k-means alternates point 1 and 2.
|
||||
Note that k-means alternates points 1 and 2.
|
||||
|
||||
\begin{proof}
|
||||
The second point is derived by imposing the derivative of \texttt{distortion} to 0.
|
||||
@ -311,7 +311,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
|
||||
\begin{description}
|
||||
\item[Termination]
|
||||
There are a finite number of ways to cluster $N$ objects into $K$ clusters.
|
||||
By construction, at each iteration the \texttt{distortion} is reduced.
|
||||
By construction, at each iteration, the \texttt{distortion} is reduced.
|
||||
Therefore, k-means is guaranteed to terminate.
|
||||
|
||||
\item[Non-optimality]
|
||||
@ -320,11 +320,11 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
|
||||
The starting configuration is usually composed of points distant as far as possible.
|
||||
|
||||
\item[Noise]
|
||||
Outliers heavily influences the clustering result. Sometimes, it is useful to remove them.
|
||||
Outliers heavily influence the clustering result. Sometimes, it is useful to remove them.
|
||||
|
||||
\item[Complexity]
|
||||
Given a $D$-dimensional dataset of $N$ points,
|
||||
Running k-means for $T$ iterations to find $K$ clusters has complexity $O(TKND)$.
|
||||
running k-means for $T$ iterations to find $K$ clusters has complexity $O(TKND)$.
|
||||
\end{description}
|
||||
\end{description}
|
||||
|
||||
@ -333,9 +333,9 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
|
||||
\section{Hierarchical clustering}
|
||||
|
||||
\begin{description}
|
||||
\item[Dendogram] \marginnote{Dendogram}
|
||||
Tree-like structure where the root is a cluster of all data points and
|
||||
the leaves are clusters with a single data points.
|
||||
\item[Dendrogram] \marginnote{Dendrogram}
|
||||
Tree-like structure where the root is a cluster of all the data points and
|
||||
the leaves are clusters with a single data point.
|
||||
|
||||
\item[Agglomerative] \marginnote{Agglomerative}
|
||||
Starts with a cluster per data point and iteratively merges them (leaves to root).
|
||||
@ -380,12 +380,12 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
|
||||
\begin{enumerate}
|
||||
\item Initialize a cluster for each data point.
|
||||
\item Compute the distance matrix between each cluster.
|
||||
\item Merge the two clusters with lowest separation,
|
||||
drop their values from the distance matrix and add an row/column for the newly created cluster.
|
||||
\item Merge the two clusters with the lowest separation,
|
||||
drop their values from the distance matrix and add a row/column for the newly created cluster.
|
||||
\item Go to point 2. if the number of clusters is greater than one.
|
||||
\end{enumerate}
|
||||
|
||||
After the construction of the dendogram, a cut \marginnote{Cut} can be performed at a user define level.
|
||||
After the construction of the dendrogram, a cut \marginnote{Cut} can be performed at a user-defined level.
|
||||
A cut near the root will result in few bigger clusters.
|
||||
A cut near the leaves will result in numerous smaller clusters.
|
||||
|
||||
@ -441,9 +441,9 @@ Consider as clusters the high-density areas of the data space.
|
||||
\item $\vec{q}$ is a core point.
|
||||
\item There exists a sequence of points $\vec{s}_1, \dots, \vec{s}_z$ such that:
|
||||
\begin{itemize}
|
||||
\item $\vec{s}_1$ is directly density reachable from $\vec{p}$.
|
||||
\item $\vec{s}_1$ is directly density reachable from $\vec{q}$.
|
||||
\item $\vec{s}_{i+1}$ is directly density reachable from $\vec{s}_i$.
|
||||
\item $\vec{q}$ is directly density reachable from $\vec{s}_z$.
|
||||
\item $\vec{p}$ is directly density reachable from $\vec{s}_z$.
|
||||
\end{itemize}
|
||||
\end{itemize}
|
||||
|
||||
@ -455,7 +455,7 @@ Consider as clusters the high-density areas of the data space.
|
||||
Determine clusters as maximal sets of density connected points.
|
||||
Border points not density connected to any core point are labeled as noise.
|
||||
|
||||
In other words, what happens it the following:
|
||||
In other words, what happens is the following:
|
||||
\begin{itemize}
|
||||
\item Neighboring core points are part of the same cluster.
|
||||
\item Border points are part of the cluster of their nearest core point neighbor.
|
||||
@ -480,7 +480,7 @@ Consider as clusters the high-density areas of the data space.
|
||||
\end{description}
|
||||
|
||||
\item[Complexity]
|
||||
Complexity of $O(N^2)$ reduced to $O(N \log N)$ if using spatial indexing.
|
||||
Complexity of $O(N^2)$, reduced to $O(N \log N)$ if using spatial indexing.
|
||||
\end{description}
|
||||
\end{description}
|
||||
|
||||
@ -493,7 +493,7 @@ Consider as clusters the high-density areas of the data space.
|
||||
|
||||
\begin{description}
|
||||
\item[Kernel function] \marginnote{Kernel function}
|
||||
Symmetric and monotonically decreasing function to describe the influence of a data point to its neighbors.
|
||||
Symmetric and monotonically decreasing function to describe the influence of a data point on its neighbors.
|
||||
|
||||
A typical kernel function is the Gaussian.
|
||||
|
||||
@ -514,7 +514,7 @@ Consider as clusters the high-density areas of the data space.
|
||||
\item Derive a density function of the dataset.
|
||||
\item Identify local maximums and consider them as density attractors.
|
||||
\item Associate to each data point the density attractor in the direction of maximum increase.
|
||||
\item Points associated to the same density attractor are part of the same cluster.
|
||||
\item Points associated with the same density attractor are part of the same cluster.
|
||||
\item Remove clusters with a density attractor lower than $\xi$.
|
||||
\item Merge clusters connected through a path of points whose density is greater or equal to $\xi$
|
||||
(e.g. in \Cref{img:denclue} the center area will result in many small clusters that can be merged with an appropriate $\xi$).
|
||||
|
||||
@ -34,10 +34,10 @@
|
||||
\item Data transformations.
|
||||
\end{itemize}
|
||||
|
||||
\section{Modelling}
|
||||
\section{Modeling}
|
||||
\begin{itemize}
|
||||
\item Select modelling technique.
|
||||
\marginnote{Modelling}
|
||||
\item Select modeling technique.
|
||||
\marginnote{Modeling}
|
||||
\item Build/train the model.
|
||||
\end{itemize}
|
||||
|
||||
|
||||
@ -18,7 +18,7 @@
|
||||
Stored data can be classified as:
|
||||
\begin{descriptionlist}
|
||||
\item[Hot] \marginnote{Hot storage}
|
||||
A low volume of highly requested data that require low latency.
|
||||
A low volume of highly requested data that requires low latency.
|
||||
More expensive HW/SW.
|
||||
\item[Cold] \marginnote{Cold storage}
|
||||
A large amount of data that does not have latency requirements.
|
||||
@ -95,9 +95,8 @@
|
||||
\section{Components}
|
||||
|
||||
\subsection{Data ingestion}
|
||||
\marginnote{Data ingestion}
|
||||
\begin{descriptionlist}
|
||||
\item[Workload migration]
|
||||
\item[Workload migration] \marginnote{Data ingestion}
|
||||
Inserting all the data from an existing source.
|
||||
\item[Incremental ingestion]
|
||||
Inserting changes since the last ingestion.
|
||||
@ -123,7 +122,7 @@
|
||||
\begin{description}
|
||||
\item[Columnar storage] \phantom{}
|
||||
\begin{itemize}
|
||||
\item Homogenous data are stores contiguously.
|
||||
\item Homogenous data are stored contiguously.
|
||||
\item Speeds up methods that process entire columns (i.e. all the values of a feature).
|
||||
\item Insertion becomes slower.
|
||||
\end{itemize}
|
||||
@ -134,9 +133,8 @@
|
||||
\end{description}
|
||||
|
||||
\subsection{Processing and analytics}
|
||||
\marginnote{Processing and analytics}
|
||||
\begin{descriptionlist}
|
||||
\item[Interactive analytics]
|
||||
\item[Interactive analytics] \marginnote{Processing and analytics}
|
||||
Interactive queries to large volumes of data.
|
||||
The results are stored back in the data lake.
|
||||
\item[Big data analytics]
|
||||
@ -149,11 +147,13 @@
|
||||
\section{Architectures}
|
||||
|
||||
\subsection{Lambda lake}
|
||||
\marginnote{Lambda lake}
|
||||
\begin{description}
|
||||
\item[Batch layer] Receives and stores the data. Prepares the batch views for the serving layer.
|
||||
\item[Serving layer] Indexes batch views for faster queries.
|
||||
\item[Speed layer] Receives the data and prepares real-time views. The views are also stored in the serving layer.
|
||||
\item[Batch layer] \marginnote{Lambda lake}
|
||||
Receives and stores the data. Prepares the batch views for the serving layer.
|
||||
\item[Serving layer]
|
||||
Indexes batch views for faster queries.
|
||||
\item[Speed layer]
|
||||
Receives the data and prepares real-time views. The views are also stored in the serving layer.
|
||||
\end{description}
|
||||
\begin{figure}[ht]
|
||||
\centering
|
||||
@ -190,7 +190,7 @@ Framework that adds features on top of an existing data lake.
|
||||
|
||||
\section{Metadata}
|
||||
\marginnote{Metadata}
|
||||
Metadata are used to organize a data lake.
|
||||
Metadata is used to organize a data lake.
|
||||
Useful metadata are:
|
||||
\begin{descriptionlist}
|
||||
\item[Source] Origin of the data.
|
||||
|
||||
@ -21,9 +21,9 @@ Useful for:
|
||||
\section{Sampling}
|
||||
\marginnote{Sampling}
|
||||
Sampling can be used when the full dataset is too expensive to obtain or too expensive to process.
|
||||
Obviously a sample has to be representative.
|
||||
Obviously, a sample has to be representative.
|
||||
|
||||
Type of sampling techniques are:
|
||||
The types of sampling techniques are:
|
||||
\begin{descriptionlist}
|
||||
\item[Simple random] \marginnote{Simple random}
|
||||
Extraction of a single element following a given probability distribution.
|
||||
@ -45,7 +45,7 @@ Type of sampling techniques are:
|
||||
\begin{description}
|
||||
\item[Sample size]
|
||||
The sampling size represents a tradeoff between data reduction and precision.
|
||||
In a labeled dataset, it is important to consider the probability of sampling data of all the possible classes.
|
||||
In a labeled dataset, it is important to consider the probability of sampling data from all the possible classes.
|
||||
\end{description}
|
||||
|
||||
|
||||
@ -121,7 +121,7 @@ Possible approaches are:
|
||||
For each entry, if its feature $E$ has value $e_i$, then $H_{e_i} = \texttt{true}$ and the rests are \texttt{false}.
|
||||
|
||||
\subsection{Ordinal encoding} \marginnote{Ordinal encoding}
|
||||
A feature whose values have an ordering can be converted in a consecutive sequence of integers
|
||||
A feature whose values have an ordering can be converted into a consecutive sequence of integers
|
||||
(e.g. ["good", "neutral", "bad"] $\mapsto$ [1, 0, -1]).
|
||||
|
||||
\subsection{Discretization} \marginnote{Discretization}
|
||||
|
||||
@ -7,13 +7,13 @@
|
||||
Deliver the right information to the right people at the right time through the right channel.
|
||||
|
||||
\item[\Ac{dwh}] \marginnote{\Acl{dwh}}
|
||||
Optimized repository that stores information for decision making processes.
|
||||
Optimized repository that stores information for decision-making processes.
|
||||
\Acp{dwh} are a specific type of \ac{dss}.
|
||||
|
||||
Features:
|
||||
\begin{itemize}
|
||||
\item Subject-oriented: focused on enterprise specific concepts.
|
||||
\item Integrates data from different sources and provides an unified view.
|
||||
\item Subject-oriented: focused on enterprise-specific concepts.
|
||||
\item Integrates data from different sources and provides a unified view.
|
||||
\item Non-volatile storage with change tracking.
|
||||
\end{itemize}
|
||||
|
||||
@ -143,7 +143,7 @@ Operational data may contain:
|
||||
\item[Missing data]
|
||||
\item[Improper use of fields] (e.g. saving the phone number in the \texttt{notes} field)
|
||||
\item[Wrong values] (e.g. 30th of February)
|
||||
\item[Inconsistency] (e.g. use of different abbreviations)
|
||||
\item[Inconsistencies] (e.g. use of different abbreviations)
|
||||
\item[Typos]
|
||||
\end{descriptionlist}
|
||||
|
||||
@ -154,10 +154,10 @@ Methods to clean and increase the quality of the data are:
|
||||
Applicable if the domain is known and limited.
|
||||
|
||||
\item[Approximate merging] \marginnote{Approximate merging}
|
||||
Merging data that do not have a common key.
|
||||
Methods to merge data that do not have a common key.
|
||||
\begin{description}
|
||||
\item[Approximate join]
|
||||
Use non-key attributes to join two tables (e.g. using the name and surname instead of an unique identifier).
|
||||
Use non-key attributes to join two tables (e.g. using the name and surname instead of a unique identifier).
|
||||
|
||||
\item[Similarity approach]
|
||||
Use similarity functions (e.g. edit distance) to merge multiple instances of the same information
|
||||
@ -178,7 +178,7 @@ Data are transformed to respect the format of the data warehouse:
|
||||
Creating new information by using existing attributes (e.g. compute profit from receipts and expenses)
|
||||
|
||||
\item[Separation and concatenation] \marginnote{Separation and concatenation}
|
||||
Denormalization of the data: introduces redundances (i.e. breaks normal form\footnote{\url{https://en.wikipedia.org/wiki/Database_normalization}})
|
||||
Denormalization of the data: introduces redundancies (i.e. breaks normal form\footnote{\url{https://en.wikipedia.org/wiki/Database_normalization}})
|
||||
to speed up operations.
|
||||
\end{descriptionlist}
|
||||
|
||||
@ -332,13 +332,14 @@ Aggregation operators can be classified as:
|
||||
\end{description}
|
||||
|
||||
|
||||
\subsection{Logical design}
|
||||
|
||||
\section{Logical design}
|
||||
\marginnote{Logical design}
|
||||
Defining the data structures (e.g. tables and relationships) according to a conceptual model.
|
||||
There are mainly two strategies:
|
||||
There are two main strategies:
|
||||
\begin{descriptionlist}
|
||||
\item[Star schema] \marginnote{Star schema}
|
||||
A fact table that contains all the measures and linked to dimensional tables.
|
||||
A fact table that contains all the measures is linked to dimensional tables.
|
||||
\begin{figure}[ht]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{img/logical_star_schema.png}
|
||||
@ -346,8 +347,8 @@ There are mainly two strategies:
|
||||
\end{figure}
|
||||
|
||||
\item[Snowflake schema] \marginnote{Snowflake schema}
|
||||
A star schema variant with partially normalized dimension tables.
|
||||
\begin{figure}[ht]
|
||||
A star schema variant with partially normalized dimensional tables.
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{img/logical_snowflake_schema.png}
|
||||
\caption{Example of snowflake schema}
|
||||
|
||||
@ -30,7 +30,7 @@
|
||||
\subsection{Software}
|
||||
\begin{description}
|
||||
\item[\Ac{oltp}] \marginnote{\Acl{oltp}}
|
||||
Class of programs to support transaction oriented applications and data storage.
|
||||
Class of programs to support transaction-oriented applications and data storage.
|
||||
Suitable for real-time applications.
|
||||
|
||||
\item[\Ac{erp}] \marginnote{\Acl{erp}}
|
||||
@ -41,10 +41,10 @@
|
||||
|
||||
|
||||
\subsection{Insight}
|
||||
Decision can be classified as:
|
||||
Decisions can be classified as:
|
||||
\begin{descriptionlist}
|
||||
\item[Structured] \marginnote{Structured decision}
|
||||
Established and well understood situations.
|
||||
Established and well-understood situations.
|
||||
What is needed is known.
|
||||
\item[Unstructured] \marginnote{Unstructured decision}
|
||||
Unplanned and unclear situations.
|
||||
@ -54,18 +54,18 @@ Decision can be classified as:
|
||||
Different levels of insight can be extracted by:
|
||||
\begin{descriptionlist}
|
||||
\item[\Ac{mis}] \marginnote{\Acl{mis}}
|
||||
Standardized reporting system built on existing \ac{oltp}.
|
||||
Standardized reporting system built on an existing \ac{oltp}.
|
||||
Used for structured decisions.
|
||||
|
||||
\item[\Ac{dss}] \marginnote{\Acl{dss}}
|
||||
Analytical system to provide support for unstructured decisions.
|
||||
|
||||
\item[\Ac{eis}] \marginnote{\Acl{eis}}
|
||||
Formulate high level decisions that impact the organization.
|
||||
Formulate high-level decisions that impact the organization.
|
||||
|
||||
\item[\Ac{olap}] \marginnote{\Acl{olap}}
|
||||
Grouped analysis of multidimensional data.
|
||||
Involves large amount of data.
|
||||
Involves a large amount of data.
|
||||
|
||||
\item[\Ac{bi}] \marginnote{\Acl{bi}}
|
||||
Applications, infrastructure, tools and best practices to analyze information.
|
||||
@ -75,7 +75,7 @@ Different levels of insight can be extracted by:
|
||||
|
||||
\begin{description}
|
||||
\item[Big data] \marginnote{Big data}
|
||||
Large and/or complex and/or fast changing collection of data that traditional DBMSs are unable to process.
|
||||
Large and/or complex and/or fast-changing collection of data that traditional DBMSs are unable to process.
|
||||
\begin{description}
|
||||
\item[Structured] e.g. relational tables.
|
||||
\item[Unstructured] e.g. videos.
|
||||
|
||||
@ -11,7 +11,7 @@
|
||||
\item[Regression] Estimation of a numeric value.
|
||||
\item[Similarity matching] Identify similar individuals.
|
||||
\item[Clustering] Grouping individuals based on their similarities.
|
||||
\item[Co-occurrence groupping] Identify associations between entities based on the transactions in which they appear together.
|
||||
\item[Co-occurrence grouping] Identify associations between entities based on the transactions in which they appear together.
|
||||
\item[Profiling] Behavior description.
|
||||
\item[Link analysis] Analysis of connections (e.g. in a graph).
|
||||
\item[Data reduction] Reduce the dimensionality of data with minimal information loss.
|
||||
@ -69,7 +69,7 @@
|
||||
|
||||
\textbf{Operators.} $=$, $\neq$, $<$, $>$, $\leq$, $\geq$, $+$, $-$
|
||||
\begin{example}
|
||||
Celsius and Fahrenheit temperature scales, CGPA, time.
|
||||
Celsius and Fahrenheit temperature scales, CGPA, time, \dots.
|
||||
|
||||
For instance, there is a $6.25\%$ increase from $16\text{°C}$ to $17\text{°C}$, but
|
||||
converted in Fahrenheit, the increase is of $2.96\%$ (from $60.8\text{°F}$ to $62.6\text{°F}$).
|
||||
@ -157,14 +157,14 @@
|
||||
\item[Missing values] \marginnote{Missing values}
|
||||
Data that have not been collected.
|
||||
Sometimes they are not easily recognizable
|
||||
(e.g. when special values are used, instead of \texttt{null}, to mark missing data).
|
||||
(e.g. when special values are used to mark missing data instead of \texttt{null}).
|
||||
|
||||
Can be handled in different ways:
|
||||
\begin{itemize}
|
||||
\item Ignore the records with missing values.
|
||||
\item Estimate or default missing values.
|
||||
\item Ignore the fact that some values are missing (not always applicable).
|
||||
\item Insert all the possible values and weight them by their probability.
|
||||
\item Insert all the possible values and weigh them by their probability.
|
||||
\end{itemize}
|
||||
|
||||
\item[Duplicated data] \marginnote{Duplicated data}
|
||||
|
||||
@ -27,7 +27,7 @@
|
||||
\begin{itemize}
|
||||
\item MSE is influenced by the magnitude of the data.
|
||||
\item It measures the fitness of a model in absolute terms.
|
||||
\item It is suited to compare different models.
|
||||
% \item It is suited to compare different models.
|
||||
\end{itemize}
|
||||
|
||||
\item[Coefficient of determination] \marginnote{Coefficient of determination}
|
||||
|
||||
Reference in New Issue
Block a user