Fix typos <noupdate>

This commit is contained in:
2024-01-10 11:24:05 +01:00
parent 73fe58ed0b
commit e48a993ccc
10 changed files with 105 additions and 102 deletions

View File

@ -80,7 +80,7 @@
\item \marginnote{Frequent itemset generation}
Determine the itemsets with $\text{support} \geq \texttt{min\_sup}$ (frequent itemsets).
\item \marginnote{Rule generation}
Determine the the association rules with $\text{confidence} \geq \texttt{min\_conf}$.
Determine the association rules with $\text{confidence} \geq \texttt{min\_conf}$.
\end{enumerate}
\end{description}
@ -250,7 +250,7 @@ Measures that take into account the statistical independence of the items.
\hline
High support & The rule applies to many transactions. \\
\hline
High confidence & The chance that the rule is true for some transaction is high. \\
High confidence & The chance that the rule is true for some transactions is high. \\
\hline
High lift & Low chance that the rule is just a coincidence. \\
\hline
@ -329,7 +329,7 @@ Measures that take into account the statistical independence of the items.
\section{Multi-level association rules}
Organize items into an hierarchy.
Organize items into a hierarchy.
\begin{description}
\item[Specialized to general] \marginnote{Specialized to general}
@ -345,7 +345,7 @@ Organize items into an hierarchy.
\end{example}
\item[Redundant level] \marginnote{Redundant level}
A more specialized rule in the hierarchy is redundant if its confidence is similar to the one of the more general rule.
A more specialized rule in the hierarchy is redundant if its confidence is similar to the one of a more general rule.
\item[Multi-level association rule mining] \marginnote{Multi-level association rule mining}
Run association rule mining on different levels of abstraction (general to specialized).

View File

@ -65,9 +65,9 @@
A supervised dataset can be randomly split into:
\begin{descriptionlist}
\item[Train set] \marginnote{Train set}
Used to learn the model. Usually the largest split. Can be seen as an upper-bound of the model performance.
Used to learn the model. Usually the largest split. Can be seen as an upper bound of the model performance.
\item[Test set] \marginnote{Test set}
Used to evaluate the trained model. Can be seen as a lower-bound of the model performance.
Used to evaluate the trained model. Can be seen as a lower bound of the model performance.
\item[Validation set] \marginnote{Validation set}
Used to evaluate the model during training and/or for tuning parameters.
\end{descriptionlist}
@ -93,7 +93,7 @@
\subsection{Test set error}
\textbf{\underline{Disclaimer: I'm very unsure about this part}}\\
The error on the test set can be seen as a lower-bound error of the model.
The error on the test set can be seen as a lower bound error of the model.
If the test set error ratio is $x$, we can expect an error of $(x \pm \text{confidence interval})$.
Predicting the elements of the test set can be seen as a binomial process (i.e. a series of $N$ Bernoulli processes).
@ -114,7 +114,7 @@ be between the $\frac{\alpha}{2}$ and $(1-\frac{\alpha}{2})$ quantiles of the ga
We can estimate $p$ using the Wilson score interval\footnote{\url{https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval}}:
\[ p = \frac{1}{1+\frac{1}{N}z^2} \left( f + \frac{1}{2N}z^2 \pm z\sqrt{\frac{1}{N}f(1-f) + \frac{z^2}{4N^2}} \right) \]
where $z$ depends on the value of $\alpha$.
For a pessimistic estimate, $\pm$ becomes a $+$. Vice versa, for a optimistic estimate, $\pm$ becomes a $-$.
For a pessimistic estimate, $\pm$ becomes a $+$. Vice versa, for an optimistic estimate, $\pm$ becomes a $-$.
As $N$ is at the denominator, this means that for large values of $N$, the uncertainty becomes smaller.
\begin{center}
@ -127,25 +127,25 @@ As $N$ is at the denominator, this means that for large values of $N$, the uncer
\item[Holdout] \marginnote{Holdout}
The dataset is split into train, test and, if needed, validation.
\item[Cross validation] \marginnote{Cross validation}
\item[Cross-validation] \marginnote{Cross-validation}
The training data is partitioned into $k$ chunks.
For $k$ iterations, one of the chunks if used to test and the others to train a new model.
For $k$ iterations, one of the chunks is used to test and the others to train a new model.
The overall error is obtained as the average of the errors of the $k$ iterations.
At the end, the final model is still trained on the entire training data,
while cross validation results are used as an evaluation and comparison metric.
Note that cross validation is done on the training set, so a final test set can still be used to
evaluate the final model.
In the end, the final model is still trained on the entire training data,
while cross-validation results are used as an evaluation and comparison metric.
Note that cross-validation is done on the training set, so a final test set can still be used to
evaluate the resulting model.
\begin{figure}[h]
\centering
\includegraphics[width=0.6\textwidth]{img/cross_validation.png}
\caption{Cross validation example}
\caption{Cross-validation example}
\end{figure}
\item[Leave-one-out] \marginnote{Leave-one-out}
Extreme case of cross validation with $k=N$, the size of the training set.
In this case the whole dataset but one element is used for training and the remaining entry for testing.
Extreme case of cross-validation with $k=N$, the size of the training set.
In this case, the whole dataset but one element is used for training and the remaining entry for testing.
\item[Bootstrap] \marginnote{Bootstrap}
Statistical sampling of the dataset with replacement (i.e. an entry can be selected multiple times).
@ -192,7 +192,7 @@ Given a test set of $N$ element, possible metrics are:
\item[Recall/Sensitivity] \marginnote{Recall}
Number of true positives among the real positives
(i.e. how many real positive the model predicted).
(i.e. how many real positives the model predicted).
\[ \text{recall} = \frac{TP}{TP + FN} \]
\item[Specificity] \marginnote{Specificity}
@ -296,7 +296,7 @@ a macro (unweighted) average or a class-weighted average.
\item[ROC curve] \marginnote{ROC curve}
The ROC curve can be seen as a way to represent multiple confusion matrices of a classifier
that uses different thresholds.
The x-axis of a ROC curve represent the false positive rate while the y-axis represent the true positive rate.
The x-axis of a ROC curve represents the false positive rate while the y-axis represents the true positive rate.
A straight line is used to represent a random classifier.
A threshold can be considered good if it is high on the y-axis and low on the x-axis.
@ -314,7 +314,7 @@ A classifier may not perform well when predicting a minority class of the traini
Possible solutions are:
\begin{descriptionlist}
\item[Undersampling] \marginnote{Undersampling}
Randomly reduce the number of example of the majority classes.
Randomly reduce the number of examples of the majority classes.
\item[Oversampling] \marginnote{Oversampling}
Increase the examples of the minority classes.
@ -324,7 +324,7 @@ Possible solutions are:
\begin{enumerate}
\item Randomly select an example $x$ belonging to the minority class.
\item Select a random neighbor $z_i$ among its $k$-nearest neighbors $z_1, \dots, z_k$.
\item Synthetize a new example by selecting a random point of the feature space between $x$ and $z_i$.
\item Synthesize a new example by selecting a random point of the feature space between $x$ and $z_i$.
\end{enumerate}
\end{description}
@ -346,7 +346,7 @@ Possible solutions are:
\begin{description}
\item[Shannon theorem] \marginnote{Shannon theorem}
Let $\matr{X} = \{ \vec{v}_1, \dots, \vec{v}_V \}$ be a data source where
each of the possible value has probability $p_i = \prob{\vec{v}_i}$.
each of the possible values has probability $p_i = \prob{\vec{v}_i}$.
The best encoding allows to transmit $\matr{X}$ with
an average number of bits given by the \textbf{entropy} of $X$: \marginnote{Entropy}
\[ H(\matr{X}) = - \sum_j p_j \log_2(p_j) \]
@ -354,7 +354,7 @@ Possible solutions are:
If $p_j \sim 1$, then the surprise of observing $\vec{v}_j$ is low, vice versa,
if $p_j \sim 0$, the surprise of observing $\vec{v}_j$ is high.
Therefore, when $H(\matr{X})$ is high, $\matr{X}$ is close to an uniform distribution.
Therefore, when $H(\matr{X})$ is high, $\matr{X}$ is close to a uniform distribution.
When $H(\matr{X})$ is low, $\matr{X}$ is close to a constant.
\begin{example}[Binary source] \phantom{}\\
@ -382,7 +382,7 @@ Possible solutions are:
It is computed as:
\[ IG(c \,\vert\, d \,:\, t) = H(c) - H(c \,\vert\, d \,:\, t) \]
When $H(c \,\vert\, d \,:\, t)$ is low, $IG(c \,\vert\, d \,:\, t)$ is high
as splitting with threshold $t$ result in purer groups.
as splitting with threshold $t$ results in purer groups.
Vice versa, when $H(c \,\vert\, d \,:\, t)$ is high, $IG(c \,\vert\, d \,:\, t)$ is low
as splitting with threshold $t$ is not very useful.
@ -520,7 +520,7 @@ each node requires to process all the attributes.
Assuming an average height of $O(\log N)$,
the overall complexity for induction (parameters search) is $O(DN \log N)$.
Moreover, The other operations of a binary tree have complexity:
Moreover, the other operations of a binary tree have complexity:
\begin{itemize}
\item Threshold search and binary split: $O(N \log N)$ (scan the dataset for the threshold).
\item Pruning: $O(N \log N)$ (requires to scan the dataset).
@ -584,8 +584,8 @@ This has complexity $O(h)$, with $h$ the height of the tree.
\item[Smooting]
If the value $e_{ij}$ of the domain of a feature $E_i$ never appears in the dataset,
its probability $\prob{e_{ij} \mid c}$ will be 0 for all classes.
This nullifies all the probabilities that uses this feature when
computing the products chain during inference.
This nullifies all the probabilities that use this feature when
computing the product chain during inference.
Smoothing methods can be used to avoid this problem.
\begin{description}
@ -597,14 +597,14 @@ This has complexity $O(h)$, with $h$ the height of the tree.
\item[$\vert \mathbb{D}_{E_i} \vert$] The number of distinct values in the domain of $E_i$.
\item[\normalfont$\text{af}_{c}$] The absolute frequency of the class $c$.
\end{descriptionlist}
the smoothed frequency is computed as:
The smoothed frequency is computed as:
\[
\prob{e_{ij} \mid c} = \frac{\text{af}_{e_{ij}, c} + \alpha}{\text{af}_{c} + \alpha \vert \mathbb{D}_{E_i} \vert}
\]
A common value of $\alpha$ is 1.
When $\alpha = 0$, there is no smoothing.
For higher values of $\alpha$, the smoothed feature gain more importance when computing the priors.
For higher values of $\alpha$, the smoothed feature gains more importance when computing the priors.
\end{description}
\item[Missing values] \marginnote{Missing values}
@ -704,7 +704,7 @@ In practice, a maximum number of iterations is set.
\end{split}
\]
where $M$ is the margin, $w_i$ are the weights of the hyperplane and $c_i = \{-1, 1 \}$ is the class.
The second constraint imposes the hyperplane to have a large margine.
The second constraint imposes the hyperplane to have a large margin.
For positive labels ($c_i=1$), this is true when the hyperplane is positive.
For negative labels ($c_i=-1$), this is true when the hyperplane is negative.
@ -736,7 +736,7 @@ Then, the data and the boundary is mapped back into the original space.
\caption{Example of mapping from $\mathbb{R}^2$ to $\mathbb{R}^3$}
\end{figure}
The kernel trick allows to avoid to explicitly map the dataset into the new space by using kernel functions.
The kernel trick allows to avoid explicitly mapping the dataset into the new space by using kernel functions.
Known kernel functions are:
\begin{descriptionlist}
\item[Linear] $K(x, y) = \langle x, y \rangle$.
@ -755,7 +755,7 @@ depending on the effectiveness of data caching.
\begin{itemize}
\item Training an SVM model is generally slower.
\item SVM is not affected by local minimums.
\item SVM do not suffer the curse of dimensionality.
\item SVM does not suffer the curse of dimensionality.
\item SVM does not directly provide probability estimates.
If needed, these can be computed using a computationally expensive method.
\end{itemize}
@ -771,10 +771,12 @@ depending on the effectiveness of data caching.
\item[Activation function] \marginnote{Activation function}
Activation functions are useful to add non-linearity.
\begin{remark}
In a linear system, if there is noise in the input, it is transferred to the output
(i.e. linearity implies that $f(x + \text{noise}) = f(x) + f(\text{noise})$).
On the other hand, a non-linear system is generally more robust
(i.e. non-linearity generally implies that $f(x + \text{noise}) \neq f(x) + f(\text{noise})$)
\end{remark}
\item[Feedforward neural network] \marginnote{Feedforward neural network}
Network with the following flow:
@ -791,7 +793,7 @@ Inputs are fed to the network and backpropagation is used to update the weights.
Size of the step for gradient descent.
\item[Epoch] \marginnote{Epoch}
A round of training where the entire dataset has been processed.
A round of training where the entire dataset is processed.
\item[Stopping criteria] \marginnote{Stopping criteria}
Possible conditions to stop the training are:
@ -874,7 +876,7 @@ Different strategies to train an ensemble classifier can be used:
\subsection{Random forests}
\marginnote{Random forests}
Different decision trees trained on a different random sampling of the training set and different subset of features.
Multiple decision trees trained on a different random sampling of the training set and different subsets of features.
A prediction is made by averaging the output of each tree.
\begin{description}

View File

@ -10,7 +10,7 @@
\item[Dissimilarity] \marginnote{Dissimilarity}
Measures how two objects differ.
0 indicates no difference while the upper-bound varies.
0 indicates no difference while the upper bound varies.
\end{description}
\begin{table}[ht]
@ -119,7 +119,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
\begin{description}
\item[Pearson's correlation] \marginnote{Pearson's correlation}
Measure of linear relationship between a pair of quantitative attributes $e_1$ and $e_2$.
To compute the Pearson's correlation, the values of $e_1$ and $e_2$ are first standardized and then ordered to obtain the vectors $\vec{e}_1$ and $\vec{e}_2$.
To compute Pearson's correlation, the values of $e_1$ and $e_2$ are first standardized and then ordered to obtain the vectors $\vec{e}_1$ and $\vec{e}_2$.
The correlation is then computed as the dot product between $\vec{e}_1$ and $\vec{e}_2$:
\[ \texttt{corr}(e_1, e_2) = \langle \vec{e}_1, \vec{e}_2 \rangle \]
@ -202,10 +202,10 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
Given the global centroid of the dataset $\vec{c}$ and
$K$ clusters each with $N_i$ objects,
the sum of squares between clusters is given by:
\[ \texttt{SSB} = \sum_{i=1}^{K} N_i \texttt{dist}(\vec{c}_i, \vec{c})^2 \]
\[ \texttt{SSB} = \sum_{i=1}^{K} N_i \cdot \texttt{dist}(\vec{c}_i, \vec{c})^2 \]
\item[Total sum of squares] \marginnote{Total sum of squares}
Sum of the squared distances between the point of the dataset and the global centroid.
Sum of the squared distances between the points of the dataset and the global centroid.
It can be shown that the total sum of squares can be computed as:
\[ \texttt{TSS} = \texttt{SSE} + \texttt{SSB} \]
@ -217,7 +217,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
The Silhouette score of a data point $\vec{x}_i$ belonging to a cluster $K_i$ is given by two components:
\begin{description}
\item[Sparsity contribution]
The average distance of $\vec{x}_i$ to all other points in $K_i$:
The average distance of $\vec{x}_i$ to the other points in $K_i$:
\[ a(\vec{x}_i) = \frac{1}{\vert K_i \vert - 1} \sum_{\vec{x}_j \in K_i, \vec{x}_j \neq \vec{x}_i} \texttt{dist}(\vec{x}_i, \vec{x}_j) \]
\item[Separation contribution]
@ -278,7 +278,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
\item an encoding function $\texttt{encode}: \mathbb{R}^D \rightarrow [1, K]$;
\item a decoding function $\texttt{decode}: [1, K] \rightarrow \mathbb{R}^D$.
\end{itemize}
Distortion (or inertia) is defines as:
Distortion (or inertia) is defined as:
\[ \texttt{distortion} = \sum_{i=1}^{N} \big(\vec{x}_i - \texttt{decode}(\texttt{encode}(\vec{x_i})) \big)^2 \]
\begin{theorem}
@ -288,7 +288,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
\item The center of a point is the centroid of the cluster it belongs to.
\end{enumerate}
Note that k-means alternates point 1 and 2.
Note that k-means alternates points 1 and 2.
\begin{proof}
The second point is derived by imposing the derivative of \texttt{distortion} to 0.
@ -311,7 +311,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
\begin{description}
\item[Termination]
There are a finite number of ways to cluster $N$ objects into $K$ clusters.
By construction, at each iteration the \texttt{distortion} is reduced.
By construction, at each iteration, the \texttt{distortion} is reduced.
Therefore, k-means is guaranteed to terminate.
\item[Non-optimality]
@ -320,11 +320,11 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
The starting configuration is usually composed of points distant as far as possible.
\item[Noise]
Outliers heavily influences the clustering result. Sometimes, it is useful to remove them.
Outliers heavily influence the clustering result. Sometimes, it is useful to remove them.
\item[Complexity]
Given a $D$-dimensional dataset of $N$ points,
Running k-means for $T$ iterations to find $K$ clusters has complexity $O(TKND)$.
running k-means for $T$ iterations to find $K$ clusters has complexity $O(TKND)$.
\end{description}
\end{description}
@ -333,9 +333,9 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
\section{Hierarchical clustering}
\begin{description}
\item[Dendogram] \marginnote{Dendogram}
Tree-like structure where the root is a cluster of all data points and
the leaves are clusters with a single data points.
\item[Dendrogram] \marginnote{Dendrogram}
Tree-like structure where the root is a cluster of all the data points and
the leaves are clusters with a single data point.
\item[Agglomerative] \marginnote{Agglomerative}
Starts with a cluster per data point and iteratively merges them (leaves to root).
@ -380,12 +380,12 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
\begin{enumerate}
\item Initialize a cluster for each data point.
\item Compute the distance matrix between each cluster.
\item Merge the two clusters with lowest separation,
drop their values from the distance matrix and add an row/column for the newly created cluster.
\item Merge the two clusters with the lowest separation,
drop their values from the distance matrix and add a row/column for the newly created cluster.
\item Go to point 2. if the number of clusters is greater than one.
\end{enumerate}
After the construction of the dendogram, a cut \marginnote{Cut} can be performed at a user define level.
After the construction of the dendrogram, a cut \marginnote{Cut} can be performed at a user-defined level.
A cut near the root will result in few bigger clusters.
A cut near the leaves will result in numerous smaller clusters.
@ -441,9 +441,9 @@ Consider as clusters the high-density areas of the data space.
\item $\vec{q}$ is a core point.
\item There exists a sequence of points $\vec{s}_1, \dots, \vec{s}_z$ such that:
\begin{itemize}
\item $\vec{s}_1$ is directly density reachable from $\vec{p}$.
\item $\vec{s}_1$ is directly density reachable from $\vec{q}$.
\item $\vec{s}_{i+1}$ is directly density reachable from $\vec{s}_i$.
\item $\vec{q}$ is directly density reachable from $\vec{s}_z$.
\item $\vec{p}$ is directly density reachable from $\vec{s}_z$.
\end{itemize}
\end{itemize}
@ -455,7 +455,7 @@ Consider as clusters the high-density areas of the data space.
Determine clusters as maximal sets of density connected points.
Border points not density connected to any core point are labeled as noise.
In other words, what happens it the following:
In other words, what happens is the following:
\begin{itemize}
\item Neighboring core points are part of the same cluster.
\item Border points are part of the cluster of their nearest core point neighbor.
@ -480,7 +480,7 @@ Consider as clusters the high-density areas of the data space.
\end{description}
\item[Complexity]
Complexity of $O(N^2)$ reduced to $O(N \log N)$ if using spatial indexing.
Complexity of $O(N^2)$, reduced to $O(N \log N)$ if using spatial indexing.
\end{description}
\end{description}
@ -493,7 +493,7 @@ Consider as clusters the high-density areas of the data space.
\begin{description}
\item[Kernel function] \marginnote{Kernel function}
Symmetric and monotonically decreasing function to describe the influence of a data point to its neighbors.
Symmetric and monotonically decreasing function to describe the influence of a data point on its neighbors.
A typical kernel function is the Gaussian.
@ -514,7 +514,7 @@ Consider as clusters the high-density areas of the data space.
\item Derive a density function of the dataset.
\item Identify local maximums and consider them as density attractors.
\item Associate to each data point the density attractor in the direction of maximum increase.
\item Points associated to the same density attractor are part of the same cluster.
\item Points associated with the same density attractor are part of the same cluster.
\item Remove clusters with a density attractor lower than $\xi$.
\item Merge clusters connected through a path of points whose density is greater or equal to $\xi$
(e.g. in \Cref{img:denclue} the center area will result in many small clusters that can be merged with an appropriate $\xi$).

View File

@ -34,10 +34,10 @@
\item Data transformations.
\end{itemize}
\section{Modelling}
\section{Modeling}
\begin{itemize}
\item Select modelling technique.
\marginnote{Modelling}
\item Select modeling technique.
\marginnote{Modeling}
\item Build/train the model.
\end{itemize}

View File

@ -18,7 +18,7 @@
Stored data can be classified as:
\begin{descriptionlist}
\item[Hot] \marginnote{Hot storage}
A low volume of highly requested data that require low latency.
A low volume of highly requested data that requires low latency.
More expensive HW/SW.
\item[Cold] \marginnote{Cold storage}
A large amount of data that does not have latency requirements.
@ -95,9 +95,8 @@
\section{Components}
\subsection{Data ingestion}
\marginnote{Data ingestion}
\begin{descriptionlist}
\item[Workload migration]
\item[Workload migration] \marginnote{Data ingestion}
Inserting all the data from an existing source.
\item[Incremental ingestion]
Inserting changes since the last ingestion.
@ -123,7 +122,7 @@
\begin{description}
\item[Columnar storage] \phantom{}
\begin{itemize}
\item Homogenous data are stores contiguously.
\item Homogenous data are stored contiguously.
\item Speeds up methods that process entire columns (i.e. all the values of a feature).
\item Insertion becomes slower.
\end{itemize}
@ -134,9 +133,8 @@
\end{description}
\subsection{Processing and analytics}
\marginnote{Processing and analytics}
\begin{descriptionlist}
\item[Interactive analytics]
\item[Interactive analytics] \marginnote{Processing and analytics}
Interactive queries to large volumes of data.
The results are stored back in the data lake.
\item[Big data analytics]
@ -149,11 +147,13 @@
\section{Architectures}
\subsection{Lambda lake}
\marginnote{Lambda lake}
\begin{description}
\item[Batch layer] Receives and stores the data. Prepares the batch views for the serving layer.
\item[Serving layer] Indexes batch views for faster queries.
\item[Speed layer] Receives the data and prepares real-time views. The views are also stored in the serving layer.
\item[Batch layer] \marginnote{Lambda lake}
Receives and stores the data. Prepares the batch views for the serving layer.
\item[Serving layer]
Indexes batch views for faster queries.
\item[Speed layer]
Receives the data and prepares real-time views. The views are also stored in the serving layer.
\end{description}
\begin{figure}[ht]
\centering
@ -190,7 +190,7 @@ Framework that adds features on top of an existing data lake.
\section{Metadata}
\marginnote{Metadata}
Metadata are used to organize a data lake.
Metadata is used to organize a data lake.
Useful metadata are:
\begin{descriptionlist}
\item[Source] Origin of the data.

View File

@ -21,9 +21,9 @@ Useful for:
\section{Sampling}
\marginnote{Sampling}
Sampling can be used when the full dataset is too expensive to obtain or too expensive to process.
Obviously a sample has to be representative.
Obviously, a sample has to be representative.
Type of sampling techniques are:
The types of sampling techniques are:
\begin{descriptionlist}
\item[Simple random] \marginnote{Simple random}
Extraction of a single element following a given probability distribution.
@ -45,7 +45,7 @@ Type of sampling techniques are:
\begin{description}
\item[Sample size]
The sampling size represents a tradeoff between data reduction and precision.
In a labeled dataset, it is important to consider the probability of sampling data of all the possible classes.
In a labeled dataset, it is important to consider the probability of sampling data from all the possible classes.
\end{description}
@ -121,7 +121,7 @@ Possible approaches are:
For each entry, if its feature $E$ has value $e_i$, then $H_{e_i} = \texttt{true}$ and the rests are \texttt{false}.
\subsection{Ordinal encoding} \marginnote{Ordinal encoding}
A feature whose values have an ordering can be converted in a consecutive sequence of integers
A feature whose values have an ordering can be converted into a consecutive sequence of integers
(e.g. ["good", "neutral", "bad"] $\mapsto$ [1, 0, -1]).
\subsection{Discretization} \marginnote{Discretization}

View File

@ -7,13 +7,13 @@
Deliver the right information to the right people at the right time through the right channel.
\item[\Ac{dwh}] \marginnote{\Acl{dwh}}
Optimized repository that stores information for decision making processes.
Optimized repository that stores information for decision-making processes.
\Acp{dwh} are a specific type of \ac{dss}.
Features:
\begin{itemize}
\item Subject-oriented: focused on enterprise specific concepts.
\item Integrates data from different sources and provides an unified view.
\item Subject-oriented: focused on enterprise-specific concepts.
\item Integrates data from different sources and provides a unified view.
\item Non-volatile storage with change tracking.
\end{itemize}
@ -143,7 +143,7 @@ Operational data may contain:
\item[Missing data]
\item[Improper use of fields] (e.g. saving the phone number in the \texttt{notes} field)
\item[Wrong values] (e.g. 30th of February)
\item[Inconsistency] (e.g. use of different abbreviations)
\item[Inconsistencies] (e.g. use of different abbreviations)
\item[Typos]
\end{descriptionlist}
@ -154,10 +154,10 @@ Methods to clean and increase the quality of the data are:
Applicable if the domain is known and limited.
\item[Approximate merging] \marginnote{Approximate merging}
Merging data that do not have a common key.
Methods to merge data that do not have a common key.
\begin{description}
\item[Approximate join]
Use non-key attributes to join two tables (e.g. using the name and surname instead of an unique identifier).
Use non-key attributes to join two tables (e.g. using the name and surname instead of a unique identifier).
\item[Similarity approach]
Use similarity functions (e.g. edit distance) to merge multiple instances of the same information
@ -178,7 +178,7 @@ Data are transformed to respect the format of the data warehouse:
Creating new information by using existing attributes (e.g. compute profit from receipts and expenses)
\item[Separation and concatenation] \marginnote{Separation and concatenation}
Denormalization of the data: introduces redundances (i.e. breaks normal form\footnote{\url{https://en.wikipedia.org/wiki/Database_normalization}})
Denormalization of the data: introduces redundancies (i.e. breaks normal form\footnote{\url{https://en.wikipedia.org/wiki/Database_normalization}})
to speed up operations.
\end{descriptionlist}
@ -332,13 +332,14 @@ Aggregation operators can be classified as:
\end{description}
\subsection{Logical design}
\section{Logical design}
\marginnote{Logical design}
Defining the data structures (e.g. tables and relationships) according to a conceptual model.
There are mainly two strategies:
There are two main strategies:
\begin{descriptionlist}
\item[Star schema] \marginnote{Star schema}
A fact table that contains all the measures and linked to dimensional tables.
A fact table that contains all the measures is linked to dimensional tables.
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{img/logical_star_schema.png}
@ -346,8 +347,8 @@ There are mainly two strategies:
\end{figure}
\item[Snowflake schema] \marginnote{Snowflake schema}
A star schema variant with partially normalized dimension tables.
\begin{figure}[ht]
A star schema variant with partially normalized dimensional tables.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{img/logical_snowflake_schema.png}
\caption{Example of snowflake schema}

View File

@ -30,7 +30,7 @@
\subsection{Software}
\begin{description}
\item[\Ac{oltp}] \marginnote{\Acl{oltp}}
Class of programs to support transaction oriented applications and data storage.
Class of programs to support transaction-oriented applications and data storage.
Suitable for real-time applications.
\item[\Ac{erp}] \marginnote{\Acl{erp}}
@ -41,10 +41,10 @@
\subsection{Insight}
Decision can be classified as:
Decisions can be classified as:
\begin{descriptionlist}
\item[Structured] \marginnote{Structured decision}
Established and well understood situations.
Established and well-understood situations.
What is needed is known.
\item[Unstructured] \marginnote{Unstructured decision}
Unplanned and unclear situations.
@ -54,18 +54,18 @@ Decision can be classified as:
Different levels of insight can be extracted by:
\begin{descriptionlist}
\item[\Ac{mis}] \marginnote{\Acl{mis}}
Standardized reporting system built on existing \ac{oltp}.
Standardized reporting system built on an existing \ac{oltp}.
Used for structured decisions.
\item[\Ac{dss}] \marginnote{\Acl{dss}}
Analytical system to provide support for unstructured decisions.
\item[\Ac{eis}] \marginnote{\Acl{eis}}
Formulate high level decisions that impact the organization.
Formulate high-level decisions that impact the organization.
\item[\Ac{olap}] \marginnote{\Acl{olap}}
Grouped analysis of multidimensional data.
Involves large amount of data.
Involves a large amount of data.
\item[\Ac{bi}] \marginnote{\Acl{bi}}
Applications, infrastructure, tools and best practices to analyze information.
@ -75,7 +75,7 @@ Different levels of insight can be extracted by:
\begin{description}
\item[Big data] \marginnote{Big data}
Large and/or complex and/or fast changing collection of data that traditional DBMSs are unable to process.
Large and/or complex and/or fast-changing collection of data that traditional DBMSs are unable to process.
\begin{description}
\item[Structured] e.g. relational tables.
\item[Unstructured] e.g. videos.

View File

@ -11,7 +11,7 @@
\item[Regression] Estimation of a numeric value.
\item[Similarity matching] Identify similar individuals.
\item[Clustering] Grouping individuals based on their similarities.
\item[Co-occurrence groupping] Identify associations between entities based on the transactions in which they appear together.
\item[Co-occurrence grouping] Identify associations between entities based on the transactions in which they appear together.
\item[Profiling] Behavior description.
\item[Link analysis] Analysis of connections (e.g. in a graph).
\item[Data reduction] Reduce the dimensionality of data with minimal information loss.
@ -69,7 +69,7 @@
\textbf{Operators.} $=$, $\neq$, $<$, $>$, $\leq$, $\geq$, $+$, $-$
\begin{example}
Celsius and Fahrenheit temperature scales, CGPA, time.
Celsius and Fahrenheit temperature scales, CGPA, time, \dots.
For instance, there is a $6.25\%$ increase from $16\text{°C}$ to $17\text{°C}$, but
converted in Fahrenheit, the increase is of $2.96\%$ (from $60.8\text{°F}$ to $62.6\text{°F}$).
@ -157,14 +157,14 @@
\item[Missing values] \marginnote{Missing values}
Data that have not been collected.
Sometimes they are not easily recognizable
(e.g. when special values are used, instead of \texttt{null}, to mark missing data).
(e.g. when special values are used to mark missing data instead of \texttt{null}).
Can be handled in different ways:
\begin{itemize}
\item Ignore the records with missing values.
\item Estimate or default missing values.
\item Ignore the fact that some values are missing (not always applicable).
\item Insert all the possible values and weight them by their probability.
\item Insert all the possible values and weigh them by their probability.
\end{itemize}
\item[Duplicated data] \marginnote{Duplicated data}

View File

@ -27,7 +27,7 @@
\begin{itemize}
\item MSE is influenced by the magnitude of the data.
\item It measures the fitness of a model in absolute terms.
\item It is suited to compare different models.
% \item It is suited to compare different models.
\end{itemize}
\item[Coefficient of determination] \marginnote{Coefficient of determination}