diff --git a/src/machine-learning-and-data-mining/sections/_association_rules.tex b/src/machine-learning-and-data-mining/sections/_association_rules.tex index 4e9aa8e..fbbb58e 100644 --- a/src/machine-learning-and-data-mining/sections/_association_rules.tex +++ b/src/machine-learning-and-data-mining/sections/_association_rules.tex @@ -80,7 +80,7 @@ \item \marginnote{Frequent itemset generation} Determine the itemsets with $\text{support} \geq \texttt{min\_sup}$ (frequent itemsets). \item \marginnote{Rule generation} - Determine the the association rules with $\text{confidence} \geq \texttt{min\_conf}$. + Determine the association rules with $\text{confidence} \geq \texttt{min\_conf}$. \end{enumerate} \end{description} @@ -250,7 +250,7 @@ Measures that take into account the statistical independence of the items. \hline High support & The rule applies to many transactions. \\ \hline - High confidence & The chance that the rule is true for some transaction is high. \\ + High confidence & The chance that the rule is true for some transactions is high. \\ \hline High lift & Low chance that the rule is just a coincidence. \\ \hline @@ -329,7 +329,7 @@ Measures that take into account the statistical independence of the items. \section{Multi-level association rules} -Organize items into an hierarchy. +Organize items into a hierarchy. \begin{description} \item[Specialized to general] \marginnote{Specialized to general} @@ -345,7 +345,7 @@ Organize items into an hierarchy. \end{example} \item[Redundant level] \marginnote{Redundant level} - A more specialized rule in the hierarchy is redundant if its confidence is similar to the one of the more general rule. + A more specialized rule in the hierarchy is redundant if its confidence is similar to the one of a more general rule. \item[Multi-level association rule mining] \marginnote{Multi-level association rule mining} Run association rule mining on different levels of abstraction (general to specialized). diff --git a/src/machine-learning-and-data-mining/sections/_classification.tex b/src/machine-learning-and-data-mining/sections/_classification.tex index 33310b2..d46bddf 100644 --- a/src/machine-learning-and-data-mining/sections/_classification.tex +++ b/src/machine-learning-and-data-mining/sections/_classification.tex @@ -65,9 +65,9 @@ A supervised dataset can be randomly split into: \begin{descriptionlist} \item[Train set] \marginnote{Train set} - Used to learn the model. Usually the largest split. Can be seen as an upper-bound of the model performance. + Used to learn the model. Usually the largest split. Can be seen as an upper bound of the model performance. \item[Test set] \marginnote{Test set} - Used to evaluate the trained model. Can be seen as a lower-bound of the model performance. + Used to evaluate the trained model. Can be seen as a lower bound of the model performance. \item[Validation set] \marginnote{Validation set} Used to evaluate the model during training and/or for tuning parameters. \end{descriptionlist} @@ -93,7 +93,7 @@ \subsection{Test set error} \textbf{\underline{Disclaimer: I'm very unsure about this part}}\\ -The error on the test set can be seen as a lower-bound error of the model. +The error on the test set can be seen as a lower bound error of the model. If the test set error ratio is $x$, we can expect an error of $(x \pm \text{confidence interval})$. Predicting the elements of the test set can be seen as a binomial process (i.e. a series of $N$ Bernoulli processes). @@ -114,7 +114,7 @@ be between the $\frac{\alpha}{2}$ and $(1-\frac{\alpha}{2})$ quantiles of the ga We can estimate $p$ using the Wilson score interval\footnote{\url{https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval}}: \[ p = \frac{1}{1+\frac{1}{N}z^2} \left( f + \frac{1}{2N}z^2 \pm z\sqrt{\frac{1}{N}f(1-f) + \frac{z^2}{4N^2}} \right) \] where $z$ depends on the value of $\alpha$. -For a pessimistic estimate, $\pm$ becomes a $+$. Vice versa, for a optimistic estimate, $\pm$ becomes a $-$. +For a pessimistic estimate, $\pm$ becomes a $+$. Vice versa, for an optimistic estimate, $\pm$ becomes a $-$. As $N$ is at the denominator, this means that for large values of $N$, the uncertainty becomes smaller. \begin{center} @@ -127,25 +127,25 @@ As $N$ is at the denominator, this means that for large values of $N$, the uncer \item[Holdout] \marginnote{Holdout} The dataset is split into train, test and, if needed, validation. - \item[Cross validation] \marginnote{Cross validation} + \item[Cross-validation] \marginnote{Cross-validation} The training data is partitioned into $k$ chunks. - For $k$ iterations, one of the chunks if used to test and the others to train a new model. + For $k$ iterations, one of the chunks is used to test and the others to train a new model. The overall error is obtained as the average of the errors of the $k$ iterations. - At the end, the final model is still trained on the entire training data, - while cross validation results are used as an evaluation and comparison metric. - Note that cross validation is done on the training set, so a final test set can still be used to - evaluate the final model. + In the end, the final model is still trained on the entire training data, + while cross-validation results are used as an evaluation and comparison metric. + Note that cross-validation is done on the training set, so a final test set can still be used to + evaluate the resulting model. \begin{figure}[h] \centering \includegraphics[width=0.6\textwidth]{img/cross_validation.png} - \caption{Cross validation example} + \caption{Cross-validation example} \end{figure} \item[Leave-one-out] \marginnote{Leave-one-out} - Extreme case of cross validation with $k=N$, the size of the training set. - In this case the whole dataset but one element is used for training and the remaining entry for testing. + Extreme case of cross-validation with $k=N$, the size of the training set. + In this case, the whole dataset but one element is used for training and the remaining entry for testing. \item[Bootstrap] \marginnote{Bootstrap} Statistical sampling of the dataset with replacement (i.e. an entry can be selected multiple times). @@ -192,7 +192,7 @@ Given a test set of $N$ element, possible metrics are: \item[Recall/Sensitivity] \marginnote{Recall} Number of true positives among the real positives - (i.e. how many real positive the model predicted). + (i.e. how many real positives the model predicted). \[ \text{recall} = \frac{TP}{TP + FN} \] \item[Specificity] \marginnote{Specificity} @@ -296,7 +296,7 @@ a macro (unweighted) average or a class-weighted average. \item[ROC curve] \marginnote{ROC curve} The ROC curve can be seen as a way to represent multiple confusion matrices of a classifier that uses different thresholds. - The x-axis of a ROC curve represent the false positive rate while the y-axis represent the true positive rate. + The x-axis of a ROC curve represents the false positive rate while the y-axis represents the true positive rate. A straight line is used to represent a random classifier. A threshold can be considered good if it is high on the y-axis and low on the x-axis. @@ -314,7 +314,7 @@ A classifier may not perform well when predicting a minority class of the traini Possible solutions are: \begin{descriptionlist} \item[Undersampling] \marginnote{Undersampling} - Randomly reduce the number of example of the majority classes. + Randomly reduce the number of examples of the majority classes. \item[Oversampling] \marginnote{Oversampling} Increase the examples of the minority classes. @@ -324,7 +324,7 @@ Possible solutions are: \begin{enumerate} \item Randomly select an example $x$ belonging to the minority class. \item Select a random neighbor $z_i$ among its $k$-nearest neighbors $z_1, \dots, z_k$. - \item Synthetize a new example by selecting a random point of the feature space between $x$ and $z_i$. + \item Synthesize a new example by selecting a random point of the feature space between $x$ and $z_i$. \end{enumerate} \end{description} @@ -346,7 +346,7 @@ Possible solutions are: \begin{description} \item[Shannon theorem] \marginnote{Shannon theorem} Let $\matr{X} = \{ \vec{v}_1, \dots, \vec{v}_V \}$ be a data source where - each of the possible value has probability $p_i = \prob{\vec{v}_i}$. + each of the possible values has probability $p_i = \prob{\vec{v}_i}$. The best encoding allows to transmit $\matr{X}$ with an average number of bits given by the \textbf{entropy} of $X$: \marginnote{Entropy} \[ H(\matr{X}) = - \sum_j p_j \log_2(p_j) \] @@ -354,7 +354,7 @@ Possible solutions are: If $p_j \sim 1$, then the surprise of observing $\vec{v}_j$ is low, vice versa, if $p_j \sim 0$, the surprise of observing $\vec{v}_j$ is high. - Therefore, when $H(\matr{X})$ is high, $\matr{X}$ is close to an uniform distribution. + Therefore, when $H(\matr{X})$ is high, $\matr{X}$ is close to a uniform distribution. When $H(\matr{X})$ is low, $\matr{X}$ is close to a constant. \begin{example}[Binary source] \phantom{}\\ @@ -382,7 +382,7 @@ Possible solutions are: It is computed as: \[ IG(c \,\vert\, d \,:\, t) = H(c) - H(c \,\vert\, d \,:\, t) \] When $H(c \,\vert\, d \,:\, t)$ is low, $IG(c \,\vert\, d \,:\, t)$ is high - as splitting with threshold $t$ result in purer groups. + as splitting with threshold $t$ results in purer groups. Vice versa, when $H(c \,\vert\, d \,:\, t)$ is high, $IG(c \,\vert\, d \,:\, t)$ is low as splitting with threshold $t$ is not very useful. @@ -520,7 +520,7 @@ each node requires to process all the attributes. Assuming an average height of $O(\log N)$, the overall complexity for induction (parameters search) is $O(DN \log N)$. -Moreover, The other operations of a binary tree have complexity: +Moreover, the other operations of a binary tree have complexity: \begin{itemize} \item Threshold search and binary split: $O(N \log N)$ (scan the dataset for the threshold). \item Pruning: $O(N \log N)$ (requires to scan the dataset). @@ -584,8 +584,8 @@ This has complexity $O(h)$, with $h$ the height of the tree. \item[Smooting] If the value $e_{ij}$ of the domain of a feature $E_i$ never appears in the dataset, its probability $\prob{e_{ij} \mid c}$ will be 0 for all classes. - This nullifies all the probabilities that uses this feature when - computing the products chain during inference. + This nullifies all the probabilities that use this feature when + computing the product chain during inference. Smoothing methods can be used to avoid this problem. \begin{description} @@ -597,14 +597,14 @@ This has complexity $O(h)$, with $h$ the height of the tree. \item[$\vert \mathbb{D}_{E_i} \vert$] The number of distinct values in the domain of $E_i$. \item[\normalfont$\text{af}_{c}$] The absolute frequency of the class $c$. \end{descriptionlist} - the smoothed frequency is computed as: + The smoothed frequency is computed as: \[ \prob{e_{ij} \mid c} = \frac{\text{af}_{e_{ij}, c} + \alpha}{\text{af}_{c} + \alpha \vert \mathbb{D}_{E_i} \vert} \] A common value of $\alpha$ is 1. When $\alpha = 0$, there is no smoothing. - For higher values of $\alpha$, the smoothed feature gain more importance when computing the priors. + For higher values of $\alpha$, the smoothed feature gains more importance when computing the priors. \end{description} \item[Missing values] \marginnote{Missing values} @@ -704,7 +704,7 @@ In practice, a maximum number of iterations is set. \end{split} \] where $M$ is the margin, $w_i$ are the weights of the hyperplane and $c_i = \{-1, 1 \}$ is the class. - The second constraint imposes the hyperplane to have a large margine. + The second constraint imposes the hyperplane to have a large margin. For positive labels ($c_i=1$), this is true when the hyperplane is positive. For negative labels ($c_i=-1$), this is true when the hyperplane is negative. @@ -736,7 +736,7 @@ Then, the data and the boundary is mapped back into the original space. \caption{Example of mapping from $\mathbb{R}^2$ to $\mathbb{R}^3$} \end{figure} -The kernel trick allows to avoid to explicitly map the dataset into the new space by using kernel functions. +The kernel trick allows to avoid explicitly mapping the dataset into the new space by using kernel functions. Known kernel functions are: \begin{descriptionlist} \item[Linear] $K(x, y) = \langle x, y \rangle$. @@ -755,7 +755,7 @@ depending on the effectiveness of data caching. \begin{itemize} \item Training an SVM model is generally slower. \item SVM is not affected by local minimums. - \item SVM do not suffer the curse of dimensionality. + \item SVM does not suffer the curse of dimensionality. \item SVM does not directly provide probability estimates. If needed, these can be computed using a computationally expensive method. \end{itemize} @@ -771,10 +771,12 @@ depending on the effectiveness of data caching. \item[Activation function] \marginnote{Activation function} Activation functions are useful to add non-linearity. - In a linear system, if there is noise in the input, it is transferred to the output - (i.e. linearity implies that $f(x + \text{noise}) = f(x) + f(\text{noise})$). - On the other hand, a non-linear system is generally more robust - (i.e. non-linearity generally implies that $f(x + \text{noise}) \neq f(x) + f(\text{noise})$) + \begin{remark} + In a linear system, if there is noise in the input, it is transferred to the output + (i.e. linearity implies that $f(x + \text{noise}) = f(x) + f(\text{noise})$). + On the other hand, a non-linear system is generally more robust + (i.e. non-linearity generally implies that $f(x + \text{noise}) \neq f(x) + f(\text{noise})$) + \end{remark} \item[Feedforward neural network] \marginnote{Feedforward neural network} Network with the following flow: @@ -791,7 +793,7 @@ Inputs are fed to the network and backpropagation is used to update the weights. Size of the step for gradient descent. \item[Epoch] \marginnote{Epoch} - A round of training where the entire dataset has been processed. + A round of training where the entire dataset is processed. \item[Stopping criteria] \marginnote{Stopping criteria} Possible conditions to stop the training are: @@ -874,7 +876,7 @@ Different strategies to train an ensemble classifier can be used: \subsection{Random forests} \marginnote{Random forests} -Different decision trees trained on a different random sampling of the training set and different subset of features. +Multiple decision trees trained on a different random sampling of the training set and different subsets of features. A prediction is made by averaging the output of each tree. \begin{description} diff --git a/src/machine-learning-and-data-mining/sections/_clustering.tex b/src/machine-learning-and-data-mining/sections/_clustering.tex index 7474ec0..99178dd 100644 --- a/src/machine-learning-and-data-mining/sections/_clustering.tex +++ b/src/machine-learning-and-data-mining/sections/_clustering.tex @@ -10,7 +10,7 @@ \item[Dissimilarity] \marginnote{Dissimilarity} Measures how two objects differ. - 0 indicates no difference while the upper-bound varies. + 0 indicates no difference while the upper bound varies. \end{description} \begin{table}[ht] @@ -119,7 +119,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar \begin{description} \item[Pearson's correlation] \marginnote{Pearson's correlation} Measure of linear relationship between a pair of quantitative attributes $e_1$ and $e_2$. - To compute the Pearson's correlation, the values of $e_1$ and $e_2$ are first standardized and then ordered to obtain the vectors $\vec{e}_1$ and $\vec{e}_2$. + To compute Pearson's correlation, the values of $e_1$ and $e_2$ are first standardized and then ordered to obtain the vectors $\vec{e}_1$ and $\vec{e}_2$. The correlation is then computed as the dot product between $\vec{e}_1$ and $\vec{e}_2$: \[ \texttt{corr}(e_1, e_2) = \langle \vec{e}_1, \vec{e}_2 \rangle \] @@ -202,10 +202,10 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar Given the global centroid of the dataset $\vec{c}$ and $K$ clusters each with $N_i$ objects, the sum of squares between clusters is given by: - \[ \texttt{SSB} = \sum_{i=1}^{K} N_i \texttt{dist}(\vec{c}_i, \vec{c})^2 \] + \[ \texttt{SSB} = \sum_{i=1}^{K} N_i \cdot \texttt{dist}(\vec{c}_i, \vec{c})^2 \] \item[Total sum of squares] \marginnote{Total sum of squares} - Sum of the squared distances between the point of the dataset and the global centroid. + Sum of the squared distances between the points of the dataset and the global centroid. It can be shown that the total sum of squares can be computed as: \[ \texttt{TSS} = \texttt{SSE} + \texttt{SSB} \] @@ -217,7 +217,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar The Silhouette score of a data point $\vec{x}_i$ belonging to a cluster $K_i$ is given by two components: \begin{description} \item[Sparsity contribution] - The average distance of $\vec{x}_i$ to all other points in $K_i$: + The average distance of $\vec{x}_i$ to the other points in $K_i$: \[ a(\vec{x}_i) = \frac{1}{\vert K_i \vert - 1} \sum_{\vec{x}_j \in K_i, \vec{x}_j \neq \vec{x}_i} \texttt{dist}(\vec{x}_i, \vec{x}_j) \] \item[Separation contribution] @@ -278,7 +278,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar \item an encoding function $\texttt{encode}: \mathbb{R}^D \rightarrow [1, K]$; \item a decoding function $\texttt{decode}: [1, K] \rightarrow \mathbb{R}^D$. \end{itemize} - Distortion (or inertia) is defines as: + Distortion (or inertia) is defined as: \[ \texttt{distortion} = \sum_{i=1}^{N} \big(\vec{x}_i - \texttt{decode}(\texttt{encode}(\vec{x_i})) \big)^2 \] \begin{theorem} @@ -288,7 +288,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar \item The center of a point is the centroid of the cluster it belongs to. \end{enumerate} - Note that k-means alternates point 1 and 2. + Note that k-means alternates points 1 and 2. \begin{proof} The second point is derived by imposing the derivative of \texttt{distortion} to 0. @@ -311,7 +311,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar \begin{description} \item[Termination] There are a finite number of ways to cluster $N$ objects into $K$ clusters. - By construction, at each iteration the \texttt{distortion} is reduced. + By construction, at each iteration, the \texttt{distortion} is reduced. Therefore, k-means is guaranteed to terminate. \item[Non-optimality] @@ -320,11 +320,11 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar The starting configuration is usually composed of points distant as far as possible. \item[Noise] - Outliers heavily influences the clustering result. Sometimes, it is useful to remove them. + Outliers heavily influence the clustering result. Sometimes, it is useful to remove them. \item[Complexity] Given a $D$-dimensional dataset of $N$ points, - Running k-means for $T$ iterations to find $K$ clusters has complexity $O(TKND)$. + running k-means for $T$ iterations to find $K$ clusters has complexity $O(TKND)$. \end{description} \end{description} @@ -333,9 +333,9 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar \section{Hierarchical clustering} \begin{description} - \item[Dendogram] \marginnote{Dendogram} - Tree-like structure where the root is a cluster of all data points and - the leaves are clusters with a single data points. + \item[Dendrogram] \marginnote{Dendrogram} + Tree-like structure where the root is a cluster of all the data points and + the leaves are clusters with a single data point. \item[Agglomerative] \marginnote{Agglomerative} Starts with a cluster per data point and iteratively merges them (leaves to root). @@ -380,12 +380,12 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar \begin{enumerate} \item Initialize a cluster for each data point. \item Compute the distance matrix between each cluster. - \item Merge the two clusters with lowest separation, - drop their values from the distance matrix and add an row/column for the newly created cluster. + \item Merge the two clusters with the lowest separation, + drop their values from the distance matrix and add a row/column for the newly created cluster. \item Go to point 2. if the number of clusters is greater than one. \end{enumerate} - After the construction of the dendogram, a cut \marginnote{Cut} can be performed at a user define level. + After the construction of the dendrogram, a cut \marginnote{Cut} can be performed at a user-defined level. A cut near the root will result in few bigger clusters. A cut near the leaves will result in numerous smaller clusters. @@ -441,9 +441,9 @@ Consider as clusters the high-density areas of the data space. \item $\vec{q}$ is a core point. \item There exists a sequence of points $\vec{s}_1, \dots, \vec{s}_z$ such that: \begin{itemize} - \item $\vec{s}_1$ is directly density reachable from $\vec{p}$. + \item $\vec{s}_1$ is directly density reachable from $\vec{q}$. \item $\vec{s}_{i+1}$ is directly density reachable from $\vec{s}_i$. - \item $\vec{q}$ is directly density reachable from $\vec{s}_z$. + \item $\vec{p}$ is directly density reachable from $\vec{s}_z$. \end{itemize} \end{itemize} @@ -455,7 +455,7 @@ Consider as clusters the high-density areas of the data space. Determine clusters as maximal sets of density connected points. Border points not density connected to any core point are labeled as noise. - In other words, what happens it the following: + In other words, what happens is the following: \begin{itemize} \item Neighboring core points are part of the same cluster. \item Border points are part of the cluster of their nearest core point neighbor. @@ -480,7 +480,7 @@ Consider as clusters the high-density areas of the data space. \end{description} \item[Complexity] - Complexity of $O(N^2)$ reduced to $O(N \log N)$ if using spatial indexing. + Complexity of $O(N^2)$, reduced to $O(N \log N)$ if using spatial indexing. \end{description} \end{description} @@ -493,7 +493,7 @@ Consider as clusters the high-density areas of the data space. \begin{description} \item[Kernel function] \marginnote{Kernel function} - Symmetric and monotonically decreasing function to describe the influence of a data point to its neighbors. + Symmetric and monotonically decreasing function to describe the influence of a data point on its neighbors. A typical kernel function is the Gaussian. @@ -514,7 +514,7 @@ Consider as clusters the high-density areas of the data space. \item Derive a density function of the dataset. \item Identify local maximums and consider them as density attractors. \item Associate to each data point the density attractor in the direction of maximum increase. - \item Points associated to the same density attractor are part of the same cluster. + \item Points associated with the same density attractor are part of the same cluster. \item Remove clusters with a density attractor lower than $\xi$. \item Merge clusters connected through a path of points whose density is greater or equal to $\xi$ (e.g. in \Cref{img:denclue} the center area will result in many small clusters that can be merged with an appropriate $\xi$). diff --git a/src/machine-learning-and-data-mining/sections/_crisp.tex b/src/machine-learning-and-data-mining/sections/_crisp.tex index aa8d5d6..973741a 100644 --- a/src/machine-learning-and-data-mining/sections/_crisp.tex +++ b/src/machine-learning-and-data-mining/sections/_crisp.tex @@ -34,10 +34,10 @@ \item Data transformations. \end{itemize} -\section{Modelling} +\section{Modeling} \begin{itemize} - \item Select modelling technique. - \marginnote{Modelling} + \item Select modeling technique. + \marginnote{Modeling} \item Build/train the model. \end{itemize} diff --git a/src/machine-learning-and-data-mining/sections/_data_lake.tex b/src/machine-learning-and-data-mining/sections/_data_lake.tex index 4309164..c7c9cdc 100644 --- a/src/machine-learning-and-data-mining/sections/_data_lake.tex +++ b/src/machine-learning-and-data-mining/sections/_data_lake.tex @@ -18,7 +18,7 @@ Stored data can be classified as: \begin{descriptionlist} \item[Hot] \marginnote{Hot storage} - A low volume of highly requested data that require low latency. + A low volume of highly requested data that requires low latency. More expensive HW/SW. \item[Cold] \marginnote{Cold storage} A large amount of data that does not have latency requirements. @@ -95,9 +95,8 @@ \section{Components} \subsection{Data ingestion} -\marginnote{Data ingestion} \begin{descriptionlist} - \item[Workload migration] + \item[Workload migration] \marginnote{Data ingestion} Inserting all the data from an existing source. \item[Incremental ingestion] Inserting changes since the last ingestion. @@ -123,7 +122,7 @@ \begin{description} \item[Columnar storage] \phantom{} \begin{itemize} - \item Homogenous data are stores contiguously. + \item Homogenous data are stored contiguously. \item Speeds up methods that process entire columns (i.e. all the values of a feature). \item Insertion becomes slower. \end{itemize} @@ -134,9 +133,8 @@ \end{description} \subsection{Processing and analytics} -\marginnote{Processing and analytics} \begin{descriptionlist} - \item[Interactive analytics] + \item[Interactive analytics] \marginnote{Processing and analytics} Interactive queries to large volumes of data. The results are stored back in the data lake. \item[Big data analytics] @@ -149,11 +147,13 @@ \section{Architectures} \subsection{Lambda lake} -\marginnote{Lambda lake} \begin{description} - \item[Batch layer] Receives and stores the data. Prepares the batch views for the serving layer. - \item[Serving layer] Indexes batch views for faster queries. - \item[Speed layer] Receives the data and prepares real-time views. The views are also stored in the serving layer. + \item[Batch layer] \marginnote{Lambda lake} + Receives and stores the data. Prepares the batch views for the serving layer. + \item[Serving layer] + Indexes batch views for faster queries. + \item[Speed layer] + Receives the data and prepares real-time views. The views are also stored in the serving layer. \end{description} \begin{figure}[ht] \centering @@ -190,7 +190,7 @@ Framework that adds features on top of an existing data lake. \section{Metadata} \marginnote{Metadata} -Metadata are used to organize a data lake. +Metadata is used to organize a data lake. Useful metadata are: \begin{descriptionlist} \item[Source] Origin of the data. diff --git a/src/machine-learning-and-data-mining/sections/_data_prepro.tex b/src/machine-learning-and-data-mining/sections/_data_prepro.tex index 95a1d88..f75a91e 100644 --- a/src/machine-learning-and-data-mining/sections/_data_prepro.tex +++ b/src/machine-learning-and-data-mining/sections/_data_prepro.tex @@ -21,9 +21,9 @@ Useful for: \section{Sampling} \marginnote{Sampling} Sampling can be used when the full dataset is too expensive to obtain or too expensive to process. -Obviously a sample has to be representative. +Obviously, a sample has to be representative. -Type of sampling techniques are: +The types of sampling techniques are: \begin{descriptionlist} \item[Simple random] \marginnote{Simple random} Extraction of a single element following a given probability distribution. @@ -45,7 +45,7 @@ Type of sampling techniques are: \begin{description} \item[Sample size] The sampling size represents a tradeoff between data reduction and precision. - In a labeled dataset, it is important to consider the probability of sampling data of all the possible classes. + In a labeled dataset, it is important to consider the probability of sampling data from all the possible classes. \end{description} @@ -121,7 +121,7 @@ Possible approaches are: For each entry, if its feature $E$ has value $e_i$, then $H_{e_i} = \texttt{true}$ and the rests are \texttt{false}. \subsection{Ordinal encoding} \marginnote{Ordinal encoding} - A feature whose values have an ordering can be converted in a consecutive sequence of integers + A feature whose values have an ordering can be converted into a consecutive sequence of integers (e.g. ["good", "neutral", "bad"] $\mapsto$ [1, 0, -1]). \subsection{Discretization} \marginnote{Discretization} diff --git a/src/machine-learning-and-data-mining/sections/_data_warehouse.tex b/src/machine-learning-and-data-mining/sections/_data_warehouse.tex index 77a7379..8c80ebf 100644 --- a/src/machine-learning-and-data-mining/sections/_data_warehouse.tex +++ b/src/machine-learning-and-data-mining/sections/_data_warehouse.tex @@ -7,13 +7,13 @@ Deliver the right information to the right people at the right time through the right channel. \item[\Ac{dwh}] \marginnote{\Acl{dwh}} - Optimized repository that stores information for decision making processes. + Optimized repository that stores information for decision-making processes. \Acp{dwh} are a specific type of \ac{dss}. Features: \begin{itemize} - \item Subject-oriented: focused on enterprise specific concepts. - \item Integrates data from different sources and provides an unified view. + \item Subject-oriented: focused on enterprise-specific concepts. + \item Integrates data from different sources and provides a unified view. \item Non-volatile storage with change tracking. \end{itemize} @@ -143,7 +143,7 @@ Operational data may contain: \item[Missing data] \item[Improper use of fields] (e.g. saving the phone number in the \texttt{notes} field) \item[Wrong values] (e.g. 30th of February) - \item[Inconsistency] (e.g. use of different abbreviations) + \item[Inconsistencies] (e.g. use of different abbreviations) \item[Typos] \end{descriptionlist} @@ -154,10 +154,10 @@ Methods to clean and increase the quality of the data are: Applicable if the domain is known and limited. \item[Approximate merging] \marginnote{Approximate merging} - Merging data that do not have a common key. + Methods to merge data that do not have a common key. \begin{description} \item[Approximate join] - Use non-key attributes to join two tables (e.g. using the name and surname instead of an unique identifier). + Use non-key attributes to join two tables (e.g. using the name and surname instead of a unique identifier). \item[Similarity approach] Use similarity functions (e.g. edit distance) to merge multiple instances of the same information @@ -178,7 +178,7 @@ Data are transformed to respect the format of the data warehouse: Creating new information by using existing attributes (e.g. compute profit from receipts and expenses) \item[Separation and concatenation] \marginnote{Separation and concatenation} - Denormalization of the data: introduces redundances (i.e. breaks normal form\footnote{\url{https://en.wikipedia.org/wiki/Database_normalization}}) + Denormalization of the data: introduces redundancies (i.e. breaks normal form\footnote{\url{https://en.wikipedia.org/wiki/Database_normalization}}) to speed up operations. \end{descriptionlist} @@ -332,13 +332,14 @@ Aggregation operators can be classified as: \end{description} -\subsection{Logical design} + +\section{Logical design} \marginnote{Logical design} Defining the data structures (e.g. tables and relationships) according to a conceptual model. -There are mainly two strategies: +There are two main strategies: \begin{descriptionlist} \item[Star schema] \marginnote{Star schema} - A fact table that contains all the measures and linked to dimensional tables. + A fact table that contains all the measures is linked to dimensional tables. \begin{figure}[ht] \centering \includegraphics[width=\textwidth]{img/logical_star_schema.png} @@ -346,8 +347,8 @@ There are mainly two strategies: \end{figure} \item[Snowflake schema] \marginnote{Snowflake schema} - A star schema variant with partially normalized dimension tables. - \begin{figure}[ht] + A star schema variant with partially normalized dimensional tables. + \begin{figure}[H] \centering \includegraphics[width=\textwidth]{img/logical_snowflake_schema.png} \caption{Example of snowflake schema} diff --git a/src/machine-learning-and-data-mining/sections/_intro.tex b/src/machine-learning-and-data-mining/sections/_intro.tex index 2c372d5..5847519 100644 --- a/src/machine-learning-and-data-mining/sections/_intro.tex +++ b/src/machine-learning-and-data-mining/sections/_intro.tex @@ -30,7 +30,7 @@ \subsection{Software} \begin{description} \item[\Ac{oltp}] \marginnote{\Acl{oltp}} - Class of programs to support transaction oriented applications and data storage. + Class of programs to support transaction-oriented applications and data storage. Suitable for real-time applications. \item[\Ac{erp}] \marginnote{\Acl{erp}} @@ -41,10 +41,10 @@ \subsection{Insight} -Decision can be classified as: +Decisions can be classified as: \begin{descriptionlist} \item[Structured] \marginnote{Structured decision} - Established and well understood situations. + Established and well-understood situations. What is needed is known. \item[Unstructured] \marginnote{Unstructured decision} Unplanned and unclear situations. @@ -54,18 +54,18 @@ Decision can be classified as: Different levels of insight can be extracted by: \begin{descriptionlist} \item[\Ac{mis}] \marginnote{\Acl{mis}} - Standardized reporting system built on existing \ac{oltp}. + Standardized reporting system built on an existing \ac{oltp}. Used for structured decisions. \item[\Ac{dss}] \marginnote{\Acl{dss}} Analytical system to provide support for unstructured decisions. \item[\Ac{eis}] \marginnote{\Acl{eis}} - Formulate high level decisions that impact the organization. + Formulate high-level decisions that impact the organization. \item[\Ac{olap}] \marginnote{\Acl{olap}} Grouped analysis of multidimensional data. - Involves large amount of data. + Involves a large amount of data. \item[\Ac{bi}] \marginnote{\Acl{bi}} Applications, infrastructure, tools and best practices to analyze information. @@ -75,7 +75,7 @@ Different levels of insight can be extracted by: \begin{description} \item[Big data] \marginnote{Big data} - Large and/or complex and/or fast changing collection of data that traditional DBMSs are unable to process. + Large and/or complex and/or fast-changing collection of data that traditional DBMSs are unable to process. \begin{description} \item[Structured] e.g. relational tables. \item[Unstructured] e.g. videos. diff --git a/src/machine-learning-and-data-mining/sections/_machine_learning.tex b/src/machine-learning-and-data-mining/sections/_machine_learning.tex index 245e0bc..27cd7cb 100644 --- a/src/machine-learning-and-data-mining/sections/_machine_learning.tex +++ b/src/machine-learning-and-data-mining/sections/_machine_learning.tex @@ -11,7 +11,7 @@ \item[Regression] Estimation of a numeric value. \item[Similarity matching] Identify similar individuals. \item[Clustering] Grouping individuals based on their similarities. - \item[Co-occurrence groupping] Identify associations between entities based on the transactions in which they appear together. + \item[Co-occurrence grouping] Identify associations between entities based on the transactions in which they appear together. \item[Profiling] Behavior description. \item[Link analysis] Analysis of connections (e.g. in a graph). \item[Data reduction] Reduce the dimensionality of data with minimal information loss. @@ -69,7 +69,7 @@ \textbf{Operators.} $=$, $\neq$, $<$, $>$, $\leq$, $\geq$, $+$, $-$ \begin{example} - Celsius and Fahrenheit temperature scales, CGPA, time. + Celsius and Fahrenheit temperature scales, CGPA, time, \dots. For instance, there is a $6.25\%$ increase from $16\text{°C}$ to $17\text{°C}$, but converted in Fahrenheit, the increase is of $2.96\%$ (from $60.8\text{°F}$ to $62.6\text{°F}$). @@ -157,14 +157,14 @@ \item[Missing values] \marginnote{Missing values} Data that have not been collected. Sometimes they are not easily recognizable - (e.g. when special values are used, instead of \texttt{null}, to mark missing data). + (e.g. when special values are used to mark missing data instead of \texttt{null}). Can be handled in different ways: \begin{itemize} \item Ignore the records with missing values. \item Estimate or default missing values. \item Ignore the fact that some values are missing (not always applicable). - \item Insert all the possible values and weight them by their probability. + \item Insert all the possible values and weigh them by their probability. \end{itemize} \item[Duplicated data] \marginnote{Duplicated data} diff --git a/src/machine-learning-and-data-mining/sections/_regression.tex b/src/machine-learning-and-data-mining/sections/_regression.tex index 1b65d57..23d42ec 100644 --- a/src/machine-learning-and-data-mining/sections/_regression.tex +++ b/src/machine-learning-and-data-mining/sections/_regression.tex @@ -27,7 +27,7 @@ \begin{itemize} \item MSE is influenced by the magnitude of the data. \item It measures the fitness of a model in absolute terms. - \item It is suited to compare different models. + % \item It is suited to compare different models. \end{itemize} \item[Coefficient of determination] \marginnote{Coefficient of determination}