Fix typos <noupdate>

2025-12-15 19:12:22 +01:00 · 2024-01-10 11:24:05 +01:00
parent 73fe58ed0b
commit e48a993ccc
10 changed files with 105 additions and 102 deletions
--- a/src/machine-learning-and-data-mining/sections/_association_rules.tex
+++ b/src/machine-learning-and-data-mining/sections/_association_rules.tex
@ -80,7 +80,7 @@
            \item \marginnote{Frequent itemset generation}
                Determine the itemsets with $\text{support} \geq \texttt{min\_sup}$ (frequent itemsets).
            \item \marginnote{Rule generation}
-                Determine the the association rules with $\text{confidence} \geq \texttt{min\_conf}$.
+                Determine the association rules with $\text{confidence} \geq \texttt{min\_conf}$.
        \end{enumerate}
 \end{description}

@ -250,7 +250,7 @@ Measures that take into account the statistical independence of the items.
        \hline
        High support        & The rule applies to many transactions. \\
        \hline
-        High confidence     & The chance that the rule is true for some transaction is high. \\
+        High confidence     & The chance that the rule is true for some transactions is high. \\
        \hline
        High lift           & Low chance that the rule is just a coincidence. \\
        \hline
@ -329,7 +329,7 @@ Measures that take into account the statistical independence of the items.


 \section{Multi-level association rules}
-Organize items into an hierarchy.
+Organize items into a hierarchy.

 \begin{description}
    \item[Specialized to general] \marginnote{Specialized to general} 
@ -345,7 +345,7 @@ Organize items into an hierarchy.
        \end{example}

    \item[Redundant level] \marginnote{Redundant level}
-        A more specialized rule in the hierarchy is redundant if its confidence is similar to the one of the more general rule.
+        A more specialized rule in the hierarchy is redundant if its confidence is similar to the one of a more general rule.

    \item[Multi-level association rule mining] \marginnote{Multi-level association rule mining}
        Run association rule mining on different levels of abstraction (general to specialized).
--- a/src/machine-learning-and-data-mining/sections/_classification.tex
+++ b/src/machine-learning-and-data-mining/sections/_classification.tex
@ -65,9 +65,9 @@
        A supervised dataset can be randomly split into:
        \begin{descriptionlist}
            \item[Train set] \marginnote{Train set}
-                Used to learn the model. Usually the largest split. Can be seen as an upper-bound of the model performance.
+                Used to learn the model. Usually the largest split. Can be seen as an upper bound of the model performance.
            \item[Test set] \marginnote{Test set}
-                Used to evaluate the trained model. Can be seen as a lower-bound of the model performance.
+                Used to evaluate the trained model. Can be seen as a lower bound of the model performance.
            \item[Validation set] \marginnote{Validation set}
                Used to evaluate the model during training and/or for tuning parameters.
        \end{descriptionlist}
@ -93,7 +93,7 @@

 \subsection{Test set error}
 \textbf{\underline{Disclaimer: I'm very unsure about this part}}\\
-The error on the test set can be seen as a lower-bound error of the model.
+The error on the test set can be seen as a lower bound error of the model.
 If the test set error ratio is $x$, we can expect an error of $(x \pm  \text{confidence interval})$.

 Predicting the elements of the test set can be seen as a binomial process (i.e. a series of $N$ Bernoulli processes).
@ -114,7 +114,7 @@ be between the $\frac{\alpha}{2}$ and $(1-\frac{\alpha}{2})$ quantiles of the ga
 We can estimate $p$ using the Wilson score interval\footnote{\url{https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval}}:
 \[ p = \frac{1}{1+\frac{1}{N}z^2} \left( f + \frac{1}{2N}z^2 \pm z\sqrt{\frac{1}{N}f(1-f) + \frac{z^2}{4N^2}} \right) \]
 where $z$ depends on the value of $\alpha$.
-For a pessimistic estimate, $\pm$ becomes a $+$. Vice versa, for a optimistic estimate, $\pm$ becomes a $-$.
+For a pessimistic estimate, $\pm$ becomes a $+$. Vice versa, for an optimistic estimate, $\pm$ becomes a $-$.

 As $N$ is at the denominator, this means that for large values of $N$, the uncertainty becomes smaller.
 \begin{center}
@ -127,25 +127,25 @@ As $N$ is at the denominator, this means that for large values of $N$, the uncer
    \item[Holdout] \marginnote{Holdout}
        The dataset is split into train, test and, if needed, validation.

-    \item[Cross validation] \marginnote{Cross validation}
+    \item[Cross-validation] \marginnote{Cross-validation}
        The training data is partitioned into $k$ chunks.
-        For $k$ iterations, one of the chunks if used to test and the others to train a new model.
+        For $k$ iterations, one of the chunks is used to test and the others to train a new model.
        The overall error is obtained as the average of the errors of the $k$ iterations.

-        At the end, the final model is still trained on the entire training data, 
-        while cross validation results are used as an evaluation and comparison metric.
-        Note that cross validation is done on the training set, so a final test set can still be used to
-        evaluate the final model.
+        In the end, the final model is still trained on the entire training data, 
+        while cross-validation results are used as an evaluation and comparison metric.
+        Note that cross-validation is done on the training set, so a final test set can still be used to
+        evaluate the resulting model.

        \begin{figure}[h]
            \centering
            \includegraphics[width=0.6\textwidth]{img/cross_validation.png}
-            \caption{Cross validation example}
+            \caption{Cross-validation example}
        \end{figure}

    \item[Leave-one-out] \marginnote{Leave-one-out}
-        Extreme case of cross validation with $k=N$, the size of the training set.
-        In this case the whole dataset but one element is used for training and the remaining entry for testing.
+        Extreme case of cross-validation with $k=N$, the size of the training set.
+        In this case, the whole dataset but one element is used for training and the remaining entry for testing.

    \item[Bootstrap] \marginnote{Bootstrap}
        Statistical sampling of the dataset with replacement (i.e. an entry can be selected multiple times).
@ -192,7 +192,7 @@ Given a test set of $N$ element, possible metrics are:

    \item[Recall/Sensitivity] \marginnote{Recall}
        Number of true positives among the real positives
-        (i.e. how many real positive the model predicted).
+        (i.e. how many real positives the model predicted).
        \[ \text{recall} = \frac{TP}{TP + FN} \]

    \item[Specificity] \marginnote{Specificity}
@ -296,7 +296,7 @@ a macro (unweighted) average or a class-weighted average.
    \item[ROC curve] \marginnote{ROC curve}
        The ROC curve can be seen as a way to represent multiple confusion matrices of a classifier
        that uses different thresholds.
-        The x-axis of a ROC curve represent the false positive rate while the y-axis represent the true positive rate.
+        The x-axis of a ROC curve represents the false positive rate while the y-axis represents the true positive rate.

        A straight line is used to represent a random classifier.
        A threshold can be considered good if it is high on the y-axis and low on the x-axis.
@ -314,7 +314,7 @@ A classifier may not perform well when predicting a minority class of the traini
 Possible solutions are:
 \begin{descriptionlist}
    \item[Undersampling] \marginnote{Undersampling}
-        Randomly reduce the number of example of the majority classes.
+        Randomly reduce the number of examples of the majority classes.

    \item[Oversampling] \marginnote{Oversampling}
        Increase the examples of the minority classes.
@ -324,7 +324,7 @@ Possible solutions are:
                \begin{enumerate}
                    \item Randomly select an example $x$ belonging to the minority class.
                    \item Select a random neighbor $z_i$ among its $k$-nearest neighbors $z_1, \dots, z_k$.
-                    \item Synthetize a new example by selecting a random point of the feature space between $x$ and $z_i$.
+                    \item Synthesize a new example by selecting a random point of the feature space between $x$ and $z_i$.
                \end{enumerate}
        \end{description}

@ -346,7 +346,7 @@ Possible solutions are:
 \begin{description}
    \item[Shannon theorem] \marginnote{Shannon theorem}
        Let $\matr{X} = \{ \vec{v}_1, \dots, \vec{v}_V \}$ be a data source where 
-        each of the possible value has probability $p_i = \prob{\vec{v}_i}$.
+        each of the possible values has probability $p_i = \prob{\vec{v}_i}$.
        The best encoding allows to transmit $\matr{X}$ with 
        an average number of bits given by the \textbf{entropy} of $X$: \marginnote{Entropy}
        \[ H(\matr{X}) = - \sum_j p_j \log_2(p_j) \]
@ -354,7 +354,7 @@ Possible solutions are:
        If $p_j \sim 1$, then the surprise of observing $\vec{v}_j$ is low, vice versa,
        if $p_j \sim 0$, the surprise of observing $\vec{v}_j$ is high.
        
-        Therefore, when $H(\matr{X})$ is high, $\matr{X}$ is close to an uniform distribution.
+        Therefore, when $H(\matr{X})$ is high, $\matr{X}$ is close to a uniform distribution.
        When $H(\matr{X})$ is low, $\matr{X}$ is close to a constant.

        \begin{example}[Binary source] \phantom{}\\
@ -382,7 +382,7 @@ Possible solutions are:
        It is computed as:
        \[ IG(c \,\vert\, d \,:\, t) = H(c) - H(c \,\vert\, d \,:\, t) \]
        When $H(c \,\vert\, d \,:\, t)$ is low, $IG(c \,\vert\, d \,:\, t)$ is high 
-        as splitting with threshold $t$ result in purer groups.
+        as splitting with threshold $t$ results in purer groups.
        Vice versa, when $H(c \,\vert\, d \,:\, t)$ is high, $IG(c \,\vert\, d \,:\, t)$ is low
        as splitting with threshold $t$ is not very useful.

@ -520,7 +520,7 @@ each node requires to process all the attributes.
 Assuming an average height of $O(\log N)$, 
 the overall complexity for induction (parameters search) is $O(DN \log N)$.

-Moreover, The other operations of a binary tree have complexity:
+Moreover, the other operations of a binary tree have complexity:
 \begin{itemize}
    \item Threshold search and binary split: $O(N \log N)$ (scan the dataset for the threshold).
    \item Pruning: $O(N \log N)$ (requires to scan the dataset).
@ -584,8 +584,8 @@ This has complexity $O(h)$, with $h$ the height of the tree.
    \item[Smooting] 
        If the value $e_{ij}$ of the domain of a feature $E_i$ never appears in the dataset, 
        its probability $\prob{e_{ij} \mid c}$ will be 0 for all classes.
-        This nullifies all the probabilities that uses this feature when 
-        computing the products chain during inference.
+        This nullifies all the probabilities that use this feature when 
+        computing the product chain during inference.
        Smoothing methods can be used to avoid this problem.

        \begin{description}
@ -597,14 +597,14 @@ This has complexity $O(h)$, with $h$ the height of the tree.
                    \item[$\vert \mathbb{D}_{E_i} \vert$] The number of distinct values in the domain of $E_i$.
                    \item[\normalfont$\text{af}_{c}$] The absolute frequency of the class $c$.
                \end{descriptionlist}
-                the smoothed frequency is computed as:
+                The smoothed frequency is computed as:
                \[
                    \prob{e_{ij} \mid c} = \frac{\text{af}_{e_{ij}, c} + \alpha}{\text{af}_{c} + \alpha \vert \mathbb{D}_{E_i} \vert}    
                \]

                A common value of $\alpha$ is 1.
                When $\alpha = 0$, there is no smoothing.
-                For higher values of $\alpha$, the smoothed feature gain more importance when computing the priors.
+                For higher values of $\alpha$, the smoothed feature gains more importance when computing the priors.
        \end{description}

    \item[Missing values] \marginnote{Missing values}
@ -704,7 +704,7 @@ In practice, a maximum number of iterations is set.
            \end{split}    
        \]
        where $M$ is the margin, $w_i$ are the weights of the hyperplane and $c_i = \{-1, 1 \}$ is the class.
-        The second constraint imposes the hyperplane to have a large margine. 
+        The second constraint imposes the hyperplane to have a large margin. 
        For positive labels ($c_i=1$), this is true when the hyperplane is positive.
        For negative labels ($c_i=-1$), this is true when the hyperplane is negative.

@ -736,7 +736,7 @@ Then, the data and the boundary is mapped back into the original space.
    \caption{Example of mapping from $\mathbb{R}^2$ to $\mathbb{R}^3$}
 \end{figure}

-The kernel trick allows to avoid to explicitly map the dataset into the new space by using kernel functions.
+The kernel trick allows to avoid explicitly mapping the dataset into the new space by using kernel functions.
 Known kernel functions are:
 \begin{descriptionlist}
    \item[Linear] $K(x, y) = \langle x, y \rangle$.
@ -755,7 +755,7 @@ depending on the effectiveness of data caching.
 \begin{itemize}
    \item Training an SVM model is generally slower.
    \item SVM is not affected by local minimums.
-    \item SVM do not suffer the curse of dimensionality.
+    \item SVM does not suffer the curse of dimensionality.
    \item SVM does not directly provide probability estimates. 
        If needed, these can be computed using a computationally expensive method.
 \end{itemize}
@ -771,10 +771,12 @@ depending on the effectiveness of data caching.
    \item[Activation function] \marginnote{Activation function}
        Activation functions are useful to add non-linearity.

+        \begin{remark}
            In a linear system, if there is noise in the input, it is transferred to the output 
            (i.e. linearity implies that $f(x + \text{noise}) = f(x) + f(\text{noise})$).
            On the other hand, a non-linear system is generally more robust 
            (i.e. non-linearity generally implies that $f(x + \text{noise}) \neq f(x) + f(\text{noise})$)
+        \end{remark}

    \item[Feedforward neural network] \marginnote{Feedforward neural network}
        Network with the following flow:
@ -791,7 +793,7 @@ Inputs are fed to the network and backpropagation is used to update the weights.
        Size of the step for gradient descent.

    \item[Epoch] \marginnote{Epoch} 
-        A round of training where the entire dataset has been processed.
+        A round of training where the entire dataset is processed.

    \item[Stopping criteria] \marginnote{Stopping criteria}
        Possible conditions to stop the training are:
@ -874,7 +876,7 @@ Different strategies to train an ensemble classifier can be used:
 \subsection{Random forests}
 \marginnote{Random forests}

-Different decision trees trained on a different random sampling of the training set and different subset of features.
+Multiple decision trees trained on a different random sampling of the training set and different subsets of features.
 A prediction is made by averaging the output of each tree.

 \begin{description}
--- a/src/machine-learning-and-data-mining/sections/_clustering.tex
+++ b/src/machine-learning-and-data-mining/sections/_clustering.tex
@ -10,7 +10,7 @@

    \item[Dissimilarity] \marginnote{Dissimilarity}
        Measures how two objects differ.
-        0 indicates no difference while the upper-bound varies.
+        0 indicates no difference while the upper bound varies.
 \end{description}

 \begin{table}[ht]
@ -119,7 +119,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
 \begin{description}
    \item[Pearson's correlation] \marginnote{Pearson's correlation}
        Measure of linear relationship between a pair of quantitative attributes $e_1$ and $e_2$.
-        To compute the Pearson's correlation, the values of $e_1$ and $e_2$ are first standardized and then ordered to obtain the vectors $\vec{e}_1$ and $\vec{e}_2$.
+        To compute Pearson's correlation, the values of $e_1$ and $e_2$ are first standardized and then ordered to obtain the vectors $\vec{e}_1$ and $\vec{e}_2$.
        The correlation is then computed as the dot product between $\vec{e}_1$ and $\vec{e}_2$:
        \[ \texttt{corr}(e_1, e_2) = \langle \vec{e}_1, \vec{e}_2 \rangle \]

@ -202,10 +202,10 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
        Given the global centroid of the dataset $\vec{c}$ and
        $K$ clusters each with $N_i$ objects,
        the sum of squares between clusters is given by:
-        \[ \texttt{SSB} = \sum_{i=1}^{K} N_i \texttt{dist}(\vec{c}_i, \vec{c})^2 \]
+        \[ \texttt{SSB} = \sum_{i=1}^{K} N_i \cdot \texttt{dist}(\vec{c}_i, \vec{c})^2 \]

    \item[Total sum of squares] \marginnote{Total sum of squares}
-        Sum of the squared distances between the point of the dataset and the global centroid.
+        Sum of the squared distances between the points of the dataset and the global centroid.
        It can be shown that the total sum of squares can be computed as:
        \[ \texttt{TSS} = \texttt{SSE} + \texttt{SSB} \]

@ -217,7 +217,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
        The Silhouette score of a data point $\vec{x}_i$ belonging to a cluster $K_i$ is given by two components:
        \begin{description}
            \item[Sparsity contribution] 
-                The average distance of $\vec{x}_i$ to all other points in $K_i$:
+                The average distance of $\vec{x}_i$ to the other points in $K_i$:
                \[ a(\vec{x}_i) = \frac{1}{\vert K_i \vert - 1} \sum_{\vec{x}_j \in K_i, \vec{x}_j \neq \vec{x}_i} \texttt{dist}(\vec{x}_i, \vec{x}_j) \]
            
            \item[Separation contribution] 
@ -278,7 +278,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
            \item an encoding function $\texttt{encode}: \mathbb{R}^D \rightarrow [1, K]$;
            \item a decoding function $\texttt{decode}: [1, K] \rightarrow \mathbb{R}^D$.
        \end{itemize}
-        Distortion (or inertia) is defines as:
+        Distortion (or inertia) is defined as:
        \[ \texttt{distortion} = \sum_{i=1}^{N} \big(\vec{x}_i - \texttt{decode}(\texttt{encode}(\vec{x_i})) \big)^2 \]

        \begin{theorem}
@ -288,7 +288,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
                \item The center of a point is the centroid of the cluster it belongs to.
            \end{enumerate}

-            Note that k-means alternates point 1 and 2.
+            Note that k-means alternates points 1 and 2.

            \begin{proof}
                The second point is derived by imposing the derivative of \texttt{distortion} to 0.
@ -311,7 +311,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
        \begin{description}
            \item[Termination] 
                There are a finite number of ways to cluster $N$ objects into $K$ clusters.
-                By construction, at each iteration the \texttt{distortion} is reduced.
+                By construction, at each iteration, the \texttt{distortion} is reduced.
                Therefore, k-means is guaranteed to terminate.

            \item[Non-optimality] 
@ -320,11 +320,11 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
                The starting configuration is usually composed of points distant as far as possible.

            \item[Noise]
-                Outliers heavily influences the clustering result. Sometimes, it is useful to remove them.
+                Outliers heavily influence the clustering result. Sometimes, it is useful to remove them.

            \item[Complexity]
                Given a $D$-dimensional dataset of $N$ points,
-                Running k-means for $T$ iterations to find $K$ clusters has complexity $O(TKND)$.
+                running k-means for $T$ iterations to find $K$ clusters has complexity $O(TKND)$.
        \end{description}
 \end{description}

@ -333,9 +333,9 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
 \section{Hierarchical clustering}

 \begin{description}
-    \item[Dendogram] \marginnote{Dendogram}
-        Tree-like structure where the root is a cluster of all data points and 
-        the leaves are clusters with a single data points.
+    \item[Dendrogram] \marginnote{Dendrogram}
+        Tree-like structure where the root is a cluster of all the data points and 
+        the leaves are clusters with a single data point.

    \item[Agglomerative] \marginnote{Agglomerative} 
        Starts with a cluster per data point and iteratively merges them (leaves to root).
@ -380,12 +380,12 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
        \begin{enumerate}
            \item Initialize a cluster for each data point.
            \item Compute the distance matrix between each cluster.
-            \item Merge the two clusters with lowest separation, 
-                drop their values from the distance matrix and add an row/column for the newly created cluster.
+            \item Merge the two clusters with the lowest separation, 
+                drop their values from the distance matrix and add a row/column for the newly created cluster.
            \item Go to point 2. if the number of clusters is greater than one.
        \end{enumerate}

-        After the construction of the dendogram, a cut \marginnote{Cut} can be performed at a user define level.
+        After the construction of the dendrogram, a cut \marginnote{Cut} can be performed at a user-defined level.
        A cut near the root will result in few bigger clusters.
        A cut near the leaves will result in numerous smaller clusters.
        
@ -441,9 +441,9 @@ Consider as clusters the high-density areas of the data space.
            \item $\vec{q}$ is a core point.
            \item There exists a sequence of points $\vec{s}_1, \dots, \vec{s}_z$ such that:
            \begin{itemize}
-                \item $\vec{s}_1$ is directly density reachable from $\vec{p}$.
+                \item $\vec{s}_1$ is directly density reachable from $\vec{q}$.
                \item $\vec{s}_{i+1}$ is directly density reachable from $\vec{s}_i$.
-                \item $\vec{q}$ is directly density reachable from $\vec{s}_z$.
+                \item $\vec{p}$ is directly density reachable from $\vec{s}_z$.
            \end{itemize}
        \end{itemize}

@ -455,7 +455,7 @@ Consider as clusters the high-density areas of the data space.
        Determine clusters as maximal sets of density connected points.
        Border points not density connected to any core point are labeled as noise.

-        In other words, what happens it the following:
+        In other words, what happens is the following:
        \begin{itemize}
            \item Neighboring core points are part of the same cluster.
            \item Border points are part of the cluster of their nearest core point neighbor.
@ -480,7 +480,7 @@ Consider as clusters the high-density areas of the data space.
                \end{description}

            \item[Complexity]
-                Complexity of $O(N^2)$ reduced to $O(N \log N)$ if using spatial indexing.
+                Complexity of $O(N^2)$, reduced to $O(N \log N)$ if using spatial indexing.
        \end{description}
 \end{description}

@ -493,7 +493,7 @@ Consider as clusters the high-density areas of the data space.

        \begin{description}
            \item[Kernel function] \marginnote{Kernel function}
-                Symmetric and monotonically decreasing function to describe the influence of a data point to its neighbors.
+                Symmetric and monotonically decreasing function to describe the influence of a data point on its neighbors.

                A typical kernel function is the Gaussian.

@ -514,7 +514,7 @@ Consider as clusters the high-density areas of the data space.
            \item Derive a density function of the dataset.
            \item Identify local maximums and consider them as density attractors.
            \item Associate to each data point the density attractor in the direction of maximum increase.
-            \item Points associated to the same density attractor are part of the same cluster.
+            \item Points associated with the same density attractor are part of the same cluster.
            \item Remove clusters with a density attractor lower than $\xi$.
            \item Merge clusters connected through a path of points whose density is greater or equal to $\xi$ 
                (e.g. in \Cref{img:denclue} the center area will result in many small clusters that can be merged with an appropriate $\xi$).
--- a/src/machine-learning-and-data-mining/sections/_crisp.tex
+++ b/src/machine-learning-and-data-mining/sections/_crisp.tex
@ -34,10 +34,10 @@
    \item Data transformations.
 \end{itemize}

-\section{Modelling}
+\section{Modeling}
 \begin{itemize}
-    \item Select modelling technique.
-    \marginnote{Modelling}
+    \item Select modeling technique.
+    \marginnote{Modeling}
    \item Build/train the model.
 \end{itemize}

--- a/src/machine-learning-and-data-mining/sections/_data_lake.tex
+++ b/src/machine-learning-and-data-mining/sections/_data_lake.tex
@ -18,7 +18,7 @@
        Stored data can be classified as:
        \begin{descriptionlist}
            \item[Hot] \marginnote{Hot storage}
-                A low volume of highly requested data that require low latency.
+                A low volume of highly requested data that requires low latency.
                More expensive HW/SW.
            \item[Cold] \marginnote{Cold storage}
                A large amount of data that does not have latency requirements.
@ -95,9 +95,8 @@
 \section{Components}

 \subsection{Data ingestion} 
-\marginnote{Data ingestion}
    \begin{descriptionlist}
-        \item[Workload migration]
+        \item[Workload migration] \marginnote{Data ingestion}
            Inserting all the data from an existing source.
        \item[Incremental ingestion]
            Inserting changes since the last ingestion.
@ -123,7 +122,7 @@
 \begin{description}
    \item[Columnar storage] \phantom{}
        \begin{itemize}
-            \item Homogenous data are stores contiguously.
+            \item Homogenous data are stored contiguously.
            \item Speeds up methods that process entire columns (i.e. all the values of a feature).
            \item Insertion becomes slower.
        \end{itemize}
@ -134,9 +133,8 @@
 \end{description}
        
 \subsection{Processing and analytics} 
-\marginnote{Processing and analytics}
 \begin{descriptionlist}
-    \item[Interactive analytics]
+    \item[Interactive analytics] \marginnote{Processing and analytics}
        Interactive queries to large volumes of data.
        The results are stored back in the data lake.
    \item[Big data analytics]
@ -149,11 +147,13 @@
 \section{Architectures}

 \subsection{Lambda lake} 
-\marginnote{Lambda lake}
 \begin{description}
-    \item[Batch layer] Receives and stores the data. Prepares the batch views for the serving layer.
-    \item[Serving layer] Indexes batch views for faster queries.
-    \item[Speed layer] Receives the data and prepares real-time views. The views are also stored in the serving layer.
+    \item[Batch layer] \marginnote{Lambda lake}
+        Receives and stores the data. Prepares the batch views for the serving layer.
+    \item[Serving layer] 
+        Indexes batch views for faster queries.
+    \item[Speed layer] 
+        Receives the data and prepares real-time views. The views are also stored in the serving layer.
 \end{description}
 \begin{figure}[ht]
    \centering
@ -190,7 +190,7 @@ Framework that adds features on top of an existing data lake.

 \section{Metadata}
 \marginnote{Metadata}
-Metadata are used to organize a data lake.
+Metadata is used to organize a data lake.
 Useful metadata are:
 \begin{descriptionlist}
    \item[Source] Origin of the data.
--- a/src/machine-learning-and-data-mining/sections/_data_prepro.tex
+++ b/src/machine-learning-and-data-mining/sections/_data_prepro.tex
@ -21,9 +21,9 @@ Useful for:
 \section{Sampling}
 \marginnote{Sampling}
 Sampling can be used when the full dataset is too expensive to obtain or too expensive to process.
-Obviously a sample has to be representative.
+Obviously, a sample has to be representative.

-Type of sampling techniques are:
+The types of sampling techniques are:
 \begin{descriptionlist}
    \item[Simple random] \marginnote{Simple random}
        Extraction of a single element following a given probability distribution.
@ -45,7 +45,7 @@ Type of sampling techniques are:
 \begin{description}
    \item[Sample size]
        The sampling size represents a tradeoff between data reduction and precision.
-        In a labeled dataset, it is important to consider the probability of sampling data of all the possible classes.
+        In a labeled dataset, it is important to consider the probability of sampling data from all the possible classes.
 \end{description}


@ -121,7 +121,7 @@ Possible approaches are:
    For each entry, if its feature $E$ has value $e_i$, then $H_{e_i} = \texttt{true}$ and the rests are \texttt{false}.

 \subsection{Ordinal encoding} \marginnote{Ordinal encoding}
-    A feature whose values have an ordering can be converted in a consecutive sequence of integers
+    A feature whose values have an ordering can be converted into a consecutive sequence of integers
    (e.g. ["good", "neutral", "bad"] $\mapsto$ [1, 0, -1]).

 \subsection{Discretization} \marginnote{Discretization}
--- a/src/machine-learning-and-data-mining/sections/_data_warehouse.tex
+++ b/src/machine-learning-and-data-mining/sections/_data_warehouse.tex
@ -7,13 +7,13 @@
        Deliver the right information to the right people at the right time through the right channel.

    \item[\Ac{dwh}] \marginnote{\Acl{dwh}}
-        Optimized repository that stores information for decision making processes.
+        Optimized repository that stores information for decision-making processes.
        \Acp{dwh} are a specific type of \ac{dss}.

        Features:
        \begin{itemize}
-            \item Subject-oriented: focused on enterprise specific concepts.
-            \item Integrates data from different sources and provides an unified view.
+            \item Subject-oriented: focused on enterprise-specific concepts.
+            \item Integrates data from different sources and provides a unified view.
            \item Non-volatile storage with change tracking. 
        \end{itemize}

@ -143,7 +143,7 @@ Operational data may contain:
    \item[Missing data] 
    \item[Improper use of fields] (e.g. saving the phone number in the \texttt{notes} field)
    \item[Wrong values] (e.g. 30th of February)
-    \item[Inconsistency] (e.g. use of different abbreviations)
+    \item[Inconsistencies] (e.g. use of different abbreviations)
    \item[Typos]    
 \end{descriptionlist}

@ -154,10 +154,10 @@ Methods to clean and increase the quality of the data are:
        Applicable if the domain is known and limited.
        
    \item[Approximate merging] \marginnote{Approximate merging}
-        Merging data that do not have a common key.
+        Methods to merge data that do not have a common key.
        \begin{description}
            \item[Approximate join]
-                Use non-key attributes to join two tables (e.g. using the name and surname instead of an unique identifier).
+                Use non-key attributes to join two tables (e.g. using the name and surname instead of a unique identifier).

            \item[Similarity approach]
                Use similarity functions (e.g. edit distance) to merge multiple instances of the same information
@ -178,7 +178,7 @@ Data are transformed to respect the format of the data warehouse:
        Creating new information by using existing attributes (e.g. compute profit from receipts and expenses)

    \item[Separation and concatenation] \marginnote{Separation and concatenation}
-        Denormalization of the data: introduces redundances (i.e. breaks normal form\footnote{\url{https://en.wikipedia.org/wiki/Database_normalization}}) 
+        Denormalization of the data: introduces redundancies (i.e. breaks normal form\footnote{\url{https://en.wikipedia.org/wiki/Database_normalization}}) 
        to speed up operations.
 \end{descriptionlist}

@ -332,13 +332,14 @@ Aggregation operators can be classified as:
 \end{description}


-\subsection{Logical design}
+
+\section{Logical design}
 \marginnote{Logical design}
 Defining the data structures (e.g. tables and relationships) according to a conceptual model.
-There are mainly two strategies:
+There are two main strategies:
 \begin{descriptionlist}
    \item[Star schema] \marginnote{Star schema}
-        A fact table that contains all the measures and linked to dimensional tables.
+        A fact table that contains all the measures is linked to dimensional tables.
        \begin{figure}[ht]
            \centering
            \includegraphics[width=\textwidth]{img/logical_star_schema.png}
@ -346,8 +347,8 @@ There are mainly two strategies:
        \end{figure}

    \item[Snowflake schema] \marginnote{Snowflake schema}
-        A star schema variant with partially normalized dimension tables.
-        \begin{figure}[ht]
+        A star schema variant with partially normalized dimensional tables.
+        \begin{figure}[H]
            \centering
            \includegraphics[width=\textwidth]{img/logical_snowflake_schema.png}
            \caption{Example of snowflake schema}
--- a/src/machine-learning-and-data-mining/sections/_intro.tex
+++ b/src/machine-learning-and-data-mining/sections/_intro.tex
@ -30,7 +30,7 @@
 \subsection{Software}
 \begin{description}
    \item[\Ac{oltp}] \marginnote{\Acl{oltp}} 
-        Class of programs to support transaction oriented applications and data storage.
+        Class of programs to support transaction-oriented applications and data storage.
        Suitable for real-time applications.

    \item[\Ac{erp}] \marginnote{\Acl{erp}} 
@ -41,10 +41,10 @@


 \subsection{Insight}
-Decision can be classified as:
+Decisions can be classified as:
 \begin{descriptionlist}
    \item[Structured] \marginnote{Structured decision}
-        Established and well understood situations.
+        Established and well-understood situations.
        What is needed is known.
    \item[Unstructured] \marginnote{Unstructured decision}
        Unplanned and unclear situations.
@ -54,18 +54,18 @@ Decision can be classified as:
 Different levels of insight can be extracted by:
 \begin{descriptionlist}
    \item[\Ac{mis}] \marginnote{\Acl{mis}}
-        Standardized reporting system built on existing \ac{oltp}.
+        Standardized reporting system built on an existing \ac{oltp}.
        Used for structured decisions.

    \item[\Ac{dss}] \marginnote{\Acl{dss}}
        Analytical system to provide support for unstructured decisions.

    \item[\Ac{eis}] \marginnote{\Acl{eis}}
-        Formulate high level decisions that impact the organization.
+        Formulate high-level decisions that impact the organization.

    \item[\Ac{olap}] \marginnote{\Acl{olap}}
        Grouped analysis of multidimensional data.
-        Involves large amount of data.
+        Involves a large amount of data.

    \item[\Ac{bi}] \marginnote{\Acl{bi}}
        Applications, infrastructure, tools and best practices to analyze information.
@ -75,7 +75,7 @@ Different levels of insight can be extracted by:

 \begin{description}
    \item[Big data] \marginnote{Big data}
-        Large and/or complex and/or fast changing collection of data that traditional DBMSs are unable to process.
+        Large and/or complex and/or fast-changing collection of data that traditional DBMSs are unable to process.
        \begin{description}
            \item[Structured] e.g. relational tables.
            \item[Unstructured] e.g. videos.
--- a/src/machine-learning-and-data-mining/sections/_machine_learning.tex
+++ b/src/machine-learning-and-data-mining/sections/_machine_learning.tex
@ -11,7 +11,7 @@
    \item[Regression] Estimation of a numeric value.
    \item[Similarity matching] Identify similar individuals.
    \item[Clustering] Grouping individuals based on their similarities.
-    \item[Co-occurrence groupping] Identify associations between entities based on the transactions in which they appear together.
+    \item[Co-occurrence grouping] Identify associations between entities based on the transactions in which they appear together.
    \item[Profiling] Behavior description.
    \item[Link analysis] Analysis of connections (e.g. in a graph).
    \item[Data reduction] Reduce the dimensionality of data with minimal information loss.
@ -69,7 +69,7 @@

                \textbf{Operators.} $=$, $\neq$, $<$, $>$, $\leq$, $\geq$, $+$, $-$
                \begin{example}
-                    Celsius and Fahrenheit temperature scales, CGPA, time.
+                    Celsius and Fahrenheit temperature scales, CGPA, time, \dots.
                    
                    For instance, there is a $6.25\%$ increase from $16\text{°C}$ to $17\text{°C}$, but
                    converted in Fahrenheit, the increase is of $2.96\%$ (from $60.8\text{°F}$ to $62.6\text{°F}$).
@ -157,14 +157,14 @@
    \item[Missing values] \marginnote{Missing values}
        Data that have not been collected.
        Sometimes they are not easily recognizable 
-        (e.g. when special values are used, instead of \texttt{null}, to mark missing data).
+        (e.g. when special values are used to mark missing data instead of \texttt{null}).

        Can be handled in different ways:
        \begin{itemize}
            \item Ignore the records with missing values.
            \item Estimate or default missing values.
            \item Ignore the fact that some values are missing (not always applicable).
-            \item Insert all the possible values and weight them by their probability.
+            \item Insert all the possible values and weigh them by their probability.
        \end{itemize}

    \item[Duplicated data] \marginnote{Duplicated data}
--- a/src/machine-learning-and-data-mining/sections/_regression.tex
+++ b/src/machine-learning-and-data-mining/sections/_regression.tex
@ -27,7 +27,7 @@
        \begin{itemize}
            \item MSE is influenced by the magnitude of the data.
            \item It measures the fitness of a model in absolute terms.
-            \item It is suited to compare different models.
+            % \item It is suited to compare different models.
        \end{itemize}

    \item[Coefficient of determination] \marginnote{Coefficient of determination}