diff --git a/src/machine-learning-and-data-mining/sections/_association_rules.tex b/src/machine-learning-and-data-mining/sections/_association_rules.tex
index 4e9aa8e..fbbb58e 100644
--- a/src/machine-learning-and-data-mining/sections/_association_rules.tex
+++ b/src/machine-learning-and-data-mining/sections/_association_rules.tex
@@ -80,7 +80,7 @@
             \item \marginnote{Frequent itemset generation}
                 Determine the itemsets with $\text{support} \geq \texttt{min\_sup}$ (frequent itemsets).
             \item \marginnote{Rule generation}
-                Determine the the association rules with $\text{confidence} \geq \texttt{min\_conf}$.
+                Determine the association rules with $\text{confidence} \geq \texttt{min\_conf}$.
         \end{enumerate}
 \end{description}
 
@@ -250,7 +250,7 @@ Measures that take into account the statistical independence of the items.
         \hline
         High support        & The rule applies to many transactions. \\
         \hline
-        High confidence     & The chance that the rule is true for some transaction is high. \\
+        High confidence     & The chance that the rule is true for some transactions is high. \\
         \hline
         High lift           & Low chance that the rule is just a coincidence. \\
         \hline
@@ -329,7 +329,7 @@ Measures that take into account the statistical independence of the items.
 
 
 \section{Multi-level association rules}
-Organize items into an hierarchy.
+Organize items into a hierarchy.
 
 \begin{description}
     \item[Specialized to general] \marginnote{Specialized to general} 
@@ -345,7 +345,7 @@ Organize items into an hierarchy.
         \end{example}
 
     \item[Redundant level] \marginnote{Redundant level}
-        A more specialized rule in the hierarchy is redundant if its confidence is similar to the one of the more general rule.
+        A more specialized rule in the hierarchy is redundant if its confidence is similar to the one of a more general rule.
 
     \item[Multi-level association rule mining] \marginnote{Multi-level association rule mining}
         Run association rule mining on different levels of abstraction (general to specialized).
diff --git a/src/machine-learning-and-data-mining/sections/_classification.tex b/src/machine-learning-and-data-mining/sections/_classification.tex
index 33310b2..d46bddf 100644
--- a/src/machine-learning-and-data-mining/sections/_classification.tex
+++ b/src/machine-learning-and-data-mining/sections/_classification.tex
@@ -65,9 +65,9 @@
         A supervised dataset can be randomly split into:
         \begin{descriptionlist}
             \item[Train set] \marginnote{Train set}
-                Used to learn the model. Usually the largest split. Can be seen as an upper-bound of the model performance.
+                Used to learn the model. Usually the largest split. Can be seen as an upper bound of the model performance.
             \item[Test set] \marginnote{Test set}
-                Used to evaluate the trained model. Can be seen as a lower-bound of the model performance.
+                Used to evaluate the trained model. Can be seen as a lower bound of the model performance.
             \item[Validation set] \marginnote{Validation set}
                 Used to evaluate the model during training and/or for tuning parameters.
         \end{descriptionlist}
@@ -93,7 +93,7 @@
 
 \subsection{Test set error}
 \textbf{\underline{Disclaimer: I'm very unsure about this part}}\\
-The error on the test set can be seen as a lower-bound error of the model.
+The error on the test set can be seen as a lower bound error of the model.
 If the test set error ratio is $x$, we can expect an error of $(x \pm  \text{confidence interval})$.
 
 Predicting the elements of the test set can be seen as a binomial process (i.e. a series of $N$ Bernoulli processes).
@@ -114,7 +114,7 @@ be between the $\frac{\alpha}{2}$ and $(1-\frac{\alpha}{2})$ quantiles of the ga
 We can estimate $p$ using the Wilson score interval\footnote{\url{https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval}}:
 \[ p = \frac{1}{1+\frac{1}{N}z^2} \left( f + \frac{1}{2N}z^2 \pm z\sqrt{\frac{1}{N}f(1-f) + \frac{z^2}{4N^2}} \right) \]
 where $z$ depends on the value of $\alpha$.
-For a pessimistic estimate, $\pm$ becomes a $+$. Vice versa, for a optimistic estimate, $\pm$ becomes a $-$.
+For a pessimistic estimate, $\pm$ becomes a $+$. Vice versa, for an optimistic estimate, $\pm$ becomes a $-$.
 
 As $N$ is at the denominator, this means that for large values of $N$, the uncertainty becomes smaller.
 \begin{center}
@@ -127,25 +127,25 @@ As $N$ is at the denominator, this means that for large values of $N$, the uncer
     \item[Holdout] \marginnote{Holdout}
         The dataset is split into train, test and, if needed, validation.
 
-    \item[Cross validation] \marginnote{Cross validation}
+    \item[Cross-validation] \marginnote{Cross-validation}
         The training data is partitioned into $k$ chunks.
-        For $k$ iterations, one of the chunks if used to test and the others to train a new model.
+        For $k$ iterations, one of the chunks is used to test and the others to train a new model.
         The overall error is obtained as the average of the errors of the $k$ iterations.
 
-        At the end, the final model is still trained on the entire training data, 
-        while cross validation results are used as an evaluation and comparison metric.
-        Note that cross validation is done on the training set, so a final test set can still be used to
-        evaluate the final model.
+        In the end, the final model is still trained on the entire training data, 
+        while cross-validation results are used as an evaluation and comparison metric.
+        Note that cross-validation is done on the training set, so a final test set can still be used to
+        evaluate the resulting model.
 
         \begin{figure}[h]
             \centering
             \includegraphics[width=0.6\textwidth]{img/cross_validation.png}
-            \caption{Cross validation example}
+            \caption{Cross-validation example}
         \end{figure}
 
     \item[Leave-one-out] \marginnote{Leave-one-out}
-        Extreme case of cross validation with $k=N$, the size of the training set.
-        In this case the whole dataset but one element is used for training and the remaining entry for testing.
+        Extreme case of cross-validation with $k=N$, the size of the training set.
+        In this case, the whole dataset but one element is used for training and the remaining entry for testing.
 
     \item[Bootstrap] \marginnote{Bootstrap}
         Statistical sampling of the dataset with replacement (i.e. an entry can be selected multiple times).
@@ -192,7 +192,7 @@ Given a test set of $N$ element, possible metrics are:
 
     \item[Recall/Sensitivity] \marginnote{Recall}
         Number of true positives among the real positives
-        (i.e. how many real positive the model predicted).
+        (i.e. how many real positives the model predicted).
         \[ \text{recall} = \frac{TP}{TP + FN} \]
 
     \item[Specificity] \marginnote{Specificity}
@@ -296,7 +296,7 @@ a macro (unweighted) average or a class-weighted average.
     \item[ROC curve] \marginnote{ROC curve}
         The ROC curve can be seen as a way to represent multiple confusion matrices of a classifier
         that uses different thresholds.
-        The x-axis of a ROC curve represent the false positive rate while the y-axis represent the true positive rate.
+        The x-axis of a ROC curve represents the false positive rate while the y-axis represents the true positive rate.
 
         A straight line is used to represent a random classifier.
         A threshold can be considered good if it is high on the y-axis and low on the x-axis.
@@ -314,7 +314,7 @@ A classifier may not perform well when predicting a minority class of the traini
 Possible solutions are:
 \begin{descriptionlist}
     \item[Undersampling] \marginnote{Undersampling}
-        Randomly reduce the number of example of the majority classes.
+        Randomly reduce the number of examples of the majority classes.
 
     \item[Oversampling] \marginnote{Oversampling}
         Increase the examples of the minority classes.
@@ -324,7 +324,7 @@ Possible solutions are:
                 \begin{enumerate}
                     \item Randomly select an example $x$ belonging to the minority class.
                     \item Select a random neighbor $z_i$ among its $k$-nearest neighbors $z_1, \dots, z_k$.
-                    \item Synthetize a new example by selecting a random point of the feature space between $x$ and $z_i$.
+                    \item Synthesize a new example by selecting a random point of the feature space between $x$ and $z_i$.
                 \end{enumerate}
         \end{description}
 
@@ -346,7 +346,7 @@ Possible solutions are:
 \begin{description}
     \item[Shannon theorem] \marginnote{Shannon theorem}
         Let $\matr{X} = \{ \vec{v}_1, \dots, \vec{v}_V \}$ be a data source where 
-        each of the possible value has probability $p_i = \prob{\vec{v}_i}$.
+        each of the possible values has probability $p_i = \prob{\vec{v}_i}$.
         The best encoding allows to transmit $\matr{X}$ with 
         an average number of bits given by the \textbf{entropy} of $X$: \marginnote{Entropy}
         \[ H(\matr{X}) = - \sum_j p_j \log_2(p_j) \]
@@ -354,7 +354,7 @@ Possible solutions are:
         If $p_j \sim 1$, then the surprise of observing $\vec{v}_j$ is low, vice versa,
         if $p_j \sim 0$, the surprise of observing $\vec{v}_j$ is high.
         
-        Therefore, when $H(\matr{X})$ is high, $\matr{X}$ is close to an uniform distribution.
+        Therefore, when $H(\matr{X})$ is high, $\matr{X}$ is close to a uniform distribution.
         When $H(\matr{X})$ is low, $\matr{X}$ is close to a constant.
 
         \begin{example}[Binary source] \phantom{}\\
@@ -382,7 +382,7 @@ Possible solutions are:
         It is computed as:
         \[ IG(c \,\vert\, d \,:\, t) = H(c) - H(c \,\vert\, d \,:\, t) \]
         When $H(c \,\vert\, d \,:\, t)$ is low, $IG(c \,\vert\, d \,:\, t)$ is high 
-        as splitting with threshold $t$ result in purer groups.
+        as splitting with threshold $t$ results in purer groups.
         Vice versa, when $H(c \,\vert\, d \,:\, t)$ is high, $IG(c \,\vert\, d \,:\, t)$ is low
         as splitting with threshold $t$ is not very useful.
 
@@ -520,7 +520,7 @@ each node requires to process all the attributes.
 Assuming an average height of $O(\log N)$, 
 the overall complexity for induction (parameters search) is $O(DN \log N)$.
 
-Moreover, The other operations of a binary tree have complexity:
+Moreover, the other operations of a binary tree have complexity:
 \begin{itemize}
     \item Threshold search and binary split: $O(N \log N)$ (scan the dataset for the threshold).
     \item Pruning: $O(N \log N)$ (requires to scan the dataset).
@@ -584,8 +584,8 @@ This has complexity $O(h)$, with $h$ the height of the tree.
     \item[Smooting] 
         If the value $e_{ij}$ of the domain of a feature $E_i$ never appears in the dataset, 
         its probability $\prob{e_{ij} \mid c}$ will be 0 for all classes.
-        This nullifies all the probabilities that uses this feature when 
-        computing the products chain during inference.
+        This nullifies all the probabilities that use this feature when 
+        computing the product chain during inference.
         Smoothing methods can be used to avoid this problem.
 
         \begin{description}
@@ -597,14 +597,14 @@ This has complexity $O(h)$, with $h$ the height of the tree.
                     \item[$\vert \mathbb{D}_{E_i} \vert$] The number of distinct values in the domain of $E_i$.
                     \item[\normalfont$\text{af}_{c}$] The absolute frequency of the class $c$.
                 \end{descriptionlist}
-                the smoothed frequency is computed as:
+                The smoothed frequency is computed as:
                 \[
                     \prob{e_{ij} \mid c} = \frac{\text{af}_{e_{ij}, c} + \alpha}{\text{af}_{c} + \alpha \vert \mathbb{D}_{E_i} \vert}    
                 \]
 
                 A common value of $\alpha$ is 1.
                 When $\alpha = 0$, there is no smoothing.
-                For higher values of $\alpha$, the smoothed feature gain more importance when computing the priors.
+                For higher values of $\alpha$, the smoothed feature gains more importance when computing the priors.
         \end{description}
 
     \item[Missing values] \marginnote{Missing values}
@@ -704,7 +704,7 @@ In practice, a maximum number of iterations is set.
             \end{split}    
         \]
         where $M$ is the margin, $w_i$ are the weights of the hyperplane and $c_i = \{-1, 1 \}$ is the class.
-        The second constraint imposes the hyperplane to have a large margine. 
+        The second constraint imposes the hyperplane to have a large margin. 
         For positive labels ($c_i=1$), this is true when the hyperplane is positive.
         For negative labels ($c_i=-1$), this is true when the hyperplane is negative.
 
@@ -736,7 +736,7 @@ Then, the data and the boundary is mapped back into the original space.
     \caption{Example of mapping from $\mathbb{R}^2$ to $\mathbb{R}^3$}
 \end{figure}
 
-The kernel trick allows to avoid to explicitly map the dataset into the new space by using kernel functions.
+The kernel trick allows to avoid explicitly mapping the dataset into the new space by using kernel functions.
 Known kernel functions are:
 \begin{descriptionlist}
     \item[Linear] $K(x, y) = \langle x, y \rangle$.
@@ -755,7 +755,7 @@ depending on the effectiveness of data caching.
 \begin{itemize}
     \item Training an SVM model is generally slower.
     \item SVM is not affected by local minimums.
-    \item SVM do not suffer the curse of dimensionality.
+    \item SVM does not suffer the curse of dimensionality.
     \item SVM does not directly provide probability estimates. 
         If needed, these can be computed using a computationally expensive method.
 \end{itemize}
@@ -771,10 +771,12 @@ depending on the effectiveness of data caching.
     \item[Activation function] \marginnote{Activation function}
         Activation functions are useful to add non-linearity.
 
-        In a linear system, if there is noise in the input, it is transferred to the output 
-        (i.e. linearity implies that $f(x + \text{noise}) = f(x) + f(\text{noise})$).
-        On the other hand, a non-linear system is generally more robust 
-        (i.e. non-linearity generally implies that $f(x + \text{noise}) \neq f(x) + f(\text{noise})$)
+        \begin{remark}
+            In a linear system, if there is noise in the input, it is transferred to the output 
+            (i.e. linearity implies that $f(x + \text{noise}) = f(x) + f(\text{noise})$).
+            On the other hand, a non-linear system is generally more robust 
+            (i.e. non-linearity generally implies that $f(x + \text{noise}) \neq f(x) + f(\text{noise})$)
+        \end{remark}
 
     \item[Feedforward neural network] \marginnote{Feedforward neural network}
         Network with the following flow:
@@ -791,7 +793,7 @@ Inputs are fed to the network and backpropagation is used to update the weights.
         Size of the step for gradient descent.
 
     \item[Epoch] \marginnote{Epoch} 
-        A round of training where the entire dataset has been processed.
+        A round of training where the entire dataset is processed.
 
     \item[Stopping criteria] \marginnote{Stopping criteria}
         Possible conditions to stop the training are:
@@ -874,7 +876,7 @@ Different strategies to train an ensemble classifier can be used:
 \subsection{Random forests}
 \marginnote{Random forests}
 
-Different decision trees trained on a different random sampling of the training set and different subset of features.
+Multiple decision trees trained on a different random sampling of the training set and different subsets of features.
 A prediction is made by averaging the output of each tree.
 
 \begin{description}
diff --git a/src/machine-learning-and-data-mining/sections/_clustering.tex b/src/machine-learning-and-data-mining/sections/_clustering.tex
index 7474ec0..99178dd 100644
--- a/src/machine-learning-and-data-mining/sections/_clustering.tex
+++ b/src/machine-learning-and-data-mining/sections/_clustering.tex
@@ -10,7 +10,7 @@
 
     \item[Dissimilarity] \marginnote{Dissimilarity}
         Measures how two objects differ.
-        0 indicates no difference while the upper-bound varies.
+        0 indicates no difference while the upper bound varies.
 \end{description}
 
 \begin{table}[ht]
@@ -119,7 +119,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
 \begin{description}
     \item[Pearson's correlation] \marginnote{Pearson's correlation}
         Measure of linear relationship between a pair of quantitative attributes $e_1$ and $e_2$.
-        To compute the Pearson's correlation, the values of $e_1$ and $e_2$ are first standardized and then ordered to obtain the vectors $\vec{e}_1$ and $\vec{e}_2$.
+        To compute Pearson's correlation, the values of $e_1$ and $e_2$ are first standardized and then ordered to obtain the vectors $\vec{e}_1$ and $\vec{e}_2$.
         The correlation is then computed as the dot product between $\vec{e}_1$ and $\vec{e}_2$:
         \[ \texttt{corr}(e_1, e_2) = \langle \vec{e}_1, \vec{e}_2 \rangle \]
 
@@ -202,10 +202,10 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
         Given the global centroid of the dataset $\vec{c}$ and
         $K$ clusters each with $N_i$ objects,
         the sum of squares between clusters is given by:
-        \[ \texttt{SSB} = \sum_{i=1}^{K} N_i \texttt{dist}(\vec{c}_i, \vec{c})^2 \]
+        \[ \texttt{SSB} = \sum_{i=1}^{K} N_i \cdot \texttt{dist}(\vec{c}_i, \vec{c})^2 \]
 
     \item[Total sum of squares] \marginnote{Total sum of squares}
-        Sum of the squared distances between the point of the dataset and the global centroid.
+        Sum of the squared distances between the points of the dataset and the global centroid.
         It can be shown that the total sum of squares can be computed as:
         \[ \texttt{TSS} = \texttt{SSE} + \texttt{SSB} \]
 
@@ -217,7 +217,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
         The Silhouette score of a data point $\vec{x}_i$ belonging to a cluster $K_i$ is given by two components:
         \begin{description}
             \item[Sparsity contribution] 
-                The average distance of $\vec{x}_i$ to all other points in $K_i$:
+                The average distance of $\vec{x}_i$ to the other points in $K_i$:
                 \[ a(\vec{x}_i) = \frac{1}{\vert K_i \vert - 1} \sum_{\vec{x}_j \in K_i, \vec{x}_j \neq \vec{x}_i} \texttt{dist}(\vec{x}_i, \vec{x}_j) \]
             
             \item[Separation contribution] 
@@ -278,7 +278,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
             \item an encoding function $\texttt{encode}: \mathbb{R}^D \rightarrow [1, K]$;
             \item a decoding function $\texttt{decode}: [1, K] \rightarrow \mathbb{R}^D$.
         \end{itemize}
-        Distortion (or inertia) is defines as:
+        Distortion (or inertia) is defined as:
         \[ \texttt{distortion} = \sum_{i=1}^{N} \big(\vec{x}_i - \texttt{decode}(\texttt{encode}(\vec{x_i})) \big)^2 \]
 
         \begin{theorem}
@@ -288,7 +288,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
                 \item The center of a point is the centroid of the cluster it belongs to.
             \end{enumerate}
 
-            Note that k-means alternates point 1 and 2.
+            Note that k-means alternates points 1 and 2.
 
             \begin{proof}
                 The second point is derived by imposing the derivative of \texttt{distortion} to 0.
@@ -311,7 +311,7 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
         \begin{description}
             \item[Termination] 
                 There are a finite number of ways to cluster $N$ objects into $K$ clusters.
-                By construction, at each iteration the \texttt{distortion} is reduced.
+                By construction, at each iteration, the \texttt{distortion} is reduced.
                 Therefore, k-means is guaranteed to terminate.
 
             \item[Non-optimality] 
@@ -320,11 +320,11 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
                 The starting configuration is usually composed of points distant as far as possible.
 
             \item[Noise]
-                Outliers heavily influences the clustering result. Sometimes, it is useful to remove them.
+                Outliers heavily influence the clustering result. Sometimes, it is useful to remove them.
 
             \item[Complexity]
                 Given a $D$-dimensional dataset of $N$ points,
-                Running k-means for $T$ iterations to find $K$ clusters has complexity $O(TKND)$.
+                running k-means for $T$ iterations to find $K$ clusters has complexity $O(TKND)$.
         \end{description}
 \end{description}
 
@@ -333,9 +333,9 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
 \section{Hierarchical clustering}
 
 \begin{description}
-    \item[Dendogram] \marginnote{Dendogram}
-        Tree-like structure where the root is a cluster of all data points and 
-        the leaves are clusters with a single data points.
+    \item[Dendrogram] \marginnote{Dendrogram}
+        Tree-like structure where the root is a cluster of all the data points and 
+        the leaves are clusters with a single data point.
 
     \item[Agglomerative] \marginnote{Agglomerative} 
         Starts with a cluster per data point and iteratively merges them (leaves to root).
@@ -380,12 +380,12 @@ Given two $D$-dimensional data entries $p$ and $q$, possible distance metrics ar
         \begin{enumerate}
             \item Initialize a cluster for each data point.
             \item Compute the distance matrix between each cluster.
-            \item Merge the two clusters with lowest separation, 
-                drop their values from the distance matrix and add an row/column for the newly created cluster.
+            \item Merge the two clusters with the lowest separation, 
+                drop their values from the distance matrix and add a row/column for the newly created cluster.
             \item Go to point 2. if the number of clusters is greater than one.
         \end{enumerate}
 
-        After the construction of the dendogram, a cut \marginnote{Cut} can be performed at a user define level.
+        After the construction of the dendrogram, a cut \marginnote{Cut} can be performed at a user-defined level.
         A cut near the root will result in few bigger clusters.
         A cut near the leaves will result in numerous smaller clusters.
         
@@ -441,9 +441,9 @@ Consider as clusters the high-density areas of the data space.
             \item $\vec{q}$ is a core point.
             \item There exists a sequence of points $\vec{s}_1, \dots, \vec{s}_z$ such that:
             \begin{itemize}
-                \item $\vec{s}_1$ is directly density reachable from $\vec{p}$.
+                \item $\vec{s}_1$ is directly density reachable from $\vec{q}$.
                 \item $\vec{s}_{i+1}$ is directly density reachable from $\vec{s}_i$.
-                \item $\vec{q}$ is directly density reachable from $\vec{s}_z$.
+                \item $\vec{p}$ is directly density reachable from $\vec{s}_z$.
             \end{itemize}
         \end{itemize}
 
@@ -455,7 +455,7 @@ Consider as clusters the high-density areas of the data space.
         Determine clusters as maximal sets of density connected points.
         Border points not density connected to any core point are labeled as noise.
 
-        In other words, what happens it the following:
+        In other words, what happens is the following:
         \begin{itemize}
             \item Neighboring core points are part of the same cluster.
             \item Border points are part of the cluster of their nearest core point neighbor.
@@ -480,7 +480,7 @@ Consider as clusters the high-density areas of the data space.
                 \end{description}
 
             \item[Complexity]
-                Complexity of $O(N^2)$ reduced to $O(N \log N)$ if using spatial indexing.
+                Complexity of $O(N^2)$, reduced to $O(N \log N)$ if using spatial indexing.
         \end{description}
 \end{description}
 
@@ -493,7 +493,7 @@ Consider as clusters the high-density areas of the data space.
 
         \begin{description}
             \item[Kernel function] \marginnote{Kernel function}
-                Symmetric and monotonically decreasing function to describe the influence of a data point to its neighbors.
+                Symmetric and monotonically decreasing function to describe the influence of a data point on its neighbors.
 
                 A typical kernel function is the Gaussian.
 
@@ -514,7 +514,7 @@ Consider as clusters the high-density areas of the data space.
             \item Derive a density function of the dataset.
             \item Identify local maximums and consider them as density attractors.
             \item Associate to each data point the density attractor in the direction of maximum increase.
-            \item Points associated to the same density attractor are part of the same cluster.
+            \item Points associated with the same density attractor are part of the same cluster.
             \item Remove clusters with a density attractor lower than $\xi$.
             \item Merge clusters connected through a path of points whose density is greater or equal to $\xi$ 
                 (e.g. in \Cref{img:denclue} the center area will result in many small clusters that can be merged with an appropriate $\xi$).
diff --git a/src/machine-learning-and-data-mining/sections/_crisp.tex b/src/machine-learning-and-data-mining/sections/_crisp.tex
index aa8d5d6..973741a 100644
--- a/src/machine-learning-and-data-mining/sections/_crisp.tex
+++ b/src/machine-learning-and-data-mining/sections/_crisp.tex
@@ -34,10 +34,10 @@
     \item Data transformations.
 \end{itemize}
 
-\section{Modelling}
+\section{Modeling}
 \begin{itemize}
-    \item Select modelling technique.
-    \marginnote{Modelling}
+    \item Select modeling technique.
+    \marginnote{Modeling}
     \item Build/train the model.
 \end{itemize}
 
diff --git a/src/machine-learning-and-data-mining/sections/_data_lake.tex b/src/machine-learning-and-data-mining/sections/_data_lake.tex
index 4309164..c7c9cdc 100644
--- a/src/machine-learning-and-data-mining/sections/_data_lake.tex
+++ b/src/machine-learning-and-data-mining/sections/_data_lake.tex
@@ -18,7 +18,7 @@
         Stored data can be classified as:
         \begin{descriptionlist}
             \item[Hot] \marginnote{Hot storage}
-                A low volume of highly requested data that require low latency.
+                A low volume of highly requested data that requires low latency.
                 More expensive HW/SW.
             \item[Cold] \marginnote{Cold storage}
                 A large amount of data that does not have latency requirements.
@@ -95,9 +95,8 @@
 \section{Components}
 
 \subsection{Data ingestion} 
-\marginnote{Data ingestion}
     \begin{descriptionlist}
-        \item[Workload migration]
+        \item[Workload migration] \marginnote{Data ingestion}
             Inserting all the data from an existing source.
         \item[Incremental ingestion]
             Inserting changes since the last ingestion.
@@ -123,7 +122,7 @@
 \begin{description}
     \item[Columnar storage] \phantom{}
         \begin{itemize}
-            \item Homogenous data are stores contiguously.
+            \item Homogenous data are stored contiguously.
             \item Speeds up methods that process entire columns (i.e. all the values of a feature).
             \item Insertion becomes slower.
         \end{itemize}
@@ -134,9 +133,8 @@
 \end{description}
         
 \subsection{Processing and analytics} 
-\marginnote{Processing and analytics}
 \begin{descriptionlist}
-    \item[Interactive analytics]
+    \item[Interactive analytics] \marginnote{Processing and analytics}
         Interactive queries to large volumes of data.
         The results are stored back in the data lake.
     \item[Big data analytics]
@@ -149,11 +147,13 @@
 \section{Architectures}
 
 \subsection{Lambda lake} 
-\marginnote{Lambda lake}
 \begin{description}
-    \item[Batch layer] Receives and stores the data. Prepares the batch views for the serving layer.
-    \item[Serving layer] Indexes batch views for faster queries.
-    \item[Speed layer] Receives the data and prepares real-time views. The views are also stored in the serving layer.
+    \item[Batch layer] \marginnote{Lambda lake}
+        Receives and stores the data. Prepares the batch views for the serving layer.
+    \item[Serving layer] 
+        Indexes batch views for faster queries.
+    \item[Speed layer] 
+        Receives the data and prepares real-time views. The views are also stored in the serving layer.
 \end{description}
 \begin{figure}[ht]
     \centering
@@ -190,7 +190,7 @@ Framework that adds features on top of an existing data lake.
 
 \section{Metadata}
 \marginnote{Metadata}
-Metadata are used to organize a data lake.
+Metadata is used to organize a data lake.
 Useful metadata are:
 \begin{descriptionlist}
     \item[Source] Origin of the data.
diff --git a/src/machine-learning-and-data-mining/sections/_data_prepro.tex b/src/machine-learning-and-data-mining/sections/_data_prepro.tex
index 95a1d88..f75a91e 100644
--- a/src/machine-learning-and-data-mining/sections/_data_prepro.tex
+++ b/src/machine-learning-and-data-mining/sections/_data_prepro.tex
@@ -21,9 +21,9 @@ Useful for:
 \section{Sampling}
 \marginnote{Sampling}
 Sampling can be used when the full dataset is too expensive to obtain or too expensive to process.
-Obviously a sample has to be representative.
+Obviously, a sample has to be representative.
 
-Type of sampling techniques are:
+The types of sampling techniques are:
 \begin{descriptionlist}
     \item[Simple random] \marginnote{Simple random}
         Extraction of a single element following a given probability distribution.
@@ -45,7 +45,7 @@ Type of sampling techniques are:
 \begin{description}
     \item[Sample size]
         The sampling size represents a tradeoff between data reduction and precision.
-        In a labeled dataset, it is important to consider the probability of sampling data of all the possible classes.
+        In a labeled dataset, it is important to consider the probability of sampling data from all the possible classes.
 \end{description}
 
 
@@ -121,7 +121,7 @@ Possible approaches are:
     For each entry, if its feature $E$ has value $e_i$, then $H_{e_i} = \texttt{true}$ and the rests are \texttt{false}.
 
 \subsection{Ordinal encoding} \marginnote{Ordinal encoding}
-    A feature whose values have an ordering can be converted in a consecutive sequence of integers
+    A feature whose values have an ordering can be converted into a consecutive sequence of integers
     (e.g. ["good", "neutral", "bad"] $\mapsto$ [1, 0, -1]).
 
 \subsection{Discretization} \marginnote{Discretization}
diff --git a/src/machine-learning-and-data-mining/sections/_data_warehouse.tex b/src/machine-learning-and-data-mining/sections/_data_warehouse.tex
index 77a7379..8c80ebf 100644
--- a/src/machine-learning-and-data-mining/sections/_data_warehouse.tex
+++ b/src/machine-learning-and-data-mining/sections/_data_warehouse.tex
@@ -7,13 +7,13 @@
         Deliver the right information to the right people at the right time through the right channel.
 
     \item[\Ac{dwh}] \marginnote{\Acl{dwh}}
-        Optimized repository that stores information for decision making processes.
+        Optimized repository that stores information for decision-making processes.
         \Acp{dwh} are a specific type of \ac{dss}.
 
         Features:
         \begin{itemize}
-            \item Subject-oriented: focused on enterprise specific concepts.
-            \item Integrates data from different sources and provides an unified view.
+            \item Subject-oriented: focused on enterprise-specific concepts.
+            \item Integrates data from different sources and provides a unified view.
             \item Non-volatile storage with change tracking. 
         \end{itemize}
 
@@ -143,7 +143,7 @@ Operational data may contain:
     \item[Missing data] 
     \item[Improper use of fields] (e.g. saving the phone number in the \texttt{notes} field)
     \item[Wrong values] (e.g. 30th of February)
-    \item[Inconsistency] (e.g. use of different abbreviations)
+    \item[Inconsistencies] (e.g. use of different abbreviations)
     \item[Typos]    
 \end{descriptionlist}
 
@@ -154,10 +154,10 @@ Methods to clean and increase the quality of the data are:
         Applicable if the domain is known and limited.
         
     \item[Approximate merging] \marginnote{Approximate merging}
-        Merging data that do not have a common key.
+        Methods to merge data that do not have a common key.
         \begin{description}
             \item[Approximate join]
-                Use non-key attributes to join two tables (e.g. using the name and surname instead of an unique identifier).
+                Use non-key attributes to join two tables (e.g. using the name and surname instead of a unique identifier).
 
             \item[Similarity approach]
                 Use similarity functions (e.g. edit distance) to merge multiple instances of the same information
@@ -178,7 +178,7 @@ Data are transformed to respect the format of the data warehouse:
         Creating new information by using existing attributes (e.g. compute profit from receipts and expenses)
 
     \item[Separation and concatenation] \marginnote{Separation and concatenation}
-        Denormalization of the data: introduces redundances (i.e. breaks normal form\footnote{\url{https://en.wikipedia.org/wiki/Database_normalization}}) 
+        Denormalization of the data: introduces redundancies (i.e. breaks normal form\footnote{\url{https://en.wikipedia.org/wiki/Database_normalization}}) 
         to speed up operations.
 \end{descriptionlist}
 
@@ -332,13 +332,14 @@ Aggregation operators can be classified as:
 \end{description}
 
 
-\subsection{Logical design}
+
+\section{Logical design}
 \marginnote{Logical design}
 Defining the data structures (e.g. tables and relationships) according to a conceptual model.
-There are mainly two strategies:
+There are two main strategies:
 \begin{descriptionlist}
     \item[Star schema] \marginnote{Star schema}
-        A fact table that contains all the measures and linked to dimensional tables.
+        A fact table that contains all the measures is linked to dimensional tables.
         \begin{figure}[ht]
             \centering
             \includegraphics[width=\textwidth]{img/logical_star_schema.png}
@@ -346,8 +347,8 @@ There are mainly two strategies:
         \end{figure}
 
     \item[Snowflake schema] \marginnote{Snowflake schema}
-        A star schema variant with partially normalized dimension tables.
-        \begin{figure}[ht]
+        A star schema variant with partially normalized dimensional tables.
+        \begin{figure}[H]
             \centering
             \includegraphics[width=\textwidth]{img/logical_snowflake_schema.png}
             \caption{Example of snowflake schema}
diff --git a/src/machine-learning-and-data-mining/sections/_intro.tex b/src/machine-learning-and-data-mining/sections/_intro.tex
index 2c372d5..5847519 100644
--- a/src/machine-learning-and-data-mining/sections/_intro.tex
+++ b/src/machine-learning-and-data-mining/sections/_intro.tex
@@ -30,7 +30,7 @@
 \subsection{Software}
 \begin{description}
     \item[\Ac{oltp}] \marginnote{\Acl{oltp}} 
-        Class of programs to support transaction oriented applications and data storage.
+        Class of programs to support transaction-oriented applications and data storage.
         Suitable for real-time applications.
 
     \item[\Ac{erp}] \marginnote{\Acl{erp}} 
@@ -41,10 +41,10 @@
 
 
 \subsection{Insight}
-Decision can be classified as:
+Decisions can be classified as:
 \begin{descriptionlist}
     \item[Structured] \marginnote{Structured decision}
-        Established and well understood situations.
+        Established and well-understood situations.
         What is needed is known.
     \item[Unstructured] \marginnote{Unstructured decision}
         Unplanned and unclear situations.
@@ -54,18 +54,18 @@ Decision can be classified as:
 Different levels of insight can be extracted by:
 \begin{descriptionlist}
     \item[\Ac{mis}] \marginnote{\Acl{mis}}
-        Standardized reporting system built on existing \ac{oltp}.
+        Standardized reporting system built on an existing \ac{oltp}.
         Used for structured decisions.
 
     \item[\Ac{dss}] \marginnote{\Acl{dss}}
         Analytical system to provide support for unstructured decisions.
 
     \item[\Ac{eis}] \marginnote{\Acl{eis}}
-        Formulate high level decisions that impact the organization.
+        Formulate high-level decisions that impact the organization.
 
     \item[\Ac{olap}] \marginnote{\Acl{olap}}
         Grouped analysis of multidimensional data.
-        Involves large amount of data.
+        Involves a large amount of data.
 
     \item[\Ac{bi}] \marginnote{\Acl{bi}}
         Applications, infrastructure, tools and best practices to analyze information.
@@ -75,7 +75,7 @@ Different levels of insight can be extracted by:
 
 \begin{description}
     \item[Big data] \marginnote{Big data}
-        Large and/or complex and/or fast changing collection of data that traditional DBMSs are unable to process.
+        Large and/or complex and/or fast-changing collection of data that traditional DBMSs are unable to process.
         \begin{description}
             \item[Structured] e.g. relational tables.
             \item[Unstructured] e.g. videos.
diff --git a/src/machine-learning-and-data-mining/sections/_machine_learning.tex b/src/machine-learning-and-data-mining/sections/_machine_learning.tex
index 245e0bc..27cd7cb 100644
--- a/src/machine-learning-and-data-mining/sections/_machine_learning.tex
+++ b/src/machine-learning-and-data-mining/sections/_machine_learning.tex
@@ -11,7 +11,7 @@
     \item[Regression] Estimation of a numeric value.
     \item[Similarity matching] Identify similar individuals.
     \item[Clustering] Grouping individuals based on their similarities.
-    \item[Co-occurrence groupping] Identify associations between entities based on the transactions in which they appear together.
+    \item[Co-occurrence grouping] Identify associations between entities based on the transactions in which they appear together.
     \item[Profiling] Behavior description.
     \item[Link analysis] Analysis of connections (e.g. in a graph).
     \item[Data reduction] Reduce the dimensionality of data with minimal information loss.
@@ -69,7 +69,7 @@
 
                 \textbf{Operators.} $=$, $\neq$, $<$, $>$, $\leq$, $\geq$, $+$, $-$
                 \begin{example}
-                    Celsius and Fahrenheit temperature scales, CGPA, time.
+                    Celsius and Fahrenheit temperature scales, CGPA, time, \dots.
                     
                     For instance, there is a $6.25\%$ increase from $16\text{°C}$ to $17\text{°C}$, but
                     converted in Fahrenheit, the increase is of $2.96\%$ (from $60.8\text{°F}$ to $62.6\text{°F}$).
@@ -157,14 +157,14 @@
     \item[Missing values] \marginnote{Missing values}
         Data that have not been collected.
         Sometimes they are not easily recognizable 
-        (e.g. when special values are used, instead of \texttt{null}, to mark missing data).
+        (e.g. when special values are used to mark missing data instead of \texttt{null}).
 
         Can be handled in different ways:
         \begin{itemize}
             \item Ignore the records with missing values.
             \item Estimate or default missing values.
             \item Ignore the fact that some values are missing (not always applicable).
-            \item Insert all the possible values and weight them by their probability.
+            \item Insert all the possible values and weigh them by their probability.
         \end{itemize}
 
     \item[Duplicated data] \marginnote{Duplicated data}
diff --git a/src/machine-learning-and-data-mining/sections/_regression.tex b/src/machine-learning-and-data-mining/sections/_regression.tex
index 1b65d57..23d42ec 100644
--- a/src/machine-learning-and-data-mining/sections/_regression.tex
+++ b/src/machine-learning-and-data-mining/sections/_regression.tex
@@ -27,7 +27,7 @@
         \begin{itemize}
             \item MSE is influenced by the magnitude of the data.
             \item It measures the fitness of a model in absolute terms.
-            \item It is suited to compare different models.
+            % \item It is suited to compare different models.
         \end{itemize}
 
     \item[Coefficient of determination] \marginnote{Coefficient of determination}