Add ethics2 CLAUDETTE

2026-02-04 07:41:43 +01:00 · 2025-04-26 18:07:07 +02:00
parent 617fd5b7bd
commit 5484e66406
1 changed files with 233 additions and 0 deletions
--- a/src/year2/ethics-in-ai/module2/sections/_claudette.tex
+++ b/src/year2/ethics-in-ai/module2/sections/_claudette.tex
@ -87,4 +87,237 @@
 \begin{description}
    \item[Training data]
        Manually annotated terms of service.
+
+    \item[Tasks] Two tasks are solved:
+        \begin{description}
+            \item[Detection] Binary classification problem aimed at determining whether a sentence contains a potentially unfair clause.
+            \item[Sentence classification] Classification problem of determining the category of the unfair clause.
+        \end{description}
+
+    \item[Experimental setup]
+        Leave-one-out where one document is used as test set and the remaining as train ($\frac{4}{5}$) and validation ($\frac{1}{5}$) set.
+
+    \item[Metrics] Precision, recall, F1.
+\end{description}
+
+
+\subsection{Base clause classifier}
+
+Experimented methods were:
+\begin{itemize}
+    \item Bag-of-words,
+    \item Tree kernels,
+    \item CNN,
+    \item SVM,
+    \item \dots
+\end{itemize}
+
+
+\subsection{Background knowledge injection}
+
+\begin{description}
+    \item[Memory-augmented neural network] \marginnote{Memory-augmented neural network}
+        Model that, given a query, retrieves some knowledge from the memory and combines them to produce the prediction.
+
+        In CLAUDETTE, the knowledge base is composed of all the possible rationales for which a clause can be unfair. The workflow is the following:
+        \begin{enumerate}
+            \item The clause is used to query the knowledge base using a similarity score and the most relevant rationale is extracted.
+            \item The rationale is combined with the query.
+            \item Repeat the extraction step until the similarity score is too low.
+            \item Make the prediction and provide the rationales used as explanation.
+        \end{enumerate}
+\end{description}
+
+\begin{example}[Knowledge base for liability exclusion]
+    Rationales are divided into six class of clauses:
+    \begin{itemize}
+        \item Kind of damage,
+        \item Standard of care,
+        \item Cause,
+        \item Causal link,
+        \item Liability theory,
+        \item Compensation amount.
+    \end{itemize}
+\end{example}
+
+
+\subsection{Multilingualism}
+
+\begin{description}
+    \item[Training data]
+        Same terms of service of the original CLAUDETTE corpus selected according to the following criteria:
+        \begin{itemize}
+            \item The ToS is available in the target language,
+            \item There is a correspondence in terms of version or publication date between the documents in the two languages,
+            \item There are structure similarities between the documents in the two languages.
+        \end{itemize}
+\end{description}
+
+
+\begin{description}
+    \item[Approaches] Different strategies have been experimented with:
+    \begin{description}
+        \item[Novel corpus for target language] \marginnote{Novel corpus for target language}
+            Retrain CLAUDETTE from scratch with newly annotated data in the target language.
+
+        \item[Semi-automated creation of corpus through projection] \marginnote{Semi-automated creation of corpus through projection}
+            Method that works as follows:
+            \begin{enumerate}
+                \item Use machine translation to translate the annotated English document in the target language while projecting the unfair clauses.
+                \item Match the machine translated document with the original document in the target language and project the unfair clauses (through human annotation).
+                \item Train CLAUDETTE from scratch.
+            \end{enumerate}
+
+        \item[Training set translation] \marginnote{Training set translation}
+            Translate the original document to the target language and train CLAUDETTE from scratch.
+
+            \begin{remark}
+                This method does not require human annotation.
+            \end{remark}
+
+        \item[Machine translation of queries] \marginnote{Machine translation of queries}
+            Method that works as follows:
+            \begin{enumerate}
+                \item Translate the document from the target language to English.
+                \item Feed the translated document to CLAUDETTE.
+                \item Translate the English document back to the target language.
+            \end{enumerate}
+            
+            \begin{remark}
+                This method does not require retraining.
+            \end{remark}
+    \end{description}
+\end{description}
+
+
+
+\section{CLAUDETTE and GDPR}
+
+
+\begin{description}
+    \item[CLAUDETTE for GDPR compliance] 
+        To integrate CLAUDETTE as a tool to check GDPR compliance, three dimensions, each containing different categories (ranked with three levels of achievement), are checked:
+        \begin{descriptionlist}
+            \item[Comprehensiveness of information] \marginnote{Comprehensiveness of information}
+                Whether the policy contains all the information required by articles 13 and 14 of the GDPR.
+
+                Categories of this dimension comprises:
+                \begin{itemize}
+                    \item Contact information of the controller,
+                    \item Contact information of the data protection officer,
+                    \item Purpose and legal bases for processing,
+                    \item Category of personal data processed,
+                    \item \dots
+                \end{itemize}
+        
+            \item[Substantive compliance] \marginnote{Substantive compliance}
+                Whether the policy processes personal data complying with the GDPR.
+
+                Categories of this dimension comprises:
+                \begin{itemize}
+                    \item Processing of sensitive data,
+                    \item Processing of children's data,
+                    \item Consent by using, take-or-leave,
+                    \item Transfer to third parties or countries,
+                    \item Policy change (e.g., if the data subject is notified),
+                    \item Licensing data,
+                    \item Advertising.
+                \end{itemize}
+        
+            \item[Clarity of expression] \marginnote{Clarity of expression}
+                Whether the policy is precise and understandable (i.e., transparent).
+
+                Categories of this dimension comprises:
+                \begin{itemize}
+                    \item Conditional terms: the performance of an action is dependent on a variable trigger.
+                    \begin{remark}
+                        Typical language qualifiers to identify this category are: depending, as necessary, as appropriate, as needed, otherwise reasonably, sometimes, from time to time, \dots
+                    \end{remark}
+                    \begin{example}
+                        ``\textit{We also may share your information if we believe, in our sole discretion, that such disclosure is \underline{necessary} \textnormal{\dots}}''
+                    \end{example}
+                    
+                    \item Generalization: terms to abstract practices with an unclear context.
+                    \begin{remark}
+                        Typical language qualifiers to identify this category are: generally, mostly, widely, general, commonly, usually, normally, typically, largely, often, primarily, among other things, \dots
+                    \end{remark}
+                    \begin{example}
+                        ``\textit{We \underline{typically} or \underline{generally} collect information \dots When you use an Application on a Device, we will collect and use information about you in \underline{generally} similar ways and for similar purposes as when you use the TripAdvisor website.}''
+                    \end{example}
+
+                    \item Modality: terms that ambiguously refer to the possibility of actions or events.
+                    \begin{remark}
+                        Typical language qualifiers to identify this category are: may, might, could, would, possible, possibly, \dots
+
+                        Note that these qualifiers have two possible meanings: possibility and permission. This category only deals with possibility.
+                    \end{remark}
+                    \begin{example}
+                        ``\textit{We \underline{may} use your personal data to develop new services.}''
+                    \end{example}
+
+                    \item Non-specific numeric quantifiers: terms that are ambiguous in terms of actual measure.
+                    \begin{remark}
+                        Typical language qualifiers to identify this category are: certain, numerous, some, most, many, various, including (but not limited to), variety, \dots
+                    \end{remark}
+                    \begin{example}
+                        ``\textit{\textnormal{\dots}we may collect a \underline{variety} of information, \underline{including} your name, mailing address, phone number, email address, \dots}''
+                    \end{example}
+                \end{itemize}
+        \end{descriptionlist}
+\end{description}
+
+
+
+\section{LLMs and privacy policies}
+
+\begin{remark}
+    The GDPR requires two competing properties for privacy policies:
+    \begin{descriptionlist}
+        \item[Comprehensiveness] The policy should contain all the relevant information.
+        \item[Comprehensibility] The policy should be easily understandable.
+    \end{descriptionlist}
+\end{remark}
+
+
+\begin{description}
+    \item[Comprehensive policy from LLMs] 
+        Formulate privacy policies for comprehensiveness and let LLMs extract the relevant information.
+
+        A template for a comprehensive policy could include:
+        \begin{itemize}
+            \item Categories of personal data collected,
+            \item Purpose each category of data is processed for,
+            \item Legal basis for processing each category,
+            \item Storage period or deletion criteria,
+            \item Recipients or categories of recipients the data is shared with, their role, the purpose of sharing, and the legal basis.
+        \end{itemize}
+\end{description}
+
+\begin{description}
+    \item[Experimental setup]
+        The following questions were defined to assess a privacy policy:
+        \begin{enumerate}
+            \item What data does the company process about me?
+            \item For what purposes does the company use my email address?
+            \item Who does the company share my geolocation with?
+            \item What types of data are processed on the basis of consent, and for what purposes?
+            \item What data does the company share with Facebook?
+            \item Does the company share my data with insurers?
+            \item What categories of data does the company collect about me automatically?
+            \item How can I contact the company if I want to exercise my rights?
+            \item How long does the company keep my delivery address?
+        \end{enumerate}
+
+        Three scenarios were considered:
+        \begin{itemize}
+            \item Human evaluation of the questions on existing privacy policies,
+            \item LLMs to answer the questions on ideal mock policies (with human evaluation).
+            \item LLMs to answer the questions on real policies (with human evaluation).
+        \end{itemize}
+
+        Results show that:
+        \begin{itemize}
+            \item LLMs have high performance on the mock policies.
+            \item LLMs and humans struggle to answer the questions on real privacy policies.
+        \end{itemize}
 \end{description}