mirror of
https://github.com/NotXia/unibo-ai-notes.git
synced 2025-12-15 19:12:22 +01:00
605 lines
27 KiB
TeX
605 lines
27 KiB
TeX
\chapter{Instrumental learning}
|
|
|
|
|
|
Form of control learning that aims to learn action-outcome associations:
|
|
\begin{itemize}
|
|
\item When a reinforcer is likely to occur.
|
|
\item Which actions bring to those reinforcers.
|
|
\end{itemize}
|
|
This allows the animal to act in anticipation of a reinforcer.
|
|
|
|
Instrumental learning includes:
|
|
\begin{descriptionlist}
|
|
\item[Habitual system] \marginnote{Habitual system}
|
|
Learn to repeat previously successful actions.
|
|
\item[Goal-directed system] \marginnote{Goal-directed system}
|
|
Evaluate actions based on their anticipated consequences.
|
|
\end{descriptionlist}
|
|
|
|
Depending on the outcome, the effect varies:
|
|
\begin{descriptionlist}
|
|
\item[Positive reinforcement] \marginnote{Positive reinforcement}
|
|
Delivering an appetitive outcome to an action increases the probability of emitting it.
|
|
|
|
\item[Positive punishment] \marginnote{Positive punishment}
|
|
Delivering an aversive outcome to an action decreases the probability of emitting it.
|
|
|
|
\item[Negative reinforcement] \marginnote{Negative reinforcement}
|
|
Omitting an aversive outcome to an action increases the probability of emitting it.
|
|
|
|
\item[Negative punishment] \marginnote{Negative punishment}
|
|
Omitting an appetitive outcome to an action decreases the probability of emitting it.
|
|
\end{descriptionlist}
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\begin{tabular}{r|cc}
|
|
\toprule
|
|
& \textbf{Delivery} & \textbf{Omission} \\
|
|
\midrule
|
|
\textbf{Appetitive} & Positive reinforcement (\texttt{+prob}) & Negative punishment (\texttt{-prob}) \\
|
|
\textbf{Aversive} & Positive punishment (\texttt{-prob}) & Negative reinforcement (\texttt{+prob}) \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Summary of the possible effects}
|
|
\end{table}
|
|
|
|
|
|
|
|
\section{Types of schedule}
|
|
|
|
There are two types of learning:
|
|
\begin{descriptionlist}
|
|
\item[Continuous schedule] \marginnote{Continuous schedule}
|
|
The desired action is followed by the outcome every time.
|
|
\begin{remark}
|
|
More effective to teach a new association.
|
|
\end{remark}
|
|
|
|
\item[Partial schedule] \marginnote{Partial schedule}
|
|
The desired action is not always followed by the outcome.
|
|
\begin{remark}
|
|
Learning is slower but the response is more resistant to extinction.
|
|
\end{remark}
|
|
|
|
There are four types of partial schedules:
|
|
\begin{descriptionlist}
|
|
\item[Fixed-ratio]
|
|
Outcome available after a specific number of responses.
|
|
|
|
This results in a high and steady rate of response, with a brief pause after the outcome is delivered.
|
|
|
|
|
|
\item[Variable-ratio]
|
|
Outcome available after an unpredictable number of responses.
|
|
|
|
This results in a high and steady rate of response.
|
|
|
|
|
|
\item[Fixed-interval]
|
|
Outcome available after a specific interval of time.
|
|
|
|
This results in a high rate of response near the end of the interval and a slowdown after the outcome is delivered.
|
|
|
|
|
|
\item[Variable-interval]
|
|
Outcome available after an unpredictable interval of time.
|
|
|
|
This results in a slow and steady rate of response.
|
|
\end{descriptionlist}
|
|
\end{descriptionlist}
|
|
|
|
\begin{minipage}{0.55\linewidth}
|
|
\begin{casestudy}[Aplysia Californica]
|
|
An Aplysia Californica will withdraw its gill upon stimulating the siphon.
|
|
\begin{itemize}
|
|
\item Repeated mild stimulations will induce a habituation of the reflex.
|
|
\item Repeated intense stimulations will induce a sensitization of the reflex.
|
|
\end{itemize}
|
|
\end{casestudy}
|
|
\end{minipage}
|
|
\begin{minipage}{0.4\linewidth}
|
|
\centering
|
|
\includegraphics[width=0.9\linewidth]{./img/gill_habituation.png}
|
|
\end{minipage}
|
|
|
|
|
|
|
|
\section{Dopamine}
|
|
|
|
There is evidence that dopamine is involved in learning action-outcome associations.
|
|
|
|
\begin{description}
|
|
\item[Striatal activity on unexpected events] \marginnote{Striatal activity on unexpected events}
|
|
When an unexpected event happens, there is a change in the activity of the striatum.
|
|
There is an increase in response when the feedback is positive and a decrease when negative.
|
|
|
|
\begin{casestudy}[Microelectrodes in substantia nigra]
|
|
\phantom{}\\
|
|
\begin{minipage}{0.7\linewidth}
|
|
The activity of the substantia nigra of patients with Parkinson's disease is measured during a probabilistic instrumental learning task.
|
|
The task consists of repeatedly drawing a card from two decks, followed by positive or negative feedback depending on the deck.
|
|
\end{minipage}
|
|
\begin{minipage}{0.3\linewidth}
|
|
\centering
|
|
\includegraphics[width=0.95\linewidth]{./img/instrumental_dopamine_sn1.png}
|
|
\end{minipage}
|
|
|
|
The increase and decrease in striatal activity can be clearly seen when the feedback is unexpected.
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=\linewidth]{./img/instrumental_dopamine_sn2.png}
|
|
\end{figure}
|
|
\end{casestudy}
|
|
|
|
\item[Dopamine effect on behavior] \marginnote{Dopamine effect on behavior}
|
|
The amount of dopamine changes the learning behavior:
|
|
\begin{itemize}
|
|
\item Low levels of dopamine cause an impairment in learning from positive feedback.
|
|
This happens because positive prediction errors cannot occur.
|
|
|
|
\item High levels of dopamine cause an impairment in learning from negative feedback.
|
|
This happens because negative prediction errors cannot occur.
|
|
\end{itemize}
|
|
|
|
\begin{casestudy}[Probabilistic selection task]
|
|
This instrumental learning task has two phases:
|
|
\begin{descriptionlist}
|
|
\item[Learning]
|
|
There are three pairs of stimuli (symbols) and, at each trial, a pair is presented to the participant who selects one.
|
|
For each pair, a symbol has a higher probability of providing positive feedback while the other is more likely to be negative.
|
|
Moreover, the probabilities are different among the three pairs.
|
|
|
|
\begin{center}
|
|
\includegraphics[width=0.55\linewidth]{./img/instrumental_dopamine_selection1.png}
|
|
\end{center}
|
|
|
|
Participants are required to learn by trial and error the stimulus in each pair that leads to a positive reward.
|
|
Note that learning could be accomplished by:
|
|
\begin{itemize}
|
|
\item Recognizing the more rewarding stimulus.
|
|
\item Recognizing the less rewarding stimulus.
|
|
\item Both.
|
|
\end{itemize}
|
|
|
|
\item[Testing]
|
|
Aims to assess if participants learned to select positive feedback or avoid negative feedback.
|
|
|
|
The same task as above is repeated but all combinations of the stimuli among the three pairs are possible.
|
|
\end{descriptionlist}
|
|
|
|
Three groups of participants are considered for this experiment:
|
|
\begin{enumerate}
|
|
\item Those who took the cabergoline drug (dopamine antagonist).
|
|
\item Those who took the haloperidol drug (dopamine agonist).
|
|
\item Those who took a drug without effects (placebo).
|
|
\end{enumerate}
|
|
|
|
\begin{center}
|
|
\includegraphics[width=0.55\linewidth]{./img/instrumental_dopamine_selection2.png}
|
|
\end{center}
|
|
|
|
Results show that:
|
|
\begin{enumerate}
|
|
\item Cabergoline inhibited positive feedback learning.
|
|
\item Haloperidol enhanced positive feedback learning.
|
|
\item Placebo learned positive and negative feedback equally.
|
|
\end{enumerate}
|
|
\end{casestudy}
|
|
|
|
\begin{casestudy}
|
|
It has been observed that:
|
|
\begin{itemize}
|
|
\item Reward prediction errors are correlated with activity in the left posterior putamen and left ventral striatum.
|
|
\item Punishment prediction errors are correlated with activity in the right anterior insula.
|
|
\end{itemize}
|
|
|
|
\begin{center}
|
|
\includegraphics[width=0.5\linewidth]{./img/pe_location.png}
|
|
\end{center}
|
|
\end{casestudy}
|
|
|
|
\item[Actor-critic model] \marginnote{Actor-critic model}
|
|
Model to correlate Pavlovian and instrumental learning.
|
|
It is composed by:
|
|
\begin{itemize}
|
|
\item The cortex is responsible for representing the current state.
|
|
\item The basal ganglia implement two computational models:
|
|
\begin{descriptionlist}
|
|
\item[Critic] \marginnote{Critic}
|
|
Learns stimulus-outcome associations and is active in both Pavlovian and instrumental learning.
|
|
It might be implemented in the ventral striatum, the amygdala and the orbitofrontal cortex.
|
|
|
|
\item[Actor] \marginnote{Actor}
|
|
Learns stimulus-action associations and is only active during instrumental learning.
|
|
It might be implemented in the dorsal striatum.
|
|
\end{descriptionlist}
|
|
\end{itemize}
|
|
\end{description}
|
|
|
|
\begin{casestudy}[Food and cocaine]
|
|
\phantom{}
|
|
\begin{itemize}
|
|
\item Food-induced dopamine response is modulated by the reward expectations that promote learning until the prediction matches the actual outcome.
|
|
\item Cocaine-induced dopamine response causes a continuous increase in the predicted reward that
|
|
will eventually surpass all other cues and bias decision-making towards cocaine.
|
|
\end{itemize}
|
|
\begin{center}
|
|
\includegraphics[width=0.8\linewidth]{./img/dopamine_food_cocaine.png}
|
|
\end{center}
|
|
\end{casestudy}
|
|
|
|
|
|
|
|
\section{Learning strategies historical evolution}
|
|
|
|
|
|
% Instrumental learning can happen in two ways:
|
|
% \begin{descriptionlist}
|
|
% \item[Cognitive map] \marginnote{Cognitive map}
|
|
% Actions are taken based on the expected reward.
|
|
|
|
% \item[Response strategy] \marginnote{Response strategy}
|
|
% Actions are associated with particular stimuli.
|
|
% \end{descriptionlist}
|
|
|
|
|
|
\subsection{Generation 0}
|
|
|
|
There were two possible learning strategies:
|
|
\begin{descriptionlist}
|
|
\item[Stimulus-response theory] \marginnote{Stimulus-response theory}
|
|
Learning happens by creating stimulus-response associations.
|
|
|
|
Learning does not happen if there is no reward.
|
|
|
|
\item[Cognitive map / Field theory] \marginnote{Cognitive map / Field theory}
|
|
A mental map is created and used to find the best action in a given state based on the expected reward.
|
|
|
|
\begin{description}
|
|
\item[Latent learning] \marginnote{Latent learning}
|
|
Learning that is not shown behaviorally unless there is enough motivation.
|
|
\end{description}
|
|
\end{descriptionlist}
|
|
|
|
\begin{casestudy}[Maze]
|
|
An animal is put at the start of a maze where a reward is located in the west arm.
|
|
After some training iterations, the animal is put at the other entrance:
|
|
\begin{itemize}
|
|
\item If it goes to the west arm, it learned to solve the maze using a cognitive map/place strategy.
|
|
\item If it goes to the east arm, it learned to solve the maze using a stimulus-response strategy.
|
|
\end{itemize}
|
|
\begin{center}
|
|
\includegraphics[width=0.55\linewidth]{./img/instrumental_maze.png}
|
|
\end{center}
|
|
|
|
It has been observed that rats start by learning a cognitive map (i.e. the environment is unknown).
|
|
After enough training, they start relying on a response strategy (i.e. the environment is stable).
|
|
\end{casestudy}
|
|
|
|
\begin{casestudy}[Tolman's maze]
|
|
Consider a maze with curtains and doors to prevent a long-distance perspective.
|
|
\begin{center}
|
|
\includegraphics[width=0.35\linewidth]{./img/tolman_maze.png}
|
|
\end{center}
|
|
|
|
Two groups of hungry rats have been considered to solve the maze:
|
|
\begin{descriptionlist}
|
|
\item[Group 1] No reward for solving the maze.
|
|
\item[Group 2] Reward for solving the maze.
|
|
\end{descriptionlist}
|
|
It has been shown that the second group completes the maze faster.
|
|
\begin{center}
|
|
\includegraphics[width=0.45\linewidth]{./img/tolman_experiment1.png}
|
|
\end{center}
|
|
|
|
To show latent learning, three groups of hungry rats have been considered:
|
|
\begin{descriptionlist}
|
|
\item[Group 1] No reward for solving the maze.
|
|
\item[Group 2] Reward for solving the maze.
|
|
\item[Group 3] Reward for solving the maze starting from day 11.
|
|
\end{descriptionlist}
|
|
It has been shown that rats of the third group complete the maze faster as soon as they receive food.
|
|
\begin{center}
|
|
\includegraphics[width=0.45\linewidth]{./img/tolman_experiment2.png}
|
|
\end{center}
|
|
\end{casestudy}
|
|
|
|
|
|
\subsection{Generation 1}
|
|
|
|
Shifted from studying the spatial domain to a more general domain.
|
|
Based on two types of actions:
|
|
\begin{descriptionlist}
|
|
\item[Goal-directed action] \marginnote{Goal-directed action}
|
|
Actions made because a desired outcome is expected.
|
|
An action is goal-directed if:
|
|
\begin{itemize}
|
|
\item There is knowledge of the relationship between action and consequences (response-outcome).
|
|
\item The outcome is motivationally relevant.
|
|
\end{itemize}
|
|
|
|
Goal-directed behavior has the following properties:
|
|
\begin{itemize}
|
|
\item Involves active deliberation.
|
|
\item Has a high computational cost.
|
|
\item It is flexible to changes of the environmental contingency (i.e. stops if no reward occurs)
|
|
\end{itemize}
|
|
|
|
\begin{center}
|
|
\includegraphics[width=0.65\linewidth]{./img/goal_directed_behavior.png}
|
|
\end{center}
|
|
|
|
\item[Habitual action] \marginnote{Habitual action}
|
|
Actions made automatically just because they were rewarded in the past.
|
|
They are not influenced by the current outcome even if it is undesired.
|
|
|
|
Habitual behavior has the following properties:
|
|
\begin{itemize}
|
|
\item Does not require active deliberation.
|
|
\item Has a low computational cost.
|
|
\item It is inflexible to changes of the environmental contingency.
|
|
\end{itemize}
|
|
|
|
\begin{center}
|
|
\includegraphics[width=0.65\linewidth]{./img/habitual_behavior.png}
|
|
\end{center}
|
|
|
|
\end{descriptionlist}
|
|
|
|
\begin{casestudy}[Goal-directed vs habitual behavior]
|
|
The experiment is done in three steps:
|
|
\begin{descriptionlist}
|
|
\item[Training]
|
|
The animal undergoes instrumental learning (e.g. associate that by pressing a lever some food will be dropped).
|
|
|
|
\item[Devaluation]
|
|
Manipulate the learned behavior by either:
|
|
\begin{itemize}
|
|
\item Devaluate the reinforcer.
|
|
\item Degradate the contingency.
|
|
\end{itemize}
|
|
|
|
\item[Testing]
|
|
Repeat the training scenario without reward:
|
|
\begin{itemize}
|
|
\item If the action associated with a devaluated reinforcer is performed less, the behavior is goal-directed.
|
|
\item If the frequency of the action is the same, the behavior is habitual.
|
|
\end{itemize}
|
|
\end{descriptionlist}
|
|
|
|
\begin{remark}
|
|
The training phase aims to instill a goal-directed behavior.
|
|
On the other hand, if the animal is overtrained, it will learn a habitual behavior.
|
|
The experiment can be done both ways.
|
|
\end{remark}
|
|
|
|
\begin{center}
|
|
\includegraphics[width=0.85\linewidth]{./img/goal_directed_vs_habitual.png}
|
|
\end{center}
|
|
\end{casestudy}
|
|
|
|
It has been hypothesized that the striatum might be the interface where rewards influence actions as: \marginnote{Striatum}
|
|
\begin{itemize}
|
|
\item The basal ganglia are involved in the selection of actions.
|
|
\item The SNc affects the plasticity of the striatum through the release of dopamine.
|
|
\end{itemize}
|
|
|
|
Moreover, different sections of the striatum are responsible for different types of behavior:
|
|
\begin{descriptionlist}
|
|
\item[Dorsomedial striatum] Supports goal-directed behavior.
|
|
\item[Dorsolateral striatum] Supports habitual behavior.
|
|
\end{descriptionlist}
|
|
This also hints that goal-directed and habitual behaviors act simultaneously and competitively (see \hyperref[sec:instrumental_gen3]{Generation 3}).
|
|
|
|
|
|
\subsection{Generation 2}
|
|
|
|
Studied goal-directed and habitual behavior in humans.
|
|
|
|
\begin{description}
|
|
\item[Functional magnetic resonance imaging (fRMI)] \marginnote{Functional magnetic resonance imaging (fRMI)}
|
|
Measures the ratio of oxygenated to deoxygenated hemoglobin molecules in the brain.
|
|
It allows to indirectly measure the neuronal activity of the brain on a high spatial resolution (i.e. allows to see where things happen but not when).
|
|
\end{description}
|
|
|
|
\begin{casestudy}[Goal-directed behavior in humans]
|
|
Candidates are trained to select between two fractals of which
|
|
one leads to a reward with a high probability and the other with a low probability.
|
|
The possible rewards are chocolate, tomato juice and orange juice (used as a control outcome).
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.45\linewidth]{./img/human_goal_directed_experiment.png}
|
|
\caption{
|
|
Structure of the task. The high probability choice leads to the primary reward (chocolate or tomato juice) with probability $0.4$,
|
|
to the control reward (orange juice) with probability $0.3$ and to nothing with probability $0.3$.
|
|
The low probability choice leads to the control reward with probability $0.3$ and nothing in the other cases.
|
|
The neutral case leads to an empty glass or nothing.
|
|
}
|
|
\end{figure}
|
|
|
|
After training, one of the primary rewards is devalued through selective satiation and the other is labeled as the valued outcome.
|
|
Then, the training task is repeated.
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.55\linewidth]{./img/human_goal_directed_experiment2.png}
|
|
\caption{
|
|
Steps of the experiment. In this figure, the devalued reward is the tomato juice.
|
|
}
|
|
\end{figure}
|
|
|
|
Behavioral results show that:
|
|
\begin{itemize}
|
|
\item During training, candidates favored the high-probability actions associated with chocolate and tomato juice.
|
|
On the other hand, choices for the neutral condition were evenly distributed.
|
|
\item The pleasantness rating for the devalued reward lowered after devaluation while the valued reward remained higher.
|
|
\item During testing, candidates reduced their choice of the high-probability action associated with the devalued reward.
|
|
\end{itemize}
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.9\linewidth]{./img/human_goal_directed_experiment3.png}
|
|
\end{figure}
|
|
|
|
During both training and testing, the fRMIs of the candidates were taken.
|
|
Neural results show that the \textbf{medial orbitofrontal cortex }has a significant modulation in its activity during instrumental action selection
|
|
depending on the value of the associated outcome.
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.4\linewidth]{./img/human_goal_directed_experiment4.png}
|
|
\end{figure}
|
|
\end{casestudy}
|
|
|
|
\begin{casestudy}[Habitual behavior in humans]
|
|
Candidates are presented, at each round of the trial, with a fractal image and a schematic indicating which button to press.
|
|
The button can be pressed an arbitrary number of times and, at each press, on the screen appears:
|
|
\begin{itemize}
|
|
\item A gray circle (no reward).
|
|
\item The image of an M\&M's {\tiny ©} or Frito {\tiny ©} (reward) with probability $0.1$.
|
|
To each fractal, only a type of reward can appear.
|
|
\end{itemize}
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.6\linewidth]{./img/human_habitual_experiment.png}
|
|
\end{figure}
|
|
|
|
After training, one of the food rewards is devalued through selective satiation.
|
|
Then, during testing, the same training task with the same stimulus-response-outcome is repeated without a reward.
|
|
|
|
Two groups have been considered:
|
|
\begin{descriptionlist}
|
|
\item[1-day group] with little training.
|
|
\item[3-day group] with extensive training.
|
|
\end{descriptionlist}
|
|
|
|
Behavioral results show that:
|
|
\begin{itemize}
|
|
\item Before devaluation, there were no significant differences between the responses of the two groups independently of the type of food.
|
|
\item During testing, the 1-day group showed a goal-directed behavior while the 3-day group showed a habitual behavior.
|
|
\end{itemize}
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.55\linewidth]{./img/human_habitual_experiment2.png}
|
|
\end{figure}
|
|
|
|
During both training and testing, the fRMIs of the candidates were taken.
|
|
Neural results show that, in the 3-day group, the \textbf{dorsolateral striatum} had significant activity.
|
|
\end{casestudy}
|
|
|
|
|
|
\subsection{Generation 3} \label{sec:instrumental_gen3}
|
|
|
|
Formalized goal-directed and habitual actions:
|
|
|
|
\begin{descriptionlist}
|
|
\item[Model-based] (Goal-directed) \marginnote{Model-based}
|
|
Use a model to predict the consequences of actions in terms of future states and expected rewards from future states.
|
|
|
|
When the environment changes, the agent can update its policy of future states without the need to actually be in those states.
|
|
|
|
\item[Model-free] (Habitual) \marginnote{Model-free}
|
|
Select actions based on the stored state-action pairs learned over many trials.
|
|
|
|
When the environment changes, the agent has to move into the new states and experience them.
|
|
|
|
\item[Hybrid model] \marginnote{Hybrid model}
|
|
Integrated computational and neural architecture where
|
|
model-based and model-free systems act simultaneously and competitively.
|
|
|
|
This is the currently favored model for behavior.
|
|
\end{descriptionlist}
|
|
|
|
|
|
\begin{casestudy}[Latent learning in humans]
|
|
The experiment consists of a sequential two-choice Markov decision task in which candidates navigate a binary decision tree.
|
|
|
|
Each state contains a fractal image and
|
|
candidates can choose to move to the left or right branch, each of which will lead with probability $0.7/0.3$ to one of the two subsequent states.
|
|
When a leaf is reached, a monetary reward (0\textcentoldstyle, 10\textcentoldstyle\, or 25\textcentoldstyle) is delivered.
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.4\linewidth]{./img/human_latent_experiment.png}
|
|
\end{figure}
|
|
|
|
\begin{minipage}{0.58\linewidth}
|
|
The experiment is divided into two sessions:
|
|
\begin{descriptionlist}
|
|
\item[First session]
|
|
Candidates choices are fixed but they can learn the transition probabilities.
|
|
|
|
\item[Before second session]
|
|
Candidates are presented with the association between fractal and reward.
|
|
|
|
\item[Second session]
|
|
Candidates are free to choose their actions at each state.
|
|
\end{descriptionlist}
|
|
\end{minipage}
|
|
\begin{minipage}{0.4\linewidth}
|
|
\centering
|
|
\includegraphics[width=0.95\linewidth]{./img/human_latent_experiment2.png}
|
|
\end{minipage}\\[1em]
|
|
|
|
\begin{minipage}{0.6\linewidth}
|
|
Behavioral results show that the majority of the candidates are able to make the optimal choice.
|
|
This indicates that their behavior cannot be explained using a model-free learning theory (as learning only happens with a reward).
|
|
A hybrid model has been proposed to model the candidates' behavior. It includes:
|
|
\begin{descriptionlist}
|
|
\item[Reward prediction error] Associated to model-free learning.
|
|
\item[State prediction error] Associated to model-based learning.
|
|
\end{descriptionlist}
|
|
\end{minipage}
|
|
\begin{minipage}{0.4\linewidth}
|
|
\centering
|
|
\includegraphics[width=0.7\linewidth]{./img/human_latent_experiment3.png}
|
|
\end{minipage}\\[1em]
|
|
|
|
On a neuronal level, fRMIs show that:
|
|
\begin{itemize}
|
|
\item State prediction error activates the \textbf{intraparietal sulcus} and the \textbf{lateral prefrontal cortex}.
|
|
\item Reward prediction error activates the \textbf{ventral striatum}.
|
|
\end{itemize}
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.75\linewidth]{./img/human_latent_experiment4.png}
|
|
\end{figure}
|
|
\end{casestudy}
|
|
|
|
\begin{casestudy}[Model-free vs model-based in humans]
|
|
Consider a Markov decision task that works as follows:
|
|
\begin{itemize}
|
|
\item In the first stage, candidates have to choose between two fractal images,
|
|
each leading to one of the two subsequent states with probability $0.7$ (common) and $0.3$ (rare).
|
|
\item In the second stage, candidates have to choose between two fractal images,
|
|
each of which will lead to a monetary reward with a certain independent probability.
|
|
\item The probability of receiving the reward changes stochastically during the trials.
|
|
\end{itemize}
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.35\linewidth]{./img/model_free_based_theoretical.png}
|
|
\end{figure}
|
|
|
|
It is expected that:
|
|
\begin{descriptionlist}
|
|
\item[Model-free agents]
|
|
Ignore the transition structure and prefer to repeat actions that lead to a reward in the past.
|
|
|
|
\item[Model-based agents]
|
|
Respect the transition structure and modify their policies depending on the outcome.
|
|
|
|
They are more likely to repeat an action following a rewarding trial only if that transition is common.
|
|
\end{descriptionlist}
|
|
|
|
Despite that, the actual results on human candidates show that a hybrid model is more suited to explain human behavior.
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.6\linewidth]{./img/human_hybrid_model.png}
|
|
\end{figure}
|
|
|
|
% \begin{figure}[H]
|
|
% \centering
|
|
% \includegraphics[width=0.5\linewidth]{./img/model_free_based_theoretical2.png}
|
|
% \end{figure}
|
|
|
|
Neural results from fRMIs also show that the activity in the \textbf{striatum} increases for both model-based and model-free prediction errors.
|
|
\end{casestudy}
|