Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration

Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration

  • Jonathan D. Cohen
  • Samuel M. McClur
  • Angela J. Yu
  • …often our decisions depend on a higher level choice: whether to exploit well known but possibly suboptimal alternatives or to explore risky but potentially more profitable ones. How adaptive agents choose between exploitation and exploration remains an important and open question that has received relatively limited attention in the behavioural and brain sciences. The choice could depend on a number of factors, including the familiarity of the environment, how quickly the environment is likely to change and the relative value of exploiting known sources of reward versus the cost of reducing uncertainty through exploration.
  • The need to balance exploitation with exploration is confronted at all levels of behaviour and time-scales of decision making from deciding what to do next in the day to planning a career path.
  • The significance of Gittins’ contribution is that it reduced the decision problem to computing and comparing these scalar indices. In practice, computing the Gittins index is not tractable for many problems for which it is known to be optimal. However, for some limited problems, explicit solutions have been found. For instance, the Gittins index has been computed for certain two armed bandit problems (in which the agent chooses between two options with independent probabilities of generating a reward), and compared to the foraging behaviour of birds under comparable circumstances; the birds were found to behave approximately optimally
  • Perhaps, the most important exception to Gittins’ assumptions is that real-world environments are typically non-stationary; i.e. they change with time. To understand how organisms manage the balance between exploration and exploitation in non-stationary environments, investigators have begun to study how organisms adapt their behaviour in response to the experimentally induced changes in reward contingencies. Several studies have now shown that both humans and other animals dynamically update their estimates of rewards associated with specific courses of action, and abandon actions that are deemed to be diminishing in value in search of others that may be more rewarding
  • At the same time, there is also longstanding evidence that humans sometimes exhibit an opposing tendency. When reward diminishes (e.g. following an error in performance), subjects often try harder at what they have been doing rather than less (e.g. Rabbitt 1966; Laming 1979; Gratton et al. 1992).
  • The balance between exploration and exploitation also seems to be sensitive to time horizons. Humans show a greater tendency to explore when there is more time left in a task, presumably because this allows them sufficient time later to enjoy the fruits of those explorations (Carstensen et al. 1999). – is this related to (lack of) stress? Something about cognitive bandwidth?
  • Bandit problems are well suited for studying the tension between exploitation and exploitation since they offer a direct trade-offbetween exploiting a known source of reward (continuing to play one arm of the bandit) and exploring the environment (trying other arms) to acquire information about other sources of reward
  • The investigators found that the time at which birds stopped exploring (operationalized as the point at which they stayed at one feeding post) closely approximated that predicted by the optimal solution. Despite their findings, Krebs et al. (1978) recognized that it was highly unlikely that their birds were carrying out the complex calculations required by the Gittins index. Rather, they suggested that the birds were using simple behavioural heuristics that produces exploration times that qualitatively approximate the optimal solution – this might be good for the modelling section.
  • Nevertheless, to our knowledge, the Daw et al. (2006)study was the first to address formally the question of how subjects weigh exploration against exploitation in a non-stationary, but experimentally controlled environment. It also produced some interesting neurobiological findings. Their subjects performed the n-armed bandit task while being scanned using functional magnetic resonance imaging (fMRI). Among the observations reported was task-related activity in two sets of regions of prefrontal cortex (PFC). One set of regions was in ventromedial PFC and was associated with both the magnitude of reward associated with a choice, and that predicted by their computational model of the task (using the softmax decision rule). This area has been consistently associated with the encoding of reward value across a variety of task domain – biological basis for different behaviors
  • Yu & Dayan (2005) proposed that a critical function of two important neuromodulators—acetylcholine (ACh) and norepinephrine (NE)—may be to signal expected and unexpected sources of uncertainty. While the model they developed for this was not intended to address the trade-off between exploitation and exploration, the distinction between expected and unexpected uncertainty is likely to be an important factor in regulating this trade-off. For example, the detection of unexpected uncertainty can be an important signal of the need to promote exploration.
  • …the distinction between expected and unexpected forms ofuncertainty may be an important element in choosing between exploitation versus exploration. As long as prediction errors can be accounted for in terms of expected uncertainty—that is the amount that we expect a given outcome to vary—then all other things being equal (e.g. ignoring potential non-stationarities in the environment), we should persist in our current behaviour (exploit). However, if errors in prediction begin to exceed the degree expected—i.e. unexpected uncertainty mounts—then we should revise our strategy and consider alternatives (explore).
  • Yu & Dayan (2005) proposed that ACh levels are used to signal expected uncertainty, and NE to signal unexpected uncertainty. They describe a computationally tractable algorithm by which these maybe estimated that approximates the Bayesian optimal computation of those estimates. Furthermore, they proposed how these estimates, reflected by NE and ACh levels, could be used to determine when to revise expectations

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.