Sources of suboptimality in a minimalistic explore-exploit task.
Mingyu SongZahy BnayaWei Ji MaPublished in: Nature human behaviour (2019)
People often choose between sticking with an available good option (exploitation) and trying out a new option that is uncertain but potentially more rewarding (exploration)1,2. Laboratory studies on explore-exploit decisions often contain real-world complexities such as non-stationary environments, stochasticity under exploitation and unknown reward distributions3-7. However, such factors might limit the researcher's ability to understand the essence of people's explore-exploit decisions. For this reason, we introduce a minimalistic task in which the optimal policy is to start off exploring and to switch to exploitation at most once in each sequence of decisions. The behaviour of 49 laboratory and 143 online participants deviated both qualitatively and quantitatively from the optimal policy, even when allowing for bias and decision noise. Instead, people seem to follow a suboptimal rule in which they switch from exploration to exploitation when the highest reward so far exceeds a certain threshold. Moreover, we show that this threshold decreases approximately linearly with the proportion of the sequence that remains, suggesting a temporal ratio law. Finally, we find evidence for 'sequence-level' variability that is shared across all decisions in the same sequence. Our results emphasize the importance of examining sequence-level strategies and their variability when studying sequential decision-making.