This paper proposes that the distinctively human capacity for cumulative, adaptive, open-ended cultural evolution came about through two temporally-distinct cognitive transitions. First, the origin of Homo-specific culture over two MYA was made possible by the onset of a finer-grained associative memory that allowed episodes to be encoded in greater detail. This in turn meant more overlap amongst the distributed representations of these episodes, such that they could more readily evoke one another through self-triggered recall (STR). STR enabled representational redescription, the chaining of thoughts and actions, and the capacity for a stream of thought. Second, fully cognitive modernity following the appearance of anatomical modernity after 200,000 BP, was made possible by the onset of contextual focus (CF): the ability to shift between an explicit convergent mode conducive to logic and refinement of ideas, and an implicit divergent mode conducive to free-association, viewing situations from radically new perspectives, concept combination, analogical thinking, and insight. This paved the way for an integrated, creative internal network of understandings, and behavioral modernity. We discuss feasible neural mechanisms for this two-stage proposal, and outline how STR and CF differ from other proposals. We provide computational evidence for the proposal obtained with an agent-based model of cultural evolution in which agents invent ideas for actions and imitate the fittest of their neighbors' actions. Mean fitness and diversity of actions across the artificial society increased with STR, and even more so with CF, but CF was only effective if STR was already in place. CF was most effective following a change in task, which supports its hypothesized role in escaping mental fixation. The proposal is discussed in the context of transition theory in the life sciences.