Journal of Experimental Social Psychology 96 (2021) 104154 Contents lists available at ScienceDirect
Journal of Experimental Social Psychology
journal homepage: www.elsevier.com/locate/jesp

Retrospective and prospective hindsight bias: Replications and extensions of Fischhoff (1975) and Slovic and Fischhoff (1977)☆
Jieying Chen a,*,1, Lok Ching Kwan (Roxane)b,1, Lok Yeung Ma (Loren)b,1, Hiu Yee Choi (HayleyAnne)b,1, Ying Ching Lo (Lita)b,1, Shin Yee Au (Sarah)b,1, Chi Ho Tsang (Toby)b,1, Bo Ley Cheng b, Gilad Feldman b,*
a Department of Business Administration, University of Manitoba, Canada b Department of Psychology, University of Hong Kong, Hong Kong SAR, China

ARTICLE INFO
Keywords: Hindsight bias Knew-it-all-along effect Outcome knowledge Judgment and decision making Surprise Confidence Pre-registered replication

ABSTRACT
Hindsight bias refers to the tendency to perceive an event outcome as more probable after being informed of that outcome. We conducted very close replications of two classic experiments of hindsight bias and a conceptual replication testing hindsight bias regarding the perceived replicability of hindsight bias. In Study 1 (N = 890), we replicated Experiment 2 in Fischhoff (1975), and found support for hindsight bias in retrospective judgments (dmean = 0.60). In Study 2 (N = 608), we replicated Experiment 1 in Slovic and Fischhoff (1977), and found support for hindsight bias in prospective judgments (dmean = 0.40). In Study 3 (N = 520) we found strong support for hindsight bias regarding perceived likelihood of our replication of hindsight bias (d = 0.43–1.03). We also included extensions examining surprise, confidence, and task difficulty, yet found mixed evidence with weak to no effects. We concluded support for hindsight bias in both retrospective and prospective judgments, and in evaluations of replication findings, and therefore call for establishing measures to address hindsight bias in valuations of replication work and interpreting research outcomes. All materials, data, and code, were shared on: https://osf.io/nrwpv/.

1. Hindsight bias
Hindsight bias refers to the tendency to perceive an event outcome as more probable after being informed of that outcome, resulting in the illusion that the outcome “was known all along” (Fischhoff, 1975; Hawkins & Hastie, 1990; Roese & Vohs, 2012). Examples of hindsight bias include claims that a surprising movie ending was actually pre­ dictable, post-election claims that it was obvious who would get elected, students feeling like they knew in advance that an unlikely question was to be on the exam, or financial analysts claiming to have predicted market changes after they happened. Hindsight bias may also affect researchers' interpretations of study findings, leading to an over­ estimation of their ability to predict the results beforehand and an un­ derestimation of their reliance on the observed outcomes in

reconstructing their previous predictions (Fischhoff, 1977). The earliest empirical investigation that touches upon the idea of
hindsight bias that we know of dates back to Forer's (1949) study about students' beliefs about a personality test (see Hoffrage & Pohl, 2003). Students were asked to rate the extent to which the test revealed basic characteristics of their personality, and then recall their ratings after knowing that the feedback received by all students was the same. Although Forer (1949) focused on examining how individuals could be fooled by universal statements about personality (e.g., “At times you are extroverted, affable, sociable, while at other times you are introverted, wary, reserved”), this study uncovered the unexpected finding that feedback may affect memory.
A more formal investigation of hindsight bias came in the mid-1970s, when Fischhoff (1975) published a study that explicitly compared the

☆ This paper has been recommended for acceptance by Professor Michael Kraus. * Corresponding author.
E-mail addresses: jieying.chen@umanitoba.ca (J. Chen), rk1128@hku.hk, rk1128@connect.hku.hk (L.C. Kwan), loren14@connect.hku.hk (L.Y. Ma), hychoi@ connect.hku.hk (H.Y. Choi), u3527928@connect.hku.hk (Y.C. Lo), u3519865@connect.hku.hk (S.Y. Au), tbtsang@connect.hku.hk, 13tsangtc1@kgv.hkBo (C.H. Tsang), boleystudies@gmail.com (B.L. Cheng), gfeldman@hku.hk (G. Feldman).
1 Contributed equally, joint first authors
https://doi.org/10.1016/j.jesp.2021.104154 Received 22 May 2020; Received in revised form 1 April 2021; Accepted 19 April 2021 0022-1031/© 2021 Elsevier Inc. All rights reserved.

J. Chen et al.
probability estimates of outcomes before (in foresight) and after (in hindsight) knowing what outcome actually occurred. In this pioneering study, participants were presented with four scenarios and four possible outcomes following each scenario. Then, they were asked to estimate the probabilities of possible outcomes in those scenarios. Some participants were informed of the outcomes of the scenarios, whereas the rest were not. Fischhoff found that participants with outcome knowledge esti­ mated the probability of the informed outcome to be higher than par­ ticipants who were not given any outcome information, demonstrating hindsight bias. Because this effect held despite the instructions to ignore outcome knowledge, Fischhoff (1975) suggested that individuals were either unaware of their bias, or, if they were aware, they were unable to make judgments in a foresightful state of mind (though Dietvorst and Simonsohn, 2019 suggested an alternative accuracy-based account).
Since the Fischhoff (1975) article was published, hindsight bias has attracted much scholarly attention and led to a sizable body of follow-up research. Several studies investigated whether hindsight bias was “real,” or whether it was induced by demand characteristics. For example, Fischhoff (1977) and Wood (1978) found that hindsight bias still held when outcome knowledge was provided as isolated statements, when outcome knowledge was provided with a delay, and when participants were asked to respond as if they were a general college student who might not have known the outcome. These findings alleviated the concern about demand characteristics.
Later studies also differentiated between two main ways to examine hindsight bias (Pohl, 2007). The design used by Fischhoff (1975) is termed the hypothetical design, as participants in the hindsight condi­ tion receive feedback about the actual outcome (or, the correct answer), but are asked to answer as if they did not know the outcome. These “as if” answers are then compared with answers by participants in the foresight condition who receive no feedback. The other design is the memory design, in which participants in the hindsight condition first answer some questions, then are informed of the correct answer, and at the end are asked to recall their initial answers (Fischhoff & Beyth, 1975; Wood, 1978). Their recalled answers are then compared with their initial answers.
The hypothetical design and the memory design share many simi­ larities, yet one distinction between them is noteworthy: hindsight bias detected using the memory design is mostly associated with memory distortion and/or the feeling that the known outcome was to happen inevitably, whereas hindsight bias that occurs in the hypothetical design may entail more complex psychological processes (Roese & Vohs, 2012).
Hindsight bias has had significant impact on a wide array of disci­ plines going beyond psychology, such as economics, management, health science, and law (e.g., Bukszar & Connolly, 1988; Casper, Bene­ dict, & Perry, 1989; Kaplan & Barach, 2002; Thaler, 2016).
2. Reasons for hindsight bias: Emotions
Multiple factors were suggested as possible causes for hindsight bias (Blank, Musch, & Pohl, 2007; Hawkins & Hastie, 1990; Roese & Vohs, 2012), including 1) cognitive processes such as memory impairment, biased reconstruction, and sense-making, 2) meta-cognitive processes involving experiences such as surprise, confidence, experienced fluency, ease of reasoning, and 3) social-motivational processes to increase controllability and enhance self-image.
Several models have been proposed to explain hindsight bias. The Reconstruction After Feedback with Take the Best (RAFT; Hoffrage, Hertwig, & Gigerenzer, 2000) model suggested that when a direct recall of the initial answer is not possible, individuals try to reconstruct their initial answer by using relevant cues to reevaluate the question. Both the initial evaluation and the reconstructed evaluation are based on a Take the Best heuristic, where decision is based on the cue that discriminates among choices and has the highest validity. Because feedback trans­ forms the values of elusive cues into discriminating ones and shifts cue values asymmetrically toward the feedback, the reconstructed answer

Journal of Experimental Social Psychology 96 (2021) 104154
will also be biased toward the feedback, demonstrating hindsight bias. The Selective Activation and Reconstructive Anchoring (SARA; Pohl, Eisenhauer, & Hardt, 2003) model assumes that individuals generate answers, encode feedback, and recall answers based on a probabilistic sampling of associations among external cues and units in the knowl­ edge base. When individuals encode the feedback into their knowledge base, the associations among external cues, feedback, and units that are similar to the feedback are strengthened. This will render units that are more similar to the feedback more likely to be activated in a memory search using those external cues (i.e., selective activation). In addition, after seeing the feedback, individuals may still maintain the feedback in the working memory, or have increased cognitive accessibility to the feedback due to its recent activation. In these cases, feedback may be used as internal retrieval cues, making units similar to the feedback more likely to be retrieved to the working memory (i.e., biased recon­ struction). According to SARA, either selective activation or biased reconstruction, or both, can lead to hindsight bias.
In both RAFT and SARA, when encoding feedback, the changes to the knowledge base, cue values, and associations occur automatically. Such knowledge updating is often seen as an adaptive learning process (e.g., Hawkins & Hastie, 1990; Hertwig, Fanselow, & Hoffrage, 2003; Hof­ frage et al., 2000; Pohl, Bender, & Lachmann, 2002). However, as Bernstein et al. (2011, p. 389) wrote, “the downside of such automatic knowledge updating is that people tend to forget their original, naive thoughts, views, and predictions.”
Other eminent models about the psychological processes underlying hindsight bias include the causal model theory (Blank & Nestler, 2007), Pezzo's (2003) sense-making model, Roese and Vohs' (2012) three-level model, and Sanna and Schwarz's (2006) metacognitive model.
3. Role of surprise, overconfidence, and task difficulty
Emotions such as surprise and overconfidence have been suggested as factors in cognitive and metacognitive processes leading to hindsight bias (Bernstein, Aßfalg, Kumar, & Ackerman, 2016). Fischhoff and Beyth (1975, p. 12) argued that “the occurrence of an event increases its reconstructed probability and makes it less surprising than it would have been had the original probability been remembered.” They operation­ alized surprise as “the occurrence of an unlikely event or the nonoc­ currence of a likely event” (Fischhoff & Beyth, 1975, p. 12), and found that outcome knowledge reduced surprise (i.e., participants made decreased probability estimates of unlikely events and increased prob­ ability estimates of likely events after knowing the outcome). Slovic and Fischhoff (1977, Experiment 3) was the first study that we know of to examine the relationship between subjective surprise feelings and hindsight bias. In this experiment, “hindsight subjects assessed the surprisingness of the reported outcome, and foresight subjects assessed how surprising each of the two possible outcomes would seem were they obtained” (Slovic & Fischhoff, p. 549). They found direct support for the hypothesis that hindsight participants who had outcome knowledge felt less surprised about the outcome than foresight participants who had no outcome knowledge. Later studies investigating the role of surprise in hindsight bias either measured surprise as a subjective feeling (e.g., Hoch & Loewenstein, 1989; Ofir & Mazursky, 1997) or manipulated surprise using expected outcomes or high cognitive loads (e.g., Mazur­ sky & Ofir, 1990; Müller & Stahlberg, 2006).
In addition, some studies found that when experiencing surprise about a highly unusual outcome, individuals may show a reversed hindsight bias, such that their reconstructed probability estimates of the outcome becomes lower than their initial probability estimates (Mazursky & Ofir, 1990; Müller & Stahlberg, 2007; Ofir & Mazursky, 1997). The underlying rationale is that hindsight bias often results from a cognitive failure to become aware of the distorted memory and evi­ dence reconstruction, and to recognize how much oneself has learned from the outcome knowledge prior to the estimation. The feeling of surprise is linked with an awareness that the outcome is different from

2

J. Chen et al.
what they would have expected given their knowledge of the event. Therefore, when experiencing high levels of surprise, individuals are more likely to conclude that they “never would have known it,” esti­ mating the outcome probability to be lower (rather than higher) than the estimates made by individuals without outcome knowledge (Mazursky & Ofir, 1990; Müller & Stahlberg, 2007; Ofir & Mazursky, 1997; Sanna & Schwarz, 2006).
Whereas surprise may help individuals overcome hindsight bias, overconfidence may exacerbate hindsight bias, as it reduces individuals' scrutiny of their own decision-making process and hinders the recog­ nition of the impact of outcome knowledge (Bernstein et al., 2016). Winman, Juslin, and Bjo¨rkman (1998) found support for a confidencehindsight mirror effect: tasks that yielded overconfidence led to a hindsight bias, whereas tasks that yielded underconfidence led to a reversed hindsight bias.
The impact of overconfidence and hindsight bias may escalate. For example, physicians may become more overconfident about their judgments of certain physiological indices over time due to accumulated outcome knowledge, which can lead to increasingly stronger hindsight bias (Arkes, 2013). However, studies indicated little to no relationship between physicians' confidence about their judgments of physiological indices and the real accuracy of those judgments (e.g., Dawson et al., 1993; Yang & Thompson, 2010). Thus, without proper caution, the escalation of overconfidence and hindsight bias may lead to undesirable consequences in high-stake decisions.
Other studies investigated the role of task difficulty in hindsight bias (e.g., Harley, Carlsen, & Loftus, 2004), based on the assumption that task difficulty is related to both surprise about the outcome and confi­ dence about the accuracy of one's own judgment (Winman et al., 1998). The arguments are similar to those regarding surprise and confidence.
4. Implications of hindsight bias for Science
Hindsight bias holds implications for science, and shows the importance of the ongoing credibility revolution in promoting open science practices (Hom Jr & Van Nuland, 2019; Kerr, 1998; Nosek, Ebersole, DeHaven, & Mellor, 2018; Shrout & Rodgers, 2018; Veldkamp, 2017). First, retrospective hindsight bias suggests that being presented with a study's outcome may lead to overestimating the probability of that outcome. This may result in the skewed perception that this outcome was the expected result and in line with own expectations even when it was not the case. Past research has shown that when evaluating research findings, individuals who had outcome knowledge perceived the research findings to be more obvious and inevitable than individuals who had no outcome knowledge (Wong, 1995). The false belief of having known the outcome all along may lead to Hypothesizing After the Results are Known (HARKing; i.e., presenting a post-hoc hypothesis as if it were an a priori hypothesis; Kerr, 1998), which has been iden­ tified as a questionable research practice (QRPs). HARKing makes exploratory analyses seem as if they were confirmatory, thereby leading to an overconfidence in the reported findings and fewer follow-up confirmatory studies, overall increasing rate of false-positive findings in the literature (Bosco, Aguinis, Field, Pierce, & Dalton, 2016; Hom Jr & Van Nuland, 2019; John, Loewenstein, & Prelec, 2012; Shrout & Rodgers, 2018). To fend against hindsight bias, researchers have rec­ ommended the endorsement of open-science best practices such as preregistration, Registered Reports, and openly sharing all predictions and decisions throughout the entire research lifecycle (Nosek et al., 2018; van't Veer & Giner-Sorolla, 2016).
Second, prospective hindsight bias may result in overestimating the robustness and the generalizability of an initial finding, believing that replications of a study would result in the same findings, and that rep­ lications are therefore of no value and a waste of resources. There are currently immense pressures for novelty in science, discouraging re­ searchers from conducting replications (Nosek, Spies, & Motyl, 2012). Then, even if researchers do conduct a replication study, the

Journal of Experimental Social Psychology 96 (2021) 104154
combination of hindsight bias and confirmation bias (Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012) may lead researchers to analyze the data and interpret replication findings in a way that would favor initial findings, or feel pressured to do so by original au­ thors, reviewers, editors, and other gatekeepers in the publication, promotion, and grant systems that perceive original findings as taken for granted or more authoritative. One way of addressing these problems is by encouraging direct close open replications by multiple third-party researchers (Brandt et al., 2014; Nosek et al., 2012; Nosek et al., 2018). Several mass open-science collaboration teams have been formed in the last decade to pursue this direction, such as the Psychological Science Accelerator (Moshontz et al., 2018), Collaborative Replications and Education Project (Wagge et al., 2019), and Many Labs (e.g., Ebersole et al., 2020; Klein et al., 2018).
However, the success of these initiatives depends on slow-to-change publication, granting, and promotion systems that may hinder these efforts. For example, grant authorities may be reluctant to fund, and reviewers and editors may be reluctant to publish, perceiving that this research question has already been addressed and therefore replications hold no contribution. This proposed impact of hindsight bias on the estimation of replication outcome and the evaluation of contribution of replication studies awaits empirical tests. Initial findings regarding journals conducting Registered Reports, publication accepted peer reviewed pre-registrations prior to data collection, both demonstrate these issues and show promise in addressing them (Chambers & Tza­ vella, 2020; Scheel, Schijen, & Lakens, 2021).
5. Current investigation: Two replications, extensions, and a new study
In this research, we conducted a close replication of hindsight bias in retrospective judgment (Study 1), a close replication of hindsight bias in prospective judgment (Study 2), and a study to examine possible hind­ sight bias regarding replicability of hindsight bias (Study 3).
We aimed to address mixed evidence regarding the magnitude and generalizability of hindsight bias. An early meta-analysis study con­ ducted by Christensen-Szalanski and Willham (1991) on 122 studies on hindsight bias suggested a small effect size of d = 0.35, 95% confidence interval (CI) [0.28, 0.41] (sample-size corrected effect size d = 0.52, 95% CI [0.43, 0.61]). A more recent meta-analysis study based on 252 independent effect sizes revealed a similar sample-size-corrected effect of d = 0.39, 95% CI [0.36, 0.42] (Guilbault, Bryant, Brockway, & Pos­ avac, 2004). In contrast to the two meta-analytical studies, the initial study of hindsight bias by Fischhoff (1975) suggested a much larger effect size (d = 1.13) for the supported contrasts between foresight and hindsight. A replication study of Fischhoff and colleagues' classic hind­ sight bias studies may help examine replicability of the effect using the same stimuli four and a half decades later, to provide an up-to-date es­ timate of the effect to aid researchers design follow-up studies (Simons, Holcombe, & Spellman, 2014).
We aimed to revisit and examine the replicability of these classic findings, following calls for a credibility revolution following what was coined a “replication/reproducibility crisis” in psychology (e.g., Klein et al., 2018; Open, 2015) and science overall (Camerer et al., 2016; Camerer et al., 2018; Gelman & Loken, 2013; Ioannidis, 2005). Datasets and code for the three studies were shared on: https://osf.io/nrwpv/.
5.1. Two pre-registered close replications
We chose Experiment 2 in Fischhoff (1975) as a target for replication for three reasons. First, this article is one of the first rigorous demon­ strations of hindsight bias (Fischhoff, 2007; Hoch & Loewenstein, 1989). At the time of writing the article had 3073 citations according to Google Scholar. Second, the study was conducted in the 1970s and employed simplified statistics and reporting. By revisiting these classic methods and stimuli we aimed to refresh and update the methods and reporting

3

J. Chen et al.
to meet current best practices in psychological science. To our knowl­ edge and based on our communication with the author, this study is the first direct replication of the target experiment.
We chose Slovic and Fischhoff's (1977) Experiment 1 for replication for three key reasons. First, this experiment investigates prospective judgments, in which participants predict the probability of outcomes in future trials. In such judgments, hindsight bias is thought to have occurred if the forecast of the probability in future trials is affected by the outcome knowledge of the initial trial. The article received much attention, with 531 citations according to Google Scholar at the time of writing. Examining prospective judgments is important because hind­ sight bias may lead to biases in generalized evaluations of research and investigations based on initial, preliminary findings (Slovic & Fischhoff, 1977). By examining both retrospective judgments (in Study 1) and prospective judgments (in Study 2), we aimed to provide a more com­ plete view of how outcome knowledge affects judgments and decision making.
Second, although Davis and Fischhoff's (2014) conducted a replica­ tion of the target experiment, we thought it worthwhile to conduct a preregistered replication by an independent external research team of no direct relationship with the original authors. As suggested by various replication protocols (e.g., KNAW: Royal Dutch Academy of Arts and Sciences, 2018; Simons et al., 2014), independent replications by re­ searchers from a different team can help reduce biases and increase credibility. Our study also enforced a pre-registration which was not included in Davis and Fischhoff (2014) and was conducted on a larger sample (N = 608 versus N = 173 after filtering the responses from 95 participants who failed the attention checks). Pre-registration is increasingly seen as important in limiting researchers' degrees of freedom and protecting against hindsight fallacy, as it helps reduce the possibility of consciously or unconsciously modifying beliefs about the hypotheses and planned ways of handling the data collection and analysis.
Overall, the two close replications answer calls for more preregistered direct replication studies and open-science transparent reporting to increase the credibility and trustworthiness of published findings (Gelman & Loken, 2013; Munafo` et al., 2017; Nosek & Lakens, 2014). Such efforts are particularly important in light of recent findings of lower-than-expected replicability rates of classic findings by mass preregistered replications (Camerer et al., 2018; Klein, Hardwicke, et al., 2018; Open, 2015).
Both replication experiments were pre-registered on the Open Sci­ ence Framework prior to data collection (Study 1: https://osf.io/5bfjg; Study 2: https://osf.io/75h98).
5.2. Extensions: Surprise and overconfidence
In addition, we added several extensions. Although the role of sur­ prise and (over)confidence in hindsight bias seem widely accepted, our knowledge about their effects is in fact limited. First, the relationship between receiving outcome knowledge and surprise about the outcome needs further clarification. Some studies found that participants with outcome knowledge were less surprised by the outcome compared to those without outcome knowledge (e.g., Slovic & Fischhoff, 1977), whereas other studies found surprise as a moderator of hindsight bias (e. g., Ofir & Mazursky, 1997). Second, there are multiple ways of manip­ ulating and measuring surprise (e.g., high/low probability of the outcome, warning/no warning about a stimulus, congruence/incon­ gruence with outcome expectation) (see Ash, 2009; Nestler & Egloff, 2009; Ofir & Mazursky, 1997; Pezzo, 2003; Slovic & Fischhoff, 1977), yet these are often disjointed. For example, Pezzo (2003) manipulated surprise by outcome feedback that was either congruent or incongruent with participants' expectation, yet found that “regardless of whether outcomes were generally congruent or incongruent, people who found them to be still surprising after 5 minutes of thought showed less hindsight bias” (p. 430). Third, theoretical arguments in past research

Journal of Experimental Social Psychology 96 (2021) 104154
suggested that surprise and confidence may mediate and/or moderate the relationship between hindsight (vs. foresight) and probability esti­ mates, yet past studies seldom explicitly and systematically tested these mechanisms.
We therefore proposed extensions regarding the roles of surprise and confidence. In Study 1, we tested the mediating and moderating roles of surprise. In Study 2, we tested the mediating and moderating roles of surprise, overconfidence, and task difficulty.
5.3. New study: Hindsight bias over replicability of hindsight bias
The purpose of the third study was to examine hindsight bias regarding the perceived replicability of hindsight bias. In our other replication work, we are often faced with reviewers who argued that our replication findings were not surprising, regardless of whether they were successful or not, and claiming that our replications added nothing new. Study 3 aimed to show the importance and generalizability of hindsight studies to directly address these issues by testing whether, ironically, hindsight bias replications may themselves be subject to hindsight bias.
In this study, we asked participants to contemplate the study design of Fischhoff's (1975) Experiment 2 and to then estimate the probabilities of a successful replication and of a failed replication. If hindsight bias holds, then participants who were informed of the outcome of the replication study would estimate the probability to be higher than par­ ticipants who did not know the outcome and participants who were informed of the opposite outcome.
This study was pre-registered on the Open Science Framework prior to data collection (Study 3: https://osf.io/qyznw).
6. Study 1: Replicating Experiment 2 of Fischhoff (1975)
6.1. Target experiment and hypotheses
6.1.1. Replication: Retrospective hindsight bias In Experiment 2 of Fischhoff (1975), 172 students from an intro­
ductory statistics class in an Israeli university participated in the study (details available in the Supplementary Materials). Participants first read a passage describing an event, and were then asked to estimate the probabilities of four possible outcomes for the event. Participants were randomly assigned to two types of conditions: those in the Before con­ dition did not have any outcome knowledge (i.e., they did not know which of the four outcomes actually occurred), whereas those from the After conditions were given the outcome knowledge but were asked to estimate as if they had not known the outcome. Because for each event, there were four possible outcomes, there were four After conditions, with each condition stating that one of the presented outcomes had actually occurred. Despite being asked to ignore their knowledge of the outcome, participants in the After conditions estimated a higher prob­ ability for the outcome to which they were told has occurred, demon­ strating hindsight bias.
We made the following prediction for the replication study of Experiment 2 of Fischhoff (1975):
H1: Probability estimates (hindsight bias). Compared with participants in the Before condition, participants in the After conditions estimate a higher probability of the outcome that they knew had occurred.
6.1.2. Extension: Surprise We proposed extension hypotheses regarding the processes leading
to hindsight bias. Feelings of surprise signal the difficulty of generating alternatives to the outcome, increase the need to scrutinize the cognitive process, and deepen the extent of sense making after receiving outcome knowledge (Bernstein et al., 2016; Pezzo, 2003; Sanna & Schwarz, 2006). Our literature review suggested that surprise could play one or both of two roles in hindsight bias. The first role is an indicator or an accompanying outcome of hindsight bias. An implicit and untested inference of this line of reasoning is that surprise is an intermediate

4

J. Chen et al.

Journal of Experimental Social Psychology 96 (2021) 104154

outcome in the cognitive processes leading to hindsight bias. For example, Slovic and Fischhoff (1977) suggested that hindsight bias occurred when outcome knowledge led individiauls to feel less surprised and biased their probability estimates toward the known outcome. The second role is a required condition that shapes the magnitude of hind­ sight bias, or a moderator of hindsight bias. For example, Sanna and Schwarz (2006) argued that hindsight bias occurs when individuals feel the outcome is unsurprising, and it could reverse when individuals feel the outcome is surprising (i.e., the “I never would have known it” effect or the “backfire effect”; Hawkins & Hastie, 1990; Hoch & Loewenstein, 1989). Some models considered both roles of surprise simultaneously. For example, Pezzo's (2003) sense-making model suggested that a sur­ prising outcome is required to trigger sense-making activities (surprise as a moderator); while the person might experience some initial surprise (surprise as a mediator), successful sense-making activities lead to hindsight bias and reduce end-state surprise feelings (surprise as an accompanying outcome).
We therefore tested three effects of surprise: as an outcome of experimental condition, as a mediator of the effect of experimental condition on probability estimates, and as a moderator of the effect of experimental condition on probability estimates.2 In order to test these effects, we asked participants to report their feelings of surprise about the outcome. We proposed that:
H2: Surprise ratings (extension). H2a: Compared with participants in the Before condition, participants in the After conditions report lower levels of surprise regarding the outcome for which they knew had occurred. H2b: Surprise mediates the relationship between outcome knowledge and probability estimates. (exploratory). H2c: Surprise moderates the relationship between outcome knowledge and probability estimates, such that hindsight bias is stronger in the lowsurprise group than in the high-surprise group. (exploratory).
6.2. Method
6.2.1. Power analysis The planned sample size for the replication study was calculated
based on an effect size of d = 1.13, 95% CI [0.44, 1.82] for a single before-after contrast, estimated from the target experiment (see Sup­ plementary Materials for details). We conducted a power analysis using G-Power (Faul, Erdfelder, Buchner, & Lang, 2009). In order to achieve a statistical power of 95% with an alpha of 0.05 (two-tailed), a sample size of 46 per comparison would be required. Because the study adopted a between-subject design (4 events with 4 possible outcomes each), we approximated a total sample size of 46 * 4 * 4 = 736. In consultation with the original author and the editor, we removed the stimuli and results relating to Events C and D. We therefore updated this analysis posthoc to indicate a total required sample size of 368.
6.2.2. Participants A total of 442 American participants were recruited from Amazon
Mechanical Turk online through CloudResearch (Litman, Robinson, & Abberbock, 2017) (245 females, 196 males, 1 undisclosed, Mage = 39.78, SDage = 11.46, see Supplementary Materials for details about sample characteristics; descriptives in this section were updated to reflect the exclusion of data collection for Events C and D, explained below).
6.2.3. Procedure and materials The materials used in this replication study were obtained from the
2 A variable can be both a mediator and a moderator of a relationship (James & Brett, 1984; Judd, Kenny & McClelland, 2001; Karazsia & Berlin, 2018). Such relationships have been tested in previous studies (e.g., Connor-Smith & Compas, 2002; Wei, Mallinckrodt, Russell & Abraham, 2004; Zhou, Wang, Chen & Shi, 2012)

author of the target experiment (see Supplementary Materials). There were four events: Event A, the British-Gurka struggle; Event B, the nearriot in Atlanta; Events C: Mrs. Dewar in therapy; and Event D: George in therapy. We note that in consultation with the original author and the editor we removed the descriptions of the stimuli of Events C and D, and related findings. We jointly strongly believe that these stimuli should no longer be used in future research.
Events A and B were each described in a passage ranged from 185 to 235 words in length, followed by four possible outcomes. For example, Event A described a war between the British and the Gurkas in South Asia in 1814. The four possible outcomes were: (1) British resulted in victory; (2) Gurka resulted in victory; (3) The two sides reached a mil­ itary stalemate, but were unable to come to a peace settlement; (4) The two sides reached a military stalemate and came to a peace settlement.
This study used a between-subject design. Participants were randomly assigned to one of five experimental conditions: one Before condition and four After conditions (each associated with one informed outcome). Each participant was presented with one of the two events used in the target experiment. That is, participants were exposed to one of the 5 (condition) x 2 (event) possibilities. Participants in the Before condition read the assigned passage alone, whereas participants in the After conditions read the assigned passage followed by a sentence which provided the outcome knowledge (e.g., Outcome: British resulted in victory).
Participants were then asked a comprehension question, “To make sure you read and understood the scenario, please answer the following comprehension question: What was the outcome of the event?”. In order to proceed to the next stage of the experiment, participants in the Before condition had to choose “The case did not indicate the outcome,” whereas participants in the After conditions had to choose the informed outcome.
6.2.3.1. Probability estimates. Participants were asked to provide prob­ ability estimates for each of the four possible outcomes of the event. For the Before condition, the question read, “In light of the information appearing in the passage, please estimate the probability of occurrence of each of the four possible outcomes listed below. There are no right or wrong answers, answer based on your intuition. (The probabilities should sum to 100%)”. For the After conditions, in addition to the sen­ tences above, participants also read “Answer as if you do not know the outcome, estimating the case at that time before outcomes were known.”
6.2.3.2. Surprise ratings. Following the probability estimates, partici­ pants were asked to rate their levels of surprise (i.e., “How surprised would you be if the outcome was that the (outcome)?”) on a 7-point Likert scale (1 = Not surprised at all, 7 = Very surprised). Participants in the Before condition were asked to rate their surprise levels regarding all four possible outcomes; participants in After conditions were only asked to rate their surprise levels regarding the informed outcome.
6.2.4. Replication evaluation: Very close replication Our replication study is a very close replication based on the criteria
proposed in LeBel, Berger, Campbell, and Loving (2017) and LeBel, McCarthy, Earp, Elson, and Vanpaemel (2018). According to LeBel and colleagues' taxonomy, a very close replication shares the same inde­ pendent variable (IV) operationalization, dependent variable (DV) operationalization, IV stimuli, and DV stimuli with the original study; only the procedural details, physical setting, and contextual variables (e. g., linguistic or cultural adaptations) differ from the original study. Similarly, Brandt et al. (2014, p. 218) wrote that “close replications refer to those replications that are based on methods and procedures as close as possible to the original study … ideally the only differences between the two are the inevitable ones (e.g., different participants…).” In Study 1, the IV operationalization, DV operationalization, IV stimuli, and DV stimuli were all the same as those used in the original study, with a few necessary adjustments to improve on the design or to accommodate

5

J. Chen et al.

Table 1 Study 1: Classification of the replication, based on LeBel et al. (2018).

Design facet

Replication Details of deviation

IV operationalization DV operationalization IV stimuli DV stimuli Procedural details
Physical settings
Contextual variables Replication classification

Same Same Same Same Similar
Different
Similar Very close replication

• Changed the word “Negro” into “African American” in the passage of Event A
• Added surprise measure after the replication.
• Used a larger sample size: Original study: 172; Replication study: 890
• Added one comprehension question for each scenario.
• Added funnel questions at the end of the study.
• Changed from offline data collection (participants were students from Hebrew University and the University of the Negev) to online data collection (participants were recruited from CloudResearch).

Note. IV = Independent variable, DV = dependent variable.

Journal of Experimental Social Psychology 96 (2021) 104154
contextual requirements. See Table 1 for a summary of classification, necessary adjustments, and theoretical extensions.
6.3. Results
6.3.1. Replication: Probability estimates We summarized the descriptives of the probability estimates in
Table 2. Violin plots of the probability estimates are available in Sup­ plementary Materials. The numbers of interest are the probability esti­ mate of an outcome in the Before condition, and probability estimate of that same outcome in the After condition in which this outcome was informed to have occurred (numbers marked in bold).
Because there are two events with four outcomes each, we conducted 8 sets of Mann-Whitney U tests. As shown in Table 3, in 7 of the 8 sets of comparison (except Event A-Outcome 2), the mean probability estimates in the After condition were higher than those in the Before condition. The results remained largely the same when we adjusted the p values using the Benjamini and Hochberg (1995) false discovery rate control method.
Historically, the correct outcomes of Events A and B were Outcome 1, yet the mean probability estimates of these two outcomes in the Before condition were not higher than chance (21.40% and 7.46%, respec­ tively). Specifically, the probability estimate for Outcome 1 (British

Table 2 Study 1: Means and standard deviations of probability estimates.

Experimental Condition

Sample Size

Outcome Informed

Outcome Evaluated

Outcome 1

Outcome 2

Outcome 3

Outcome 4

Mean

SD

Mean

SD

Mean

SD

Mean

SD

Event A: British-Gurka struggle

Before

43

After

45

42

44

43

None Outcome 1 Outcome 2 Outcome 3 Outcome 4

21.40 45.51 26.05 21.93 25.49

18.17 28.59 20.35 17.13 17.84

38.61 21.18 43.62 23.18 28.40

26.60 19.45 23.62 16.14 22.72

23.49 19.69 18.48 31.59 18.72

19.93 16.25 18.66 19.61 15.97

16.51 13.62 11.86 23.30 27.40

15.53 11.46 9.52 14.10 23.98

Event B: Near riot in Atlanta

Before

46

After

46

44

44

45

None Outcome 1 Outcome 2 Outcome 3 Outcome 4

7.46 25.44 11.61 15.23 9.87

9.25 23.11 12.50 13.64 12.18

25.91 22.63 50.02 17.50 12.98

23.88 17.58 29.13 12.60 12.24

12.91 22.28 9.52 29.77 11.20

18.43 21.88 10.34 28.53 16.82

53.72 29.65 28.84 37.50 65.96

26.66 18.76 22.18 24.53 27.76

Note: The bolded numbers indicate the key sets of comparison of interest (i.e., the Before and After probability estimates of the same outcome). The foresight ratings of all four outcomes came from the same participants in the foresight condition. The hindsight ratings of the four outcomes came from participants in the four hindsight conditions, respectively. Following a discussion with lead original author and editor Events C and D about therapy have been removed from reporting due to prob­ lematic stimuli in the target article.

Table 3 Study 1: Mann-Whitney U tests of probability estimates difference between before and after conditions.

After - Before

Mean Difference (Rank)

95% CI for ϕ

95% CI for d

U

z

p

padjusted

r

ϕ

LL

UL

d

LL

UL

Event A Outcome 1

23.0

Event A Outcome 2

5.8

Event A Outcome 3

11.5

Event A Outcome 4

14.0

Event B Outcome 1

26.7

Event B Outcome 2

24.6

Event B Outcome 3

20.9

Event B Outcome 4

11.3

462

4.24

<0.001

<0.001

0.45

0.76

0.65

0.84

1.00

0.53

1.46

780

1.09

0.277

0.277

0.12

0.57

0.45

0.68

0.20

− 0.23

0.63

695

2.15

0.032

0.043

0.23

0.63

0.51

0.74

0.41

− 0.02

0.84

624.5

2.62

0.009

0.014

0.28

0.66

0.54

0.76

0.54

0.10

0.97

444

4.87

<0.001

<0.001

0.51

0.79

0.68

0.87

1.02

0.56

1.48

459.5

4.48

<0.001

<0.001

0.47

0.77

0.66

0.85

0.91

0.45

1.36

543

3.82

<0.001

<0.001

0.40

0.73

0.62

0.82

0.71

0.26

1.14

778.5

2.05

0.041

0.047

0.21

0.62

0.50

0.73

0.45

0.03

0.87

Note. We calculated three effect sizes of the Mann-Whitney U tests, which are r (the correlation between being in the hindsight condition and winning in the rank comparison with the other condition, see Fritz, Morris, & Richler, 2012), ϕ (the probability that a score in the hindsight condition was higher than that in the foresight condition, see Fay & Malinovsky, 2018), and Cohen's d (the standard difference in the mean ranking between the hindsight condition and the foresight condition, assuming that the rankings follow a normal distribution, see Cohen, 1988). p values were adjusted using the Benjamini and Hochberg (1995) false discovery rate control method. Following a discussion with lead original author and editor Events C and D about therapy have been removed from reporting due to problematic stimuli in the target article.

6

J. Chen et al.

Journal of Experimental Social Psychology 96 (2021) 104154

Table 4 Study 1 Extension: Means and standard deviations of surprise ratings.

Experimental Condition

Outcome Evaluated

Outcome 1

Outcome 2

Outcome 3

Outcome 4

n

Mean

SD

n

Mean

SD

n

Mean

SD

n

Mean

SD

Event A: British-Gurka struggle

Before

43

4.35

2.14

43

3.95

2.16

43

3.42

1.76

43

4.53

1.84

After

45

3.20

2.00

42

4.10

1.88

44

3.41

1.76

43

4.60

1.55

Event B: Near-riot in Atlanta

Before

46

5.89

1.55

46

2.78

1.55

46

5.46

1.57

46

1.96

1.38

After

46

5.17

1.70

44

2.91

1.65

44

5.36

1.94

45

1.91

1.44

Note. The foresight ratings of all four outcomes came from the same participants in the foresight condition. The hindsight ratings of the four outcomes came from participants in the four hindsight conditions, respectively. Hindsight participants only rated their surprise over the outcome which they knew had occurred. Following a discussion with lead original author and editor Events C and D about therapy have been removed from reporting due to problematic stimuli in the target article.

resulted in victory) in Event A (Before condition) was not significantly different from chance (one-sample t-test: t = − 1.30, df = 42, p = .200, d = − 0.20). The probability estimate for Outcome 1 (dispersion and no outbreak of violence) in Event B (Before condition) was the lowest among those for all four outcomes, and it was significantly smaller than chance (one-sample t-test: t = − 12.87, df = 45, p = .000, d = − 1.90). These suggest that the participants did not have much knowledge about the historical background of these two events, relieving the concern that prior knowledge gained before participating in this study impacted participants' reactions to these two experimental stimuli. Importantly, as Event B is the only event that is linked to the American history, the findings address the concern that using an American sample (versus the Israeli sample used in the original study) reduced the task difficulty of this question or impacted the magnitude of hindsight bias.
Because Mann-Whitney U tests are nonparametric, we calculated three effect sizes: (1) r, the correlation between experimental group membership and whether the rank is higher or lower than the other group (see Fritz et al., 2012), (2) ϕ, the probabilistic index reflecting the likelihood that the score in one group is smaller than or equal to that of the other group, estimated using the receiver operating characteristic curve under the proportional odds assumption (see Fay & Malinovsky, 2018), and (3) Cohen's d, the standard difference between the mean rankings of the two groups, assuming that the rankings in the two groups follow a normal distribution (Cohen, 1988).
As shown in Table 3, the correlations rs between being in the hind­ sight condition and winning in the rank comparison with the other condition were all positive. The sizes of correlations were mostly me­ dium to large (Cohen, 1988). The effect sizes ϕs, reflecting the proba­ bility that a score in the hindsight condition was higher than that in the foresight condition, did not include 0.50 in all but one set of comparison (i.e., Event A-Outcome 2). However, when we calculated the Cohen's ds under the assumption of a normal distribution of the rankings, two comparisons had confidence intervals that overlapped with the null (i.e., Event A-Outcome 2, Event A-Outcome 3). The Cohen's d effects were mostly medium to large.

6.3.2. Robustness checks: Alternative tests and exclusion criteria To examine the robustness of the findings, we conducted additional
analyses on the probability estimates (see Supplementary Materials). Results of Student's independent samples t-tests of probability estimates were largely consistent with the results of the Mann-Whitney U tests. When we analyzed the data with only participants who met a set of preregistered criteria (i.e., understood the English used in the study, was serious in the study, and did not correctly guess the purpose of the study), the results regarding the probability estimates remained mostly the same. We concluded robust support for Hypothesis 1.
6.3.3. Extension: Surprise ratings We detailed the descriptives of the surprise ratings in Table 4. Violin
plots of the surprise ratings are available in Supplementary Materials. Similar to previous analyses with probability estimates, we con­
ducted 8 sets of Mann-Whitney U tests to compare the differences in surprise ratings between the Before condition and the After conditions. As shown in Table 5, a total of two sets of comparisons were significant, based on p value and the confidence interval of ϕ. Specifically, for Event A Outcome 1 and Event B Outcome 1, surprise ratings in the After condition were significantly lower than those in the Before condition, and the effect sizes were small to medium. The results of the other three sets of comparison (Event C-Outcome 2, Event C-Outcome 4, Event DOutcome 2) were in the opposite direction of our prediction, with the surprise ratings in the After condition being higher than those in the Before condition (small to medium effect sizes). When we adjusted the p values using the Benjamini and Hochberg (1995) false discovery rate control method, none of the Mann-Whitney U tests remained significant. Results of Student's independent samples t-tests of surprise ratings (see Supplementary Materials) were largely consistent with the results of the Mann-Whitney U tests. Overall, the results provided little to no support for Hypothesis 2(a) regarding surprise ratings.
We found no support for exploratory Hypotheses 2 that surprise acted as a mediator of the relationship between outcome knowledge and probability estimates. We found mixed support for exploratory

Table 5 Study 1: Extension: Mann-Whitney U tests of differences in surprise between Before and After conditions.

After - Before

Mean Difference (Rank)

95% CI for ϕ

95% CI for d

Event A Outcome 1 Event A Outcome 2 Event A Outcome 3 Event A Outcome 4 Event B Outcome 1 Event B Outcome 2 Event B Outcome 3 Event B Outcome 4

− 13.76 1.67 − 0.69 1.86 − 13.80 1.40 1.91 − 1.03

U
665 867.5 931 884.5 740.5 980.5 969 1011.5

z
− 2.56 0.32 − 0.13 0.35 − 2.57 0.26 0.36 − 0.21

p
0.011 0.752 0.897 0.725 0.010 0.795 0.719 0.833

padjusted
0.044 0.897 0.897 0.897 0.044 0.897 0.897 0.897

r
− 0.27 0.03 − 0.01 0.04 − 0.27 0.03 0.04 − 0.02

ϕ
0.34 0.52 0.49 0.52 0.35 0.52 0.52 0.49

Lower
0.24 0.40 0.38 0.40 0.25 0.40 0.41 0.39

Upper
0.46 0.64 0.61 0.64 0.46 0.63 0.63 0.59

d
− 0.56 0.07 − 0.01 0.04 − 0.44 0.08 − 0.05 − 0.03

Lower
− 0.99 − 0.36 − 0.43 − 0.38 − 0.86 − 0.34 − 0.47 − 0.44

Upper
− 0.12 0.50 0.42 0.46 − 0.02 0.49 0.36 0.38

Note. p values were adjusted using the Benjamini and Hochberg (1995) false discovery rate control method. Following a discussion with lead original author and editor Events C and D about therapy have been removed from reporting due to problematic stimuli in the target article.

7

J. Chen et al.

Table 6 Study 1: Comparison of results of the original study and the replication study.

Cohen's d [95% CI] p-value Note

Fischhoff (1975) Replication Event A Outcome 1 Event A Outcome 2 Event A Outcome 3 Event A Outcome 4 Event B Outcome 1 Event B Outcome 2 Event B Outcome 3 Event B Outcome 4

1.13 [0.44, 1.82]
1.00 [0.53, 1.46] 0.20 [− 0.23, 0.63] 0.41 [− 0.02, 0.84] 0.54 [0.10, 0.97] 1.02 [0.56, 1.48] 0.91 [0.45, 1.36] 0.71 [0.26, 1.14] 0.45 [0.03, 0.87]

<0.001
<0.001 0.277 0.032 0.009 <0.001 <0.001 <0.001 0.041

Signal – consistent No signal – inconsistent, smaller No signal – inconsistent, smaller Signal – inconsistent, smaller Signal – consistent Signal – consistent Signal – consistent Signal – inconsistent, smaller

Note: Following a discussion with lead original author and editor Events C and D about therapy have been removed from reporting due to problematic stimuli in the target article. According to LeBel et al. (2019), there is a signal if the confidence interval of the replication effect size excludes zero, and the replication result is considered consistent with the original study if the confidence interval of the replication effect size includes the effect size of the original study.

Hypothesis 2c that surprise acted as a moderator, such that the rela­ tionship between outcome knowledge and probability estimates was stronger when surprise was lower rather than higher. However, in our original analysis when all four events were included, we did not find support for the moderating effect of surprise. While we have decided to remove results related to Events C and D, which is a deliberate deviation from the preregistration, we caution our readers about the conflicting findings of the moderating effect of surprise in Study 1 when different

Journal of Experimental Social Psychology 96 (2021) 104154
events were included in the analysis. We provided all related details and analyses in the Supplementary Materials.
6.4. Discussion
We aimed to replicate Fischhoff (1975)’s Experiment 2, a classic study of hindsight bias. Following the original study, we hypothesized that participants provided with outcome knowledge would estimate a greater probability for the outcome which they knew had occurred, compared to participants without outcome knowledge. This hypothesis was supported in 7 of the 8 sets of comparison of probability estimates, and the effect sizes were mostly medium to large. Once participants were informed of the outcome, they perceived the outcome to be more probable, even if they were asked to ignore the outcome, demonstrating hindsight bias. These findings therefore support the idea that partici­ pants were either unaware of or unable to resist the influence of outcome knowledge.
6.4.1. Evaluation of replication findings: Mostly successful replication In Table 6 we compared the results of the target experiment and the
replication study using the criteria described in LeBel, Vanpaemel, Cheung, and Campbell (2019). All the 8 sets of comparison of proba­ bility estimates were in the same direction as in the original study. The replication effects were medium to large, though slightly smaller than those found in the original study. In 4 of the 8 sets of probability esti­ mates comparisons, the confidence intervals of the effect sizes (Cohen's ds) of the replication study included d = 1.13, which is the effect size estimated from the target experiment. In Fig. 1 we provided a forest plot

Fig. 1. Study 1: forest plot for probability estimates. 8

J. Chen et al.
of the probability estimates contrasts. Overall, we conclude this repli­ cation of hindsight bias as successful.
6.4.2. Extension: Surprise ratings Beyond the replication, we extended the experiment by investigating
an intuitive yet understudied dependent variable, the level of surprise associated with the known outcome. Judging from null hypothesis sig­ nificance testing (NHST), effect sizes, and confidence intervals, 2 of the 8 sets of surprise ratings comparisons were significant in the predicted direction.
Contrary to our expectations, we found no support for surprise as a mediator in the relationship between outcome knowledge and proba­ bility estimates. Additional analyses showed that surprise ratings and probability estimates were indeed negatively correlated, both in the Before condition and in the After conditions (see Supplementary Mate­ rials). These results suggest that the negative correlation between sur­ prise ratings and probability estimates may be caused by factors other than hindsight bias. Also, we found inconclusive findings for the exploratory hypothesis that surprise acted as a moderator of the rela­ tionship between outcome knowledge and probability estimates.
7. Study 2: Replicating experiment 1 of Slovic and Fischhoff (1977)
7.1. Target experiment and hypotheses
7.1.1. Replication: Prospective hindsight bias In Experiment 1 of Slovic and Fischhoff (1977), 184 American par­
ticipants were recruited via university newspaper. All participants read four vignettes about scientific research. For each vignette, participants in the foresight condition read that two outcomes were possible in the first trial, whereas participants in the hindsight condition read that the first trial had been conducted and one of the two outcomes had occurred. They were then asked why they thought the outcome(s) might occur, and then predicted the probability that the previously observed outcome would repeat in future research trials. The results suggested a sense of inevitability of the disclosed outcome among hindsight participants: their predicted probabilities of the previously observed outcome to repeat were higher than those of participants in the foresight condition (d = 0.36). Davis and Fischhoff (2014) replicated this experiment, which produced similar effects (overall effect: 0.27–0.33, d = 0.20 to 0.44) that the disclosed outcome of the initial trial was perceived to be more likely to occur in future trials in hindsight than in foresight.
We extended the original design and tested exploratory analyses regarding the mechanisms underlying hindsight bias, using a different set of materials and decisions (i.e., prospective judgments). In addition to surprise, we asked participants to report their levels of confidence about the accuracy of their own judgments. To better understand if the nature of the task would have an impact on hindsight bias, we also measured participants' overall levels of perceived difficulty of the pre­ diction task.
We followed Experiment 1 in Slovic and Fischhoff (1977) to predict that hindsight bias would be observed in prospective judgments. In­ dividuals often use past information to form judgments about the future (Aarts, Verplanken, & Van Knippenberg, 1998; Ouellette & Wood, 1998). If individuals' beliefs about past events changed due to outcome knowledge, then those changed beliefs may trigger hindsight bias when people use them to make prospective judgments. In addition, knowing the outcome of the initial trial may increase the perceived inevitability of the outcome, which will increase the expectation that the outcome will repeatedly occur in the future. Therefore, we predicted:
H3: Participants in the hindsight condition estimate a greater probability that the outcome will continue to occur in future trials, compared with par­ ticipants in the foresight condition.

Journal of Experimental Social Psychology 96 (2021) 104154
7.2. Extension: Surprise, confidence, and task difficulty
For the extension hypotheses, we first examined the effects of sur­ prise and confidence. By surprise, we refer to individuals' feelings of surprise if a particular outcome would occur in future trials (Slovic & Fischhoff, 1977). By confidence, we refer to individuals' feelings of confidence about the accuracy of their own judgments (Granhag, Stro¨mwall, & Allwood, 2000). We chose to study these two factors because these have been suggested as mechanisms that affect hindsight bias: beliefs about events' objective likelihoods, and beliefs about one's own prediction ability subjectively (Roese & Vohs, 2012).
As in Study 1, we hypothesized that surprise ratings are lower among participants in the hindsight condition than those in the foresight con­ dition. We also tested the hypothesis that surprise mediates or moder­ ates the relationship between hindsight condition and probability estimates as in Study 1.
H4: Surprise ratings (extension). (H4a) Participants in the hindsight conditions report lower levels of surprise regarding the outcome for which they knew had initially occurred compared with participants in the foresight condition. (H4b) Surprise mediates the relationship between the hindsight condition and probability estimates. (exploratory) (H4c) Surprise moderates the relationship between hindsight condition and probability estimates, such that hindsight bias is stronger in the lowsurprise group than in the high-surprise group. (exploratory) Like surprise, past research has also theorized and examined multiple roles that confidence can play in hindsight bias. For example, over­ confidence is often proposed as a consequence of outcome knowledge (Davis & Fischhoff, 2014; Slovic, Lichtenstein, & Fischhoff, 1988). Other studies examined the moderating role of confidence in hindsight bias. For example, Arkes, Wortmann, Saville, and Harkness (1981) found that a procedure to reduce overconfidence by asking for reasons for each possible outcome reduced hindsight bias. Also, Werth and Strack (2003) found that the magnitude of hindsight bias was contingent on the feeling of confidence, which served as a signal of whether the individual would have known the answer or not. They found that participants who experienced higher confidence showed greater hindsight bias than participants who experienced lower confidence. Therefore, we hypothesized that participants in the hindsight con­ dition will report greater confidence about the accuracy of their esti­ mation than participants in the foresight condition. Furthermore, like surprise, we examined whether confidence mediates or moderates the relationship between hindsight condition and probability estimates. H5: Confidence ratings (extension). (H5a) In prospective judgments, compared with participants in the fore­ sight condition, participants in the hindsight conditions report higher levels of confidence about the accuracy of their judgments. (H5b) Confidence mediates the relationship between hindsight condition and probability estimates. (exploratory) (H5c) Confidence moderates the relationship between hindsight condition and probability estimates, such that hindsight bias is stronger in the highconfidence group than in the low-confidence group. (exploratory) To examine the effect of the characteristics of the task, we also measured the extent to which participants perceived the task to be difficult. We expected that participants in the hindsight condition will report lower levels of task difficulty than participants in the foresight condition. This is because the foresight condition could dilute partici­ pants' attention by asking them to consider two outcomes simulta­ neously, whereas the hindsight condition could cue participants to ignore the outcome that did not occur in the initial trial (Slovic & Fischhoff, 1977). Lower levels of perceived task difficulty, in turn, may contribute to hindsight bias, as the subjective difficulty to generate alternative outcomes can be taken as an indication that those outcomes are implausible (Harley et al., 2004; Roese & Vohs, 2012; Sanna & Schwarz, 2006). We therefore tested the following: H6: Task difficulty (exploratory extension).

9

10

Table 7 Study 2: Questions asked in the virgin rat scenario.

Foresight condition

Hindsight outcome A condition

1. Try and estimate, what are the probabilities of the following outcomes (these

Outcome: The initial virgin rat exhibited maternal behavior in the first

probabilities should total 100%)

trial.

Virgin rat will exhibit maternal behavior: _______

1. What is the probability that in a replication of this experiment with 10

Virgin rat will NOT exhibit maternal behavior: _______

additional virgin female rats (these probabilities should total 100%)

Total: ________

a. All will exhibit maternal behavior?: _______

2. If the virgin rat does exhibit maternal behavior, what is the probability that in b. Some will exhibit maternal behavior?: _______

a replication of this experiment with 10 additional virgin female rats (these

c. None will exhibit maternal behavior?: _______

probabilities should total 100%)

Total: ________

a. All will exhibit maternal behavior?: _______

2. Do you think the finding that the virgin rat exhibited maternal behavior is

b. Some will exhibit maternal behavior?: _______

surprising? 1 = Not surprising at all … 5 = Extremely surprising

c. None will exhibit maternal behavior?: _______

3. How confident are you about the accuracy of your predictions on the

Total: ________

probability of the future outcomes of the Virgin Rat experiment? 0 = Extremely

3. If the virgin rat does exhibit maternal behavior, how surprised would you be? 1 = not confident … 6 = Extremely confident

Not surprised at all … 5 = Extremely surprised

4. If the virgin rat does NOT exhibit maternal behavior, what is the probability

that in a replication of this experiment with 10 additional virgin female rats

(these probabilities should total 100%)

a. All will exhibit maternal behavior?: _______

b. Some will exhibit maternal behavior?: _______

c. None will exhibit maternal behavior?: _______

Total: ________

5. If the virgin rat does NOT exhibit maternal behavior, how surprised would you be?

1 = Not surprised at all … 5 = Extremely surprised

6. How confident are you about the accuracy of your predictions on the probability of

the future outcomes of the Virgin Rat experiment? 0 = Extremely not confident … 6 =

Extremely confident

For all three conditions, after reading all four scenarios

How difficult was it to make estimations of outcomes probabilities? 1 = Extremely easy … 7 = Extremely difficult

Note. Questions italicized in the table are the extension questions; they were not italicized in the Qualtrics survey.

Hindsight outcome B condition
Outcome: The initial virgin rat did NOT exhibit maternal behavior in the first trial. 1. What is the probability that in a replication of this experiment with 10 additional virgin female rats (these probabilities should total 100%) a. All will exhibit maternal behavior?: _______ b. Some will exhibit maternal behavior?: _______ c. None will exhibit maternal behavior?: _______ Total: ________ 2. Do you think the finding that the virgin rat did not exhibit maternal behavior is surprising? 1 = Not surprising at all … 5 = Extremely surprising 3. How confident are you about the accuracy of your predictions on the probability of the future outcomes of the Virgin Rat experiment? 0 = Extremely not confident … 6 = Extremely confident

Journal of Experimental Social Psychology 96 (2021) 104154

J. Chen et al.

J. Chen et al.

Journal of Experimental Social Psychology 96 (2021) 104154

Table 8 Study 2: Classification of the Replication, based on LeBel et al. (2018)

Design facet

Replication

Details of deviation

IV operationalization DV operationalization IV stimuli

Same Same Same

DV stimuli

Similar

Procedural details

Similar

Physical settings

Different

Contextual variables Replication
classification

Different Very close replication

• Changed outcome B in the Y-Test scenario from “Places in Area B" to “Places in Area C,” so that outcome A and outcome B were symmetric.
• Removed reasons for why the outcome had occurred. • Added surprise, confidence, and task difficulty measures. • Used a larger sample size: Original study: 184 (sample size per group varied from 24 to 37); Replication study: 604 (197 hindsight, 204
foresight outcome A, 203 foresight outcome B) • Added one comprehension question for each scenario. • Added funnel questions at the end of the study. • Changed from offline data collection (participants were recruited via a student newspaper at the University of Oregon) to online data
collection (participants were recruited from CloudResearch).

Note. IV = Independent variable, DV = dependent variable.

(H6a) In prospective judgments, compared with participants in the fore­ sight condition, participants in the hindsight condition report lower levels of task difficulty.
(H6b) Task difficulty mediates the relationship between hindsight con­ dition and probability estimates.
(H6c) Task difficulty moderates the relationship between hindsight con­ dition and probability estimates, such that hindsight bias is stronger among those who perceive the task to be easy than among those who perceive the task to be difficult.
7.3. Method
7.3.1. Power analysis The planned sample size for the replication study was estimated from
the target experiment (see Supplementary Materials for details). We estimated the effect sizes based on p values, because they were the only statistics available from the target experiment. The p values of pairwise comparisons ranged from 0.001, 0.01, to 0.05. We chose p = .05, which lead to d = 0.36, 95% CI [0.00, 0.72]. We conducted a power analysis using G-Power (Faul et al., 2009). In order to achieve a statistical power of 95% with alpha of 0.05 (one-tailed), a sample size of at least 168 people would be required for each condition, totaling a sample size of 504 for three conditions: foresight, hindsight outcome A, hindsight outcome B. In anticipation of unexpected situations such as careless responses and to make sure that our study would be over-powered, we planned to recruit about ten more participants per comparison.
7.3.2. Participants A total of 604 American participants were recruited online through
CloudResearch (300 females, 302 males, 2 undisclosed, Mage = 38.5, SDage = 12.00, see Supplementary Materials for details about sample characteristics). We did not allow participants who took part in Study 1 to take part in Study 2.
7.3.3. Procedure and materials The study used a between-subject design. Participants were
randomly assigned to one of three conditions. In the foresight condition, participants were not presented with any outcomes of an initial trial. In the hindsight conditions, because there were two possible outcomes for each scientific trial scenario, half of the participants read that outcome A had occurred in the initial trial (hindsight outcome A condition), and the other half read that outcome B had occurred in the initial trial (hindsight outcome B condition). All participants read all four scenarios: virgin rat, hurricane seeding, gosling imprinting, and Y test, shown in a random order.
The descriptions of the four scenarios were adapted from Slovic and Fischhoff's (1977) Experiment 1 on hindsight bias (see Supplementary

Materials for full materials). We use the virgin rat scenario to illustrate the materials and the question format:
Virgin Rat. Several researchers intend to perform the following experiment: They will inject blood from a mother rat into a virgin rat immediately after the mother rat has given birth. After the injection, the virgin rat will be placed in a cage with the newly born baby rats, after removal of the actual mother. The possible outcomes were: (a) the virgin rat exhibited maternal behavior or. (b) the virgin rat failed to exhibit maternal behavior. Following each scenario, participants were required to correctly answer comprehension questions before proceeding to the next stage of the study. For the virgin rat scenario, the comprehension question was, “Which rat will be placed in a cage with the newly born baby?” The correct answer was “Virgin rat with mother rat blood injection.” Then, participants were asked questions measuring probability es­ timates (of the initial trial for foresight condition, and of the future trials for both foresight and hindsight conditions), followed by our extension questions measuring surprise and confidence. We present the questions for the virgin rat scenario in Table 7.
7.3.3.1. Probability estimates of future trials. Participants were asked to estimate the probability that the outcome would occur in “all,” “some,” and “none” (or “A,” “B,” and “C” for the Y-test scenario) of future trials. The percentages of the three items (“all,” “some,” and “none”) needed to add up to 100%. Participants in foresight condition were asked to rate the probabilities of two possible outcomes; participants in hindsight conditions were only asked to rate the outcome which they knew had occurred in the initial trial.
7.3.3.2. Extension: Surprise ratings. Following the probability estimates, participants were asked to rate their levels of surprise regarding the outcome(s) (i.e., “Do you think the (outcome) is surprising?”) on a 5point Likert scale (1 = not surprising at all, 5 = extremely surprising). Participants in the foresight condition were asked to rate the levels of surprise regarding two possible outcomes; participants in the hindsight conditions were only asked to rate the outcome which they were knew had occurred in the initial trial.
7.3.3.3. Confidence ratings. For each scenario, participants were asked to rate their confidence (i.e., “How confident are you about the accuracy of your predictions on the probability of the future outcomes of the (scenario)?”) on a 7-point Likert scale (0 = extremely not confident, 6 = extremely confident).
7.3.3.4. Task difficulty. After reading all four scenarios, participants

11

J. Chen et al.

Table 9 Study 2: Mean Probabilities in Future Trials (in percentage %).

Initial result and kind of replication

Foresight

N

Mean SD

Hindsight

N

Mean SD

Virgin rat experiment

Outcome A: Shows maternal

behavior

a. All show maternal

29.16 28.09

38.42 29.19

behavior**

b. Some show maternal

197 34.57 26.04 204 36.58 25.37

behavior

c. None show maternal

36.27 31.44

25.00 26.04

behavior***

Outcome B: Fails to show

maternal behavior

a. All show maternal

17.73 23.68

13.89 21.81

behavior

b. Some show maternal

197 28.08 23.90 203s 25.90 23.56

behavior

c. None show maternal

54.20 32.83

60.21 33.18

behavior

Hurricane seeding experiment

Outcome A: Intensity

increases

a. All increase

47.74 30.13

b. Some increase

197 33.80 24.37 204

c. None increase

18.45 20.60

Outcome B: Intensity

weakens

a. All weaken

29.59 25.52

b. Some weaken**

197 34.51 23.60 203

c. None weaken***

35.91 30.19

49.35 34.99 15.66

28.73 24.98 18.59

34.00 41.24 24.77

26.39 25.47 24.50

Gosling imprinting experiment

Outcome A: Approaches

duck

a. All approach duck*

39.14 27.63

b. Some approach duck

197 38.50 25.96 204

c. None approach duck***

22.36 24.58

Outcome B: Approaches

goose

a. All approach goose**

38.10 30.38

b. Some approach goose

197 38.98 27.09 203

c. None approach goose*

22.92 24.71

45.26 39.63 15.10

30.62 27.93 17.73

46.39 36.42 17.19

33.13 27.95 21.90

Y-test experiment Outcome A: Places dot in
Area A a. Places in Area A b. Places in Area B c. Places in Area C* Outcome B: Places dot in Area C a. Places in Area A b. Places in Area B c. Places in Area C*

59.62 23.92 197 13.90 14.67 204
26.48 17.98

61.96 15.80 22.24

22.66 17.53 16.21

51.54 24.18 197 14.68 15.04 203
33.78 21.56

47.52 13.76 38.73

23.36 14.84 22.70

Note. Options and numbers marked in bold represent the kind of replication that was reported to have occurred in the initial trial (hindsight) or could possibly occur in the initial trial (foresight). The foresight ratings of both outcome A and outcome B came from the same participants in the foresight condition. The hindsight ratings came from participants in the hindsight outcome A condition or the hindsight outcome B condition, respectively. *p < .05, **p < .01, ***p < .001.

were required to rate the difficulty of the prediction task (i.e., “How difficult was it to make estimations of outcomes probabilities?”) on a 7-point Likert scale (1 = extremely easy, 7 = extremely difficult).
7.3.4. Replication evaluation: Very close replication Our replication study is a very close replication based on the criteria
proposed in LeBel et al. (2017) and LeBel et al. (2018). Our IV oper­ ationalization and DV operationalization were the same as those used in the original study. For IV stimuli, we made the necessary adjustment to change outcome B in the Y-Test scenario from “Places in Area B" to

Journal of Experimental Social Psychology 96 (2021) 104154

Table 10 Study 2: Independent Samples Student’s T-Tests of Probability Estimates be­ tween Foresight and Hindsight (Outcome A/B) Conditions.

Hindsight vs. Foresight

Mean

t

Difference

df p

padjusted Cohen’s Cohen’s

d

d 95% CI

Lower Upper

Virgin rat experiment

Outcome A: Shows

maternal behavior

a. All show

9.26

maternal

behavior**

b. Some show 2.01

maternal behavior

c. None show

-11.27

maternal behavior*** a

Outcome B: Fails to

show maternal

behavior

a. All show

-3.83

maternal behavior

b. Some show -2.18

maternal behavior

c .None show 6.01

maternal

behavior

3.24 399 0.001 0.006 0.32 0.12 0.52

0.78 399 0.434 0.521 0.08 -3.92 399 <0.001 <0.001 -0.39

-0.12 0.27 -0.59 -0.19

-1.69 398 0.093 0.159 -0.17 -0.92 398 0.359 0.453 -0.09 1.82 398 0.069 0.151 0.18

-0.37 0.03 -0.29 0.10 -0.02 0.38

Hurricane seeding experiment

Outcome A:

Intensity increases

a. All increases 1.61

b. Some increases 1.18

c. None increases -2.79

Outcome B:

Intensity weakens

a. All weaken 4.41

b. Some weaken** 6.73

c. None weaken*** a

-11.14

0.55 399 0.584 0.48 399 0.632 -1.43 399 0.155

0.637 0.659 0.248

0.05 0.05 -0.14

1.70 398 0.090 0.159 0.17 2.74 398 0.006 0.029 0.27 -4.06 398 <0.001 <0.001 -0.41

-0.14 0.25 -0.15 0.24 -0.34 0.05
-0.03 0.37 0.08 0.47 -0.61 -0.21

Gosling imprinting experiment Outcome A:
Approaches duck a. All approach 6.12 duck* b. Some approach 1.13 duck c. None approach -7.26 duck*** a Outcome B: Approaches goose a. All approach 8.29 goose**a b. Some approach -2.56 goose c. None approach -5.73 goose*

2.10 399 0.036 0.42 399 0.674 -3.40 399 0.001
2.61 398 0.009 -0.93 398 0.353 -2.46 398 0.014

0.086 0.674 0.006
0.036 0.453 0.042

0.21 0.04 -0.34
0.26 -0.09 -0.25

0.01 0.41 -0.15 0.24 -0.54 -0.14
0.06 0.46 -0.29 0.10 -0.44 -0.05

Y-test experiment Outcome A: Places
dot in Area A a. Places in Area 2.34 A b. Places in Area B 1.90
a

1.00 399 0.316 0.446 0.10 1.18 399 0.240 0.360 0.12

-.10 0.30 -0.08 0.31

c. Places in Area -4.24 C* Outcome B: Places dot in Area C a. Places in Area A -4.02 b. Places in Area B -0.93 c. Places in Area 4.95 C*

-2.48 399 0.013 0.042 -0.25 -0.45 -0.05

-1.69 398 0.091 -0.62 398 0.535 2.23 398 0.026

0.159 0.611 0.069

-0.17 -0.06 0.22

-0.37 0.03 -0.26 0.13 0.03 0.42

Note. Bolded options indicate the pairs of comparisons of interest. a Levene’s test was significant. *p < .05, **p < .01, ***p < .001. p values were adjusted using the Benjamini and Hochberg (1995) false discovery rate control method.

12

J. Chen et al.
“Places in Area C,” so that outcome A and outcome B were symmetric. For DV stimuli, we removed the request for writing down the reasons for why the outcome had occurred, in order to reduce the time required for the experiment in an online setting where participants might have shorter focus than when they were in a physical laboratory. These ad­ justments were necessary and did not fundamentally change the stimuli used in the replication study. We therefore consider this replication a very close replication of the original study. See Table 8 for a summary of classification, necessary adjustments, and theoretical extensions.
7.4. Results
7.4.1. Probability estimates We summarized the descriptive statistics of probability estimates in
Table 9. Violin plots of the probability estimates are available in Sup­ plementary Materials. As there were four scenarios (virgin rat, hurricane seeding, gosling imprinting, Y-test), two possible outcomes (A or B) for the initial trial, and three possible outcomes of future trials (all, some, none for the first three scenarios; A, B, C for the Y-test scenario), we conducted 24 sets of independent samples Student's t-tests.
These eight key sets of comparisons are bolded in Tables 9 and 10. For the virgin rat, hurricane seeding, and gosling imprinting scenarios, among the three options (i.e., all, some, and none repetition), we were particularly interested in the probability estimates for repetition in all future trials. For the Y-test scenario with only one future trial, we were interested in the probability estimate of the dot being placed in the same area as in the initial trial.
As shown in Table 10, in four of the eight comparisons, the proba­ bility estimates in the hindsight condition were higher than those in the foresight condition, demonstrating hindsight bias. In the other four sets of comparison, the differences in the probability estimates between the hindsight condition and the foresight condition were weaker.
Overall, the results provide moderate support for Hypothesis 3. The effects in all eight sets of comparisons were in the direction of partici­ pants in the hindsight condition providing higher estimates than those in the foresight condition, although there were variations depending on the scenario and the outcome.
7.4.2. Extension: Surprise ratings We summarized the descriptives of surprise ratings in Table 11, and
the violin plots are available in the Supplementary Materials. Similar to previous analyses for probability estimates, we conducted eight sets of independent samples Student's t-tests to compare the surprise ratings in the foresight and hindsight conditions.
As shown in Table 12, three of the eight sets of comparison of sur­ prise ratings were in support of hindsight bias: hurricane seeding-

Journal of Experimental Social Psychology 96 (2021) 104154
outcome B, gosling imprinting-outcome B, and Y-test-outcome B. Overall, the results provide some support for Hypothesis 4(a) regarding surprise ratings.
7.4.3. Extension: Confidence ratings As shown in Table 12, only one of the eight sets of comparison were
in support of difference in the confidence ratings between the foresight condition and the hindsight condition: virgin rat scenario-Outcome B. The results for the virgin rat-Outcome A were contrary to our expecta­ tion. All other confidence ratings comparison sets had much weaker effects. We concluded results provide no support for Hypothesis 5(a) regarding confidence ratings.
7.4.4. Task difficulty We conducted an independent samples Student's t-test to examine
the difference in the perceived task difficulty. Participants in the hind­ sight outcome A condition (M = 4.41, S⋅D = 1.61) reported lower levels of task difficulty than participants in the foresight condition (M = 4.98, S⋅D = 1.43), t(399) = − 3.79, p < .001, d = − 0.38, 95% CI [− 0.58, − 0.18]. Similarly, participants in the hindsight outcome B condition (M = 4.40, S⋅D = 1.51) reported lower levels of task difficulty than par­ ticipants in the foresight condition (M = 4.98, S⋅D = 1.43), t(398) = − 3.98, p < .001, d = − 0.40, 95% CI [− 0.60, − 0.20]. Overall, we conclude strong support for Hypothesis 6(a) that participants in the hindsight conditions perceived the task to be less difficult than partici­ pants in the foresight condition.
7.4.5. Robustness checks: Alternative tests and exclusion criteria To examine the robustness of the findings, we conducted additional
analyses (see Supplementary Materials for details). First, we tested the Hypotheses 3, 4(a), 5(a), and 6(a) using Mann-Whitney U tests, and the results were highly similar to those obtained using Student's indepen­ dent samples t-tests. Second, when we analyzed the data with only participants who met a set of pre-registered exclusion criteria (i.e., selfreported English proficiency and seriousness, and guessing study pur­ pose), we found little to no differences.
7.4.6. Mediation and moderation analyses We tested the mediation and the moderation hypotheses (see Sup­
plementary Materials for details). Surprise partially mediated the rela­ tionship between hindsight (vs. foresight) and probability estimates, supporting H4(b), and confidence moderated the relationship between hindsight (vs. foresight) and probability estimates, supporting H5(c). We found no support for the mediating effects of confidence in H5(b) or task difficulty in H6(b), and no support for the moderating effects of surprise in H4(c) or task difficulty in H6(c).

Table 11 Study 2: Means and Standard Deviations of Surprise Ratings and Confidence Ratings.

Scenario

Outcome A

Outcome B

Foresight

Hindsight

Foresight

Hindsight

Mean

SD

Mean

SD

Mean

SD

Mean

SD

Surprise

Virgin rat

3.13

1.40

2.93

1.25

1.75

1.05

1.57

0.95

Hurricane seeding

2.03

1.14

2.13

1.19

3.01

1.26

2.67

1.16

Goose imprinting

2.20

1.21

2.08

1.10

2.16

1.14

1.90

1.13

Y-test

1.81

1.06

1.66

0.95

2.46

1.17

2.14

1.01

Confidence

Virgin rat

3.61

1.56

3.17

1.58

3.61

1.56

3.91

1.5

Hurricane seeding

3.27

1.68

3.39

1.61

3.27

1.68

3.25

1.45

Goose imprinting

3.41

1.62

3.49

1.53

3.41

1.62

3.67

1.48

Y-test

3.52

1.47

3.63

1.47

3.52

1.47

3.34

1.41

Note. Surprise ratings: 1 = not surprising at all, 5 = extremely surprising. Confidence ratings: 0 = extremely not confident, 6 = extremely confidence. The foresight ratings of both outcome A and outcome B came from the same participants in the foresight condition. The hindsight ratings came from participants in the hindsight outcome A condition or the hindsight outcome B condition, respectively. Hindsight participants only rated their surprise over the outcome which they knew had occurred in the initial trial.

13

J. Chen et al.

Table 12 Study 2: Independent samples student’s T-tests of surprise and confidence rat­ ings between foresight and hindsight conditions.

Hindsight vs.

t

df

p

padjusted

d

95% CI of d

Foresight

Lower Upper

Surprise Outcome A
a. Virgin rat b. Hurricane seeding c. Gosling imprinting d. Y-test Outcome B a. Virgin rat b. Hurricane seeding** c. Gosling imprinting* d. Y-test**

-1.48a 399 .140 .187

.88

399 .382 .382

-.67

399 .320 .366

-1.54 399 .124 .187

-1.79 -2.82

398 .074 .148 398 .005 .020

-2.30 398 .022 .059 -2.92a 398 .004 .020

-.15 -.35

.05

.09 -.11

.29

-.10 -.30

.10

-.15 -.29

-.01

-.18 -.38

.02

-.28 -.48

-.08

-.23 -.43

-.03

-.29 -.49

-.09

Confidence Outcome A
a. Virgin rat** b. Hurricane seeding c. Gosling imprinting d. Y-test Outcome B a. Virgin rat* b. Hurricane seeding c. Gosling imprinting d. Y-test

-2.79 .75 .50 .78 1.98 -.14a 1.70 -1.20

399 .006 .048 99 .454 .605
399 .616 .704
399 .436 .605
398 .049 .196 398 .885 .885
398 .091 .243
398 .232 .464

-.28 -.48

-.08

.07 -.13

.27

.05 -.15

.25

.08 -.12

.28

.20 .002

.40

-.01 -.21

.19

.17 -.03

.37

-.12 -.32

.08

Note. Levene’s test was significant. * p < .05, ** p < .01, *** p < .001. p values were adjusted using the Benjamini and Hochberg (1995) false discovery rate control method.

7.5. Discussion

We aimed to replicate Slovic and Fischhoff's (1977) Experiment 1, a study of hindsight bias in prospective judgments. In line with the find­ ings in the original study, we found support for our predictions in four of the eight sets of comparison. Overall, our findings provide moderate support for hindsight bias in prospective judgments.

7.5.1. Replication: Mostly successful We compared the results of the target experiment and the replication
study based on the criteria described in LeBel et al. (2019). As summa­ rized in Table 13 and Fig. 2, in four of the eight sets of probability es­ timates comparison, we found signals for successful replication. The effect sizes observed in the replication study were similar to those of the

Journal of Experimental Social Psychology 96 (2021) 104154
target experiment for one outcome, smaller for two outcomes, and larger for one outcome. Overall, we conclude this a mostly successful replication.
8. Study 3: Predictions on the replicability of Fischhoff (1975)
8.1. Design and procedure
In this study, we asked participants to predict the replicability of Experiment 2 of Fischhoff (1975) and expected hindsight bias over the replicability of hindsight bias.
All participants first read a brief introduction to the main findings of Experiment 2 of Fischhoff (1975). To ease participants' understanding, we 1) removed “Experiment 2” and simply used “Fischhoff (1975)” in this introduction, and 2) focused only on the results about probability estimates in Fischhoff (1975). Participants were then randomly assigned to one of three conditions: Foresight, Hindsight Outcome Success, and Hindsight Outcome Fail. Those in the Foresight condition were told that a group of researchers intended to conduct a replication of Fischhoff (1975), and there were two possible outcomes: successful replication or failed replication. In addition, those in the Hindsight Outcome Success condition were told that the outcome of the replication was successful; those in the Hindsight Outcome Fail condition were told that the outcome of the replication was a failed replication. All participants were asked to write down the reasons for a successful replication and the reasons for a failed replication. They then provided probability esti­ mates of successful and failed replications. They also answered ques­ tions about surprise, confidence, and task difficulty.
8.2. Hypotheses
Because Study 2 replicated the finding that people tend to use the results of past findings to predict future research outcomes, we expected that:
H7: Participants in the Foresight condition will predict the probability of a successful replication to be higher than chance (50%).
In addition, as suggested by previous research on hindsight bias, outcome knowledge might bias probability estimates toward the known outcome. If participants' probability estimates are influenced by knowledge about the replication outcome, then those who were informed of a successful replication would perceive a successful repli­ cation to be more probable than those who did not have outcome knowledge, whereas those who were informed of a failed replication would perceive a successful replication to be less probable than those who did not have outcome knowledge. Such hindsight bias may occur through cognitive processes such as memory impairment, biased reconstruction, sense-making, and meta-cognitive experiences, as well as social-motivational processes to increase perceived controllability and enhance self-image (Blank et al., 2007). For example, information about a successful replication may impact the person's memory by

Table 13 Study 2: Comparison of Results in the Original Study and the Replication Study.

Scenario

p-value original

Original effect: Cohen's da

p-value replication

Replication effect: Cohen's d [95% CI]

Replication summary

Slovic & Fischhoff, 1977 Present Study Virgin Rat A Virgin Rat B Hurricane Seeding A Hurricane Seeding B Gosling Imprinting A Gosling Imprinting B Y-Test A Y-Test B

< 0.05
< 0.05 > 0.05 < 0.001 < 0.05 < 0.001 > 0.05 < 0.001 < 0.001

0.36 [0, 0.72]
0.36 0 0.61 0.36 0.61 0 0.61 0.61

0.001 0.069 0.584 0.090 0.036 0.009 0.316 0.026

0.32 [0.12, 0.52] 0.18 [− 0.02, 0.38] 0.05 [− 0.14, 0.25] 0.17 [− 0.03, 0.37] 0.21 [0.01, 0.41] 0.26 [0.06, 0.46] 0.10 [− 0.10, 0.30] 0.22 [0.03, 0.42]

Signal – consistent No signal – consistent No signal – inconsistent No signal – consistent Signal – inconsistent, smaller Signal – inconsistent, larger No signal – inconsistent Signal – inconsistent, smaller

Note: a. Estimated using largest possible p-values (e.g., 0.001 if p < .001; 0.05 if p < .05; 0.99 if p > .05; see the power analysis in the Supplementary Materials for details).

14

J. Chen et al.

Journal of Experimental Social Psychology 96 (2021) 104154

Fig. 2. Study 2: Forest Plot of the Effect Size of Probability Estimates.

strengthening the association between relevant cues (e.g., the type of study to be replicated and the research question) and the outcome of a successful replication, or overwriting old knowledge with the newly informed knowledge unconsciously. (e.g., Blank & Nestler, 2007; Hof­ frage et al., 2000; Pohl et al., 2003).
Hence, presenting evidence regarding hindsight bias will result in participants in the Hindsight Outcome Success condition predicting the highest probability for successful replication, followed by participants in the Foresight condition, and lastly participants in the Hindsight Outcome Fail condition.
Therefore: H8: Participants in the Hindsight Outcome Success condition estimate the probability of a successful replication to be higher than that estimated by participants in the Hindsight Outcome Fail condition. H9: Participants in the Hindsight conditions estimate a greater probability for the informed outcome of replication, compared with participants in the Foresight condition.
8.3. Method
8.3.1. Power analysis The planned sample size for the replication study was calculated
based on pretests indicating an effect size of d = 0.4 (see supplementary for details), with power of 95% with alpha of 0.05 (two-tailed) requiring a sample size of 164 people for each condition, totaling a sample size of 492. We collected slightly more responses to address the possibility of unexpected exclusions.

Table 14 Study 3: Mean Estimations of Outcomes of a Replication of Fischhoff (1975) (in percentage %).

Foresight (n = 154)

Hindsight Outcome Success: Successful Replication (n = 178)

Hindsight Outcome Fail: Failed Replication (n = 188)

Mean

SD

Mean

SD

Mean

SD

Estimated probabilities
a. Successful replication
b. Failed replication Surprise a. Successful
replication b. Failed replication Confidence Task difficulty

65.36 a 34.64 a 2.22 a 3.06 a 3.99 a 3.98 a

18.08
18.08
1.28
1.13 1.29 1.66

73.07 b 26.93 b 2.16 a 3.38 b 4.18 a 3.89 a

17.46
17.46
1.24
1.12 1.30 1.73

52.22 c 47.78 c 2.42 a 2.89 a,c 3.64 b 4.19 a

22.62
22.62
1.26
1.14 1.39 1.58

Note. *p < .05, **p < .01, ***p < .001. Means with different superscripts (a, b, c) were significantly different from each other.

8.3.2. Participants A total of 520 American participants were recruited online through
CloudResearch (228 females, 289 males, 3 undisclosed, Mage = 38.96, SDage = 12.18, see Supplementary Materials for details about sample characteristics).

15

J. Chen et al.

Journal of Experimental Social Psychology 96 (2021) 104154

Table 15 Study 3: Independent Samples Student's T-Tests of Estimations of Outcomes of a Replication of Fischhoff (1975).

Hindsight vs. Foresight

Mean Difference

t

df

p

Estimated probabilities of successful replication Hindsight Outcome Success vs. Foresight Hindsight Outcome Fail vs. Foresight Hindsight Outcome Success vs. Hindsight Outcome Fail
Surprise about successful replication Hindsight Outcome Success vs. Foresight Hindsight Outcome Fail vs. Foresight Hindsight Outcome Success vs. Hindsight Outcome Fail
Surprise about failed replication Hindsight Outcome Success vs. Foresight Hindsight Outcome Fail vs. Foresight Hindsight Outcome Success vs. Hindsight Outcome Fail
Confidence Hindsight Outcome Success vs. Foresight Hindsight Outcome Fail vs. Foresight Hindsight Outcome Success vs. Hindsight Outcome Fail
Task difficulty Hindsight Outcome Success vs. Foresight Hindsight Outcome Fail vs. Foresight Hindsight Outcome Success vs. Hindsight Outcome Fail

7.71 − 13.15 20.85
− 0.06 0.20 − 0.26
0.32 − 0.16 0.48
0.19 − 0.35 0.54
− 0.09 0.21 − 0.30

Note. a. Levene's test was nonsignificant for all comparisons.

3.95

330

<0.001

− 5.84

340

<0.001

9.84 a

364

<0.001

− 0.42

330

0.677

1.45

340

0.149

− 1.97

364

0.050

2.56

330

0.011

− 1.33

340

0.184

4.07

364

<0.001

1.31

330

0.192

− 2.40

340

0.017

3.80

364

<0.001

− 0.50

330

0.620

1.17

340

0.243

− 1.73

364

0.085

Cohen's d
0.43 − 0.64 1.03
− 0.05 0.16 − 0.21
0.28 − 0.14 0.43
0.14 − 0.26 0.40
− 0.05 0.13 − 0.18

95% CI of Cohen's d

Lower

Upper

0.21 − 0.86 0.80
− 0.27 − 0.05 − 0.42
0.06 − 0.35 0.22
− 0.08 − 0.47 0.19
− 0.27 − 0.08 − 0.39

0.65 − 0.41 1.26
0.17 0.37 0.00
0.50 0.07 0.64
0.36 − 0.04 0.61
0.17 0.34 0.03

8.3.3. Procedure and materials The study used a between-subject design. Participants were
randomly assigned to one of three conditions. In the Foresight condition, participants did not receive any knowledge about the actual outcome of the replication study. In the hindsight conditions, because there were two possible outcomes for each scientific trial scenario, half of the participants read that the replication was successful (Hindsight Outcome Success condition), and the other half read that replication failed (Hind­ sight Outcome Fail condition). Following the information, participants were required to correctly answer two comprehension questions before proceeding to the next stage of the study. Participants then responded to two open-ended questions asking the reasons for successful or failed replications.
8.3.4. Probability estimates of replication outcomes Participants were then asked to provide probability estimates for
both Outcome A (the hindsight bias effect will be successfully repli­ cated) and Outcome B (the hindsight bias effect will fail to replicate). In the Foresight condition, the instructions were: “In light of the informa­ tion appearing in the paragraphs provided, please estimate the proba­ bilities of occurrence of the two possible outcomes in the replication study. There are no right or wrong answers, answer based on your intuition. (The probabilities should sum to 100%).” In the Hindsight conditions, the instructions contained an additional sentence: “Answer as if you do not know the outcome, estimating the probabilities at that time before the replication study was launched.”
8.3.5. Surprise, confidence, and task difficulty ratings: exploratory We added exploratory measures of surprise, confidence, and task
difficulty. Exploratory hypotheses and findings are reported in the supplementary.
Participants were asked to rate their surprise about both Outcome A and Outcome B, confidence about the accuracy of their estimation, and perceived task difficulty. Measures of surprise, confidence, and task difficulty were similar or identical to those used in Study 2.
8.4. Results
We summarized the descriptive statistics of probability estimates, surprise, confidence, and task difficulty in Table 14. Violin plots of these variables are available in Supplementary Materials.
We conducted a one-sample t-test to test H7. We found that the

probability estimates for a successful replication (MeanProb = 65.36%, S. D.Prob = 18.08%) were higher than chance (50%), t(153) = 10.55, p < .001, d = 0.85. We concluded support for H7.
We conducted independent samples t-tests to test H8 and H9. As shown in Table 15, participants who were informed of Outcome Success estimated a successful replication to be more probable than participants who were informed of Outcome Fail, t(364) = 9.84, p < .001, Cohen's d = 1.03, 95% CI [0.80, 1.26]. In addition, participants who were informed of Outcome Success estimated a successful replication to be more probable than participants who did not know the outcome, t(330) = 3.95, p < .001, Cohen's d = 0.43, 95% CI [0.21, 0.65]. In contrast, participants who were informed of Outcome Fail estimated a successful replication to be less probable than participants who did not know the outcome, t(340) = − 5.84, p < .001, Cohen's d = − 0.64, 95% CI [− 0.86, − 0.41]. The results therefore provided strong support for H8 and H9.
8.5. Robustness checks
To examine the robustness of the findings, we conducted additional analyses (see Supplementary Materials for details). When we analyzed the data with only participants who met a set of pre-registered exclusion criteria (i.e., self-reported English proficiency and seriousness, and guessing study purpose), we found little to no differences between the results with the full sample and the results after exclusion.
8.6. Exploratory extensions
We found some support for the mediating role and the moderating role of surprise over the alternative outcome for the relationship be­ tween Hindsight Outcome Success condition and probability estimates of Outcome A. However, there was no support for any other hypothe­ sized mediating or moderating effects, and we concluded weak to no support for the mediating or moderating effects. Hypotheses, analyses, and results are provided in the supplementary.
8.7. Discussion
We found strong support of hindsight bias for the replicability of hindsight bias. First, being presented with an outcome of Fischhoff's (1975) original study, participants' probability estimates of a successful replication were higher than chance. Second, participants' probability estimates of a certain outcome were higher when they knew the

16

J. Chen et al.
outcome than when they did not know the outcome.
9. General Discussion
We conducted very close replications of Experiment 2 in Fischhoff (1975) and Experiment 1 in Slovic and Fischhoff (1977), and found support for hindsight bias in both retrospective and prospective judg­ ments. In retrospective judgments (Study 1: replication of Fischhoff, 1975), participants were asked to predict the probability of an outcome in a past event. Compared to participants who had no knowledge about the actual outcome of the event, participants who knew the actual outcome estimated the probability of the actual outcome to be higher, even if they were asked to estimate as if they did not know the actual outcome. In prospective judgments (Study 2: replication of Slovic & Fischhoff, 1977), participants were told that researchers had conducted an initial trial of an experiment, and would conduct either one or mul­ tiple trials of the same kind in the future. The participants' job was to predict the outcome of those future trials. Compared to participants who had no knowledge of the actual outcome of the initial trial, participants who knew the actual outcome of the initial trial predicted the proba­ bility of the actual outcome in future trials to be higher.
Building on these two replication studies, we added a third study to examine hindsight bias in estimating the replicability of hindsight bias. Our findings suggest that estimates of replication outcomes were heavily influenced by outcome knowledge. Overall, participants predicted a successful replication for Fischhoff (1975). The probability estimates of a successful replication were highest among those who were informed of a successful replication, moderate among those who were not informed of an outcome, and lowest among those who were informed of a failed replication. Our findings suggest that probability estimations regarding research and replication outcomes were affected by hindsight bias.
9.1. Replications: comparison with original findings
In our two replication studies, results were mostly in line with the original findings with some minor deviations. We concluded these rep­ lications as mostly successful despite these deviations for two reasons. First, study materials were designed almost half a century ago, and some participants may have been more knowledgeable about some of these stimuli than participants in the 1970s. For example, in the Y-test sce­ nario of Study 2, a 4-year-old child was asked to determine the relative position of a dot to the letter Y when viewed from the back of the easel, like in a left-right mirror image. Back in 1970s, people might not necessarily know the more likely choice of the child. However, today, following wider dissemination of findings in developmental and cogni­ tive psychology, more people may have had the insight that mirrorimage confusions are prevalent among children, because the abilities that are required to make the correct choice, such as spatial cognition (Colby, 2009) and theory of mind (Wellman & Liu, 2004), are not welldeveloped among 4-year olds (Gregory, Landau, & McCloskey, 2011). In the target experiment, the average probability of outcome A (“places in area A", showing a lack of spatial cognition and theory of mind) in the foresight condition was 0.29. However, in the replication study, the number was much higher (0.60), possibly indicating a shift of knowl­ edge regarding this phenomenon over the decades. Similarly, in the hurricane seeding scenario in Study 2, the average probability of outcome A (“All increase”) was 0.29 in the target experiment, and 0.48 in our replication study. When participants hold certain knowledge prior to taking part in the study, their probability estimates may be less influenced by the study's manipulation of outcome knowledge (of the initial trial), weakening hindsight bias. Given these changes, we consider our findings an impressive demonstration of the generaliz­ ability and relevancy of the effect.
Second, for Study 2, while the target experiment asked the partici­ pants to write down why they thought the outcome would happen, we did not include this question in the replication study. When asked to

Journal of Experimental Social Psychology 96 (2021) 104154
provide explanations of an outcome, the person would have to tempo­ rarily assume that outcome is true, and then assess its plausibility. Such cognitive processes can lead the person to perceive the outcome to be more plausible, persuasive, or even inevitable (Koehler, 1991). It is therefore possible that writing down the reasons for the outcome re­ inforces participants' belief that the outcome is true, which in turn in­ tensifies hindsight bias. In our replication study we had to make adjustments to remove the step of providing explanations and this may have led to the observed effect size to be smaller than the case when participants were asked to provide explanations.. We note, however, that this explanation does not clarify the weaker effects in Study 1. It could be that the effect size of hindsight bias is larger for retrospective judgments, and smaller for prospective judgments. This possibility awaits further investigation.
9.2. Extensions
We added several extensions. In Study 1, we found no support for the mediating effect of surprise in the relationship between hindsight con­ dition and probability estimates, and inconclusive results for the moderating effect of surprise on the relationship between hindsight condition and probability estimates. In Study 2, we found some support for surprise, but not for confidence, as a mediator of the relationship between hindsight condition and probability estimates. In addition, we found support for confidence, but not for surprise, as a moderator of the relationship between hindsight condition and probability estimates. Hindsight bias was evident when confidence about one's own judgments was high, but it was reversed when confidence was low. In Study 3, we found weak to no support for the mediating role and the moderating role of surprise. Other than that, there was no support for the mediating or the moderating effects of surprise, confidence, and task difficulty.
Given these mixed findings, we are hesitant to offer any conclusions regarding surprise and confidence. Past findings regarding the effect of surprise were not unequivocal. Although many articles argued that hindsight bias could be caused by a lack of scrutiny and consideration of alternatives associated with a lack of surprise feelings (Sanna & Schwarz, 2006; Slovic & Fischhoff, 1977), other research noted that a certain level of surprise is required for hindsight bias to occur––after all, if the person already had the knowledge (thus would not feel surprised), then his/her estimation of the probability shall not be affected by the outcome knowledge provided by the researcher (Pezzo, 2003). In testing the robustness of hindsight bias, some research found that hindsight bias persisted even when the materials and outcome knowledge were diffi­ cult or unexpected by the participants (e.g., Ash, 2009; Fischhoff, 1977; Hoch & Loewenstein, 1989; Roese & Olson, 1996; Wood, 1978), sug­ gesting that surprise did not necessarily hinder hindsight bias. Furthermore, Schkade and Kilbourne (1991) found that hindsight bias was larger when outcomes were inconsistent with expectations than when they were consistent. The authors reasoned that this could be because the process of assimilating the outcome knowledge into what was already known was immediate and at least partially automatic. Thus, the more different and surprising the outcome knowledge was from prior knowledge, the larger the hindsight bias; the more familiar the outcome knowledge was from prior knowledge, the less likely that a cognitive reconstruction leading to hindsight bias will occur. More research is needed to clarify these varying theoretical arguments and mixed findings about the role of surprise in hindsight bias.
Previous studies have linked hindsight bias to confidence, yet there are studies that failed to detect such associations. Ross (2012) found that the effect of outcome knowledge on probability estimates and that on confidence are disconnected. In addition, Schatz (2019) failed to find support for the relationship between receiving outcome knowledge and confidence across ten studies. These and our findings suggest more research is needed to understand role of confidence in hindsight bias, yet it is possible that these links have been overestimated.
In addition, studies in the literature tend to consider surprise and

17

J. Chen et al.
confidence as two sides of the same coin, based on an assumption that feelings of surprise may reduce a person's confidence about a judgment. However, we found no indication for such an association. Future studies may aim to differentiate and contrast surprise and confidence in hind­ sight bias.
We found no support for the mediating effect or moderating effect of subjective task difficulty in the relationship between hindsight condition and probability estimates. Although participants in the hindsight con­ dition perceived the task to be easier, this decreased perceived difficulty did not seem to predict probability estimates. Task difficulty was negatively associated with confidence about one's own judgments, and weakly positively associated with surprise of the outcome. Similar to surprise, the literature also showed discrepancies in whether hindsight bias is larger in more difficult or less difficult tasks (see for example Arkes et al., 1981; Harley et al., 2004). More research is needed to address these discrepancies and clarify the role of task difficulty in hindsight bias.
9.3. Take-aways for Science: Endorsement of Open Science practices
In the introduction we discussed direct and important implications of hindsight bias for science. Beyond our successful replications of classic hindsight bias studies, we also successfully demonstrated the application of hindsight bias regarding our very own replication of hindsight bias.
We were asked by the editor and reviewers to discuss our views on possible ways to address hindsight bias in the scientific process. First, there is the issue of raising awareness to hindsight bias pitfalls. To be able to overcome this bias, there needs to be some awareness that the problem exists, and some scholars in the open-science community have been trying to raise awareness to the impact of cognitive biases and study these systematically using meta research (e.g., Bishop, 2019, 2020a, 2020b). Second, pre-registrations - if done appropriately - seem like a promising direction against researchers fooling themselves by making a public commitment regarding their hypotheses, design, pro­ cedures, and data analysis plans (Nosek et al., 2018; Shrout & Rodgers, 2018; van't Veer & Giner-Sorolla, 2016). These may at the very least address the issues of unintended memory reconstruction and HARKing, since researchers can easily go back to their pre-registrations and examine their findings against their prior plans. These may also partly serve to ensure others of the researchers' open transparent research process, and demonstrate researchers' public commitment to over­ coming their own biases.
Third, Registered Reports publication format (Chambers & Tzavella, 2020; Simons et al., 2014) and results-blind review (Button, Bal, Clark, & Shipley, 2016) can reduce hindsight bias in the publication review process by addressing outcome driven interpretations and the pressures on authors to adhere to a certain outcome. Determining whether to accept or reject a replication study prior to data collection also helps address outcome bias (Baron & Hershey, 1988; Savani & King, 2015), where a failed replication (i.e., a bad outcome) leads to perceiving the study or the replicators as lower quality compared to a successful replication (i.e., a good outcome). Endorsement of Replication Regis­ tered Reports as an integral part of the scientific process, with directions like the Pottery Barn rule (if you publish it, you commit to publishing replications of it; Edlund, Cuccolo, Irgens, Wagge, & Zlokovich, 2020; Srivastava, 2012) and a commitment to publishing all well-executed replications (e.g., Chambers, 2018) may help overcome inherent bia­ ses against replications as being more predictable and of lower value (Zwaan, Etz, Lucas, & Donnellan, 2018).
Lastly, and most important, systematically documenting and openly sharing everything about the research life-cycle, from initial idea and research question, through process, design, and decisions, to materials, data, and code, with public commitment and openness toward third party open peer review, can greatly reduce human biases introduced in the scientific process and encourage collaboration and sharing. This is the essence of open science.

Journal of Experimental Social Psychology 96 (2021) 104154
9.4. Limitations and future research
In all three studies, we used the hypothetical design to test hindsight bias (“answer as if you did not know the outcome”). However, this design makes it difficult to examine psychological processes underlying hindsight bias. We therefore encourage future studies to 1) replicate further studies about hindsight bias which had a stronger focus on the underlying psychological processes, and 2) extend our findings in Study 3 using other designs, such as memory recall (Pohl, 2007), and multi­ nomial processing trees (Bernstein et al., 2011; Groß & Bayen, 2015; Hell, Gigerenzer, Gauggel, Mall, & Müller, 1988).
We conducted all studies using an American sample, and future studies may aim to extend our efforts to also examine samples from other diverse cultures.
We discussed possible implications of hindsight bias for science, yet these were inferred rather than directly tested. We believe that this is a promising and much needed area of research. Future research may aim to directly examine whether and to what extent hindsight bias influences researchers' decisions to embark on replications and reviewers' and ed­ itors' decisions to publish a replication study. If such a bias is found, it would be imperative to further examine the impact of our above sug­ gested solutions and other potential remedies to overcome this bias.
This replication presented us with a special challenge, regarding some of the events included in the original stimuli of Fischhoff (1975). Events C and D used in the original were from a classic clinical psy­ chology book by Ellis from the 1960s. The original authors reflected on the use of these stimuli and noted that the scenarios described patients "in terms that fit now–antiquated mores and theories" (Fischhoff, 2007, p. 11; also see interview in Klein, Hegarty, & Fischhoff, 2017). In cor­ respondence with the original author and the editor we felt it needed to include a warning note that that these stimuli should no longer be used in follow-up research. We removed the reporting of these materials and analyses of these events from the manuscript and the supplementary.
10. Conclusion
We conducted two close replication studies and one novel study to investigate hindsight bias. In Study 1, we found support for hindsight bias as in Experiment 2 of Fischhoff (1975). Participants were more likely to estimate the probability of an outcome to be higher when they knew that the outcome actually occurred. In Study 2, we found some support for hindsight bias as in Experiment 1 of Slovic and Fischhoff (1977). When informed of the outcome of an initial trial, participants were more likely to predict this same outcome to repeatedly occur in future trials. In Study 3, we found support for hindsight bias over the replicability of hindsight bias. We found mixed weak to no support for the mediating and moderating roles of surprise, confidence, and task difficulty. We conclude that after almost five decades since the original studies were published, we found consistent evidence for hindsight bias.
Financial disclosure/funding
This research was supported by the European Association for Social Psychology seedcorn grant.
Authorship declaration
Gilad led the reported replication effort with the team listed below. Gilad supervised each step of the project, conducted the pre-registration, and ran data collection. Jieying followed up on initial work by the other coauthors to verify analyses and conclusions, added advanced tables and plots, designed, ran, and analyzed the third study, and completed the manuscript submission draft. Jieying and Gilad jointly finalized the manuscript for submission.
Lok Ching (Roxane) Kwan, Lok Yeung (Loren) Ma, Hiu Yee (Hay­ leyAnne) Choi, Ying Ching (Lita) Lo, Shin Yee (Sarah) Au, and Chi Ho

18

J. Chen et al.

(Toby) Tsang conducted the two replication studies as part of university coursework. They conducted an initial analysis of the paper, designed the replication, initiated the extensions, wrote the pre-registrations, conducted initial data analyses, and wrote initial replication reports.
Bo Ley Cheng guided and assisted the replication effort.

Contributor roles taxonomy

In the table below, employ CRediT (Contributor Roles Taxonomy) to identify the contribution and roles played by the contributors in the current replication effort. Please refer to the url (https://www.casrai.or g/credit.html) on details and definitions of each of the roles listed below.

Role

Jieying Chen

Conceptualization

X

Pre-registrations

X

Data curation

Formal analysis

X

Funding acquisition

Investigation

X

Methodology

X

Pre-registration peer X

review /

verification

Data analysis peer

X

review /

verification

Project

administration

Resources

Software

X

Supervision

Validation

X

Visualization

X

Writing-original

X

draft

Writing-review and X

editing

Gilad Feldman
X X X X X X X X

Lok Ching Kwan, Lok Yeung (Loren) Ma, Hiu Yee (HayleyAnne) Choi, Ying Ching (Lita) Lo, Shin Yee (Sarah) Au, and Chi Ho (Toby) Tsang

Bo Ley Cheng

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Declaration of Competing Interest
The author(s) declared no potential conflicts of interests with respect to the authorship and/or publication of this article.
Appendix A. Supplementary data
Supplementary data to this article can be found online at https://doi. org/10.1016/j.jesp.2021.104154.
References
Aarts, H., Verplanken, B., & Van Knippenberg, A. (1998). Predicting behavior from actions in the past: Repeated decision making or a matter of habit? Journal of Applied Social Psychology, 28, 1355–1374.
Arkes, H. R. (2013). The consequences of the hindsight bias in medical decision making. Current Directions in Psychological Science, 22, 356–360.
Arkes, H. R., Wortmann, R. L., Saville, P. D., & Harkness, A. R. (1981). Hindsight bias among physicians weighing the likelihood of diagnoses. Journal of Applied Psychology, 66, 252–254.
Ash, I. K. (2009). Surprise, memory, and retrospective judgment making: Testing cognitive reconstruction theories of the hindsight bias effect. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35, 916–933.
Baron, J., & Hershey, J. C. (1988). Outcome bias in decision evaluation. Journal of Personality and Social Psychology, 54, 569–579.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.

Journal of Experimental Social Psychology 96 (2021) 104154
Bernstein, D., Aßfalg, A., Kumar, R., & Ackerman, R. (2016). Looking backward and forward on hindsight bias. Handbook of Metamemory (pp. 289–304). Oxford, UK: Oxford University Press.
Bernstein, D. M., Erdfelder, E., Meltzoff, A. N., Peria, W., & Loftus, G. R. (2011). Hindsight bias from 3 to 95 years of age. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37, 378–391.
Bishop, D. (2019). Fixing the replication crisis: The need to understand human psychology. APS Observer, 32(10).
Bishop, D. (2020a). How scientists can stop fooling themselves over statistics. Nature, 584(7819), 9.
Bishop, D. (2020b). The psychology of experimental psychologists: Overcoming cognitive constraints to improve research: The 47th sir Frederic Bartlett lecture. Quarterly Journal of Experimental Psychology, 73(1), 1–19.
Blank, H., Musch, J., & Pohl, R. F. (2007). Hindsight bias: On being wise after the event. Social Cognition, 25, 1–9.
Blank, H., & Nestler, S. (2007). Cognitive process models of hindsight bias. Social Cognition, 25, 132–146.
Bosco, F. A., Aguinis, H., Field, J. G., Pierce, C. A., & Dalton, D. R. (2016). HARKing’s threat to organizational research: Evidence from primary and meta-analytic sources. Personnel Psychology, 69, 709–750.
Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F. J., Geller, J., Giner-Sorolla, R., … Van’t Veer, A. (2014). The replication recipe: What makes for a convincing replication? Journal of Experimental Social Psychology, 50, 217–224.
Bukszar, E., & Connolly, T. (1988). Hindsight bias and strategic choice: Some problems in learning from experience. Academy of Management Journal, 31, 628–641.
Button, K. S., Bal, L., Clark, A., & Shipley, T. (2016). Preventing the ends from justifying the means: Withholding results to address publication bias in peer-review. BMC Psychology, 4(1), 1–7.
Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., … Wu, H. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351 (6280), 1433–1436.
Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T. H., Huber, J., Johannesson, M., … Altmejd, A. (2018). Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nature Human Behaviour, 2, 637–644.
Casper, J. D., Benedict, K., & Perry, J. L. (1989). Juror decision making, attitudes, and the hindsight bias. Law and Human Behavior, 13, 291–310.
Chambers, C. D. (2018). Reproducibility meets accountability: Introducing the replications initiative at Royal Society Open Science. In Royal Society Open Science. Retrieved from https://royalsociety.org/blog/2018/10/reproducibility-meets-acc ountability/.
Chambers, C. D., & Tzavella, L. (2020). Registered Reports: Past, Present and Future. https://doi.org/10.31222/osf.io/43298.
Christensen-Szalanski, J. J., & Willham, C. F. (1991). The hindsight bias: A meta-analysis. Organizational Behavior and Human Decision Processes, 48, 147–168.
Cohen, J. E. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
Colby, C. L. (2009). Spatial Cognition. Encyclopedia of Neuroscience, 165–171. Davis, A. L., & Fischhoff, B. (2014). Communicating uncertain experimental evidence.
Journal of Experimental Psychology: Learning, Memory, and Cognition, 40, 261–274. Dawson, N. V., Connors, A. F., Jr., Speroff, T., Kemka, A., Shaw, P., & Arkes, H. R. (1993).
Hemodynamic assessment in the critically ill: Is physician confidence warranted? Medical Decision Making, 13, 258–266. Ebersole, C. R., Mathur, M. B., Baranski, E., Bart-Plange, D. J., Buttrick, N. R., Chartier, C. R., … Szecsi, P. (2020). Many labs 5: Testing pre-data-collection peer review as an intervention to increase replicability. Advances in Methods and Practices in Psychological Science, 3(3), 309–331. Edlund, J., Cuccolo, K., Irgens, M. S., Wagge, J. R., & Zlokovich, M. S. (2020). Saving Science Through Replication Studies. https://doi.org/10.31234/osf.io/efypc. Faul, F., Erdfelder, E., Buchner, A., & Lang, A. G. (2009). Statistical power analyses using G* power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149–1160. Fay, M. P., & Malinovsky, Y. (2018). Confidence intervals of the Mann-Whitney parameter that are compatible with the Wilcoxon-Mann-Whitney test. Statistics in Medicine, 37, 3991–4006. Fischhoff, B. (1975). Hindsight ∕= foresight: The effect of outcome knowledge on judgment under uncertainty. Journal of Experimental Psychology: Human Perception and Performance, 104, 288–299. Fischhoff, B. (1977). Perceived informativeness of facts. Journal of Experimental Psychology: Human Perception and Performance, 3, 349–358. Fischhoff, B. (2007). An early history of hindsight research. Social Cognition, 25(1), 10–13. Fischhoff, B., & Beyth, R. (1975). I knew it would happen: Remembered probabilities of once—Future things. Organizational Behavior and Human Performance, 13, 1–16. Forer, B. R. (1949). The fallacy of personal validation: A classroom demonstration of gullibility. Journal of Abnormal and Social Psychology, 44, 118–123. Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141, 2–18. Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem even when there is no “fishing expectation” or “p-hacking” and the research hypothesis was posited ahead of time. Retrieved from https://osf.io/n3axs. Granhag, P. A., Stro¨mwall, L. A., & Allwood, C. M. (2000). Effects of reiteration, hindsight bias, and memory on realism in eyewitness confidence. Applied Cognitive Psychology, 14, 397–420. Gregory, E., Landau, B., & McCloskey, M. (2011). Representation of object orientation in children: Evidence from mirror-image confusions. Visual Cognition, 19, 1035–1062.

19

J. Chen et al.
Groß, J., & Bayen, U. J. (2015). Adult age differences in hindsight bias: The role of recall ability. Psychology and Aging, 30, 253–258.
Guilbault, R. L., Bryant, F. B., Brockway, J. H., & Posavac, E. J. (2004). A meta-analysis of research on hindsight bias. Basic and Applied Social Psychology, 26, 103–117.
Harley, E. M., Carlsen, K. A., & Loftus, G. R. (2004). The“ saw-it-all-along” effect: Demonstrations of visual hindsight bias. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30, 960–968.
Hawkins, S. A., & Hastie, R. (1990). Hindsight: Biased judgments of past events after the outcomes are known. Psychological Bulletin, 107, 311–327.
Hell, W., Gigerenzer, G., Gauggel, S., Mall, M., & Müller, M. (1988). Hindsight bias: An interaction of automatic and motivational factors? Memory & Cognition, 16, 533–538.
Hertwig, R., Fanselow, C., & Hoffrage, U. (2003). Hindsight bias: How knowledge and heuristics affect our reconstruction of the past. Memory, 11, 357–377.
Hoch, S. J., & Loewenstein, G. F. (1989). Outcome feedback: Hindsight and information. Journal of Experimental Psychology: Learning, Memory, and Cognition, 15, 605–619.
Hoffrage, U., Hertwig, R., & Gigerenzer, G. (2000). Hindsight bias: A by-product of knowledge updating? Journal of Experimental Psychology: Learning, Memory, and Cognition, 26, 566–581.
Hoffrage, U., & Pohl, R. (2003). Research on hindsight bias: A rich past, a productive present, and a challenging future. Memory, 11, 329–335.
Hom, H. L., Jr., & Van Nuland, A. L. (2019). Evaluating scientific research: Belief, hindsight bias, ethics, and research evaluation. Applied Cognitive Psychology, 33, 675–681.
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Medicine, 2 (8), Article e124.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524–532.
Kaplan, H., & Barach, P. (2002). Incident reporting: Science or protoscience? Ten years later. BMJ Quality & Safety, 11, 144–145.
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2, 196–217.
Klein, O., Hardwicke, T. E., Aust, F., Breuer, J., Danielsson, H., Hofelich Mohr, A., … Frank, M. C. (2018). A practical guide for transparency in psychological science. Collabra: Psychology, 4, 1–15.
Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Jr., Alper, S., … Sowden, W. (2018). Many labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490.
KNAW: Royal Dutch Academy of Arts and Sciences. (2018). Replication studies: Improving reproducibility in the empirical sciences. Amsterdam, Netherlands Retrieved from https://knaw.nl/en/news/publications/replication-studies.
Koehler, D. J. (1991). Explanation, imagination, and confidence in judgment. Psychological Bulletin, 110(3), 499–519.
LeBel, E. P., Berger, D., Campbell, L., & Loving, T. J. (2017). Falsifiability is not optional. Journal of Personality and Social Psychology, 11, 254–261.
LeBel, E. P., McCarthy, R. J., Earp, B. D., Elson, M., & Vanpaemel, W. (2018). A unified framework to quantify the credibility of scientific findings. Advances in Methods and Practices in Psychological Science, 1, 389–402.
LeBel, E. P., Vanpaemel, W., Cheung, I., & Campbell, L. (2019). A brief guide to evaluate replications. Meta-Psychology, 3. MP.2018.843.
Litman, L., Robinson, J., & Abberbock, T. (2017). TurkPrime. Com: A versatile crowdsourcing data acquisition platform for the behavioral sciences. Behavior Research Methods, 49(2), 433–442.
Mazursky, D., & Ofir, C. (1990). “I could never have expected it to happen”: The reversal of the hindsight bias. Organizational Behavior and Human Decision Processes, 46, 20–33.
Moshontz, H., Campbell, L., Ebersole, C. R., IJzerman, H., Urry, H. L., Forscher, P. S., … Chartier, C. R. (2018). The psychological science accelerator: Advancing psychology through a distributed collaborative network. Advances in Methods and Practices in Psychological Science, 1(4), 501–515.
Müller, P. A., & Stahlberg, D. (2006). Surprise as information: Metacognitive influences on hindsight bias. Unpublished manuscript. Germany: University of Mannheim.
Müller, P. A., & Stahlberg, D. (2007). The role of surprise in hindsight bias: A metacognitive model of reduced and reversed hindsight bias. Social Cognition, 25, 165–184.
Munafo`, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., Du Sert, N. P., … Ioannidis, J. P. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 1–9.
Nestler, S., & Egloff, B. (2009). Increased or reversed? The effect of surprise on hindsight bias depends on the hindsight component. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35, 1539–1544.
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115, 2600–2606.
Nosek, B. A., & Lakens, D. (2014). A method to increase the credibility of published results. Social Psychology, 45, 137–141.

Journal of Experimental Social Psychology 96 (2021) 104154
Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615–631.
Ofir, C., & Mazursky, D. (1997). Does a surprising outcome reinforce or reverse the hindsight bias? Organizational Behavior and Human Decision Processes, 6, 51–57.
Open, S. C. (2015). Psychology. Estimating the reproducibility of psychological science. Science, 349(6251). aac4716.
Ouellette, J. A., & Wood, W. (1998). Habit and intention in everyday life: The multiple processes by which past behavior predicts future behavior. Psychological Bulletin, 124, 54–74.
Pezzo, M. (2003). Surprise, defence, or making sense: What removes hindsight bias? Memory, 11, 421–441.
Pohl, R., Eisenhauer, M., & Hardt, O. (2003). SARA: A cognitive process model to simulate the anchoring effect and hindsight bias. Memory, 11, 337–356.
Pohl, R. F. (2007). Ways to assess hindsight bias. Social Cognition, 2, 14–31. Pohl, R. F., Bender, M., & Lachmann, G. (2002). Hindsight bias around the world.
Experimental Psychology, 49, 270–282. Roese, N. J., & Olson, J. M. (1996). Counterfactuals, causal attributions, and the
hindsight bias: A conceptual integration. Journal of Experimental Social Psychology, 32 (3), 197–227. Roese, N. J., & Vohs, K. D. (2012). Hindsight bias. Perspectives on Psychological Science, 7, 411–426. Ross, M. (2012). The hindsight bias: Judgment task differentiation. doctoral dissertation. Old dominion university. Sanna, L. J., & Schwarz, N. (2006). Metacognitive experiences and human judgment: The case of hindsight bias and its debiasing. Current Directions in Psychological Science, 15, 172–176. Savani, K., & King, D. (2015). Perceiving outcomes as determined by external forces: The role of event construal in attenuating the outcome bias. Organizational Behavior and Human Decision Processes, 130, 136–146. Schatz, D. A. (2019). Boundaries of the hindsight bias. Doctoral dissertation. Berkeley: University of California. Scheel, A. M., Schijen, M., & Lakens, D. (2021). An excess of positive results: Comparing the standard psychology literature with registered reports. In Advances in Methods and Practices in Psychological Science. Schkade, D. A., & Kilbourne, L. M. (1991). Expectation-outcome consistency and hindsight bias. Organizational Behavior and Human Decision Processes, 49, 105–123. Shrout, P. E., & Rodgers, J. L. (2018). Psychology, science, and knowledge construction: Broadening perspectives from the replication crisis. Annual Review of Psychology, 69, 487–510. Simons, D. J., Holcombe, A. O., & Spellman, B. A. (2014). An introduction to registered replication reports at perspectives on psychological science. Perspectives on Psychological Science, 9, 552–555. Slovic, P., & Fischhoff, B. (1977). On the psychology of experimental surprises. Journal of Experimental Psychology: Human Perception and Performance, 3, 544–551. Slovic, P., Lichtenstein, S., & Fischhoff, B. (1988). Decision-making. In R. C. Atkinson, et al. (Eds.), Learning and cognition: Vol. 2. Steven’s handbook of experimental psychology (pp. 673–738). New York, NY: Wiley. Srivastava, S. (2012). A Pottery Barn rule for scientific journals. Retreived from: https: //hardsci.wordpress.com/2012/09/27/a-pottery-barn-rule-for-scientific-journals/. van’t Veer, A. E., & Giner-Sorolla, R. (2016). Pre-registration in social psychology—A discussion and suggested template. Journal of Experimental Social Psychology, 67, 2–12. Thaler, R. H. (2016). Behavioral economics: Past, present, and future. American Economic Review, 106, 1577–1600. Veldkamp, C. (2017). The human fallibility of scientists: Dealing with error and bias in academic research. doctoral dissertation. Tilburg University. Wagenmakers, E. J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7, 632–638. Wagge, J. R., Brandt, M. J., Lazarevic, L. B., Legate, N., Christopherson, C., Wiggins, B., & Grahe, J. E. (2019). Publishing research with undergraduate students via replication work: The collaborative replications and education project. Frontiers in Psychology, 10, 247. Wellman, H. M., & Liu, D. (2004). Scaling of theory-of-mind tasks. Child Development, 75, 523–541. Werth, L., & Strack, F. (2003). An inferential approach to the knew-it-all-along phenomenon. Memory, 11(4–5), 411–419. Winman, A., Juslin, P., & Bjo¨rkman, M. (1998). The confidence–hindsight mirror effect in judgment: An accuracy-assessment model for the knew-it-all-along phenomenon. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24(2), 415. Wong, L. Y. S. (1995). Research on teaching: Process-product research findings and the feelings of obviousness. Journal of Educational Psychology, 87(3), 504. Wood, G. (1978). The knew-it-all-along effect. Journal of Experimental Psychology: Human Perception and Performance, 4, 345–353. Yang, H., & Thompson, C. (2010). Nurses’ risk assessment judgements: A confidence calibration study. Journal of Advanced Nursing, 66, 2751–2760. Zwaan, R. A., Etz, A., Lucas, R. E., & Donnellan, M. B. (2018). Making replication mainstream. Behavioral and Brain Sciences, 41.

20

J. Chen et al.
Jieying Chen is an assistant professor at the Department of Business Administration, University of Manitoba. Her research focuses on judgment and decision-making, crosscultural interactions, strategic human resource management, and mindfulness.
Lok Ching (Roxane) Kwan, Lok Yeung (Loren) Ma, Hiu Yee (HayleyAnne) Choi, Ying Ching (Lita) Lo, Shin Yee (Sarah) Au, and Chi Ho (Toby) Tsang were students at the University of Hong Kong during the academic year 2018-9.

Journal of Experimental Social Psychology 96 (2021) 104154
Bo Ley Cheng was the teaching assistant at the University of Hong Kong psychology department during the academic year 2018–9.
Gilad Feldman is an assistant professor with the University of Hong Kong psychology department. His research focuses on judgment and decision-making.

21

