Dasha Pruss (University of Pittsburgh)
A flurry of failed experimental replications in the 21st century has led to the declaration of a "replication crisis" in a number of experimental fields, including psychology and medicine. Recent articles (e.g., Hutson, 2018) have proclaimed a similar crisis in the computational sciences: researchers have had widespread difficulties in reproducing key computational results, such as reported levels of predictive accuracy of machine learning algorithms. At first, importing the experimental concept of a replication crisis to explain what is happening in the computational sciences might seem attractive - in both fields, questionable research practices have led to the publication of results that cannot be reproduced. With the help of careful conceptual analysis, however, it becomes clear that this analogy between experimental sciences and computational sciences is at best a strained one, and at worst a meaningless one.
Scientific writing on experimental replication is awash with conceptual confusion; to assess the concept of replication in the computational sciences, I appeal to Machery's re-sampling account of experimental replication (Machery, Ms). On the re-sampling account, an experiment replicates an earlier experiment if and only if the new experiment consists of a sequence of events of the same type as the original experiment, while re-sampling some of its experimental components, with the aim of establishing the reliability (as opposed to the validity) of an experimental result. The difficulty of applying the concept of experimental replication to the crisis in the computational sciences stems from two important epistemic differences between computational sciences and experimental sciences: the first is that the distinction between random and fixed factors is not as clear or consistent in the computational sciences as it is in the experimental sciences (the components that stay unchanged between the two experiments are fixed components, and the components that get re-sampled are random components). The second is that, unlike in the experimental sciences, computational components often cannot be separately modified - this means that establishing the reliability of a computational result is often intimately connected to establishing the validity of the result. In light of this, I argue that there are two defensible ways to conceive of replicability in the computational sciences: weak replicability (reproducing an earlier result using identical code and data and different input or system factors), which is concerned with issues already captured by the concept of repeatability, and strong replicability (reproducing an earlier result using different code or data), which is concerned with issues already captured by robustness. Because neither concept of replicability captures anything new with regard to the challenges the computational sciences face, I argue that we should resist the fad of seeing a replication crisis at every corner and should do away with the concept of replication in the computational sciences. Instead, philosophers and computer scientists alike should focus exclusively on issues of repeatability and robustness.