Chapter 1 Transparent Statistics Guiding Principles

Version: 1.0

Contributed to the writing: Pierre Dragicevic, Chat Wacharamanotham, Matthew Kay

Gave feedback:

Endorsed:

1.1 Introduction

Human-computer interaction (HCI) is a large, multidisciplinary field drawing on a variety of approaches for analyzing quantitative data. However, many of our existing practices have drawn increasing criticism, such as our overreliance on mechanical statistical testing procedures, our lack of replications and meta-analyses, and our unwillingness to share data and study materials. These issues have been discussed within HCI (Wilson et al. 2011; Kaptein and Robertson 2012; Dragicevic 2016; Kay, Nelson, and Hekler 2016; Cockburn, Gutwin, and Dix 2018) and (to a much larger extent) in other fields (Cohen 1994; Gigerenzer 2004; Ioannidis 2005; Simmons, Nelson, and Simonsohn 2011; Giner-Sorolla 2012; Cumming 2014; Nosek et al. 2017). Poor statistical practice and the lack of transparency in statistical reporting hamper the progress of knowledge and undermine the scientific credibility of affected disciplines, as witnessed by the growing number of press articles reporting a “crisis of confidence” in the most visible of these disciplines (Earp and Trafimow 2015).

The purpose of the transparent statistics guidelines is not to admonish an entire field of researchers for their existing practices nor to urge them to adopt a specific set of methods. There is no universal inference procedure that can act as a substitute for good statistical reasoning (Gigerenzer and Marewski 2015). The multifaceted nature of HCI also means we need to embrace a variety of practices. A fixed set of DOs and DON’Ts would be both too brittle to change over time and too restrictive in the face of the various ways of generating knowledge in our field. Instead, we propose to advance a general vision of transparent statistics that HCI researchers can draw inspiration from, and that is largely method-agnostic. We refer to transparent statistics simply as “a philosophy of statistical reporting whose purpose is to advance scientific knowledge rather than to persuade”. Regardless of the methods used, we aim to provide guidance that makes the communication of those methods more transparent, that makes reproduction and replication of work easier, and that makes evaluation of work (e.g., by peer reviewers) easier and more fair.

To that end, a “transparent statistics” initiative was started in 2016, whose purpose is to discuss ways of promoting transparent statistics at CHI and suggest a series of incremental changes within the community (“Transparent Statistics Website” 2017). These include more specific author and reviewer guidelines, exemplars for authors, and “badges” (“Open Science Badges” 2017). The goal of the initiative is to address questions such as: what can an author do to improve the transparency of their communication? What can a reviewer do to encourage and reward transparency? What changes to the review process might encourage transparency and incentivize researchers? In this way we hope to avoid the time-honored tradition of admonishing researchers for doing statistics poorly, and instead encourage them—and guide them—to do better. The goal of this first chapter is to lay out the high-level principles on which other chapters will be based. Like other chapters, this chapter is not meant to be fixed in stone, but is meant to be constantly evolving and iteratively refined by the CHI community.

1.2 Guiding Principles

Again, transparent statistics is “a philosophy of statistical reporting whose purpose is to advance scientific knowledge rather than to persuade”. This idea is not new. For example, the following quote from Ronald Fisher captures the essence of transparent statistics:

“we have the duty of […] communicating our conclusions in intelligible form, in recognition of the right of other free minds to utilize them in making their own decisions.” (Fisher 1955).

More recent writings have emphasized the importance of contributing useful and accurate knowledge over telling compelling and convincing stories (Giner-Sorolla 2012; Dragicevic 2016). Based on these visions, we propose a set of nine guiding principles for writing transparent statistical reports: 1) faithfulness, 2) robustness, 3) resilience, 4) process transparency, 5) clarity, 6) simplicity, 7) non-contingency, 8) precision and economy, and 9) material availability.

1. Faithfulness

At the most basic level, a transparent statistical report should strive to be faithful to the data and the phenomena studied. This means that it should strive to capture and convey the “truth” as accurately as possible, especially concerning the uncertainty within the data. Major sources of uncertainty need to be carefully assessed and propagated to the presentations and interpretations of the results, all the way up to the final conclusions. Conclusions should be nuanced and stress the uncertainty in the data and in the process.

  • Example: It is evident that any major error in an analysis will result in findings that are likely not faithful to the data and the phenomena studied. This includes effect estimates that are very different from the true effect, but also measures of uncertainty that fail to capture the true uncertainty.

  • Example: Exaggerating findings by presenting uncertain results as certain is unfaithful to the data.

  • Example: A study report that analyzes all its data carefully but fails to acknowledge important issues with data validity (e.g., non-random condition assignment) is faithful to the data but unfaithful to the phenomena studied. The same goes with over-generalizing findings.

2. Robustness

In order to minimize the likelihood of inaccurate (unfaithful) results, data analysis and reporting strategies that are robust to departures from statistical assumptions – or that make few assumptions – should ideally be preferred.

Given that statistical assumptions are never met perfectly, the question should not be “are the assumptions met?” but instead “what are the likely consequences of such and such departure?”. Thus, it is hugely beneficial for researchers to know how their methods behave depending on the nature and degree of possible departures, so that they can explain it in their report when necessary. When uncertain, methods that are known for their robustness are safer choices.

  • Example: ANOVA is robust to the normality assumption, and can in some cases give accurate results with unusual distributions and very small sample sizes (Norman 2010).
  • Example: Bootstrapping makes no assumption about data distribution and is robust to departures from its own statistical assumptions, even though these assumptions are implausible (Kirby and Gerlanc 2013).

3. Resilience

Data analysis and reporting strategies should be resilient to statistical noise, i.e., they should yield similar outcomes across hypothetical replications of the same study. Researchers should ask themselves how their statistical report would change if they took another random sample from the same population, and should try to make claims that are as robust as possible to these changes.

In practice, the principle of resilience implies that researchers should avoid presenting statistical noise as signal, either by overfitting, or by overinterpreting patterns in results. It also implies that study reports should be smooth functions of the data. This means that data analysis and presentation strategies should be chosen so that similar experimental datasets yield similar results, interpretations and conclusions (Dixon 2003; Dragicevic 2016; “Statistical Dances: Why No Statistical Analysis Is Reliable and What to Do About It” 2017). The principle of resilience is important and is directly relevant to the issue of study replicability.

  • Example: Presenting a bar chart of means without error bars and commenting on the emerging patterns is akin to overfitting and is thus not resilient.
  • Example: Computing and reporting 95% interval estimates is resilient, but drawing binary conclusions based on whether they contain zero is not, because two very similar datasets may yield seemingly very different scientific conclusions (Cumming 2013).
  • Example: For the same reasons, computing Bayes factors and interpreting them strictly based on conventional thresholds violates the principle of resilience (“Dance of the Bayes Factors” 2016).

4. Process Transparency

A core aspect of transparent statistics is that data analysis and reporting strategies need to be explained rather than implied. The decisions made during the analysis and report writing should be communicated as explicitly as possible, as the results of an analysis cannot be fairly assessed and understood if many decisions remain concealed (Giner-Sorolla 2012; Gelman and Loken 2013).

At the most basic level, researchers should ideally state which portions of their data analysis were planned before the data was seen, and which portions were not. Analyses that are fully planned can be referred to as prespecified, while analyses that are largely unplanned should be referred to as exploratory (Cumming 2014, 10). Both types of analysis are valid, although the former allows to support stronger claims than the latter.

Process transparency also implies faithfully reporting what were the research goals, the research questions, and (optionally) the researcher’s expectations prior to seeing the data (Kerr 1998; Gelman and Loken 2013; Cockburn, Gutwin, and Dix 2018). Results from analyses need to be reported whether or not they meet the researcher’s initial expectations. When this is not the case, the rationale for selecting results needs be explained. Finally, sharing data and analysis scripts greatly benefits process transparency.

  • Example: Hypothesizing after the results are known (or “HARKing”) strongly goes against process transparency (Kerr 1998). Researchers who do not have clear expectations should state research questions instead of hypotheses (Cumming 2013).
  • Example: Cherry-picking “convenient” results (e.g., results that best support the hypotheses), or trying multiple alternative analyses and reporting only those that are convenient clearly violates process transparency (Simmons, Nelson, and Simonsohn 2011).
  • Example: Even when a researcher has no preference for a given hypothesis and no intention to p-hack, cherry-picking results to give the impression of a coherent story also goes against process transparency (Giner-Sorolla 2012; Gelman and Loken 2013).
  • Example: Provided an analysis is presented as exploratory, trying multiple analyses and reporting the most interesting and informative results by taking a neutral stance is perfectly acceptable and does not go against process transparency (Tukey 1977). Transparency can however be increased by explaining what has been tried.

5. Clarity

Study reports should be as easy to understand as possible, because as explained by Ronald Fisher (quoted above), readers and reviewers cannot judge an analysis without understanding. There are two facets of clarity: ease of processing and accessibility.

Study reports should be easy to process, even when they target experts. When results can be communicated more effectively with visual representations than with numerals, visual representations should be preferred (Loftus 1993; Gelman, Pasarica, and Dodhia 2002). Although a study report should communicate as much relevant information as possible, information overload must be avoided by reporting non-essential information in appendices or in supplemental material.

Second, study reports should ideally be accessible to most members of the HCI community, instead of being comprehensible by only a handful of specialists. The more accessible an analysis is, the more the “free minds” who can judge it. Thus a study report should be more an exercise of pedagogy than an exercise of rhetoric. The goal of a statistical report is not to signal expertise, but to explain.

  • Example: Using statistics for defensive purposes by generating p-cluttered reports rather than informative plots violates the principle of clarity.
  • Example: Excessive numbers of significant digits are difficult to process thus they go against the principle of clarity (Ehrenberg 1977), in addition to giving a misleading impression of precision (Taylor 1997).

6. Simplicity

When choosing between two data analysis procedures, the simplest procedure should ideally be preferred even if it is slightly inferior in other respects. A focus on simplicity follows from the principles of clarity and ease of processing, and it makes both researcher mistakes and reader misinterpretations less likely to occur. In other words, the KISS principle (Keep It Simple, Stupid) is as relevant in transparent statistics as in any other domain.

7. Non-contingency

When possible and outside exploratory analyses, data analysis and reporting strategies should avoid decisions that are contingent on data, e.g., “if the data turns out like this, compute this, or report that”. This principle follows from the principles of process transparency, clarity, and simplicity, because data-contingent procedures are hard to explain and easy to leave unexplained (Gelman and Loken 2013). It is also a corollary of the principle of resilience because any dichotomous decision decreases a procedure’s resilience to statistical noise.

Carefully planning an analysis is a good way to make sure that the principle of non-contingency is met (Cumming 2014), especially if all the analysis code has been written ahead of time based on pilot data (Dragicevic 2016). Pre-registering an analysis further increases transparency by allowing anyone to verify that the plan has been followed (Nosek et al. 2017; Cockburn, Gutwin, and Dix 2018). In exploratory analyses and in complex modeling problems, which are often data-contingent by nature, the principle of non-contingency should be applied to the best effort.

  • Example: Using a test of normality to decide whether to use parametric or non-parametric methods violates the principle of non-contingency, in addition to not being very useful (Stewart-Oaten 1995; Wierdsma 2013). If the test of normality is not mentioned in the report, it additionally violates the principle of process transparency.
  • Example: Selective reporting of data (i.e., cherry-picking) clearly violates the non-contingency principle, and generally also the principles of faithfulness and of process transparency. It is only acceptable if the analysis is clearly presented as exploratory, and if the goal of the selection is to learn from the data rather than to support a convenient hypothesis or story.

8. Precision and economy

Data quality (Gelman 2017) and high statistical power (Cohen 1994), which in the estimation world translates to high statistical precision (Cumming 2013; Kruschke and Liddell 2017), are important goals to pursue. This is because even if full transparency is achieved, a study report where nothing conclusive can be said would be a waste of readers’ time, and may prompt them to seek inexistent patterns. Precision depends on experiment design, but also on the choice of analysis methods – thus analysis methods that yield high precision should be prefered. However, researchers should strive to avoid false precision, e.g., reporting numerical results without information about their uncertainty and/or with way more significant digits than justified by their uncertainty (Taylor 1997).

Analysis and reporting strategies that waste statistical power and precision (e.g., by dichotomizing continuous variables) should also be ideally avoided (Dragicevic 2016). Though the economy principle is not directly related to transparency, it is generally advisable not to waste data. It is a sensible goal for researchers to try to learn as much as possible from a study, provided that the principles of faithfulness and process transparency are carefully kept in mind. For similar reasons, while it is essential that researchers do not read too much in their data and do not fall for confirmation bias, exploratory analyses are often very informative and should thus be encouraged. The best study reports combine prespecified with exploratory analyses, while clearly distinguishing between the two.

9. Material availability

Sharing as much study material as possible is a core part of transparent statistics, as it greatly facilitates peer scrutiny and replication. Being able to run the experimental software and examine what participants saw (the techniques, tasks, instructions, and questions asked) is essential in order for other researchers to understand the details of a study. In addition, sharing the source code of the experimental software greatly facilitates replication. Similarly, experimental data (all data files and if possible analysis scripts) is necessary for conducting re-analyses and meta-analyses. Although uploading supplementary material makes sense during the reviewing phase, to be really useful all material should be freely shared online upon paper acceptance, ideally on a website that can guarantee long-term accessibility.

References

Wilson, Max L, Wendy Mackay, Ed Chi, Michael Bernstein, Dan Russell, and Harold Thimbleby. 2011. “RepliCHI-Chi Should Be Replicating and Validating Results More: Discuss.” In CHI’11 Extended Abstracts on Human Factors in Computing Systems, 463–66. ACM. https://hal.inria.fr/file/index/docid/1000423/filename/RepliCHI-panel-2011.pdf.

Kaptein, Maurits, and Judy Robertson. 2012. “Rethinking Statistical Analysis Methods for Chi.” In Proceedings of the Sigchi Conference on Human Factors in Computing Systems, 1105–14. ACM. http://judyrobertson.typepad.com/files/chi2012_submission_final.pdf.

Dragicevic, Pierre. 2016. “Fair Statistical Communication in Hci.” In Modern Statistical Methods for Hci, 291–330. Springer. https://hal.inria.fr/hal-01377894/document.

Kay, Matthew, Gregory L Nelson, and Eric B Hekler. 2016. “Researcher-Centered Design of Statistics: Why Bayesian Statistics Better Fit the Culture and Incentives of Hci.” In Proceedings of the 2016 Chi Conference on Human Factors in Computing Systems, 4521–32. ACM. http://www.mjskay.com/papers/chi_2016_bayes.pdf.

Cockburn, Andy, Karl Gutwin, and Alan Dix. 2018. “HARK No More: On the Preregistration of Chi Experiments.” ACM.

Cohen, Jacob. 1994. “The Earth Is Round (P<.05).” American Psychologist 49 (12). American Psychological Association: 997. http://ist-socrates.berkeley.edu/~maccoun/PP279_Cohen1.pdf.

Gigerenzer, Gerd. 2004. “Mindless Statistics.” The Journal of Socio-Economics 33 (5). Elsevier: 587–606. http://pubman.mpdl.mpg.de/pubman/item/escidoc:2101336/component/escidoc:2101335/GG_Mindless_2004.pdf.

Ioannidis, John PA. 2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2 (8). Public Library of Science: e124. http://robotics.cs.tamu.edu/RSS2015NegativeResults/pmed.0020124.pdf.

Simmons, Joseph P, Leif D Nelson, and Uri Simonsohn. 2011. “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science 22 (11). Sage Publications Sage CA: Los Angeles, CA: 1359–66. http://opim.wharton.upenn.edu/DPlab/papers/publishedPapers/Simmons_2011_False-Positive%20Psychology.pdf.

Giner-Sorolla, Roger. 2012. “Science or Art? How Aesthetic Standards Grease the Way Through the Publication Bottleneck but Undermine Science.” Perspectives on Psychological Science 7 (6). Sage Publications Sage CA: Los Angeles, CA: 562–71. http://journals.sagepub.com/doi/full/10.1177/1745691612457576.

Cumming, Geoff. 2014. “The New Statistics: Why and How.” Psychological Science 25 (1): 7–29. doi:10.1177/0956797613504966.

Nosek, Brian A, Charles R Ebersole, Alexander DeHaven, and David Mellor. 2017. “The Preregistration Revolution.” Open Science Framework. https://osf.io/2dxu5/download?format=pdf.

Earp, Brian D, and David Trafimow. 2015. “Replication, Falsification, and the Crisis of Confidence in Social Psychology.” Frontiers in Psychology 6. Frontiers Media SA. https://www.frontiersin.org/articles/10.3389/fpsyg.2015.00621/full.

Gigerenzer, Gerd, and Julian N Marewski. 2015. “Surrogate Science: The Idol of a Universal Method for Scientific Inference.” Journal of Management 41 (2). Sage Publications Sage CA: Los Angeles, CA: 421–40. http://www.dcscience.net/Gigerenzer-Journal-of-Management-2015.pdf.

“Transparent Statistics Website.” 2017. http://transparentstatistics.org/.

Fisher, Ronald. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 69–78. http://www.ssnpstudents.com/wp/wp-content/uploads/2015/02/Fisher-1955.pdf.

Norman, Geoff. 2010. “Likert Scales, Levels of Measurement and the ‘Laws’ of Statistics.” Advances in Health Sciences Education 15 (5). Springer: 625–32. https://pdfs.semanticscholar.org/6dc0/0756ab722370b815df1223f4044dd63841a8.pdf.

Kirby, Kris N, and Daniel Gerlanc. 2013. “BootES: An R Package for Bootstrap Confidence Intervals on Effect Sizes.” Behavior Research Methods 45 (4). Springer: 905–27. http://web.williams.edu/Psychology/Faculty/Kirby/bootes-kirby-gerlanc-in-press.pdf.

Dixon, Peter. 2003. “The P-Value Fallacy and How to Avoid It.” Canadian Journal of Experimental Psychology/Revue Canadienne de Psychologie Experimentale 57 (3). Canadian Psychological Association: 189. https://www.ncbi.nlm.nih.gov/pubmed/14596477.

“Statistical Dances: Why No Statistical Analysis Is Reliable and What to Do About It.” 2017. https://tinyurl.com/gricad-dance. https://tinyurl.com/gricad-dance.

Cumming, Geoff. 2013. Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge.

Gelman, Andrew, and Eric Loken. 2013. “The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No ‘Fishing Expedition’ or ‘P-Hacking’ and the Research Hypothesis Was Posited Ahead of Time.” Department of Statistics, Columbia University. http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf.

Kerr, Norbert L. 1998. “HARKing: Hypothesizing After the Results Are Known.” Personality and Social Psychology Review 2 (3). Sage Publications Sage CA: Los Angeles, CA: 196–217. http://www.socialrelationslab.com/uploads/1/8/9/6/18966149/harkingkerr1998.pdf.

Tukey, John W. 1977. “Exploratory Data Analysis.” Reading, Mass.

Loftus, Geoffrey R. 1993. “A Picture Is Worth a Thousandp Values: On the Irrelevance of Hypothesis Testing in the Microcomputer Age.” Behavior Research Methods, Instruments, & Computers 25 (2). Springer: 250–56. https://faculty.washington.edu/gloftus/Research/Publications/Manuscript.pdf/Loftus%20p-values%201993.pdf.

Gelman, Andrew, Cristian Pasarica, and Rahul Dodhia. 2002. “Let’s Practice What We Preach: Turning Tables into Graphs.” The American Statistician 56 (2). Taylor & Francis: 121–30. https://pdfs.semanticscholar.org/202c/fec06a87fc96d3d56b6ad2ba4237b3fde141.pdf.

Ehrenberg, ASC. 1977. “Rudiments of Numeracy.” Journal of the Royal Statistical Society. Series A (General). JSTOR, 277–97. http://www1.maths.leeds.ac.uk/~sta6ajb/math1910/p4.pdf.

Taylor, John. 1997. Introduction to Error Analysis, the Study of Uncertainties in Physical Measurements. University Science Books.

Stewart-Oaten, Allan. 1995. “Rules and Judgments in Statistics: Three Examples.” Ecology 76 (6). Wiley Online Library: 2001–9. http://onlinelibrary.wiley.com/doi/10.2307/1940736/full.

Wierdsma, A. 2013. “What Is Wrong with Tests of Normality?” http://tinyurl.com/normality-wrong. http://tinyurl.com/normality-wrong.

Gelman, Andrew. 2017. “Ethics and Statistics: Honesty and Transparency Are Not Enough.” Chance 30 (1). Taylor & Francis: 37–39. http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics14.pdf.

Kruschke, John K, and Torrin M Liddell. 2017. “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective.” Psychonomic Bulletin & Review. Springer, 1–29. https://osf.io/ksfyr/download?format=pdf.