Computational experiments are the dominant paradigm to understand and compare machine learning algorithms. Typically, multiple learning algorithms (the treatments) are compared over multiple datasets that provide training and validation subsets using various predictive performance metrics, i.e., the response variables.
Such experimental designs are referred to as repeated-measure designs. This way we build knowledge through sense-making of many results. But we need to be sure our experimental results are reliable. I answer this question by examining the domain of software defect prediction. A re-analysis of experiments found ~40% contained inconsistent results and/or basic statistical errors. Elsewhere I show that inappropriate response metrics can not only change the magnitude of results but also the direction of effects in ~25% of cases.
We all make errors, and there can be considerable complexity in our computational experiments, so I recommend (i) use open science to expose studies to scrutiny, (ii) try to avoid dichotomous inferencing methods and (iii) use meta-analysis with caution!
Meet up at Blekinge Institute of Technology, Venue J1360 or join via Zoom https://bth.zoom.us/j/63747257887