Designing a hypothesis test to determine the best of two machine learning algorithms with only a small data set available is not a simple task. Many popular tests su#er from low power (5x2 cv [2]), or high Type I error (Weka's 10x10 cross validation [11]). Furthermore, many tests show a low level of replicability, so that tests performed by different scientists with the same pair of algorithms, the same data sets and the same hypothesis test still may present di#erent results.