From a statistical viewpoint, without further knowledge of the impact of differences in quality attributes on safety and efficacy and without taking into account any risk mitigation by proper control strategy in manufacturing, the average false acceptance and rejection rates represent estimates for false positive and false negative decision of similarity between quality attributes of two products. Both error rates are important and should be as low as possible, however, a small false acceptance rate is even more desirable because it might impact risks posed to the patient, whereas a false rejection rate primarily impacts the risk for the manufacturer. The tool is therefore well suited to compare different statistical tests for its applicability in similarity assessments. Any specific application for a similarity exercise additionally requires consideration of potential multiplicity effects as typically many quality attributes are compared in parallel (Bretz et al. 2010). The tool also assumes normally distributed data and process variability without special cause variation, meaning that the analytical variability is negligible and the sample data do not shift over time. Non-normally distributed data and special cause variation require additional considerations with regard to sampling distributions and data evaluation.

The results provided in this article reveal that MinMax is a conservative approach with a low false acceptance rate, but it has a high false rejection rate. Equivalence testing has also a high false rejection rate and with increasing sample size a considerable false acceptance rate. The 3Sigma approach provides a more practical compromise of error rates, which further improves with larger sample size. Tolerance interval testing is only usable if sample size is sufficiently large.

A frequent practical question in the evaluation of similarity is on how many test samples are needed for robust decision making. The tool shows nicely that very small sample sizes can considerably increase the false acceptance rates for the range-based tests. The tool allows definition of acceptable sample sizes based on desired operating characteristics and/or to investigate alternative strategies to control the false acceptance rate.

For the equivalence test, on the other hand, an increasing sample size leads to greater precision in estimating the mean difference. In combination with the lack of alignment of the EQT with the equivalence hypothesis (test population in a reference population), this leads to an undesired increase of the false acceptance rate with increasing sample size.

While the examples illustrate the impact of sample size, the tool can also be used to assess the impact of other statistical testing parameters on the false acceptance and rejection rates. Finally, alternative hypotheses for statistical equivalence of the quality attributes can be easily assessed. For example, the equivalence area can be defined differently to allow a small difference in means when σ is the same on one side, but restrict the uncomfortable but also highly unlikely situation that a very narrow distributed test distribution is located in the far tail of the reference distribution. Such a hypothesis could define equivalence of the quality attribute if the central 95% if the test population are within the central 99% of the reference population. – see Additional file 1: Figure S2. For the operating characteristics of such an alternate hypothesis please see Additional file 1: Figure S3. (MinMax, 3Sigma, Equivalence testing of means) and Fig. 4 (TI).