The additional value of ONEST (Observers Needed to Evaluate Subjective Tests) in assessing reproducibility of oestrogen receptor, progesterone receptor, and Ki67 classification in breast cancer.
Bálint CserniRita BoriErika CsörgőOrsolya Oláh-NémethTamás PancsaAnita SejbenIstván SejbenAndrás VörösTamás ZomboriTibor NyáriGábor CserniPublished in: Virchows Archiv : an international journal of pathology (2021)
The reproducibility of assessing potential biomarkers is crucial for their implementation. ONEST (Observers Needed to Evaluate Subjective Tests) has been recently introduced as a new additive evaluation method for the assessment of reliability, by demonstrating how the number of observers impact on interobserver agreement. Oestrogen receptor (ER), progesterone receptor (PR), and Ki67 proliferation marker immunohistochemical stainings were assessed on 50 core needle biopsy and 50 excision samples from breast cancers by 9 pathologists according to daily practice. ER and PR statuses based on the percentages of stained nuclei were the most consistently assessed parameters (intraclass correlation coefficients, ICC 0.918-0.996), whereas Ki67 with 5 different theoretical or St Gallen Consensus Conference-proposed cut-off values demonstrated moderate to good reproducibility (ICC: 0.625-0.760). ONEST highlighted that consistent tests like ER and PR assessment needed only 2 or 3 observers for optimal evaluation of reproducibility, and the width between plots of the best and worst overall percent agreement values for 100 randomly selected permutations of observers was narrow. In contrast, with less consistently evaluated tests of Ki67 categorization, ONEST suggested at least 5 observers required for more trustful assessment of reliability, and the bandwidth of the best and worst plots was wider (up to 34% difference between two observers). ONEST has additional value to traditional calculations of the interobserver agreement by not only highlighting the number of observers needed to trustfully evaluate reproducibility but also by highlighting the rate of agreement with an increasing number of observers and disagreement between the better and worse ratings.