新規登録 | ログイン | FAQ      [?] 
CiteULike is a free online bibliography manager. Register and you can start organising your references online.
Recent | Unread | Search | Authors | Tags | Export

Improved variance estimation of classification performance via reduction of bias caused by small sample size.

by: Ulrika Wickenberg-Bolin, Hanna Goransson, Marten Fryknas, Mats G Gustafsson, Anders Isaksson
BMC Bioinformatics, Vol. 7, No. 1. (13 March 2006)


View FullText article


X Reviews [Write a review of this article]

There are no reviews of this article

X Find related articles from these CiteULike users

X Find related articles with these CiteULike tags

X Abstract

ABSTRACT: BACKGROUND: Supervised learning for classification of cancer uses a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it has a performance that is better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which are translated into biased confidence intervals. We explored such biases and developed a refined algorithm called Repeated Independent Design and Test (RIDT). RESULTS: Our simulations show that repeated designs and tests based on resampling in a fixed bag of samples yield a biased variance estimate. We also demonstrate that it is possible to obtain an improved variance estimate by means of a procedure that explicitly models how this bias depends on the number of samples used for testing. For the special case of repeated designs and tests using new samples for each design and test, we present an exact analytical expression for how the expected value of the bias decreases with the size of the test set. CONCLUSIONS: We show that via modeling and subsequent reduction of the small sample bias, it is possible to obtain an improved estimate of the variance of classifier performance between design sets. However, even though the method provides unbiased performance estimates, the uncertainty of the variance estimate is large in the simulations performed, which indicates that the method in its present form cannot be directly applied to small data sets.


X BibTeX record

X RIS record



RIS BibTeX
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.