Biological sequence classification utilizing positive and unlabeled data.Bioinformatics (14 March 2008)
|
Reviews
[Write a review of this article]
There are no reviews of this article
Find related articles from these CiteULike users
Find related articles with these CiteULike tags
AbstractMOTIVATION: In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data. RESULTS: Here, we develop a novel method, Likely Positive-Iterative Classification (LP-IC), for this problem and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LPIC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studies - prediction of HLA binding, and alternative splicing conservation between human and mouse - we show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data. CONTACT: mark@biostat.ucsf.edu.
BibTeX record
RIS record