Linear discriminant analysis of character sequences using occurrences of words

Subhajit Dutta, Probal Chaudhuri, Anil Ghosh

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Classification of character sequences, where the characters come from a finite set, arises in disciplines such as molecular biology and computer science. For discriminant analysis of such character sequences, the Bayes classifier based on Markov models turns out to have class boundaries defined by linear functions of occurrences of words in the sequences. It is shown that for such classifiers based on Markov models with unknown orders, if the orders are estimated from the data using cross-validation, the resulting classifier has Bayes risk consistency under suitable conditions. Even when Markov models are not valid for the data, we develop methods for constructing classifiers based on linear functions of occurrences of words, where the word length is chosen by cross-validation. Such linear classifiers are constructed using ideas of support vector machines, regression depth, and distance weighted discrimination. We show that classifiers with linear class boundaries have certain optimal properties in terms of their asymptotic misclassification probabilities. The performance of these classifiers is demonstrated in various simulated and benchmark data sets.
Original languageEnglish (US)
Pages (from-to)493-514
Number of pages22
JournalStatistica Sinica
Volume24
Issue number1
DOIs
StatePublished - Feb 25 2013

ASJC Scopus subject areas

  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Fingerprint Dive into the research topics of 'Linear discriminant analysis of character sequences using occurrences of words'. Together they form a unique fingerprint.

Cite this