Launch prediction server Query precalculated results
Algorithms Prediction server Database query Output description Benchmark results

Description of MuLDAS methodology

MuLDAS is a web tool that maps the query and reference sequences together to a high-dimensional space using multidimensional scaling and predicts the subtype using linear discriminant analysis. The features of MuLDAS include:

  1. Use of several hundreds reference sequences, allowing statistical inference
  2. For each query, linear discriminant models are separately built and validated by leave-one-out cross-validation of the reference sequences.
  3. The maximum a posteriori classification of the subtype and the associated posterior probability are given.
  4. Stepwise analyses: the prediction among non-recombinant M group subtypes, followed by a 'nested' analysis among the best and its associated recombinant subtypes
  5. Outlierness analyses indicate whether the query is outside or inside the subtype cluster

Flow-chart of MuLDAS

Details of the method (HIV-1 as an example)

  1. For each of the HIV-1 genes, the corresponding multiple sequence alignment (MSA) downloaded from Los Alamos National Laboratory (LANL) is used to create HMM profiles. Each MSA contains 400 бн 1,000 complete gene sequences, which are different from 100 or so manually picked subtype reference sequences published and updated regularly by LANL.
  2. The query nucleotide sequence in FASTA format is aligned to the RefSeq genome sequence (NC_001802) using BLASTN to identify the genes it covers.
  3. For each hit found above, the query is aligned to the corresponding MSA using the HMM profile. This honors the MSA from LANL that have undergone manual editing and expedite the whole process by avoiding the lengthy step required for de novo multiple alignment of several hundred sequences.
  4. The MSA that includes the query is then trimmed off any indels and used for the calculation of the pairwise distance matrix using distmat program of EMBOSS package. We offer all the multiple substitution correction algorithms available with distmat and the default is Jukes-Cantor
  5. The query and reference sequences are represented as points in a high-dimensional space using the multidimensional scaling (MDS) process built in R statistical package (vide infra for MDS dimensionality)
  6. The decision boundaries that separate the reference subtypes are modeled by linear discriminant analysis available from MASS library of R package (vide infra for LDA concepts). These models are validated using leave-one-out cross-validation of the reference sequences. For the given models, the query sequence is classified as the one having maximum a posteriori (MAP) probability
  7. The steps 5 & 6 are performed first using non-recombinant major reference sequences (A-K and CRF01_AE) and called 'major' analysis (we do not distinguish sub-subtypes). The subtype having MAP probability is reported as the best prediction and the subtypes having the probability greater than 0.01 are identified. The circulating recombinant forms (CRF) that have originated from these subtypes are looked up from LANL. A subset of reference sequences belonging to these CRFs as well as the original major subtypes are collected and the prediction steps 5 & 6 are repeated. This is called 'nested' analysis.

MDS dimensionality

In what dimension MDS should be performed is an important issue. MDS reports the eigenvalues in descending order. One would like to capture as much as variance in the distance matrix. We have tested this issue by monitoring the LOOCV misclassification of the reference sequences. Although depending on which gene is concerned the error rate varies, the consensus is that, beyond 10 dimensions, no improvement is seen with the HIV-1 non-recombinant subtypes. Similar test with HIV-1 CRF references indicated that 5 dimensions are sufficient. As the eigenvalues become too small, we are capturing mere noise and the subsequent singular decomposition step of LDA gets unstable. Practically we limit them to 1% of the maximum.

LDA concepts (HIV-1 as an example)

Once multidimensional scaling maps the sequences in a high-dimensional space (a two-dimensional projection here), LDA looks for the boundaries (white lines) that separate the subtypes (white symbols). The boundaries are lines in two-dimension and become planes in three-dimension. The boundary is in k-1 dimensions for k-dimensional data. In practice, LDA fits Gaussian probability distribution functions to each subtype points and assigns maximum a posteriori classification to the query. If the query clusters well inside one of the subtypes, its subtype assignment is clear and the posterior probability would be close to 1. If the query is located between clusters, the probabilities can be split as shown below. The query shown as '+' is found outside A and CRF01_AE and not far away from K.

Outlierness analysis

Sometimes the query may be found outside a nearby cluster. LDA may still assign it to the nearest subtype with very high probability if the second nearest subtype is far away. It would be informative to the users to know whether the query clusters with the assigned subtype well or not. We define outlierness as the distance of the query from the nearest subtype center relative to the radius of the subtype cluster. In practice let v = q - c, where q is the position of the query and c is the center of the subtype. Then o = v · v / max(v · (r - c)), where r is the position of a reference. If o is greater than 1, the query is outside the cluster.

Copyrighted by Bio-Data Mining Lab, Department of Bioinformatics and Life Sciences, Soongsil University, Seoul, Korea