Description of MuLDAS methodology
MuLDAS is a web tool that maps the query and reference sequences together to a highdimensional space using multidimensional scaling and predicts the subtype using linear discriminant analysis. The features of MuLDAS include:
 Use of several hundreds reference sequences, allowing statistical inference
 For each query, linear discriminant models are separately built and validated by leaveoneout crossvalidation of the reference sequences.
 The maximum a posteriori classification of the subtype and the associated posterior probability are given.
 Stepwise analyses: the prediction among nonrecombinant M group subtypes, followed by a 'nested' analysis among the best and its associated recombinant subtypes
 Outlierness analyses indicate whether the query is outside or inside the subtype cluster

Flowchart of MuLDAS

Details of the method (HIV1 as an example)
 For each of the HIV1 genes, the corresponding multiple sequence alignment (MSA) downloaded from Los Alamos National Laboratory (LANL) is used to create HMM profiles. Each MSA contains 400 бн 1,000 complete gene sequences, which are different from 100 or so manually picked subtype reference sequences published and updated regularly by LANL.
 The query nucleotide sequence in FASTA format is aligned to the RefSeq genome sequence (NC_001802) using BLASTN to identify the genes it covers.
 For each hit found above, the query is aligned to the corresponding MSA using the HMM profile. This honors the MSA from LANL that have undergone manual editing and expedite the whole process by avoiding the lengthy step required for de novo multiple alignment of several hundred sequences.
 The MSA that includes the query is then trimmed off any indels and used for the calculation of the pairwise distance matrix using distmat program of EMBOSS package. We offer all the multiple substitution correction algorithms available with distmat and the default is JukesCantor
 The query and reference sequences are represented as points in a highdimensional space using the multidimensional scaling (MDS) process built in R statistical package (vide infra for MDS dimensionality)
 The decision boundaries that separate the reference subtypes are modeled by linear discriminant analysis available from MASS library of R package (vide infra for LDA concepts). These models are validated using leaveoneout crossvalidation of the reference sequences. For the given models, the query sequence is classified as the one having maximum a posteriori (MAP) probability
 The steps 5 & 6 are performed first using nonrecombinant major reference sequences (AK and CRF01_AE) and called 'major' analysis (we do not distinguish subsubtypes). The subtype having MAP probability is reported as the best prediction and the subtypes having the probability greater than 0.01 are identified. The circulating recombinant forms (CRF) that have originated from these subtypes are looked up from LANL. A subset of reference sequences belonging to these CRFs as well as the original major subtypes are collected and the prediction steps 5 & 6 are repeated. This is called 'nested' analysis.

MDS dimensionality
In what dimension MDS should be performed is an important issue. MDS reports the eigenvalues in descending order. One would like to capture as much as variance in the distance matrix. We have tested this issue by monitoring the LOOCV misclassification of the reference sequences. Although depending on which gene is concerned the error rate varies, the consensus is that, beyond 10 dimensions, no improvement is seen with the HIV1 nonrecombinant subtypes. Similar test with HIV1 CRF references indicated that 5 dimensions are sufficient. As the eigenvalues become too small, we are capturing mere noise and the subsequent singular decomposition step of LDA gets unstable. Practically we limit them to 1% of the maximum.

LDA concepts (HIV1 as an example)
Once multidimensional scaling maps the sequences in a highdimensional space (a twodimensional projection here), LDA looks for the boundaries (white lines) that separate the subtypes (white symbols). The boundaries are lines in twodimension and become planes in threedimension. The boundary is in k1 dimensions for kdimensional data. In practice, LDA fits Gaussian probability distribution functions to each subtype points and assigns maximum a posteriori classification to the query. If the query clusters well inside one of the subtypes, its subtype assignment is clear and the posterior probability would be close to 1. If the query is located between clusters, the probabilities can be split as shown below. The query shown as '+' is found outside A and CRF01_AE and not far away from K.

Outlierness analysis
Sometimes the query may be found outside a nearby cluster. LDA may still assign it to the nearest subtype with very high probability if the second nearest subtype is far away. It would be informative to the users to know whether the query clusters with the assigned subtype well or not. We define outlierness as the distance of the query from the nearest subtype center relative to the radius of the subtype cluster. In practice let v = q  c, where q is the position of the query and c is the center of the subtype. Then o = v · v / max(v · (r  c)), where r is the position of a reference. If o is greater than 1, the query is outside the cluster.

