Friday, March 21, 2014

Nice visualisation for single dimension data with labels

You may know Jari Oksanen et al's vegan (plant community ecology) package (http://cran.r-project.org/web/packages/vegan/vegan.pdf). It contains a neat function, 'linestack', for plotting labelled one-dimensional data so that the labels fit. I have found it a handy visualisation technique, e.g. for plotting correlations between plant taxa and an environmental variable (see real example plot below).



vectors <- data.frame(names=as.matrix(colors())[runif(40, min=1, max=650)], r = runif(40, min=-1, max=1))

vectors

                  names           r
1                orange -0.36946900
2                grey67 -0.15152158
3                  red2 -0.80901845
4        mediumseagreen  0.11225298
5            lightpink3  0.15080658
6                ivory1  0.11756504
7        cornflowerblue  0.57406034
8               oldlace -0.71916037
9        lightslategray -0.76570037
10               grey92  0.04404581
11       paleturquoise1  0.94855012
12               gray33  0.20623527
13                snow2 -0.31711523
14         navajowhite1 -0.53142940
15               gray71 -0.80816250
16               gray62 -0.13817008
17           steelblue3  0.57297053
18           lightcyan4  0.29911661
19 lightgoldenrodyellow -0.53221485
20               coral2 -0.66000856
21       paleturquoise2  0.83141393
22           palegreen4 -0.74607829
23                 pink  0.98511951
24               gray44  0.17423459
25              tomato1 -0.04841200
26          floralwhite  0.31414141
27               gray20 -0.02050105
28           lightcyan3  0.92788320
29               grey69 -0.84439991
30           burlywood3 -0.60867127
31                snow3  0.45804930
32               gray13  0.18198486
33               wheat2  0.71317735
34               purple -0.98669286
35              dimgrey -0.63993588
36               grey91  0.91495308
37        darkslateblue -0.54674852
38               gray14  0.59225932
39           indianred1  0.36252024
40            cadetblue  0.96656531



#plot - this code creates 2 side by side plots: the linestack on the left, and some descriptive arrows on the right

par(mfrow=c(1,2), mar=c(3,0,1,0))

linestack(vectors$r, labels=vectors$names, axis=TRUE, air=1.8, cex=0.7, hoff=7, font=6, at=-0.3)

plot(0.5, 0.5, ylim=c(-1,1),xlim=c(0,1), cex=0, xaxt="n", xlab="", bty="n", yaxt="n", ylab="")

arrows(0.1, 0.2, x1=0.1, y1=0.8, lwd=2, lty=1, angle=10)
arrows(0.1, -0.2, x1=0.1, y1=-0.8, lwd=2, angle=10)

text(0.05, -0.4, "negative", srt=90, cex=1.5)
text(0.05, 0.4, "positive", srt=90, cex=1.5)

Dummy example of 'linestack' in Vegan produced by above code.





Tuesday, March 18, 2014

Neighbour-joining trees based on factor input data

Here is a simple way to generate a neighbour-joining tree from factor data in R. This might be useful for, e.g.,  a cladistic analysis, or even analysis of indels, length mutations…? In this example, the factors are unordered and pairwise dissimilarities are calculated based on the proportion of shared character states, i.e. if 20% of the factors have shared states between a pair of species, their dissimilarity is 0.8; 1 indicates none shared. You can also specify ordered factors. The tree might be an end in itself, or a starting point for further analysis/optimisation. Note that while you can specify that the 'daisy' function should use 'Gower' dissimilarity (which handles factors), it is not needed in this case as it is automatically used if some of the input data are not numeric.



require(cluster)

DATA <- data.frame(f1=c(1,1,2,1),f2=c(3,1,2,2), f3=c(4,3,2,2), f4=c(5,5,4,1), f5=c(2,5,5,5), row.names=c("species1", "species2", "species3", "species4")) #dummy data

DATA
         f1 f2 f3 f4 f5
species1  1  3  4  5  2
species2  1  1  3  5  5
species3  2  2  2  4  5
species4  1  2  2  1  5


FUNCx <- function(x) as.factor(x)

DATA2 <- as.data.frame(apply(DATA, 2, FUNCx)) #convert from numeric to factor

DAISY <- daisy(DATA2) #generate dissimilarity matrix

DAISY
Dissimilarities :
         species1 species2 species3
species2      0.6                  
species3      1.0      0.8         
species4      0.8      0.6      0.4

Metric :  mixed ;  Types = N, N, N, N, N 
Number of objects : 4


require(ape)

plot(nj(DAISY)) #plot tree
Neighbour-joining tree based on dummy factor data using the function 'daisy' with Gower dissimilarity




Tuesday, March 4, 2014

old school permutation test for phylogenetic signal in a Maximum Parsimony tree

Some old-school phylogenetics in R with Maximum Parsimony- still useful on occasion. The idea of MP is to find the shortest possible tree, given the data (or the consensus of equally short trees). The MP (most parsimonious) tree is considered the most likely. But how do you test whether an MP tree is statistically significant? It is usual to run bootstrap re-sampling of the data and see how many times particular branches appear, but what about testing for phylogenetic signal across the tree?

Below is one approach that tests whether the tree is any shorter than if the relationships were random. The test uses a (non-parametric) permutation, where character states at each locus are randomly permuted among the taxa. We can conclude the tree has strong phylogenetic signal if it is significantly shorter than trees generated from permuted data. The test is known as PTP or 'permutation tail probability'. You should check literature on pros and cons and remember that this method is relatively crude and will only really single out a pretty poor data set.


First generate a MP tree. The approach is to start with a NJ tree which is then optimised with MP.
Some dummy data:

Copy and paste the following into a text editor, save as plain text, then change the file extenstion to .fasta (I'm sure there is a way to paste directly into R in required format, but this works for an example)


>Species1
AAAAGAATAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTTCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAATAAA-----TCTTAAATATTTTT-AGATAGTCAAATTTT------CTTTTTTATTTTTGATTTTGAATTC
AT--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT-TTA-ATTCTATA
TTTCT-TATTTCTGTTTCGAATTCGATTGATTGGTTATAAGGAATCAGGAC-ATTCCTTGGGCTTTTCA-TTCCGGAAAA
GGGGGAAGTTAAGAAAAAAAA----------TAATCGACCCTTCAA-GTATCCCCAATTCCATGGGAAAAG-TGGCGGAA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCCATATTGAATTGCGGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATAAGGGTTTCTATAGGTAAGAAAGAAAAG-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCAGTATAAATGAAAGAA---------GCAATCCTGAGCCAAATCCT-GTTTTCTCAAAACAAAGGT-----
TT-AAAAAACGAAAAAAAAAA-----GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAACCTTTCCATGGAAACTTCAGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATAAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAATATTCATTCAGCAAATGATTCACTCCAGAAGTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCCTTTTTCGTTATCGGTTCGAAATTCCTTTATCTTTCTGATTCTTTGACAAA------
CGT-ATTTGGGCGCAAA----GGACTTTCTCTTATCACATGGGATATAGAATACACATCCAAATTAAGC-AAGGAATCCC
TTTTTGAAGGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAA--CTTATAAAGTCGTCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGCAATTCCCCTTG-CCCCTTTAATTGACATAGACTCCAGTCATCTAA
TAAAATGAGGATGGGATGCTACATTGGGAATGGTCGGGATA-----GGGGGGCCCCCCG
>Species2
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTAATAAATATTGGATATTAT-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAACTATTTTT-AGATAGTCAAATTTTTTTT---------ATTTAAGATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATATCTATATTTTCTATCCTATATATTTCTTATATT-TCTA
ATTCTATATTTC--TTTCGAATTAGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCAG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAGAAA---------GAATCGACCCTTCAA-GTATCCCAAATTGCATGGGAAAAA-TGGTAGGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTGGATTG
G-ATCAAATCCGGGTTTCTATAGGTAAGAAAGAAAAT-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCAGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAAAA-----GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCCC
TGGTAGAGGCATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAATGTATCTATTCAATACTATATCATT---A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAGTCGAATATTCAGTCAGCAAATGATTCACTCCATA-GTCCAATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCTTTTTTCATTAGCGATTCCAAATTCCTTTATCTTTCTGATTCTTTGACAAA------
CGT-ATTTGGGCGTAAA----TGCCTTTCTCTTATCACATGTGATATAGAATACACATCCAAATTATGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAAT-------------------------------------------------TT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGGACGGGATACTACATTGGGAATGGTCGGGATA-----GGGGGGCCCC-CG
>Species3
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AG-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCAA-TTTTTTTT--CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT---ATATT-TCTA
ATTCTATATTTC--TTTGGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAAAA-------GAATCGACCCTTCAA-GTACCCCCAATTGCATGGGAAAAA-TGGTAGGA
AG-AGAAAGATATATTATGGAG-TATATATCAATCTATATTGAATTACCGATACAGAAATGATAAAATTCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAAGAAAA----TGACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAA---------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TAAAGG-------TATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTTAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGAGAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCTTTTTTCATTAGCGGTTCCAAATTTCTTTATCTTTCTGATTCTTTGACAAACA----
--T-ATTTGGGCATAAA----TGACTTTCTCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTTGAAAGGCGCCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGGATGGGATGCTACATTTGGAATGGTCGGGATA-----CCCCCGCCCCCCG
>Species4
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AG-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTTGAATAC-AACTTGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCCA-TTTTTTT---CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTGGAATTCAATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-TTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAAAAAAAAAAA---TCGACCCTTCAA-GTATCCCTAATTACATG-GAAAAA-TGGTATGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAAGAAAA----T-ATTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCCCGTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAA-------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTCTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCTTTTTTTCATTAGCGGTTCCAAATTCCTTTATCTTTCGGATTCTTTGACAAA------
CGT-ATTTGGGCGTAAA----TGACTTTCTCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAA-GGCGCCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGAGCCCAGTCATCTAA
TAAAATGAGGATGGGATGCTACATTGGGAATGGTCGGGATA-----CCCCCCCCCCGCC
>Species5
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-ATAAATTCGAATAC-AATTCAAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTATT-AGATAGTCAA-TTTTTTTT--CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTGGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAAA--------GAATCGACCCTTCAA-GTATCCCCAATTACATG-GAAAAA-TGGTATGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAAGAAAA----T-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAAA------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAACCGTATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCTTTTTTCATTAGCGGTTCCAAATTCCTTTATCTTTCGGATTCTTTGACAAA------
CGT-ATTTGGGCGTAAA----TGACTTTCTCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAA-GGCGCCTTTTT-GAAG
ATCCTCGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGGGTGGGATGCTACATTGGGAATGGTCGGGATA-----CCCCCCCCCCGCC
>Species6
AAAAAA-TAGAATCTATAATTCCAAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AG-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAATCAA-----TCTTAAATATTTTT-AGATAGTCAA-TTTTTTTT--CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTGGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAAAA-------TAATCGACCCTTCAA-CTATACCAAATTACATG-GAAAAA-TGGTATGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAAGAAAA----T-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAAA------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTCACTGCGT
TGGTAGAGTAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATACTATATCAA
ATGATTATTGATGGTCCGAATCTGTATTTGTATTTTTTACTATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCACTTTTT-CATTAGCGGTTCCAAATTCCTTTATCTTTCTGATTCTTTGACAAAGACAAA
CGT-ATTTGGGCGTAAA----TGACTATATCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAAAGGCGCCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCTTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGGATGGGATGCTACATTGGGAATGGTCGGGATA-----CCCCCCGCGGCCC
>Species7
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCAA-TTTTTTTT--CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAC------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTTGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAA---------GAATCGACCCTTCAA-GTATCCCCAATTATATGATAAAAA-TTGTATGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAATAAAA----T-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAAAA-----GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATC----------CCCCTTTTT-CATTAGCGGTTCCAAATTCCTTTATCTTTCTGATTCTTTGACAAAGACAAA
CGT-ATTTGGGCGTAAA----TGACTTTCTCTTATCACATGTGA-ATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAGTGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAAAGGCGCCTTTTT-GAAG
ATCCACGAGATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGCATGGGATGCTACATTGGGAATGGTCGGGATA-----CCCCCCCGGGCCG
>Species8
AAAAA--TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AG-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCAA-TTTTTTTTT-CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTGGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAAA--------GAATCGACCCTTCAA-GTATCCCCAATTACATG-GAAAA--KGGTATGA
AR-AGAAAGATATAT-ATGGGG-TATATATCAATCGATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATKG
G-ATCAAATATGGGTTTCTATAAGTAAGAAAA----T-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAAAA------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCAT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTAAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGCAAAATGGAAGAATTAGTGTGAATTGATTCCACATT
GAAGAAAAAATTGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGCAATTTTTAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCTTTTTTCATTAGCGGTTCCAAATTCCTTTATCTTTCGGATTCTTTGACAAAC-----
-GT-ATTTGGGCGTAAA----TGACTTTCTCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCAAAA-TCAATAGCATTAC-TCAGACT----------GAAAAACTTTGAA-GGCGCCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCCGTAATCTAA
TAAAATTAGGATGGGCTGCTACATTGGGAATGGTCGGGATA-----CCCCCCCCCCGGC
>Species9
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCAA-TTTTTTT---CTTTTT-------GATTTTGAATTC
AC--ATGACATTTGAAATTCTT---TTTGTTACACTTCTATAC------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTTGAATTCGATTGATTGGTTATAACTAATCAG-AC-ATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAA---------GAATCGACCCTTCAA-GTATCCCCAATTATATGATAAAAA-TTGTATGA
AG-AGAAAGATATAT-ATGGGG-TATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCAATTTGATTG
G-ATCAAATATGGGTTTCTATAAGTAATAAAA----T-ACTTTTTTCGAGATAGTAAT------CGCTATCTAATAAATT
CAAGGGTTCCGGTATAAATGAAAGAA---------GTAATCCTGAGCCAAATCCC-GTTTTCTCAAAACAAAGGTAAAGG
TTCAAAAAACGAAAAAAA-------GGGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCTTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATATCA-----A
ATGATTAATGATAGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCCCTTTTT-CATTAGCGGTTCCAAATTCCTTTATCTTTCTGATTCTTTGACAAAGACAAA
CGT-ATTTGGGCGTAAA----TGACTTTCTCTTATCACATGTGA-ATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAGTGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAAAGGCGCCTTTTT-GAAG
ATCCACGAGATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCCTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGCATGGGATGCTACATTGGGAATGGTCGGGATA-----CCCCCCCGGGCCG
>Species10
AAAAAA-TAGAATCTATAATTCCCAATA-AATAT-TG-GAT-ATTA-----------------T-AG-AA-CATAACG-A
TTT-AT-ATAGCG-ATATAGAATTCCG-ATTTCTTTATC-AC-AATT-CTAAATTCGAATAC-AATTCGAATATTATTCT
ATTC-GATAGTCAA-----TCTTAAATATTTTT-AGATAGTCAA-TT---------------------------------
-------------GAAATTCTT---TTTGTTACACTTCTATAT------TTTCTATCCTATATATTT----------CTA
ATTCTATATTTC--TTTGGAATTCGATTGATTGGTTATAACTAATCAG-ACAATTCCTCGG-CTTT-CA-TTC-GTAAAA
GGGG-AAGTTAAGAAAAAAAAA------------TCGACCCTTCAA-GTATCCCCAATTACATGAGAAAAA-TTGTATGA
AG-AGAAAGATATAT-ATGGGGGTATATATCAATCTATATTGAATTGCCGATACAGAAATGATAAAATCCA-TTTGAATG
GGATCAAATATGGGGTTCTATAAGTAATAAAA----T-CCTTTTTTCGAG------------------------------
-----------------------------------GTAATCCTGAGCCAAATCCC-GTTTTTTCAAAACAAAGGTAAAGG
TTCAAAAAACAAAAAAAA--------GGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTGACTGCGT
TGGTAGAGGAATCCTTGCATGGAAACTTCCGAAAGGA---TGAAGGATAAACGTATCTATTGAATACTATATCA-----A
ATGATTAATGATGGTCCGAATCTGTATTTGTATTTTTTAATATGAAAAATGGAAGAATTGGTGTGAATTGATTCCACATT
GAAGAAAAAATCGAA-----------CAAATGATTCACTCCATA-GTCCGATAGATCTTTTAAAGAACTGATTAATCGGA
CGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACCGGCAACAATGAAATTTATAGTAAGAGGAAAATCCGTCGACT
TTAAAAATCGTG-------CCACTTTTT-CATTAGCGGTTCCAAATTCCTTTATCTTTCTGATTCTTTGACAAAGACAAA
CGT-ATTTGGGCGTAAA----TGACTTTATCTTATCACATGTGATATAGAATACACATCCAAATTACGC-AAGGAATCCC
TTTTTGAATGATTCACAA-TCAATAGCATTAC-TCATACT----------GAAAAACTTGGAAAGGCGCCTTTTT-GAAG
ATCCACGAAATTAGAGGACTTGGAGAAAACTTTGTAATTCCCCTTG-TCCTTTTAATTGACATAGACCCCAGTCATCTAA
TAAAATGAGGATGGGATGCTACATTGGGAATGGTCGGGATA-----CCC-CCCCGGCCG




###
###R code

require(picante)
require(phangorn)

DNA <- read.dna("DNA.fasta", format="fasta")

DNA2 <- as.phyDat(DNA) #for optim.parsimony function

DIST_DNA <- dist.dna(DNA, model="TN93") #Nei's genetic distance
tree_DNA <- nj(DIST_DNA)

x <- as.vector(DIST_DNA); y <- as.vector(as.dist(cophenetic(tree_DNA)))
plot(x, y, xlab="pairwise genetic distances", ylab="pairwise distances on NJ tree", pch=20, cex=2); abline(lm(y~x), col="red", lwd=2); text(0.04, 0.11, round(cor(x,y)^2, 3)) #visualise how well raw distances are represented in the tree

Visual comparison of genetic distances versus distance on tree


parsim_DNA <- optim.parsimony(tree_DNA, DNA2, trace=3) #optimise the nj tree with MP

#optional
#parsim_DNA_root <- root(parsim_DNA, outgroup=1) #root tree if you like

#plot(parsim_DNA_root)


Tree to be tested


#BOOT1 <- boot.phylo(parsim_DNA_root, DNA, FUN = function(xx) nj(dist.dna(xx)), trees=TRUE, B=1000) #1000 bootstrap reps

#CONSENSUS <- consensus(BOOT1$trees, p = 0.5, check.labels = FALSE)


##############
#FUNCTION seqPerm:
#non-parametric permutation test for shorter tree length with MP optimisation versus character states randomly permuted among taxa
#

#the tree being testing is unrooted
#the DNA data should be in DNAbin format and the tree to be tested in phylo format

sample_func <- function(x
{
sample(x, length(x))
}


PFUNCTION <- function(x, y, z)
{
   if(x <= y) z <- z+1
   return(z)
}


seqPerm <- function(x, y, nperm) #x DNA data matrix, y tree to test, nperm required number of permutations
{

if(class(x)!="DNAbin")
{return(print("DNA sequence data must be in DNAbin format"))}

if(class(y)!="phylo")
{return(print("Tree to be tested must be class 'phylo'"))}

nGT <- 1

data2 <- as.phyDat(x)

PARS1 <- parsimony(y, data2)

for(i in 1:nperm)
{
DNA_perm <- as.matrix(apply(x, 2, sample_func))
row.names(DNA_perm) <- row.names(x
class(DNA_perm) <- "DNAbin"
DNA_perm2 <- as.phyDat(DNA_perm)

perm_tree <- nj(dist.dna(DNA_perm, model="TN93")) 

delete <- optim.parsimony(perm_tree, DNA_perm2, trace=0)


PARS2 <- parsimony(delete, DNA_perm2)

nGT <- PFUNCTION(PARS2, PARS1, nGT)
}


P <- nGT/(nperm+1)

cat('Prob (',nperm,'permutations) =',P,'\n','\n')
return(P.perm=P )
}



seqPerm(DNA, parsim_DNA, 1000)
#computation can be a little slow with many permutations


Example of a tree based on permuted data