BAMBE

Bayesian Analysis in Molecular Biology and Evolution

Version 2.03 beta, January 2001

Summarizing output files

The program summarize acts on a tree topology file generated by BAMBE. It counts the appearance of each tree topology, automatically identifies clades, displays the frequency each tree topology appears, and shows transitions between subtree topologies within clades.

We define a named clade with these criteria:

A named clade must have at least two members.
The members of a named clade must appear as a monophyletic group in at least a proportion threshold of all sampled tree topologies. The threshold must be greater than one half.
There may not be more than max_top different subtree topologies among all sampled trees where the named clade is monophyletic.
A named clade cannot be a proper subset of another named clade.

These rules for naming clades give the user some flexibility and make reading summaries of the posterior for large trees substantially easier.

The options for summarize are:

Summarize Options
Option Description
-n Number of lines to skip from each input file. 0 is the default.
-p Threshold for named clade definition (greater than one half). 0.9 is the default.
-c Maximum number of subtree topologies within a named clade (not more than 10). 10 is the default.

Summarize Options
Option	Description
`-n`	Number of lines to skip from each input file. 0 is the default.
`-p`	Threshold for named clade definition (greater than one half). 0.9 is the default.
`-c`	Maximum number of subtree topologies within a named clade (not more than 10). 10 is the default.

The most general way to call summarize is:

% summarize [-n skip] [-p threshold] [-c max_tops] < file 1> < file 2 > ...

The square brackets indicate optional arguments. If the symbol `-' is used in place of a file name, the program expects the input from standard input.

Examples

% summarize -n 200 -p .8 -c 8 run1.top run2.top run3.top > runs.sum

will ignore the first 200 input lines from each file and summarize the concatenation of the remaining files with named clades having a threshold of 80% and no more than eight subtree topologies observed in the combined sample.

% head -20000 run1.top | tail -10000 | summarize - > run1.sum

will run summarize on lines 10,001 through 20,000 of run1.top.

The summarize program output contains several components. We describe this output from an example of a summary of a sample of 200,000 tree topologies describing the evolutionary relationship of fourteen taxa.

The first section of the summary output shows the classification of taxa into named clades and how often each named clade and subtree topology appears. Taxa that do not appear in a named clade are listed separately.

******************** Named clades ********************

200000    A  {1,2}
  200000  A1 (1,2)

200000    B  {4,5}
  200000  B1 (4,5)

200000    C  {6,7}
  200000  C1 (6,7)

170693    D  {8,9,10,11,12}
   99386  D1 (((8,9),10),(11,12))
   55889  D2 (((8,9),11),(10,12))
    5303  D3 (((8,9),(10,12)),11)
    3752  D4 ((8,9),(10,(11,12)))
    3236  D5 ((((8,9),10),12),11)
    1184  D6 ((8,9),((10,12),11))
    1019  D7 (((8,9),(11,12)),10)
     360  D8 ((((8,9),10),11),12)
     329  D9 ((((8,9),11),12),10)
     235  D10 ((((8,9),11),10),12)

              3
             13
             14

The next section of summary output gives a complete sorted list of each observed tree topology. The actual file contained 174 different tree topologies, of which we show the first 10 and the last 3. The first column is the raw count. The second column is the posterior probability of the tree topology. The third column is the cumulative posterior probability. Notice that the first ten tree topologies account for nearly 90% of the posterior probability. You must refer back to the named clades for a complete description. Notice that most of the uncertainty in the top ten trees involves the taxa in named clade D.

******************** Tree topologies ********************

Count  Prob.  Cum.  Tree topology
93239  0.466  0.466 (A1,(3,(B1,((C1,(D1,13)),14))))
52018  0.260  0.726 (A1,(3,(B1,((C1,(D2,13)),14))))
 6268  0.031  0.758 (A1,(3,(B1,((C1,((((8,9),10),13),(11,12))),14))))
 5981  0.030  0.788 (A1,(3,(B1,((C1,(((8,9),10),((11,12),13))),14))))
 5004  0.025  0.813 (A1,(3,(B1,((C1,(D3,13)),14))))
 4084  0.020  0.833 (A1,(3,((B1,14),(C1,(D1,13)))))
 3546  0.018  0.851 (A1,(3,(B1,((C1,(D4,13)),14))))
 3098  0.015  0.866 (A1,(3,(B1,((C1,(D5,13)),14))))
 2726  0.014  0.880 (A1,(3,(B1,((C1,((((8,9),11),13),(10,12))),14))))
 2715  0.014  0.893 (A1,(3,((B1,14),(C1,(D2,13)))))

                          .
                          .
                          .

    1  0.000  1.000 (A1,(3,((B1,14),(C1,((8,9),((10,(11,12)),13))))))
    1  0.000  1.000 ((A1,(B1,((C1,(((8,9),10),((11,12),13))),14))),3)
    1  0.000  1.000 (A1,(3,(B1,((C1,(((((8,9),13),10),11),12)),14))))

The next section of summary output is similar to bootstrap proportions given by other methods. Relative to the most probable tree topology, the posterior probability of every clade, named or not, is provided. Taxa 8,9,10,11,12, and 13 appear together 99.1% of the time. The program did not name them as a clade because the number of distinct subtree topologies among the sampled trees exceeded ten.

***** Posterior probabilities of clades in most probable tree topology *****

     Count  Prob. Tree topology
    200000  1.000 {1,2,3,4,5,6,7,8,9,10,11,12,13,14}
    200000  1.000 {1,2}
    198620  0.993 {3,4,5,6,7,8,9,10,11,12,13,14}
    200000  1.000 {4,5,6,7,8,9,10,11,12,13,14}
    200000  1.000 {4,5}
    189343  0.947 {6,7,8,9,10,11,12,13,14}
    200000  1.000 {6,7,8,9,10,11,12,13}
    200000  1.000 {6,7}
    198142  0.991 {8,9,10,11,12,13}
    170693  0.853 {8,9,10,11,12}
    115978  0.580 {8,9,10}
    200000  1.000 {8,9}
    121991  0.610 {11,12}

For each named clade we summarize the transitions between subtree topologies. These tables can be useful for examining mixing efficiency. Ideally, the transitions would occur as frequently as one might expect from independent samples from the posterior, but this is almost never approached. It is important that there be a sufficient number of transitions between various likely subtree topologies. It is interesting in this example that there are never any direct transitions between subtrees D1 and D2. Presumably, a different sampler that allowed such transitions directly may greatly increase mixing properties.

******************** Clade transition matrices ********************

     |   D1    D2    D3    D4    D5    D6    D7    D8    D9    D10    -  
-----+-----------------------------------------------------------------
 D1  |98485     0     1   151    97     0    50    10     0     0   591 
 D2  |    0 55411   191     0     0    57     0     0    17     9   204 
 D3  |    0   202  4994     0    47    34     0     0     0     0    26 
 D4  |  158     0     1  3514     0    10    36     0     0     0    33 
 D5  |   91     0    49     0  3068     0     0    15     0     0    13 
 D6  |    0    54    40    10     0  1060     0     0     0     0    20 
 D7  |   49     0     0    38     0     0   929     0     3     0     0 
 D8  |    9     0     0     0    15     0     0   329     0     5     2 
 D9  |    0    12     0     1     0     0     4     0   305     7     0 
 D10 |    0    13     0     0     0     0     0     4     4   214     0 
  -  |  593   197    27    38     9    23     0     2     0     0 28418

The last portion of the summary output shows "clade trees" where the subtree topology differences within named clades is ignored. We see here that ignoring uncertainty in the tree topology of named clade D, the best clade tree appears nearly 80% of the time. The remaining uncertainty is mostly in the location of taxon 13. Lumping taxon 13 with clade D, we see that there is very little uncertainty in the tree.

******************** Clade tree topologies ********************

Count  Prob.  Cum.  Tree topology
159821  0.799  0.799 (A,(3,(B,((C,(D,13)),14))))
 7429  0.037  0.836 (A,(3,((B,14),(C,(D,13)))))
 6268  0.031  0.868 (A,(3,(B,((C,((((8,9),10),13),(11,12))),14))))
 5981  0.030  0.897 (A,(3,(B,((C,(((8,9),10),((11,12),13))),14))))
 2726  0.014  0.911 (A,(3,(B,((C,((((8,9),11),13),(10,12))),14))))
 2185  0.011  0.922 (A,(3,(B,((C,((((8,9),13),11),(10,12))),14))))
 2030  0.010  0.932 (A,(3,(B,((C,((((8,9),13),10),(11,12))),14))))
 1677  0.008  0.941 (A,(3,((B,(C,(D,13))),14)))
 1524  0.008  0.948 (A,(3,(B,((C,(((8,9),13),((10,12),11))),14))))
 1406  0.007  0.955 (A,(3,(B,((C,(((8,9),13),(10,(11,12)))),14))))

                               .
                               .
                               .

    1  0.000  1.000 (A,(3,((B,14),(C,((8,9),((10,(11,12)),13))))))
    1  0.000  1.000 (A,(3,((B,14),(C,((8,9),(((10,12),11),13))))))
    1  0.000  1.000 (A,(3,((B,(C,((((8,9),13),11),(10,12)))),14)))

Back to the table of contents.

This page was most recently updated on January 19, 2001.

bambe@mathcs.duq.edu