Evoinformatics

Summary

Fitchi is a python script that produces haplotype genealogy graphs from alignment files in Nexus format, along with summary statistics.

Haplotype genealogy graphs

With population genetic sequence data, haplotype genealogy graphs are often a better way to depict variation among populations than standard phylogenetic trees. In a haplotype genealogy graph, nodes usually represent unique sequences (haplotypes), and their size is according to the number of records in the input alignment that share exactly this sequence. Popular programs such as TCS draw relationships among these sequences as haplotype networks that include reticulations. These can represent either ambiguous node connections or a true biological signal of conflicting tree topologies due to recombination within the data set. If recombination is considered unlikely within the data set (e.g. for mitochondrial data or short nuclear fragments), it may thus be helpful to visualize sequence data without reticulations, which is what Fitchi does. For more information on cases, where networks with reticulation, or genealogies without reticulation may be more appropriate, I refer to Salzburger et al. (2011) and Mardulyn (2012).

In Fitchi's haplotype genealogies, edges between nodes indicate their connections in a user-supplied bifurcating phylogenetic tree, and edge lengths are according to the minimum number of mutations by which these sequences are separated. In Fitchi, ancestral sequences are reconstructed using the algorithm presented by Walter M. Fitch in his 1970 paper "Distinguishing homologous from analogous proteins", hence the name of this script. Details on the transformation of a bifucating phylogenetic tree into a haplotype genealogy are given in Salzburger et al. (2011).

The positioning of nodes in a haplotype genealogy graph can be tricky, especially with larger node numbers. In a perfectly laid-out graph, connecting edges of a node would be evenly spaced, no edges would cross each other, and no two nodes would overlap. Fortunately, graph theoreticians have run into this problem before us and developed algorithms that don't always work perfectly, but usually do a good enough job. Fitchi employs one of the most popular of these algorithms, called "neato" and accesses it through the graphviz package.

Requirements for Fitchi

Unfortunately, Fitchi is unlikely to run on your machine right out of the box, but may require a few additional installations. The below instructions should work well on Macintosh and Linux systems. It might also be possible to get all requirements to run on Windows, however, this has not been tested.

python3 and pip3

First of all, Fitchi only runs with python3, not with earlier versions of python. See whether you already have this version by typing

python3 -V

in a terminal window. If you see something like "Python 3.X.X" you should be fine. If not, please download and install python3 from the Python website. If python3 is correctly installed, it is likely that the python module manager called pip has come along with it. We will need it later, so make sure that you have it:

pip3 -V

You should see something like "pip 19.3.1 from /usr/local/lib/python3.7/site-packages (python 3.7)" (make sure the python version number given in parentheses is 3.X). If you don't, note that pip for python3 might also be named just plain "pip". Try this command instead of "pip3". If this still doesn't work for you, this blog post could help.

graphviz

Next, the above-mentioned graphviz package should be installed on your machine, and it needs some additional tools in order to be accessible from python scripts (like Fitchi). If you're on OS X and you have both the highly recommended package manager Homebrew and Apple's command line tools, then the graphviz installation should be as easy as

sudo brew install graphviz
sudo brew install pkg-config

If you're using Linux, these instructions given on the graphviz website and Stack Overflow should help you to install graphviz. If the installation worked, typing

neato -V

should result in something like "neato - graphviz version 2.38.0 (20140413.2041)".

python modules

Finally, Fitchi requires that the following four python modules are installed: pygraphviz, biopython, and both scipy and numpy. You may use

pip3 list

to check whether any of these are installed already, and install the remaining ones with

sudo pip3 install pygraphviz
sudo pip3 install biopython
sudo pip3 install scipy
sudo pip3 install numpy

If you see an "ImportError" related to pygraphviz, you could try

sudo pip3 install pygraphviz --install-option="--include-path=/usr/include/graphviz" --install-option="--library-path=/usr/lib/graphviz

(make sure to specify the right installation path of graphviz, it may not be in /usr/include/graphviz) instead of

sudo pip3 install pygraphviz

Fitchi input

Fitchi reads files in Nexus format that include both a sequence alignment and a bifurcating phylogenetic tree of these sequences. It is left to the user how this phylogenetic tree is obtained, but obvious candidate programs for this would include PAUP* or IQ-TREE, depending on the number and length of sequence records. The sequence alignment may include missing data, coded with IUPAC ambiguity codes. If this is the case, Fitchi uses the Fitch algorithm to infer the sequence compatible with the ambiguity code that has the smallest number of nucleotide changes compared to internal nodes. Thus, nucleotides with completely unknown state, coded as 'N', '?', or '-', will never be counted as mismatches. Note that for F_st calculations (following Weir & Cockerham 1984), sequences are expected to be diploid and phased, with each pair of two consecutive sequences assumed to be from the same individual. If sequences are haploid (e.g. mitochondrial), this can be specified with option --haploid (see below). An example file can be found on the Fitchi GitHub repository.

Running Fitchi

The good news is that once you've got the above packages and modules installed, and you've got an input file ready in Nexus format, running Fitchi is absolutely easy. Just place the python script fitchi.py somewhere on your machine, and type

python3 fitchi.py input.nex output.html

This should give you an output file in HTML format, named output.html. This output file contains the haplotype genealogy as an embedded SVG formatted graph, plus summary statistics for nucleotide diversity and overall differentiation. However, the haplotype genealogy will appear in plain grey, as no population identifiers were specified, and sequences could not be assigned to populations.

For a more informative graph, specify population identifiers that are also included in the ids of the sequence records in the alignment. For instance, populations in the example input file are simply named "pop1", "pop2", etc., and are recognized by Fitchi if the script is started with the "-p" option:

python3 fitchi.py example.nex example.html -p pop1 pop2 pop3 pop4 pop5 pop6 pop7 pop8

Sequence records that can not be assigned to any of the specified population identifiers are always shown in light grey, as are the sequence records of the last populations when more than 13 population identifiers are specified (for the lack of a good color scheme with more colors that are easy to discriminate).

Fitchi can also be piped, which can be useful when doing sliding window analyses, e.g. for genomic datasets:

cat input.nex | python3 fitchi.py > output.html

Advanced

The following additional options can be specified on the command line:

-f INTEGER
Using "-f", followed by an integer number specifies the first position in the alignment that should be used for analysis. Effectively, this cuts off the first sites up to, but not including, the specified position.

-t INTEGER
Just like the "-f" option trims away the first part of the alignment, "-t" cuts off its tail. For examle "-f 3 -t 10" would cut off the first two positions, and everything from position 11 to the end of the alignment. These two options affect both the haplotype genealogy graph and all statistics calculated from the alignment.

-e INTEGER
With larger alignments, haplotype genealogy graphs can soon become too cluttered to be meaningful. In this case, it might help to specify a minimum edge length for display in the graph. This means that all edges shorter than this minimum length (in Fitch distances) will not be shown, and nodes that are linked by these short edges will be collapsed into one. As a result, nodes may not represent unique single sequences anymore, but instead a collection of closely related sequences. This option only affects the haplotype genealogy graph, not the alignment statistics.

-n INTEGER
In a similar fashion, a minimum node size for display can be specified with the "-n" option. The effect of this option depends on the degree of each node. Nodes of degree 1 (terminal nodes), that are below the minimum size will disappear together with their connection edge. For nodes of degree 2, the two connecting edges will be linked to a single edge and the node will be removed. Nodes of larger degrees (with more than two edges) will still be included in the graph, but will appear as empty nodes with size 0. Like "-e", "-n" affects only the graph, not the alignment statistics.

-x
Another option to reduce graph complexity is to ignore all transitions and only use transversions to calculate edge lengths, which can be chosen with "-x". Like "-e" and "-n", this only affects the graph, not the alignment statistics.

--haploid
With the default setting, the calculation of pairwise F_st values assumes that the alignment contains diploid phased sequences, with each pair of consecutive sequences coming from the same individual. If sequences are haploid and each individual contributes only a single sequence, this can be specified with "--haploid". This option only affects the F_st calculation, not the other statistics or the haplotype genealogy graph.

-m FLOAT
For purely aesthetic reasons, you might want to increase or decrease the size of all nodes in relation to the edge lengths connecting them. This can be done by specifying a scale factor for all radi with the "-m" option (for example "-m 2.5"). You could also try "-m auto", which tells Fitchi to try to find an ideal size automatically so that no two nodes are overlapping.

-s INTEGER
This allows specification of a random number seed. Wherever multiple solutions of the Fitch algorithm are equally good, Fitchi decides among these solutions at random, thus, multiple runs of Fitchi with the same dataset (without random number seed) may lead to different results. In order to make results reproducible, a random number seed can be specified and use of the same random number seed will always produce the same solution of the Fitch algorithm.

Fitchi output

Fitchi writes HTML files with embedded SVG code. These can be read by most browsers, including recent versions of Firefox and Safari. HTML files are great for displaying the graphs and associated statistics, however, if you're running a large number of analyses with Fitchi, or you'ld like to prepare a haplotype genealogy for publication, you might want to extract particular bits of information from these HTML files. That's why fitchi_extract.py comes along with fitchi.py. Using the "-e" option of fitchi_extract.py, you can choose which information to extract from the HTML. For example,

python3 fitchi_extract.py example.html example.svg -e svg

returns only the SVG part of the HTML and writes it to file example.svg, after minor changes are made to the SVG code to include a figure legend. If you need a black and white figure, you can specify "-e svg_bw", and all colors will be converted to black and white before extraction of the SVG code. Similarly, "-e svg_simple" removes the semi-transparent gradients that cause the glossy look of nodes, and "-e svg_simple_bw" removes this and also converts to black and white.

If you're not interested in the haplotype genealogy at all, but only in a particular alignment statistic, you could specify for example "-e prop_var" for the overall proportion of variable sites in the alignment, "-e tot_var" for the total number of variable sites, or "-e fst" for the first pairwise F_st value. Type

python3 fitchi_extract.py -h

to see a full list of available options.

Since fitchi_extract.py can be piped just like fitchi.py, you could do the following to obtain a statistic (here the F_st between pop3 and pop5) directly without writing an HTML file:

cat example.nex | python3 fitchi.py -p pop3 pop5 | python3 fitchi_extract.py -e fst

Credits

Credits are due to Ethan Schoonover, whose color scheme Solarized substantially contributes to the good look of Fitchi's haplotype genealogy graphs. Further thanks go to the developers of the networkx and pygraphviz python modules. While Fitchi version >1.1 does no longer import the networkx module, it implements the Graph class developed for networkx.

How to cite Fitchi

Matschiner M (2015) Fitchi: Haplotype genealogy graphs based on the Fitch algorithm. Bioinformatics, 32:1250-252.

Download Fitchi

Evoinformatics Group

Fitchi

Summary

Haplotype genealogy graphs

Requirements for Fitchi

Fitchi input

Running Fitchi

Advanced

Fitchi output

Credits

How to cite Fitchi

Send us a message

Other ways to reach us

Email

Twitter

Visiting Address

Mailing Address