Integration of external software¶
EggLib provides wrappers of external programs in order to extend its functionalities besides its own population genetics and sequence management tools (although providing a wide array of external tools is not one of the primary aims of EggLib).
The available tools are presented through an integrated interface: once configured, they can be used as any regular EggLib function, using objects from EggLib class as argument and/or return values.
Currently, the wrappers module provides:
Multiple sequence alignment using Clustal Omega and MUSCLE.
Maximum-likelihood phylogeny using PhyML.
Maximum-likelihood fitting and testing of coding sequence evolution using
CodeMLfrom the PAML package.
Distance-based phylogeny using
neighborfrom the PHYLIP package.
Basic Local Alignment using
makeblastdbfrom the BLAST+ standalone package.
The requested programs must be installed separately.
Configuration of application paths¶
By default, all applications are disabled. Any attempt to use any wrappers
function will cause a
RuntimeError as in the example below:
>>> import egglib >>> cnt = egglib.io.from_fasta('sequences3.fas', cls=egglib.Container, alphabet=egglib.alphabets.DNA) >>> aln = egglib.wrappers.clustal(cnt) [traceback omitted] RuntimeError: Clustal Omega program not available -- please configure path
There is a paths object in the wrappers module which is responsible for holding the command executing the external applications. The applications are identified by these keys:
CodeML (PAML package)
We can see that the default value is
>>> print(egglib.wrappers.paths['clustal']) None
This object can be configured using the command-line tools
as described in Configuring external applications. The approach for editing paths from Python
scripts is described below.
Auto-detection of applications¶
autodetect() tries a set of
pre-defined commands and, if any works, set it as the command to run the
>>> egglib.wrappers.paths.autodetect(verbose=True) > codeml: codeml --> ok > phyml: phyml --> fail ("No such file or directory") > clustal: clustalo --> fail ("No such file or directory") > muscle: muscle --> ok
This method has a verbose option which provide you with information (success or failure, and message error in case of failure). In case of a failure, the corresponding application is disabled but this is not treated as an error.
In this example, the default commands
muscle succeeded, so the
two applications are now available. But the command for Clustal Omega was
not successful, either because the program is not installed, or not installed under
this name. As a result, it is still not available:
>>> print(egglib.wrappers.paths['clustal']) None
Of course, this result is dependent on system configuration and properties.
Setting application paths manually¶
It is possible to specify a command string (including the full path of the program executable) for any application. Assume that we have compiled Clustal Omega, but did not install it in the binary search path (the executable remains in the user’s home directory). We can set the path as follows:
>>> egglib.wrappers.paths['clustal'] = '/home/user/data/software/clustal-omega-1.2.1/src/clustalo' >>> print(egglib.wrappers.paths['clustal']) /home/user/data/software/clustal-omega-1.2.1/src/clustalo
If the path is incorrect, or if the specified command does not run as expected for this software, an error will be caused.
Saving and loading paths¶
Once paths have been set satisfactorily, it is possible to save them in a system file in order to have them available for the next session. When EggLib is imported again, the saved paths will be loaded and the properly configured application will be immediately usable. In case a program has been removed, it will cause an error when attempting to run it. The command to save paths is:
If the paths has been modified during a session, it is still possible to reset the paths from the saved configuration, discarding any changes that have been applied since EggLib has been imported, by running:
Using applications in scripts¶
This manual and the reference manual explain how to use the functions calling the external applications and they are aimed at users who are already experienced with these programs. It can be necessary to refer their own documentation to understand the meaning of available options and returned data.
Note that, for all wrappers, there is a
verbose option to control
whether or not the program output should be displayed on screen or deleted.
PhyML is a software for maximum-likelihood phylogenetic reconstruction
using amino acid or nucleotide sequences. It is available through the
wrappers.phyml() (reference included in the documentation).
The usage is fairly straightforward: it takes an
as argument and returns a
tuple with two items (the phylogenetic
tree as a
Tree object and a dictionary of statistics, respectively):
Besides the input sequence alignment, this function takes an array of option.
Only one is required (
model, the model name). See the reference manual
for the function for details. Most of the options are passed to the software.
The phylogenetic tree is returned as the first item of the return value.
It is an object of the EggLib class
Tree, which supports a wide
a range of editing operation (management of internal nodes, subtree extraction,
search for phylogenetic clades). The second item is a dictionary of statistics
including maximum-likelihood estimates of model parameters. See the function
manual for details.
CodeML is one of the program included in the PAML software package, a widely use
collection of tools for the analysis of the evolution of biological sequences.
wrappers.codeml() function of EggLib focuses on the tools describing
the evolution of coding sequences.
To be used, this tools requires an alignment of coding sequences (in particular with an alignment length which is a multiple of 3), and a phylogenetic tree whose leaves names match the sequence names. Let us assume that this is what we have with the alignment imported in the previous example. We also have built the phylogenetic with PhyML already. Let us check that the number of samples matches and that the alignment length is a multiple of 3:
>>> print(aln.ns, tree.num_leaves) 17 17 >>> print(aln.ls) 270
Note that it is also necessary to ensure that there are no stop codons in the
alignment and that the list of names match exactly. In addition, the tree must
not be rooted (the base must be a trifurcation, which is the standard for
trees reconstructed by maximum likelihood). This being said, to
run PhyML, we only need an
Align object, a
and the name of one of the models:
>>> aln = egglib.tools.to_codons(aln) >>> results = egglib.wrappers.codeml(aln, tree, 'M1a') >>> print(results['lnL']) -2081.576434 >>> print(results['omega']) [0.0905, 1.0] >>> print(results['freq']) [0.84535, 0.15465] >>> print(results['site_w']['postw']) [0.091, 0.091, 0.091, 1.0, 0.092, 0.094, 0.091, 0.091, 0.092, 0.107, 0.158, 0.092, 0.112, 0.091, 0.311, 0.091, 0.091, 0.091, 0.091, 0.091, 0.093, 0.136, 0.753, 0.092, 0.603, 0.939, 0.091, 0.091, 0.38, 0.093, 0.091, 0.1, 0.093, 0.097, 0.091, 0.092, 1.0, 0.091, 0.091, 0.091, 0.997, 0.091, 0.093, 0.091, 0.322, 0.092, 0.091, 0.091, 0.091, 0.092, 0.092, 0.091, 0.092, 0.093, 0.091, 0.091, 0.092, 0.091, 0.091, 0.091, 0.129, 0.092, 0.127, 0.091, 0.091, 0.091, 0.092, 0.987, 0.091, 0.168, 0.092, 0.213, 0.689, 0.989, 0.399, 0.997, 0.992, 0.093, 0.134, 0.985, 0.092, 0.099, 0.925, 0.091, 0.104, 0.102, 0.091, 0.091, 0.092, 0.091]
Please refer to the manual of the
wrappers.codeml() function for
a detailed list of options (including the list of available models), and
the list of returned data. The return value is a
a large number of entries exposing most of the contents of CodeML output.
In this example we displayed the log-likelihood of the model (
non-synonymous to synonymous rate ratio, or \(\omega\) for the two rate
omega), the corresponding category frequencies (
and the list of posterior \(\omega\) per-site estimates for the
amino acid sites (since the DNA alignment length is 270, there are 90
amino acid sites).
In the specific case of branch- and clade-models (with or without site
variability), CodeML expects that the phylogenetic tree contains internal
node labels to identify branches (
x is an integer) and
$x) that should have a different \(\omega\) rate. If these
labels are present in a Newick-formatted tree file, they will be imported
properly by the constructor of
Tree. Otherwise, it is also
possible to add them dynamically using the methods of
find_clade() and the property
Clustal Omega and Muscle¶
These are two multiple sequence alignment programs both supporting nucleotide
and amino acid sequences. In both cases, alignment of sequences provided as
Container object to a resulting
Align object is fairly
>>> aln1 = egglib.wrappers.clustal(cnt) >>> aln2 = egglib.wrappers.muscle(cnt)
These examples leave all other options to default values. Actually, both
wrappers provide a lot of options, thereby exposing the flexibility of these
two software tools. The options differ significantly between them, reflecting
their different paradigms. Please refer to the manual of the two functions
wrappers.muscle() (and of course the
user’s guide of the two programs) for a detailed list of the available
There are separate wrappers for
MUSCLE version 3 and version 5,
have different sets of options because the versions of the program
MUSCLE 5 is the current, improved version, but the
wrapper for version 3 is maintained in hope it might be useful. Only
one can be configured, based on the version of the
provided. The generic-named function
muscle() is an alias for
the available version.