External application tools

This module contains functions that can run external programs within the EggLib framework (taking as arguments and/or returning EggLib objects). To use these functions, the underlying programs must be available in the system. This is controlled by the application paths object as explained in Configuring external applications.

egglib.wrappers.phyml(align, model[, ...])

Reconstruct maximum-likelihood phylogeny using PhyML.

egglib.wrappers.codeml(align, tree, model[, ...])

Fit nucleotide substitution models using PAML.

egglib.wrappers.nj(aln[, model, kappa, ...])

Neighbour-joining (or UPGMA) tree using PHYLIP.

egglib.wrappers.clustal(source[, ref, full, ...])

Multiple sequence alignment using Clustal Omega.

egglib.wrappers.muscle(...)

Perform multiple alignment using Muscle.

egglib.wrappers.makeblastdb(source[, ...])

Create a BLAST database.

egglib.wrappers.megablast(query[, db, ...])

megablast similarity search.

egglib.wrappers.dc_megablast(query[, db, ...])

Dicontinuous megablast similarity search.

egglib.wrappers.blastn(query[, db, subject, ...])

blastn similarity search.

egglib.wrappers.blastn_short(query[, db, ...])

blastn for short sequences.

egglib.wrappers.blastp(query[, db, subject, ...])

bastp similary search. This is designed for using a protein query

egglib.wrappers.blastp_short(query[, db, ...])

blastp similarity search for short sequences.

egglib.wrappers.blastp_fast(query[, db, ...])

Quick blastp similarity search.

egglib.wrappers.blastx(query[, db, subject, ...])

blastx similarity search.

egglib.wrappers.blastx_fast(query[, db, ...])

Quick blastx similarity search.

egglib.wrappers.tblastn(query[, db, ...])

tblastn similary search.

egglib.wrappers.tblastn_fast(query[, db, ...])

Quich tblastn similary search.

egglib.wrappers.tblastx(*args, **kwargs)

tblastx similary search.

egglib.wrappers.BlastHit()

Results for a given hit of a BLAST run.

egglib.wrappers.BlastHsp()

Description of an Hsp of a BLAST run.

egglib.wrappers.BlastOutput()

Full results of a BLAST run.

egglib.wrappers.BlastQueryHits()

Results for a given query of a BLAST run.

egglib.wrappers.phyml(align, model, labels=False, rates=1, boot=0, start_tree='nj', fixed_topology=False, fixed_brlens=False, freq=None, TiTv=4.0, pinv=0.0, alpha=None, use_median=False, free_rates=False, seed=None, verbose=False)

Reconstruct maximum-likelihood phylogeny using PhyML.

PhyML is a program performing maximum-likelihood phylogeny estimation using nucleotide or amino acid sequence alignments.

Reference

Guindon S., J.-F. Dufayard, V. Lefort, M. Animisova, W. Hordijk, and O. Gascuel. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59: 307-321.

Parameters:
  • align – input sequence alignment as an Align instance.

  • model – substitution model to use (see list below).

  • labels – boolean indicating whether the group labels should be included in the names of sequences (they will as the following string: @lbl1,lbl2,lbl3…, that is: @ followed by all labels separated by commas).

  • rates – number of discrete categories of evolutionary rate. If different of 1, fits a gamma distribution of rates.

  • boot – number of bootstrap repetitions. Values of -1, -2 and -4 activate one the test-based branch support evaluation methods that provide faster alternatives to bootstrap repetitions (-1: aLRT statistics, -2: Chi2-based parametric tests, -4: Shimodaira and Hasegawa-like statistics). A value of 0 provides no branch support at all.

  • start_tree – starting topology used by the program. Possible values are the string nj (neighbour-joining tree), the string pars (maximum-parsimony tree), and a Tree instance containing a user-provided topology. In the latter case, the names of leaves of the tree must match the names of the input alignment (without group labels), implying that names cannot be repeated.

  • fixed_topology – boolean indicating whether the topology provided in the Tree instance passed as start_tree argument should be NOT be improved.

  • fixed_brlens – boolean indicating whether the branch lengths provided in the Tree instance passed as start_tree argument should be NOT be improved. All branch lengths must be specified. Automatically sets fixed_topology to True.

  • freq – nucleotide or amino acid frequencies. Possible values are the string o (observed, frequencies measured from the data), the string m (estimated by maximum likelihood for nucleotides, or retrieved from the substitution model for amino acids), or a four-item tuples provided the relative frequencies of A, C, G and T respectively (only for nucleotides). By default, use o for nucleotides and m for amino acids.

  • TiTv – transition/transversion ratio. If None, estimated by maximum likelihood. Otherwise, must be a stricly positive value. Ignored if data are not nucleotides or if the model does not support it. For the TN93 model, there must be a pair of ratios, one for purines and one for pyrimidines (in that order). However, a single value can be supplied (it will be applied to both rates).

  • pinv – proportion of invariable sites. If None estimated by maximum likelihood. Otherwise, must be in the range [0, 1].

  • alpha – gamma shape parameter. If None, estimated by maximum likelihood. Otherwise, must be a strictly positive value. Ignored if rates is 1 or if free_rates is True.

  • use_median – boolean indicating whether the median (instead of the mean) should be use to report values for rate from the discretized gamma distribution. Ignored if rates is 1 or if free_rates is True.

  • free_rates – boolean indicating whether a mixture model should be used for substitution rate categories instead of the discretized gamma. In this case all reates and their frequencies will be estimated. Requires that rates is larger than 1.

  • seed – pseudo-random number generator seed. Must be a stricly positive integer, preferably large.

  • verbose – boolean indicating whether standard output of PhyML should be displayed.

Returns:

A (tree, stats) tuple where tree is a Tree instance and stats is a dict containing the following statistics or estimated parameters:

  • lk – log-likelihood of the returned tree and model.

  • pars – parsimony score of the returned tree.

  • size – length of the returned tree.

  • rates – only available if model is GTR or custom, only for nucleotide sequences: relative substitution rates, as a list providing values in the following order:

    1. A \(\leftrightarrow\) C,

    2. A \(\leftrightarrow\) G

    3. A \(\leftrightarrow\) T

    4. C \(\leftrightarrow\) G

    5. C \(\leftrightarrow\) T

    6. G \(\leftrightarrow\) T

  • alpha – gamma shape parameter (only if the number of rate categories is larger than 1 and if free_rates was False).

  • cats – list of (rate, proportion) tuples for each discrete rate category (only if free_rates was True, implying that the number of rates was larger than 1).

  • freq – list of the relative base frequencies, in the following order: A, C, G, and T (only for nucleotide sequences).

  • ti/tv – transition/transversion ratio (available for the models K80, HKY85, F84, and TN93). For the TN93 model, the resulting value is a pair of transition/transversion ratios, one for purines and one for pyrimidines (in that order).

  • pinv – proportion of invariable sites (only if the corresponding option was not set to 0).

The choice of the model defines the type of data that are expected. The available models are:

  • Nucleotides:

    Code

    Full name

    Rates

    Base frequencies

    JC69

    Jukes and Cantor 1969

    one

    equal

    K80

    Kimura 1980

    two

    equal

    F81

    Felsenstein 1981

    one

    unequal

    HKY85

    Hasegawa, Kishino & Yano 1985

    two

    unequal

    F84

    Felsenstein 1984

    two

    unequal

    TN93

    Tamura and Nei 1993

    three

    unequal

    GTR

    general time reversible

    six

    unequal

    In addition, custom nucleotide substitution models can be specified. In that case, model must be a six-character strings of numeric characters specifying which of the six (reversable) substitution rates are allowed to vary. The one-rate model is specified by the string 000000, the two-rate model (separate transition and transversion rates) is specified by 010010, and the GTR model is specified by 012345. The substitution rates are specified in the following order:

    1. A \(\leftrightarrow\) C,

    2. A \(\leftrightarrow\) G

    3. A \(\leftrightarrow\) T

    4. C \(\leftrightarrow\) G

    5. C \(\leftrightarrow\) T

    6. G \(\leftrightarrow\) T

  • Amino acids:

    Code

    Authors

    LG

    Le & Gascuel (Mol. Biol. Evol. 2008)

    WAG

    Whelan & Goldman (Mol. Biol. Evol. 2001)

    JTT

    Jones, Taylor & Thornton (CABIOS 1992)

    MtREV

    Adachi & Hasegawa (in Computer Science Monographs 1996)

    Dayhoff

    Dayhoff et al. (in Atlas of Protein Sequence and Structure 1978)

    DCMut

    Kosiol & Goldman (Mol. Biol. Evol. 2004)

    RtREV

    Dimmic et al. (J. Mol. Evol. 2002)

    CpREV

    Adachi et al. (J. Mol. Evol. 2000)

    VT

    Muller & Vingron (J. Comput. Biol. 2000)

    Blosum62

    Henikoff & Henikoff (PNAS 1992)

    MtMam

    Cao et al. (J. Mol. Evol. 1998)

    MtArt

    Abascal, Posada & Zardoya (Mol. Biol. Evol. 2007)

    HIVw

    Nickle et al. (PLoS One 2007)

    HIVb

    ibid.

Changed in version 3.0.0: No more default value for model option. Added custom model for nucleotides. Changed SH pseudo-bootstrap option flag from -3 to -4. quiet function replaced by verbose. Several additional options are added. The syntax for input a user tree is modified. The second item in the returned tuple is a dictionary of statistics.

egglib.wrappers.codeml(align, tree, model, code=1, ncat=None, codon_freq=2, verbose=False, get_files=False, kappa=2.0, fix_kappa=False, omega=0.4)

Fit nucleotide substitution models using PAML.

This function uses the CodeML program only of the PAML package.

Parameters:
  • align – an Align containing a coding sequence alignment. The number of sequences must be at least 3, the length of the alignment is required to be a multiple of 3 (unless codons are provided). There must be no stop codons (even final stop codons) and there must not be any duplicated sequence name. The alphabet might be DNA of codon.

  • tree – a Tree providing the phylogenetic relationships between samples. The name of the sequences in the Align and in the Tree are required to match. If tree is None, a star topology is used (usage not recommended anymore and not supported by recent versions of PAML). If the tree contains branch length or node labels, they are discounted, except for PAML node tags (#x and $x where x is an integer) that are allowed both as nodel labels. If one wants to label a terminal branch of the tree, they can add the label at the end of the sample name (with an optional separating white space). The tree must not be rooted (if there is a birfurcation at the base, an error will be caused).

  • model

    model. The list of model names appears below:

    • M0 – one-ratio model (1 parameter).

    • free – all branches have a different ratio (1 parameter per branch).

    • nW – several sets of branches. Requires labelling of branches of the tree (1 parameter per set of branches).

    • M1a – nearly-neutral model (2 parameters).

    • M2a – positive selection model (4 parameters).

    • M3 – discrete model. Requires setting ncat (2 * ncat - 1 parameters).

    • M4 – frequencies model. Requires setting ncat (ncat - 1 parameters).

    • M7 – beta-distribution model. Requires setting ncat (2 parameters).

    • M8a – beta + single ratio, additional ratio fixed to 1. Requires setting ncat (3 parameters).

    • M8 – beta + single ratio. Requires setting ncat (4 parameters).

    • A0 – null branch-site model. Requires labelling of branches of the tree with two different labels (3 parameters).

    • A – branch-site model with positive selection. Requires labelling of branches of the tree with two different labels (4 parameters).

    • C0 – null model for model C (M2a_rel). Does not require branch labelling (4 parameters).

    • C – branch-site model. Requires labelling of branches (5 parameters).

    • D – discrete branch-site model. Requires labelling of branches and requires setting ncat to either 2 or 3 (4 or 6 parameters, respectively).

    The number of parameters given for each model concern the dN/dS ratios only. Refer to PAML documentation or the following references for more details and recommendations: Bielawski, J.P. & Z. Yang. 2004. A maximum likelihood method for detecting functional divergence at individual codon sites, with application to gene family evolution. J. Mol. Evol. 59:121-132; Yang Z., R. Nielsen, N. Goldman & A.M.K. Pedersen. 2000; Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449. Yang, Z., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908-917; Zhang, J., R. Nielsen & Z. Yang. 2005. Evaluation of an improved branch-site lieklihood method for detecting positive selection at the molecular level. Mol. Biol. Evol. 22:472-2479.

  • code – genetic code identifier (see here). Required to be an integer among the valid values. The default value is the standard genetic code. Only codes 1-11 are available.

  • ncat – number of dN/dS categories. Only a subset of models require that the number of categories to be specified. See models.

  • codon_freq

    an integer specifying the model for codon frequencies. Must be one of:

    • 0 – equal frequencies.

    • 1 – codon frequencies from base frequencies (3 degrees of liberty).

    • 2 – codon frequencies from base frequencies at three codon positions (9 degrees of liberty).

    • 3 – independent codon frequencies (60 degrees of freedom).

  • verbose – boolean indicating whether standard output of CodeML should be displayed.

  • get_files – boolean indicating whether the raw content of CodeML output files should be included in the returned data.

  • kappa – starting value for the transition/transversion rate ratio.

  • fix_kappa – tell if the transition/transversion rate ratio should be fixed to its starting value (otherwise, it is estimated as a free parameter).

  • omega – starting value for the dN/dS ratio (strictly positive value).

Deprecated since version 3.3.1: The star topology option is still supported but raise a UserWarning since it can cause an error with recent versions of PAML.

Returns:

A dict holding results. The keys defined in the returned dictionary are:

  • model – model name.

  • lk – log-likelihood.

  • np – number of parameters of the model.

  • kappa – fixed or estimated value of the transition/transversion rate ratio.

  • beta – if model is M7, M8a, or M8, a tuple with the p and q parameters of the beta distribution of neutral dN/dS ratios; otherwise, None.

  • K – number of dN/dS ratio categories. Equals to 0 for the free model, to the number of branch categories for the nW model, and to the number of site categories otherwise. This value is not necessarily equal to the ncat argument because M8a and M8 models add a category, and because it has a different meaning for model nW.

  • num_tags – number of branch categories detected from the imported tree (irrespective to the model that has been fitted). If the star topology has been used (tree=None), this value is 1.

  • omega – estimated dN/dS ratio or ratios. The structure of the value depends on the model:

    • M0 model – a single value.

    • free model – None (ratios are available as node labels in the tree available as tree_ratios).

    • nW model – a list of dN/dS ratios for all branch categories (they are listed in the order corresponding to branch labels).

    • Discrete models (M1a, M2a, M3, M4, C0, M7, M8a, and M8) – a list of K dN/dS ratios. The frequency of each dN/dS category is available is freq.

    • A0 and A models – a tuple of two list of 4 items each, containing respectively the background and foreground dN/dS ratios. The frequency of each dN/dS category is available is freq.

    • C and D models – a tuple of num_tags list (one list for each set of branches, as defined by branch labels found in the provided tree), each of them containing K dN/dS ratios. The frequency of each dN/dS category is available is freq.

  • freq – the frequency of dN/dS ratio categories. If defined, it is a list of K values. This entry is None for models M0, free, and nW.

  • length – total length of tree after estimating branch lengths with the specified model.

  • tree – the tree with fitted branch lengths, as a Tree instance. Branch lengths are expressed in terms of the model of codon evolution.

  • length_dS – total length of tree in terms of synonymous substitutions. Only available with M0, free, and nW models.

  • length_dN – total length of tree in terms of non-synonymous substitutions. Only available with M0, free, and nW models.

  • tree_dS – a Tree instance with branch lengths expressed in terms of synonymous substitutions. Only available with free and nW models.

  • tree_dN – a Tree instance with branch lengths expressed in terms of non-synonymous substitutions. Only available with free and nW models.

  • tree_ratios – a Tree instance with the dN/dS ratios included as branch labels. Only available with free and nW models.

  • site_w – a dict containing posterior predictions of site dN/dS ratios. Not available for models M0, free, and nW (in that cases, the value is None). The dict contains the following keys:

    • method – on the strings NEB and BEB.

    • aminoacid – the list of reference amino acids for all amino acid sites of the alignment (they are taken from the first sequence in the original alignment).

    • proba – the list of posterior probabilites of the dN/dS categories for all amino acid sites of the alignment. For each site, a tuple of K (the number of dN/dS categories) is provided.

    • best – the index of the best category for each site.

    • postw – list of the posterior dN/dS estimate for all sites (None if not available).

    • postwsd – list of the standard deviation of the dN/dS estimate for all sites (always available if postw is available and the method is BEB, None otherwise).

    • P(w>1) – probability that the dN/dS ratio is greater than 1 for all sites (None if not available).

  • main_output – raw content of the main CodeML output file. This key is not present if the option get_files is not set to True.

  • rst_output – raw content of the rst detailed CodeML output file. This key is not present if the option get_files is not set to True.

Changed in version 3.0.0: Turned into a singe function, interface changes (more models, more options, more results).

Changed in version 3.3.1: Raise a warning if tree is set to None (star topology).

egglib.wrappers.nj(aln, model=None, kappa=None, upgma=False, outgroup=None, randomize=0, verbose=False)

Neighbour-joining (or UPGMA) tree using PHYLIP.

The programs of PHYLIP used are dnadist (or protdist) and neighbor.

Parameters:
  • aln – an Align instance containing source sequences. The alphabet must DNA or protein, matching the model argument. Note: outgroup is ignored.

  • model

    one of the models among the list below:

    For DNA sequences:
    • JC69: Jukes & Cantor’s 1969 one-parameter model.

    • K80: Kimura’s 1980 two-parameter model.

    • F84: like K80 with unequal base frequencies (default).

    • LD: LogDet (log-determinant of nucleotide occurence matrix).

    For protein sequences:
    • PAM: Dayoff PAM matrix.

    • JTT: Jones-Taylor-Thornton model (default).

    • PMB: probability matrix from blocks.

  • kappa – transition/transversion ratio (default is 2.0).

  • upgma – whether using the UPGMA method rather than the neighbour-joining method.

  • outgroup – name of the sample to use as outgroup for printing the tree (the root is based at the parent node of this sample; default is the first sample).

  • randomize – whether to randomize samples.

  • verbose – whether displaying console output of the PHYLIP programs.

Returns:

A Tree instance containing the tree.

New in version 3.0.0.

egglib.wrappers.clustal(source, ref=None, full=False, full_iter=False, cluster_size=100, use_kimura=True, num_iter=1, threads=1, keep_order=False, verbose=False)

Multiple sequence alignment using Clustal Omega.

Parameters:
  • source

    a Container or Align containing the sequences to align. If a Container is provided, sequences are assumed to be unaligned, and, if a Align is provided, sequences are assumed to be aligned. The list below explains what is done based on the type of source and whether a value is provided for ref:

    • If source is a Container and ref is None, the sequences in source are aligned.

    • If source is an Align and ref is None, a hidden Markov model is built from the alignment, then the alignment is reset and sequences are realigned.

    • If source is an Align and an alignment is provided as ref, the two alignments are preserved (their columns are left unchanged), and they aligned with respect to each other.

    • If source is a Container and an alignment is provided as ref, a hidden Markov model is built from ref, then source is aligned using it, and finally the resulting alignment is aligned with respect to ref as described for the previous case.

    source must contain at least two sequences unless it is an Align and a value is provided for ref (in that case, it must contain at least one sequence).

  • ref – an Align instance providing an external alignment. See above for more details. ref must contain at least one sequence. Sequences must be aligned.

  • full – use full distance matrix to determine guide tree (the default is using the faster and less memory-intensive mBed approximation).

  • full_iter – use full distance matrix to determine guide tree during iterations.

  • cluster_size – size of clusters (as a number of sequences) used in the mBed algorithm.

  • use_kimura – use Kimura correction for estimating whole-alignment distance (only available if a protein alignment has been provided as source).

  • num_iter – number of iterations allowing to improve the quality of the alignment. Must be a number \(\geq 1\) or a pair of numbers \(\geq 1\). If the value is a pair of numbers, they specify the number of guide tree iterations and hidden Markov model iterations, respectively. If a single value is provided, iterations couple guide tree and hidden Markov model.

  • threads – number of threads for parallelization (available for parts of the program).

  • keep_order – return the sequences in the same order as they were loaded.

  • verbose – display Clustal Omega’s console output.

Returns:

An Align instance containing aligned sequences.

Changed in version 3.0.0: Ported to Clustal Omega and added support for more options.

Changed in version 3.1.0: Support protein sequences.

egglib.wrappers.muscle(...)[source]

Perform multiple alignment using Muscle.

Depending of the version of MUSCLE detected at configuration, the call will be forwarded to either wrappers.muscle3() or wrappers.muscle5().

Changed in version 3.2.0: Dynamically use the new muscle5() method if MUSCLE version 5 is present.

egglib.wrappers.muscle5(source, super5=False, perm='none', perturb=0, consiters=2, refineiters=100, threads=None, verbose=False)

Perform multiple alignment using Muscle version 5.

Parameters:
  • source – a Container instance contain sequences to align. Align is supported but will be treated as if it was a Container. The alphabet must be DNA or protein.

  • super5 – use the Super5 algorithm (recommended for datasets of more than a few hundred sequences).

  • perm – guide tree permutation mode. Available values are none, abc, acb, and bca. More information here.

  • perturb – if different of 0, the value is a random number generator seed used to perform hidden Markov model perturbations.

  • consiter – number of consistency iterations.

  • refineiters – number of refinement iterations.

  • threads – number of threads. By default, let MUSCLE pick the value.

  • verbose – show MUSCLE’s output in stdout.

Note

The ensemble fasta features of muscle5 are currently not available through this wrapper.

Returns:

An Align containing aligned sequences.

New in version 3.2.0.

egglib.wrappers.muscle3(source, ref=None, verbose=False, **kwargs)

Perform multiple alignment using Muscle.

This wrapper is designed to run version 3 of MUSCLE. To use MUSCLE version 5, configure the application path using the egglib-config apps command.

MUSCLE’s default options tend to produce high-quality alignments but may be long to run on large data sets. Muscle’s author recommends using the option maxiters=2 for large data sets, and, for fast alignment (in particular of closely related sequences): maxiters=1 diags=True aa_profile='sv' distance1='kbit20_3' (for amino acid sequences) and maxiters=1 diags=True (for nucleotide sequences).

Parameters:
  • source – a Container or Align containing sequences to align. If an Align is provided, sequences are assumed to be already aligned and alignment will be refined (using the -refine option of Muscle), unless an alignment is also provided as ref. In the latter case, the two alignments are preserved (their columns are left unchanged), and they are aligned with respect to each other.

  • ref – an Align instance providing an alignement that should be aligned with respect to the alignment provided as source. If ref is provided, it is required both source and ref are Align instances.

  • verbose – display Muscle’s console output.

  • kwargs

    other keyword arguments are passed to Muscle. The available options are listed below:

    option value

    anchors

    a boolean

    brenner

    a boolean

    cluster

    a boolean

    diags

    a boolean

    diags1

    a boolean

    diags2

    a boolean

    dimer

    a boolean

    teamgaps4

    a boolean

    SUEFF

    a float

    aa_profile

    one of: le, sp, sv

    anchorspacing

    an integer

    center

    a float

    cluster1

    one of: upgma, upgmb, neighborjoining

    diagbreak

    an integer

    diaglength

    an integer

    diagmargin

    an integer

    distance1

    one of: kmer6_6, kmer20_3, kmer20_4, kbit20_3, kmer4_6

    distance2

    one of: pctidkimura, pctidlog

    gapopen

    a float

    hydro

    an integer

    hydrofactor

    a float

    maxiters

    an integer

    maxtrees

    an integer

    minbestcolscore

    a float

    minsmoothscore

    a float

    nt_profile

    one of: spn

    objscore

    one of: sp, ps, dp, xp, spf, spm

    refinewindow

    an integer

    root1

    one of: pseudo, midlongestspan, minavgleafdist

    seqtype

    one of: protein, dna, auto

    smoothscoreceil

    a float

    weight1

    one of: none, henikoff, henikoffpb, gsc, clustalw, threeway

    weight2

    one of: none, henikoff, henikoffpb, gsc, clustalw, threeway

    For a description of options, see the Muscle manual. Most of Muscle’s options are available. Note that function takes no flag option, and Muscle’s flag options are passed as boolean keyword arguments (except options relative to the amino acid or nucleotide profile score options, that are passed as string as aa_profile and nt_profile, respectively. The order of options is preserved.

Returns:

An Align containing aligned sequences.

Changed in version 3.0.0: Added support for most options.

Changed in version 3.2.0: Renamed as muscle3(). Available as muscle() if MUSCLE version 3 is available

egglib.wrappers.makeblastdb(source, dbtype=None, out=None, input_type='fasta', verbose=False, title=None, parse_seqids=False, hash_index=False, mask_data=None, mask_id=None, mask_desc=None, blastdb_version=5, max_file_sz='1GB', taxid=None, taxid_map=None)[source]

Create a BLAST database.

Parameters:
  • source – name of an input file of the appropriate format. If not fasta, the format must be specified using the input_type option. Alternatively, source can be a Container or Align instance. If so, its alphabet must be DNA or protein and the dbtype argument, if specified, must match. Note that passing a Container or Align instance must be avoided for large databases.

  • dbtype – database type: "nucl" or "prot" are acceptable. Can be omitted if a Container or an Align is provided as source.

  • out – database name. Must be specified if a Container or an Align is provided as source, or if the input_type is “blastdb”, otherwise the input file name is used as database name.

  • input_type – format of input file. Must be "fasta" if a Container or an Align is provided as source. Otherwise must describe the format of source: "fasta", "asn1_bin", "asn1_txt", or "blastdb".

  • verbose – display makeblastdb output (by default, it is returned by the function). Errors are always displayed.

  • title – database title. A default title is inserted in case a Container or Align instance is passed as source.

Parse_seqids:

parse seqid from sequence names (considered if input_type is fasta, including if a Container or an Align is provided as source; argument ignored otherwise: seqid is always imported).

Hash_index:

create index of sequence hash values.

Mask_data:

list of input files containing masking data.

Mask_id:

list of strings to uniquely identify the masking algorithm, one for each mask file (requires mask_data).

Mask_desc:

list of free form strings to describe the masking algorithm details, one for each mask file (requires mask_id).

Blastdb_version:

version of BLAST database to be created (4 or 5).

Max_file_sz:

maximum file size for BLAST database files.

Taxid:

taxonomy ID to assign to all sequences as an integer (incompatible with taxid_map).

Taxid_map:

text file mapping sequence IDs to taxonomy IDs (requires parse_seqids, incompatible with taxid).

Returns:

Standard output of the program (None if verbose was True).

Please refer to the manual of BLAST tools for more details.

egglib.wrappers.megablast(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=28, gapopen=5, gapextend=2, reward=1, penalty=-2, strand='both', no_dust=False, no_soft_masking=False, lcase_masking=False, perc_identity=0, no_greedy=False)[source]

megablast similarity search. This is designed for strongly similar sequences using a nucleotide query on a nucleotide database.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • gapopen – cost to open gaps.

  • gapextend – cost to extend gaps.

  • reward – reward for nucleotide match.

  • penalty

    penalty for nucleotide mismatch. The available reward/penalty combinations are listed below, with the gapopen/gapextend combinations available for each reward/penalty combination. The default gapopen/gapextend combination is indicated by an asterix. For megablast, except for (1,-1), the default for gapopen/gapextend is left up to the blastn program:

    (1,-2) --  *(5,2)  (2,2)  (1,2)  (0,2)  (3,1)  (2,1)  (1,1)
    (1,-3) --  *(5,2)  (2,2)  (1,2)  (0,2)  (2,1)  (1,1)
    (1,-4) --  *(5,2)  (1,2)  (0,2)  (2,1)  (1,1)
    (2,-3) --   (4,4)  (2,4)  (0,4)  (3,3)  (6,2) *(5,2)  (4,2)  (2,2)
    (4,-5) -- *(12,8)  (6,5)  (5,5)  (4,5)  (3,5)
    (1,-1) --  *(5,2)  (3,2)  (2,2)  (1,2)  (0,2)  (4,1)  (3,1)  (2,1)
    

    Defaults:

    • megablast: (1,-2)

    • dc-megablast: (2,-3)

    • blastn: (2,-3)

    • blastn-short: (1,-3)

  • strand – query strand to use: "both", "minus", or "plus".

  • no_dust – prevent DUST filtering (by default, use BLAST’s default filtering mode). False by default unless for blastn-short.

  • no_soft_masking – do not apply filtering locations as soft masks (i.e., only for finding initial matches). False by default unless for blastn-short.

  • lcase_masking – use lower case filtering in query and subject sequences (not supported when EggLib objects are used because the DNA alphabet is case-insensitive and all bases will be passed to the blast program as upper case).

  • perc_identity – percent identify cutoff

Returns:

A BlastOutput instance.

egglib.wrappers.dc_megablast(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=11, gapopen=None, gapextend=None, reward=2, penalty=-3, strand='both', no_dust=False, no_soft_masking=False, lcase_masking=False, perc_identity=0, template_type='coding', template_length=18)[source]

Dicontinuous megablast similarity search. This is designed for similar sequences (less similar than megablast()) using a nucleotide query on a nucleotide database.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • gapopen – cost to open gaps.

  • gapextend – cost to extend gaps.

  • reward – reward for nucleotide match.

  • penalty

    penalty for nucleotide mismatch. The available reward/penalty combinations are listed below, with the gapopen/gapextend combinations available for each reward/penalty combination. The default gapopen/gapextend combination is indicated by an asterix. For megablast, except for (1,-1), the default for gapopen/gapextend is left up to the blastn program:

    (1,-2) --  *(5,2)  (2,2)  (1,2)  (0,2)  (3,1)  (2,1)  (1,1)
    (1,-3) --  *(5,2)  (2,2)  (1,2)  (0,2)  (2,1)  (1,1)
    (1,-4) --  *(5,2)  (1,2)  (0,2)  (2,1)  (1,1)
    (2,-3) --   (4,4)  (2,4)  (0,4)  (3,3)  (6,2) *(5,2)  (4,2)  (2,2)
    (4,-5) -- *(12,8)  (6,5)  (5,5)  (4,5)  (3,5)
    (1,-1) --  *(5,2)  (3,2)  (2,2)  (1,2)  (0,2)  (4,1)  (3,1)  (2,1)
    

    Defaults:

    • megablast: (1,-2)

    • dc-megablast: (2,-3)

    • blastn: (2,-3)

    • blastn-short: (1,-3)

  • strand – query strand to use: "both", "minus", or "plus".

  • no_dust – prevent DUST filtering (by default, use BLAST’s default filtering mode). False by default unless for blastn-short.

  • no_soft_masking – do not apply filtering locations as soft masks (i.e., only for finding initial matches). False by default unless for blastn-short.

  • lcase_masking – use lower case filtering in query and subject sequences (not supported when EggLib objects are used because the DNA alphabet is case-insensitive and all bases will be passed to the blast program as upper case).

  • perc_identity – percent identify cutoff

Parameters:
  • template_type – template type for for dc-megablast. Possible values are "coding" (default), "optimal", and "coding_and_optimal".

  • template_length – template length for dc-megablast. Possible values are 16, 18 (the default), and 21.

egglib.wrappers.blastn(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=11, gapopen=None, gapextend=None, reward=2, penalty=-3, strand='both', no_dust=False, no_soft_masking=False, lcase_masking=False, perc_identity=0)[source]

blastn similarity search. This is designed for distant sequences using a nucleotide query on a nucleotide database.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • gapopen – cost to open gaps.

  • gapextend – cost to extend gaps.

  • reward – reward for nucleotide match.

  • penalty

    penalty for nucleotide mismatch. The available reward/penalty combinations are listed below, with the gapopen/gapextend combinations available for each reward/penalty combination. The default gapopen/gapextend combination is indicated by an asterix. For megablast, except for (1,-1), the default for gapopen/gapextend is left up to the blastn program:

    (1,-2) --  *(5,2)  (2,2)  (1,2)  (0,2)  (3,1)  (2,1)  (1,1)
    (1,-3) --  *(5,2)  (2,2)  (1,2)  (0,2)  (2,1)  (1,1)
    (1,-4) --  *(5,2)  (1,2)  (0,2)  (2,1)  (1,1)
    (2,-3) --   (4,4)  (2,4)  (0,4)  (3,3)  (6,2) *(5,2)  (4,2)  (2,2)
    (4,-5) -- *(12,8)  (6,5)  (5,5)  (4,5)  (3,5)
    (1,-1) --  *(5,2)  (3,2)  (2,2)  (1,2)  (0,2)  (4,1)  (3,1)  (2,1)
    

    Defaults:

    • megablast: (1,-2)

    • dc-megablast: (2,-3)

    • blastn: (2,-3)

    • blastn-short: (1,-3)

  • strand – query strand to use: "both", "minus", or "plus".

  • no_dust – prevent DUST filtering (by default, use BLAST’s default filtering mode). False by default unless for blastn-short.

  • no_soft_masking – do not apply filtering locations as soft masks (i.e., only for finding initial matches). False by default unless for blastn-short.

  • lcase_masking – use lower case filtering in query and subject sequences (not supported when EggLib objects are used because the DNA alphabet is case-insensitive and all bases will be passed to the blast program as upper case).

  • perc_identity – percent identify cutoff

Returns:

A BlastOutput instance.

egglib.wrappers.blastn_short(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=1000, parse_deflines=False, num_threads=1, word_size=7, gapopen=None, gapextend=None, reward=1, penalty=-3, strand='both', no_dust=True, no_soft_masking=True, lcase_masking=False, perc_identity=0)[source]

blastn for short sequences. This is optimised for query sequences up to 50 bp long. It automatically sets evalue=1000, word_size=7, no_dust=True, no_soft_masking=True, reward=1, penalty=-3, gapopen=5 and gapextend=2.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • gapopen – cost to open gaps.

  • gapextend – cost to extend gaps.

  • reward – reward for nucleotide match.

  • penalty

    penalty for nucleotide mismatch. The available reward/penalty combinations are listed below, with the gapopen/gapextend combinations available for each reward/penalty combination. The default gapopen/gapextend combination is indicated by an asterix. For megablast, except for (1,-1), the default for gapopen/gapextend is left up to the blastn program:

    (1,-2) --  *(5,2)  (2,2)  (1,2)  (0,2)  (3,1)  (2,1)  (1,1)
    (1,-3) --  *(5,2)  (2,2)  (1,2)  (0,2)  (2,1)  (1,1)
    (1,-4) --  *(5,2)  (1,2)  (0,2)  (2,1)  (1,1)
    (2,-3) --   (4,4)  (2,4)  (0,4)  (3,3)  (6,2) *(5,2)  (4,2)  (2,2)
    (4,-5) -- *(12,8)  (6,5)  (5,5)  (4,5)  (3,5)
    (1,-1) --  *(5,2)  (3,2)  (2,2)  (1,2)  (0,2)  (4,1)  (3,1)  (2,1)
    

    Defaults:

    • megablast: (1,-2)

    • dc-megablast: (2,-3)

    • blastn: (2,-3)

    • blastn-short: (1,-3)

  • strand – query strand to use: "both", "minus", or "plus".

  • no_dust – prevent DUST filtering (by default, use BLAST’s default filtering mode). False by default unless for blastn-short.

  • no_soft_masking – do not apply filtering locations as soft masks (i.e., only for finding initial matches). False by default unless for blastn-short.

  • lcase_masking – use lower case filtering in query and subject sequences (not supported when EggLib objects are used because the DNA alphabet is case-insensitive and all bases will be passed to the blast program as upper case).

  • perc_identity – percent identify cutoff

Returns:

A BlastOutput instance.

egglib.wrappers.blastp(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='BLOSUM62', threshold=11, comp_based_stats=2, seg=0, soft_masking=False, lcase_masking=False, window_size=40, use_sw_tback=False)[source]
bastp similary search. This is designed for using a protein query

on a protein database.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • gapopen – cost to open a gap. None: use default

  • gapextend – cost to extend a gap. None: use default

  • matrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.

  • threshold – minimum word score such that the word is added to the BLAST lookup table (>0).

  • seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).

  • window_size – multiple hits window size (use 0 to specify 1-hit algorithm).

Parameters:
  • parse_deflines – parse query and subject bar delimited sequence identifiers.

  • comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).

  • soft_masking – apply filtering locations as soft masks.

  • lcase_masking – use lower case filtering in query and subject sequences.

  • use_sw_tback – compute locally optimal Smith-Waterman alignments.

Returns:

A BlastOutput instance.

egglib.wrappers.blastp_short(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='PAM30', threshold=16, comp_based_stats=0, seg=0, lcase_masking=False, window_size=15, use_sw_tback=False)[source]

blastp similarity search for short sequences.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • gapopen – cost to open a gap. None: use default

  • gapextend – cost to extend a gap. None: use default

  • matrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.

  • threshold – minimum word score such that the word is added to the BLAST lookup table (>0).

  • seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).

  • window_size – multiple hits window size (use 0 to specify 1-hit algorithm).

Parameters:
  • parse_deflines – parse query and subject bar delimited sequence identifiers.

  • comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).

  • lcase_masking – use lower case filtering in query and subject sequences.

  • use_sw_tback – compute locally optimal Smith-Waterman alignments.

Returns:

A BlastOutput instance.

egglib.wrappers.blastp_fast(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=None, threshold=21, comp_based_stats=2, seg=0, lcase_masking=False, window_size=40, use_sw_tback=False)[source]

Quick blastp similarity search.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • threshold – minimum word score such that the word is added to the BLAST lookup table (>0).

  • seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).

  • window_size – multiple hits window size (use 0 to specify 1-hit algorithm).

  • parse_deflines – parse query and subject bar delimited sequence identifiers.

  • comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).

  • lcase_masking – use lower case filtering in query and subject sequences.

  • use_sw_tback – compute locally optimal Smith-Waterman alignments.

Returns:

A BlastOutput instance.

egglib.wrappers.blastx(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='BLOSUM62', threshold=12, seg=(12, 2.2, 2.5), soft_masking=False, lcase_masking=False, window_size=40, strand='both', query_genetic_code=1, max_intron_length=0, comp_based_stats=2)[source]

blastx similarity search. Designed for using a translated nucleotide query on a protein database.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • gapopen – cost to open a gap. None: use default

  • gapextend – cost to extend a gap. None: use default

  • matrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.

  • threshold – minimum word score such that the word is added to the BLAST lookup table (>0).

  • seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).

  • window_size – multiple hits window size (use 0 to specify 1-hit algorithm).

Parameters:
  • soft_masking – apply filtering locations as soft masks.

  • lcase_masking – use lower case filtering in query and subject sequences.

  • query_genetic_code – genetic code to translate query. Allowed values are: 1-6, 9-16, 21-25.

  • max_intron_length – length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).

  • strand – query strand(s) to search against database/subject. Choice of both, minus, or plus.

  • comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).

Returns:

A BlastOutput instance.

egglib.wrappers.blastx_fast(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='BLOSUM62', threshold=21, seg=(12, 2.2, 2.5), soft_masking=False, lcase_masking=False, window_size=40, strand='both', query_genetic_code=1, max_intron_length=0, comp_based_stats=2)[source]

Quick blastx similarity search. Designed for using a translated nucleotide query on a protein database and optimised for faster execution.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • gapopen – cost to open a gap. None: use default

  • gapextend – cost to extend a gap. None: use default

  • matrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.

  • threshold – minimum word score such that the word is added to the BLAST lookup table (>0).

  • seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).

  • window_size – multiple hits window size (use 0 to specify 1-hit algorithm).

Parameters:
  • soft_masking – apply filtering locations as soft masks.

  • lcase_masking – use lower case filtering in query and subject sequences.

  • query_genetic_code – genetic code to translate query. Allowed values are: 1-6, 9-16, 21-25.

  • max_intron_length – length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).

  • strand – query strand(s) to search against database/subject. Choice of both, minus, or plus.

  • comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).

Returns:

A BlastOutput instance.

egglib.wrappers.tblastn(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='BLOSUM62', threshold=13, seg=(12, 2.2, 2.5), soft_masking=False, window_size=40, db_genetic_code=1, max_intron_length=0, comp_based_stats=2)[source]

tblastn similary search. Designed for using a protein query on a translated nucleotide database.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • gapopen – cost to open a gap. None: use default

  • gapextend – cost to extend a gap. None: use default

  • matrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.

  • threshold – minimum word score such that the word is added to the BLAST lookup table (>0).

  • seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).

  • window_size – multiple hits window size (use 0 to specify 1-hit algorithm).

Parameters:
  • soft_masking – apply filtering locations as soft masks.

  • db_genetic_code – genetic code to translate subject sequences. Allowed values are: 1-6, 9-16, 21-25.

  • max_intron_length – length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).

  • comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).

Returns:

A BlastOutput instance.

egglib.wrappers.tblastn_fast(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='BLOSUM62', threshold=21, seg=(12, 2.2, 2.5), soft_masking=False, window_size=40, db_genetic_code=1, max_intron_length=0, comp_based_stats=2)[source]

Quich tblastn similary search. Designed for using a protein query on a translated nucleotide database and optimised for fast execution.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • gapopen – cost to open a gap. None: use default

  • gapextend – cost to extend a gap. None: use default

  • matrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.

  • threshold – minimum word score such that the word is added to the BLAST lookup table (>0).

  • seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).

  • window_size – multiple hits window size (use 0 to specify 1-hit algorithm).

Parameters:
  • soft_masking – apply filtering locations as soft masks.

  • db_genetic_code – genetic code to translate subject sequences. Allowed values are: 1-6, 9-16, 21-25.

  • max_intron_length – length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).

  • comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).

Returns:

A BlastOutput instance.

egglib.wrappers.tblastx(*args, **kwargs)

tblastx similary search. Designed for using a translated nucleotide query on a translated nucleotide database.

Parameters:
  • query – input sequence, as a str, SequenceView, SampleView, Container or Align object. If an EggLib object, the alphabet must be DNA.

  • db – name of a nucleotide database (such as one created with makeblastdb(). Incompatible with subject.

  • subject – can be used alternatively to db. Subject sequence to search, as a str or a SequenceView object.

  • query_loc – location on the query sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software. Not supported if query is a Container.

  • subject_loc – location on the target sequence, as a (start, stop) tuple. The stop position is not included in the range passed to the software.

  • evalue – expect value (E) for saving hits.

  • num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.

  • word_size – length of initial exact match.

Parameters:
  • matrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.

  • seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).

  • window_size – multiple hits window size (use 0 to specify 1-hit algorithm).

  • soft_masking – apply filtering locations as soft masks.

  • lcase_masking – use lower case filtering in query and subject sequences.

  • query_genetic_code – genetic code to translate query. Allowed values are: 1-6, 9-16, 21-25.

  • db_genetic_code – genetic code to translate subject sequences. Allowed values are: 1-6, 9-16, 21-25.

  • strand – query strand(s) to search against database/subject. Choice of both, minus, or plus.

Returns:

A BlastOutput instance.

class egglib.wrappers.BlastOutput[source]

Full results of a BLAST run.

Attributes

db

Name of the database used.

num_hits

Total number of hits for all queries.

num_hsp

Total number of Hsp's for all hits of all entries.

num_queries

Number of queries used in the BLAST search.

params

Search parameters.

program

Name of the program used.

query_ID

Identifier of the query.

query_def

Description of the query.

query_len

Length of the query.

reference

Bibliographic reference.

version

Version of the program.

Methods

get_query(i)

Hits for a given query.

iter_hits()

Iterator over all hits of all queries.

iter_hsp()

Iterator over all Hsp's of all hits of all queries.

iter_queries()

Iterator over queries.

property db

Name of the database used.

get_query(i)[source]

Hits for a given query. An instance of BlastQueryHits is returned. blast_output.get_query(i) is also available as blast_output[i].

iter_hits()[source]

Iterator over all hits of all queries. Iterates over BlastHit instances

iter_hsp()[source]

Iterator over all Hsp’s of all hits of all queries.

iter_queries()[source]

Iterator over queries. Allows to iterate over BlastQueryHits instances for all queries. for query_hit in blast_output.iter_queries() is also available as for query_hit in blast_output.

property num_hits

Total number of hits for all queries.

property num_hsp

Total number of Hsp’s for all hits of all entries.

property num_queries

Number of queries used in the BLAST search. blast_output.num_queries is also available as len(blast_output).

property params

Search parameters.

"expect": E-value, "reward": nucleotide match reward, "penalty": nucleotide mismatch reward, "gapopen": cost for opening a gap, "gapextend": cost for extending a gap. "filter": filter string.

property program

Name of the program used.

property query_ID

Identifier of the query.

property query_def

Description of the query.

property query_len

Length of the query.

property reference

Bibliographic reference.

property version

Version of the program.

class egglib.wrappers.BlastQueryHits[source]

Results for a given query of a BLAST run.

Attributes

H

Karlin-Altschul entropy parameter.

K

Karlin-Altschul kappa parameter.

L

Karlin-Altschul lambda parameter.

db_len

Number of letters in the database.

db_num

Number of sequence in the database.

eff_space

Effective space of the search.

hsp_len

Length adjustment.

num

index of the query in the BLAST run.

num_hits

Number of hits for this query.

num_hsp

Total number of Hsp's for all hits.

query_ID

Identifier of the query.

query_def

Description of the query.

query_len

Length of the query.

Methods

get_hit(i)

Get a given hit, as a BlastHit instance.

iter_hits()

Iterator to the BlastHit instances of all hits.

iter_hsp()

Iterator over all Hsp's of all hits, as BlastHsp instances

property H

Karlin-Altschul entropy parameter.

property K

Karlin-Altschul kappa parameter.

property L

Karlin-Altschul lambda parameter.

property db_len

Number of letters in the database.

property db_num

Number of sequence in the database.

property eff_space

Effective space of the search.

get_hit(i)[source]

Get a given hit, as a BlastHit instance. query_hits.get_hit(i) is also available as query_hits[i].

property hsp_len

Length adjustment.

iter_hits()[source]

Iterator to the BlastHit instances of all hits. for hit in query_hits.iter_hits() is also available as for hit in query_hits.

iter_hsp()[source]

Iterator over all Hsp’s of all hits, as BlastHsp instances

property num

index of the query in the BLAST run.

property num_hits

Number of hits for this query. query_hits.num_hits() is also available as len(query_hits).

property num_hsp

Total number of Hsp’s for all hits.

property query_ID

Identifier of the query.

property query_def

Description of the query.

property query_len

Length of the query.

class egglib.wrappers.BlastHit[source]

Results for a given hit of a BLAST run.

Attributes

accession

Identifier of the subject.

descr

Description of the subject.

id

Identifier of the subject.

len

Length of subject.

num

Index of the hit for the corresponding query.

num_hsp

Number of Hsp's in this hit.

Methods

get_hsp(i)

Get a given Hsp, as a BlastHsp instance.

iter_hsp()

Iterator to the BlastHsp instances for all Hsp's.

property accession

Identifier of the subject.

property descr

Description of the subject.

get_hsp(i)[source]

Get a given Hsp, as a BlastHsp instance. hit.get_hsp(i) is also available as hit[i].

property id

Identifier of the subject.

iter_hsp()[source]

Iterator to the BlastHsp instances for all Hsp’s. for hsp in hit.iter_hsp() is also available as for hsp in hit.

property len

Length of subject.

property num

Index of the hit for the corresponding query.

property num_hsp

Number of Hsp’s in this hit. hit.num_Hsp()  is also available as ``len(hit).

class egglib.wrappers.BlastHsp[source]

Description of an Hsp of a BLAST run.

Start and stop positions are always interpreted as range parameters (use frame to determine if the complement should be used):

>>> hit_sequence = seq[query_start:query_to]

Attributes

align_len

Length of the alignment.

bit_score

Bit score of the Hsp.

evalue

Expectation value of the Hsp.

gaps

Number of gap positions.

hit_frame

Frame of the hit.

hit_start

Start position on the subject.

hit_stop

Stop position on the subject.

hseq

Aligned subject sequence.

identity

Number of identical positions.

midline

Alignment midline.

num

Index of the Hsp in the corresponding hit.

positive

Number of positions with positive score.

qseq

Aligned query sequence.

query_frame

Frame of the query.

query_start

Start position on the query.

query_stop

Stop position on the query.

property align_len

Length of the alignment.

property bit_score

Bit score of the Hsp.

property evalue

Expectation value of the Hsp.

property gaps

Number of gap positions.

property hit_frame

Frame of the hit.

property hit_start

Start position on the subject.

property hit_stop

Stop position on the subject.

property hseq

Aligned subject sequence.

property identity

Number of identical positions.

property midline

Alignment midline.

property num

Index of the Hsp in the corresponding hit.

property positive

Number of positions with positive score.

property qseq

Aligned query sequence.

property query_frame

Frame of the query.

property query_start

Start position on the query.

property query_stop

Stop position on the query.

Configuring paths

Application paths can be set using the following syntax. A ValueError is raised if the automatic test fails. The change is valid for the current session only unless save() is used:

egglib.wrappers.paths[app] = path

And application paths are accessed as followed:

egglib.wrappers.paths[app]
egglib.wrappers.paths.autodetect(verbose=False)

Auto-configure application paths based on default command names.

Parameters:

verbose – if True, print progress information.

The function returns a (npassed, nfailed, failed_info) with:

  • npassed the number of applications which passed.

  • nfailed the number of applications which failed.

  • failed_info a dict containing, for each failing application, the command which was used and the error message.

egglib.wrappers.path.load()

Load values of application paths from the configuration file located within the package. All values currently set are discarded.

egglib.wrappers.path.save()

Save current values of application paths in the configuration file located within the package. This action may require administrator rights. All values currently set will be reloaded at next import of the package.