Siblings Distributed Computations

The genome sequence comparisons are computed in a distributed manner. A centralized scheduler (siblings.scheduler.AllAllJobScheduler) distributes small chunks of alignment work to workers (siblings.aligner.AlignWorker) running anywhere in the cloud. This infrastructure is based on the ZeroMQ messaging library.

Distributed workers

class siblings.aligner.AlignWorker(cfg, verbose=False)[source]

AlignWorker contains the server side code for computing alignments of siblings.

This class communicates through a broker (MajorDomoBroker) with the Siblings database. The database initializes chunks of all-against-all comparisons which this worker computes (in parallel).

compute_alignments(data)[source]

compute the significant alignments of a job Request.

Parameters:data – a validated request object containing the sequences to be aligned
Return dict:a reply ready dictionary containing a numpy array of type siblings.tablesformat.Matches holding the significant matches data.
run_device()[source]

main loop, started by the MajorDomoWorker.

class siblings.aligner.Aligner(threshold)[source]

A class to compute optimal pairwise alignments and to quickly test for homology.

This class provides methods to quickly test whether two amino adcid sequences are homologous given a certain alignment threshold. Further, the class provides a method to compute the local and global optimal pairwise alignment using SmithWaterman dynamic programming including distance optimisation, i.e. the scoring matrix is adjusted to the evolutionary distance and optimized for that.

fast_homology_test(seq1, seq2)[source]

quickly test whether two sequences are homologous

The test uses pyopa’s implementation of the SmithWaterman algorithm that uses SSE instruction on a scoring matrix of shorts. See TODO:ref for details.

Parameters:
  • seq1 – the first sequence to be aligned
  • seq2 – the second sequence to be aligned
Return bool:

whether or not the sequences are homologous at the threshold level

log_evalue(opa, len1=None, len2=None)[source]

compute ln(evalue) for the given alignment

optimal_pairwise_alignment(seq1, seq2, modif=’Local’)[source]

computes the optimal pairwise alignment between two sequences

This method computes the global or local optimal pairwise alignment between two sequences. The algorithm optimizes for the evolutionary distances that maximizes the alignment score. To obtain such an optimal alignment, several Smith-Watermann alignments are required and hence is a costly operation.

Parameters:
  • seq1 – the first sequence to be aligned
  • seq2 – the second sequence to be aligned.
  • modif – (default Local) either Local or Global to specify which type of alignment should be computed.
Returns:

an OPA instance

percent_identity(opa)[source]

return percent identity of the given alignment

class siblings.aligner.Match(EntryNr1, EntryNr2, Score, PamDistance, Start1, End1, Start2, End2, PamVariance, LogEValue, PIdent, Global_Score, Global_PamDistance, Global_PamVariance, Global_PIdent)
End1

Alias for field number 5

End2

Alias for field number 7

EntryNr1

Alias for field number 0

EntryNr2

Alias for field number 1

Global_PIdent

Alias for field number 14

Global_PamDistance

Alias for field number 12

Global_PamVariance

Alias for field number 13

Global_Score

Alias for field number 11

LogEValue

Alias for field number 9

PIdent

Alias for field number 10

PamDistance

Alias for field number 3

PamVariance

Alias for field number 8

Score

Alias for field number 2

Start1

Alias for field number 4

Start2

Alias for field number 6

class siblings.aligner.OPA(score, pam, pamvar, env, start1, end1, start2, end2, aligned_seq1, aligned_seq2, typ)
aligned_seq1

Alias for field number 8

aligned_seq2

Alias for field number 9

end1

Alias for field number 5

end2

Alias for field number 7

env

Alias for field number 3

pam

Alias for field number 1

pamvar

Alias for field number 2

score

Alias for field number 0

start1

Alias for field number 4

start2

Alias for field number 6

typ

Alias for field number 10

siblings.aligner.count_iter_items(iterable)[source]

Consume an iterable not reading it into memory; return the number of items.

siblings.aligner.json_numpy_obj_hook(dct)[source]

Decodes a previously encoded numpy ndarray with proper shape and dtype

Parameters:dct (dict) – encoded numpy array
Returns:decoded numpy array or original input
Return type:dict or numpy.array

Scheduler for Distributing Computations

class siblings.scheduler.AllAllJobScheduler(writer=None, cfg=None, verbose=False)[source]

Scheduler of pending AllAll jobs

This class is responsible to schedule allall-jobs for new genomes, sending the relevant jobs to the workers and storing the results in the hdf5 database.

class siblings.scheduler.Writer(db_path, cfg)[source]

The main database writer object.

This class is responsible for creating the hdf5 database and writing data to it.

add_genome(data)[source]

add a new genome to the database.

This method creates a new genome entry to the database and makes sure that the genome will eventually be processed.

Parameters:

data (dict) – genome data

Raises:
  • ImportDataError – if genome exists already in the database
  • DBConsistencyError – internal unrecoverable problem with database
mark_job_as_error(jobId)[source]

increase error count on job.

If the error count becomes too high, job won’t be send anymore to be computed.

store_reply(reply)[source]

store reply of align workers in pytable.

Replies from alignment workers are processed in this method. The JobQueue and PairStatus tables are updated accordingly. Replies which do not conform to the expected format are ignored and a ImportDataError is raised.

The method also updates the computation providing user statistics.

Siblings Database

The database module siblings.db provides methods to read data from a local copy of the hdf5 database files. We use PyTables to work with hdf5 files from python. Unless you work with your own copy of hdf5 database, users most likely with not have to worry about these internals.

class siblings.db.GenomeOrderSwapper(cols)[source]

This class handles the swapping of column ids, column position and QueryFilters introduced by the ordering of the matches columns.

swap_columns_in_matches(data)[source]

function to swap the columns in the matches array

class siblings.db.Reader(filename, mode=’r’, logger=None)[source]

This is the main database class. It handles queries to the hdf5 siblings files

columnnames

Returns all the column names of the Matches table

genomesid

returns the NCBI Taxonomy ID for all available genomes.

This is the internal Siblings genomes ID. See get_genomes() for a method to access summary information on the genomes.

get_genome_sequences(genome, cDNA=False, entry_nrs=None, start=0, stop=None, **kwargs)[source]

Get the protein sequences of a query genome.

The method returns a list of dicts with the protein sequence and – if selected with the boolean flag cdna – the coding DNA sequence for a set of selected entries.

Parameters:
  • genome (int) – the NCBI taxonomy id of the query genome.
  • entry_nrs – the query entry numbers. By default, all entry numbers are returned.
  • cdna (bool) – boolean flag whether or not the corresponding cDNA sequences should be returned.
Returns:

a list of dicts EntryNr, ID, Sequence and cDNA elements for the requested entry numbers.

get_genomes(columns=None, start=None, stop=None)[source]

Returns summary information on the available genomes.

This method returns summary information in the form of a numpy.ndarray. The returned columns are

  • NCBITaxonId, the internal (stable) ID for the genome as assigned by NCBI.
  • TotEntries, the number of proteins in the genome
  • SciName, the scientific name of the genome
  • DBRelease, the source and release information (e.g. source of genome, assembly build and release date or version).
  • UniProtSpeciesCode, a (usually) 5-letter code for the species, managed and assigned by the UniProt consortium.
Returns:numpy.ndarray
get_homologs_of_gene(genome, entry, filter_obj=<siblings.db.QueryFilter object>, returnid=None, **kw)[source]

Returns all homologs for a given query gene in all other species.

Parameters:
  • genome – the NCBITaxonomy entry of the query gene’s genome.
  • entry – the entry number of the query gene.
  • filter_obj – a filtering object py::siblings::db::QueryFilter to select subsets of all possible matches, i.e. to select with a more stringent parameter set.
  • retrunid – a ReturnIdType object to convert internal entry numbers to some supported xref ids.
get_matches_between_genomes(genome1, genome2, filter_obj=None, returnid=None, start=0, stop=None, **kw)[source]

This method returns the matches between a genome pair fulfilling the filtering criterions. the returned ids are simply the EntryNr in the respective genomes.

static query_filter(filter_str)[source]

Converts a query string into a QueryFilter object

This is a convenience method which creates a QueryFilter object to work with.

Returns:a QueryFilter object
returnid_factory(primary, fallback)[source]

creates the necessary objects to retrieve matches with the desired type of Ids

class siblings.db.ReturnIdType(tab, primary, fallback=None)[source]

the minimum number of matches requested to convert necessary to prefetch all xrefs of that species

siblings.db.reverse_operator(op)[source]

adjusted operators for reversed lhs and rhs

siblings.db.swap_columns(data, swapCols)[source]

function to swap a set of column pairs in the data array