Siblings Documentation

The SIBLINGs project (SIB Large INtercomparison of Genomes) aims to provide the bioinformatics community with inter- and intra- genome homology searches, i.e. “all-against-all” gene comparisons from complete genomes. Such datasets of homologous gene relations can be further processed for identifying gene families, e.g. for phylogenetic studies, as well as for identifying paralogs and orthologs.

SIBLINGs is still in an experimental status. Feel free to contact us to provide feedback or to get involved in the project.

The SIBLINGs project is supported by Swiss Institute of Bioinformatics, SIB, through federal funding for OMA and OrthoDB. This collaboration is aimed at unifying the “common denominator” of data processing for our orthology databases that requires substantial efforts for data preparation and CPU processing time.

SIBLINGs consist of three different parts, the SIBLINGs backend database, a RESTful webservice providing end-points to query the data and upload new genomes, and a broker based infrastructure to compute alignments in a distributed way. In the following we document these individual parts briefly.

Siblings REST API documentation

Siblings REST API provides endpoints to load protein alignment summaries and the used genomic sequences for pairs of genomes from the siblings database.

This API is currently in beta status and therefore not yet completely stable.

In the following we list the different end points that we provide.

GET /api/genomes/?

List all the available genomes.

This share point returns a list with summary information on the available genomes.

Example request:

GET /api/genomes HTTP/1.1
Host: siblings.ch
Accept: application/json

Example response:

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8

[
  {
    "NCBITaxonId": 9606,
    "UniProtSpeciesCode": "HUMAN",
    "TotEntries": 31540,
    "SciName": "Homo sapiens",
    "DBRelease": "Ensembl 70; GRCh37; 12-DEC-2012""
  },
  {
    "NCBITaxonId": 9605,
    "UniProtSpeciesCode": "PANTR",
    "TotEntries": 30522,
    "SciName": "Pan trygolis",
    "DBRelease": "Ensembl 75"
  }
]
GET /api/genomes/(?P<taxid>d+)

Get the protein sequences of a query genome.

The method returns a list with the protein sequence and – if selected with the boolean flag cdna – the coding DNA sequence for a set of selected entries.

Parameters:
  • taxid (int) – the NCBI taxonomy id of the query genome.
Return:

a list of dicts EntryNr, ID, Sequence and cDNA elements for the requested entry numbers.

Status Codes:

Example request:

GET /api/genomes/9606 HTTP/1.1
Host: siblings.ch
Accept: application/json
POST /api/genomes/new/?

Upload a new genome to be included.

Using this method, users can contribute new genomes to be included into the dataset. They are automatically added to the queue for all-against-all computations and eventually will become available in this service for download again. Obviously, the computation will require a significant amount of time and hence this will take usually several days or weeks, depending on the number of pending genomes, their sizes and the number of available compute nodes.

The uploaded genome needs to be a json encoded string containing the following attributes:

Request JSON Object:
 
  • NCBITaxonId (int) – the NCBI taxonomy id of the new genome
  • SciName (str) – the scientific name of the genome (including strain information)
  • Proteins – an array of objects, each composed of a uniq ID (the canonical id of the of the protein), the Sequence as amino acid sequence, and optionally a cDNA (coding DNA of protein) and an XRefs array with Source and ID.

Example request:

POST /api/genomes/new HTTP/1.1
Host: siblings.ch
Accept: application/json
Content-type: application/json
BODY data: {"NCBITaxonId": 9606, "Proteins":
  [{"ID": "A0A024RAP8",
    "cDNA": "ATGGGGTGGATTCGTGGTC...CTGTGTAA",
    "Sequence": "MGWIRGRRSRHSWE...CMQRTV",
    "XRefs": [
      {"Source": "Ensembl Gene", "ID": "ENSG00000255819"},
      {"Source": "Ensembl Protein", "ID": "ENSP00000480609"},
      {"Source": "EntrezGene", "ID": "22914"}]
  }]}
GET /api/matches/(?P<taxid1>d+)/(?P<taxid2>d+)/?

Return homologous matches between a genome pair

Summary information about the homologs between the genome pair is returned in form of a matrix, where each row corresponds to a relation between to proteins. The columns of the matrix hold the properties of the alignment, i.e.

  • the two homologous proteins (EntryNr1 and EntryNr2)
  • the alignment Score computed using Smith-Waterman on a local alignment
  • an maximum likelihood estimation of the evolutionary distance PamDistance in PAM units (Percent accepted mutations). In addition, the PamVariance is an ML estimate of the variance for the PamDistance.
  • the ranges of the two proteins that span the local alignment (Start1, End1, Start2, End2). Positions are relative from the start of the protein sequences and are 1 based, e.g. if the alignment includes the first amino acid, the range will start at 1.
  • the LogEValue, a transformed E-Value of the alignment using the ln() function to save precision.
  • the percent identity of the alignment in PIdent.
  • Global_{Score, PamDistance, PamVariance, PIdent} store the same properties based on the global alignment.
Parameters:
  • taxid1 – Taxonomy Id of first genome
  • taxid2 – Taxonomy Id of second genome
Response JSON Object:
 
  • colnames – column names / properties of the alignments.
  • data – the data matrix. Each row corresponds to one significant alignment.
Status Codes:

Example request:

GET /api/matches/9606/9606 HTTP/1.1
Host: siblings.ch
Accept: application/json
POST /api/matches/(?P<taxid1>d+)/(?P<taxid2>d+)/?

Return homologous matches between a genome pair with filtering options.

Same as get(), but allows for additional filtering options by passing a filter string attribute. Several conditions can be combined with the “&” operator (logical and) and can operate on all the attributes of the returned matches table.

Parameters:
  • taxid1 – Taxonomy Id of first genome
  • taxid2 – Taxonomy Id of second genome
Response JSON Object:
 
  • colnames – column names / properties of the alignments.
  • data – the data matrix. Each row corresponds to one significant alignment.
Status Codes:

Example request: return matches that have a score >2000 and an evolutionary distance of less than 15 PAM:

POST /api/matches/9606/10090 HTTP/1.1
Host: siblings.ch
Accept: application/json
Content-type: application/json
[BODY data: {"filter": "Score>2000 & PamDistance<15"}]
GET /(?:api/)?timings/?

Get summary timing statistics of all users contributing computations.

The method returns a dictionary with the contributed number of All-against-all units, their average duration and the total contributed cpu time in seconds per user (and host).

If a html page is requested (using the Accept header), this information is rendered as a html page including graphics, otherwise a json encoded object is returned.

Status Codes:

Siblings Distributed Computations

The genome sequence comparisons are computed in a distributed manner. A centralized scheduler (siblings.scheduler.AllAllJobScheduler) distributes small chunks of alignment work to workers (siblings.aligner.AlignWorker) running anywhere in the cloud. This infrastructure is based on the ZeroMQ messaging library.

Distributed workers

class siblings.aligner.AlignWorker(cfg, verbose=False)[source]

AlignWorker contains the server side code for computing alignments of siblings.

This class communicates through a broker (MajorDomoBroker) with the Siblings database. The database initializes chunks of all-against-all comparisons which this worker computes (in parallel).

compute_alignments(data)[source]

compute the significant alignments of a job Request.

Parameters:data – a validated request object containing the sequences to be aligned
Return dict:a reply ready dictionary containing a numpy array of type siblings.tablesformat.Matches holding the significant matches data.
run_device()[source]

main loop, started by the MajorDomoWorker.

class siblings.aligner.Aligner(threshold)[source]

A class to compute optimal pairwise alignments and to quickly test for homology.

This class provides methods to quickly test whether two amino adcid sequences are homologous given a certain alignment threshold. Further, the class provides a method to compute the local and global optimal pairwise alignment using SmithWaterman dynamic programming including distance optimisation, i.e. the scoring matrix is adjusted to the evolutionary distance and optimized for that.

fast_homology_test(seq1, seq2)[source]

quickly test whether two sequences are homologous

The test uses pyopa’s implementation of the SmithWaterman algorithm that uses SSE instruction on a scoring matrix of shorts. See TODO:ref for details.

Parameters:
  • seq1 – the first sequence to be aligned
  • seq2 – the second sequence to be aligned
Return bool:

whether or not the sequences are homologous at the threshold level

log_evalue(opa, len1=None, len2=None)[source]

compute ln(evalue) for the given alignment

optimal_pairwise_alignment(seq1, seq2, modif='Local')[source]

computes the optimal pairwise alignment between two sequences

This method computes the global or local optimal pairwise alignment between two sequences. The algorithm optimizes for the evolutionary distances that maximizes the alignment score. To obtain such an optimal alignment, several Smith-Watermann alignments are required and hence is a costly operation.

Parameters:
  • seq1 – the first sequence to be aligned
  • seq2 – the second sequence to be aligned.
  • modif – (default Local) either Local or Global to specify which type of alignment should be computed.
Returns:

an OPA instance

percent_identity(opa)[source]

return percent identity of the given alignment

class siblings.aligner.Match(EntryNr1, EntryNr2, Score, PamDistance, Start1, End1, Start2, End2, PamVariance, LogEValue, PIdent, Global_Score, Global_PamDistance, Global_PamVariance, Global_PIdent)
End1

Alias for field number 5

End2

Alias for field number 7

EntryNr1

Alias for field number 0

EntryNr2

Alias for field number 1

Global_PIdent

Alias for field number 14

Global_PamDistance

Alias for field number 12

Global_PamVariance

Alias for field number 13

Global_Score

Alias for field number 11

LogEValue

Alias for field number 9

PIdent

Alias for field number 10

PamDistance

Alias for field number 3

PamVariance

Alias for field number 8

Score

Alias for field number 2

Start1

Alias for field number 4

Start2

Alias for field number 6

class siblings.aligner.OPA(score, pam, pamvar, env, start1, end1, start2, end2, aligned_seq1, aligned_seq2, typ)
aligned_seq1

Alias for field number 8

aligned_seq2

Alias for field number 9

end1

Alias for field number 5

end2

Alias for field number 7

env

Alias for field number 3

pam

Alias for field number 1

pamvar

Alias for field number 2

score

Alias for field number 0

start1

Alias for field number 4

start2

Alias for field number 6

typ

Alias for field number 10

siblings.aligner.json_numpy_obj_hook(dct)[source]

Decodes a previously encoded numpy ndarray with proper shape and dtype

Parameters:dct (dict) – encoded numpy array
Returns:decoded numpy array or original input
Return type:dict or numpy.array

Scheduler for Distributing Computations

class siblings.scheduler.AllAllJobScheduler(writer=None, cfg=None, verbose=False)[source]

Scheduler of pending AllAll jobs

This class is responsible to schedule allall-jobs for new genomes, sending the relevant jobs to the workers and storing the results in the hdf5 database.

class siblings.scheduler.Writer(db_path, cfg)[source]

The main database writer object.

This class is responsible for creating the hdf5 database and writing data to it.

add_genome(data)[source]

add a new genome to the database.

This method creates a new genome entry to the database and makes sure that the genome will eventually be processed.

Parameters:

data (dict) – genome data

Raises:
  • ImportDataError – if genome exists already in the database
  • DBConsistencyError – internal unrecoverable problem with database
mark_job_as_error(jobId)[source]

increase error count on job.

If the error count becomes too high, job won’t be send anymore to be computed.

store_reply(reply)[source]

store reply of align workers in pytable.

Replies from alignment workers are processed in this method. The JobQueue and PairStatus tables are updated accordingly. Replies which do not conform to the expected format are ignored and a ImportDataError is raised.

The method also updates the computation providing user statistics.

Siblings Database

The database module siblings.db provides methods to read data from a local copy of the hdf5 database files. We use PyTables to work with hdf5 files from python. Unless you work with your own copy of hdf5 database, users most likely with not have to worry about these internals.

class siblings.db.GenomeOrderSwapper(cols)[source]

This class handles the swapping of column ids, column position and QueryFilters introduced by the ordering of the matches columns.

swap_columns_in_matches(data)[source]

function to swap the columns in the matches array

class siblings.db.Reader(filename, mode='r', logger=None)[source]

This is the main database class. It handles queries to the hdf5 siblings files

columnnames

Returns all the column names of the Matches table

genomesid

returns the NCBI Taxonomy ID for all available genomes.

This is the internal Siblings genomes ID. See get_genomes() for a method to access summary information on the genomes.

get_genome_sequences(genome, cDNA=False, entry_nrs=None)[source]

Get the protein sequences of a query genome.

The method returns a list of dicts with the protein sequence and – if selected with the boolean flag cdna – the coding DNA sequence for a set of selected entries.

Parameters:
  • genome (int) – the NCBI taxonomy id of the query genome.
  • entry_nrs – the query entry numbers. By default, all entry numbers are returned.
  • cdna (bool) – boolean flag whether or not the corresponding cDNA sequences should be returned.
Returns:

a list of dicts EntryNr, ID, Sequence and cDNA elements for the requested entry numbers.

get_genomes(columns=None)[source]

Returns summary information on the available genomes.

This method returns summary information in the form of a numpy.ndarray. The returned columns are

  • NCBITaxonId, the internal (stable) ID for the genome as assigned by NCBI.
  • TotEntries, the number of proteins in the genome
  • SciName, the scientific name of the genome
  • DBRelease, the source and release information (e.g. source of genome, assembly build and release date or version).
  • UniProtSpeciesCode, a (usually) 5-letter code for the species, managed and assigned by the UniProt consortium.
Returns:numpy.ndarray
get_homologs_of_gene(genome, entry, filter_obj=<siblings.db.QueryFilter object at 0x360b410>, returnid=None, **kw)[source]

Returns all homologs for a given query gene in all other species.

Parameters:
  • genome – the NCBITaxonomy entry of the query gene’s genome.
  • entry – the entry number of the query gene.
  • filter_obj – a filtering object py::siblings::db::QueryFilter to select subsets of all possible matches, i.e. to select with a more stringent parameter set.
  • retrunid – a ReturnIdType object to convert internal entry numbers to some supported xref ids.
get_matches_between_genomes(genome1, genome2, filter_obj=None, returnid=None, **kw)[source]

This method returns the matches between a genome pair fulfilling the filtering criterions. the returned ids are simply the EntryNr in the respective genomes.

static query_filter(filter_str)[source]

Converts a query string into a QueryFilter object

This is a convenience method which creates a QueryFilter object to work with.

Returns:a QueryFilter object
returnid_factory(primary, fallback)[source]

creates the necessary objects to retrieve matches with the desired type of Ids

class siblings.db.ReturnIdType(tab, primary, fallback=None)[source]

the minimum number of matches requested to convert necessary to prefetch all xrefs of that species

siblings.db.reverse_operator(op)[source]

adjusted operators for reversed lhs and rhs

siblings.db.swap_columns(data, swapCols)[source]

function to swap a set of column pairs in the data array

Indices and tables