Siblings Documentation

The SIBLINGs project (SIB Large INtercomparison of Genomes) aims to provide the bioinformatics community with inter- and intra- genome homology searches, i.e. “all-against-all” gene comparisons from complete genomes. Such datasets of homologous gene relations can be further processed for identifying gene families, e.g. for phylogenetic studies, as well as for identifying paralogs and orthologs.

SIBLINGs is still in an experimental status. Feel free to contact us to provide feedback or to get involved in the project.

The SIBLINGs project is supported by Swiss Institute of Bioinformatics, SIB, through federal funding for OMA and OrthoDB. This collaboration is aimed at unifying the “common denominator” of data processing for our orthology databases that requires substantial efforts for data preparation and CPU processing time.

SIBLINGs consist of three different parts, the SIBLINGs backend database, a RESTful webservice providing end-points to query the data and upload new genomes, and a broker based infrastructure to compute alignments in a distributed way. In the following we document these individual parts briefly.

Siblings REST API documentation

Siblings REST API provides endpoints to load protein alignment summaries and the used genomic sequences for pairs of genomes from the siblings database.

This API is currently in beta status and therefore not yet completely stable.

You can make use of the Accept request header to select the format of the reply, i.e. by specifying in your request header Accept: application/json, the reply will come in json format where as for Accept: text/csv the result will be returned as a tab-separated table.

The result format can also be specified for all endpoints by specifying a format=fmt query parameter or for the POST request by adding a “format”: “fmt” parameter to the json formatted body content, where fmt can be one of html, json and csv. Some of the endpoint provide also special formats like fasta or blast.

All endpoints that return lists will honor the offset and limit parameters to select only a subset of the results.

In the following we list the different end points that we provide:

GET /api/version/?

Retrieve version information of siblings database and the REST API.

Example request:

GET /api/version HTTP/1.1
Host: siblings.ch
Accept: application/json

Example response:

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8

{
    "api_version": "1.0",
    "dataset": "QfO Reference proteomes",
    "release": "2018-04"
}
GET /api/genomes/?

List all the available genomes.

This share point returns a list with summary information on the available genomes.

Example request:

GET /api/genomes HTTP/1.1
Host: siblings.ch
Accept: application/json

Example response:

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8

[
  {
    "NCBITaxonId": 9606,
    "UniProtSpeciesCode": "HUMAN",
    "TotEntries": 31540,
    "SciName": "Homo sapiens",
    "DBRelease": "Ensembl 70; GRCh37; 12-DEC-2012"
  },
  {
    "NCBITaxonId": 9605,
    "UniProtSpeciesCode": "PANTR",
    "TotEntries": 30522,
    "SciName": "Pan trygolis",
    "DBRelease": "Ensembl 75"
  }
]
GET /api/genomes/(?P<taxid>d+)

Get the protein sequences of a query genome.

The method returns a list with the protein sequence and – if selected with the boolean cdna query parameter – the coding DNA sequence for a set of selected entries. If fasta format is requested and cDNA is turned on, the return will be only the fasta formatted cDNA sequences, otherwise the protein sequences.

Parameters:
  • taxid (int) – the NCBI taxonomy id of the query genome.
Query Parameters:
 
  • cdna (bool) – whether or not to include the cDNA sequence in the output. Defaults to yes unless fasta format is requested.
  • offset (int) – offset number. default is 0
  • limit (int) – limit number. default is unlimited
  • format – response format, one of json, csv, html or fasta. if not specified, Accept is used.
Request Headers:
 
  • Accept – the response content type, i.e. the response format depends on Accept header, unless format is specified via the query parameter.
Response Headers:
 
  • Content-Type – this depends on Accept header of request or the format query parameter.
Return:

a list of dicts EntryNr, ID, Sequence and cDNA elements for the requested entry numbers.

Status Codes:

Example request:

GET /api/genomes/9606?limit=2 HTTP/1.1
Host: siblings.ch
Accept: application/json

Example response:

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8

[
  {
    "EntryNr": 1,
    "ID": "A0A024R161",
    "cDNA": "ATGGGGGCGCCGCTGCTCTCTCCCGGCTGGGGAGCCGGGGCTGCCGGCCGGCGCTGGTGGATGCTGCTGGCGCCCCTGCTGCCGGCGCTGCTGCTGGTGCGGCCCGCGGGGGCCCTGGTGGAGGGGCTCTACTGCGGCACGCGGGACTGCTACGAGGTGCTGGGCGTGAGCCGCTCGGCGGGCAAGGCGGAGATCGCGCGGGCCTACCGCCAGCTGGCCCGGCGCTACCACCCTGACCGCTACCGGCCCCAGCCCGGAGACGAGGGCCCCGGGCGGACGCCGCAGAGCGCCGAGGAGGCTTTCCTGCTGGTGGCAACCGCCTACGAGACACTCAAGGTCTCTCAGGCAGCTGCAGAGCTTCAACAGTACTGTATGCAGAATGCCTGCAAGGATGCCCTGCTGGTGGGTGTTCCAGCTGGAAGTAACCCCTTCCGGGAGCCTAGATCCTGTGCTTTACTCTGA",
    "Sequence": "MGAPLLSPGWGAGAAGRRWWMLLAPLLPALLLVRPAGALVEGLYCGTRDCYEVLGVSRSAGKAEIARAYRQLARRYHPDRYRPQPGDEGPGRTPQSAEEAFLLVATAYETLKVSQAAAELQQYCMQNACKDALLVGVPAGSNPFREPRSCALL"
  },
  {
    "EntryNr": 2,
    "ID": "A0A024R1R8",
    "cDNA": "ATGTCCAGCCACGAAGGTGGCAAGAAGAAGGCACTGAAACAGCCCAAGAAGCAGGCCAAGGAGATGGACGAGGAAGAGAAGGCTTTCAAGCAGAAACAAAAAGAGGAGCAGAAGAAACTCGAGGTGCTAAAAGCGAAGGTCGTGGGGAAGGGGCCTCTGGCCACAGGTGGAATTAAGAAATCTGGCAAAAAATAA",
    "Sequence": "MSSHEGGKKKALKQPKKQAKEMDEEEKAFKQKQKEEQKKLEVLKAKVVGKGPLATGGIKKSGKK"
  }
]
GET /api/matches/(?P<taxid1>d+)/(?P<taxid2>d+)/?

Return homologous matches between a genome pair

Summary information about the homologs between the genome pair is returned in form of a matrix, where each row corresponds to a relation between to proteins. The columns of the matrix hold the properties of the alignment, i.e.

  • the two homologous proteins (EntryNr1 and EntryNr2)
  • the alignment Score computed using Smith-Waterman on a local alignment
  • an maximum likelihood estimation of the evolutionary distance PamDistance in PAM units (Percent accepted mutations). In addition, the PamVariance is an ML estimate of the variance for the PamDistance.
  • the ranges of the two proteins that span the local alignment (Start1, End1, Start2, End2). Positions are relative from the start of the protein sequences and are 1 based, e.g. if the alignment includes the first amino acid, the range will start at 1.
  • the LogEValue, a transformed E-Value of the alignment using the ln() function to save precision.
  • the percent identity of the alignment in PIdent.
  • Global_{Score, PamDistance, PamVariance, PIdent} store the same properties based on the global alignment.

For the blast and csv format, the result corresponds to a tabular blast output, i.e. blast output format 7. However, the length, mismatch and gapopen columns will be set to 0 as this information is currently not stored in SIBLINGs. The bit-score column contains a normalised score of the local alignment.

Parameters:
  • taxid1 – Taxonomy Id of first genome
  • taxid2 – Taxonomy Id of second genome
Query Parameters:
 
  • offset – offset number. default is 0
  • limit – limit number. default is unlimited
  • idtype – source or numeric. defaults to source.
  • format – return format, one of json, csv, html or blast. csv is same as blast. if not specified, Accept is used.
Request Headers:
 
  • Accept – the response content type depends on Accept header, unless format query parameter is specified.
Response JSON Object:
 
  • colnames – column names / properties of the alignments.
  • data – the data matrix. Each row corresponds to one significant alignment.
Response Headers:
 
  • Content-Type – this depends on Accept header of request or the format query parameter.
Status Codes:

Example request:

GET /api/matches/9606/9606?limit=1 HTTP/1.1
Host: siblings.ch
Accept: application/json

Example result:

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8

{
  "colnames": [
    "Protein1",
    "Protein2",
    "Score",
    "PamDistance",
    "Start1",
    "End1",
    "Start2",
    "End2",
    "PamVariance",
    "LogEValue",
    "PIdent",
    "Global_Score",
    "Global_PamDistance",
    "Global_PamVariance",
    "Global_PIdent"
  ],
  "data": [
    [
      "A0A087WZY1",
      "A0A1B0GTR4",
      151.03562927246094,
      126.0,
      148,
      250,
      2,
      99,
      348.0545959472656,
      -14.15829086303711,
      0.4095238149166107,
      -93.56526947021484,
      130.0,
      326.6018371582031,
      0.1760299652814865
    ],
    [
      "A0A087X0B3",
      "A0A1B0GTQ1",
      130.96060180664062,
      161.0,
      37,
      152,
      54,
      169,
      469.6202087402344,
      -9.248920440673828,
      0.2905983030796051,
      -191.30209350585938,
      189.0,
      488.39263916015625,
      0.12181302905082703
    ]
  ]
}
POST /api/matches/(?P<taxid1>d+)/(?P<taxid2>d+)/?

Return homologous matches between a genome pair with filtering options.

Same as get(), but allows for additional filtering options by passing a filter string attribute. Several conditions can be combined with the “&” operator (logical and) and can operate on all the attributes of the returned matches table. For an example, see the example request that retrieves only high scoring matches (Score > 2000) and that are separated by estimated evolutionary distance of less than 15 PAM (PamDistance < 15).

Query Parameters:
 
  • offset – offset number. default is 0
  • limit – limit number. default is unlimited
  • idtype – source or numeric. defaults to source.
  • format – response format, one of json, csv, html or blast. csv is same as blast. if not specified, Accept is used.
Request JSON Object:
 
  • filter (str) – filter condition.
  • offset (int) – offset number. Body data variant has precedence over query parameter.
  • limit (int) – limit number. Body data variant has precedence over query parameter.
  • idtype (str) – source or numeric. Body data variant has precedence over query parameter.
  • format (str) – response format, one of json, csv, html or blast. Body data variant has precedence over query parameter.
Request Headers:
 
  • Accept – the response content type, i.e. the response format depends on Accept header, unless format is specified via json body object or query parameter.
Response JSON Object:
 
  • colnames – column names / properties of the alignments.
  • data – the data matrix. Each row corresponds to one significant alignment.
Response Headers:
 
  • Content-Type – this depends on Accept header of request or the format query parameter.
Status Codes:
  • 200 OK – success
  • 400 Bad Request – the query or filter condition cannot be parsed.
  • 404 Not Found – at least one taxonomic id of the query genomes was not found in the dataset.

Example request:

Retrieve the first 2 homologous matches between Homo sapiens and Mus musculus that have a score >2000 and an evolutionary distance of less than 15 PAM:

POST /api/matches/9606/10090 HTTP/1.1
Host: siblings.ch
Accept: application/json
Content-type: application/json

{"filter": "Score>2000 & PamDistance<15", "limit": 2}

Example response:

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8

{
  "colnames": [
    "Protein1",
    "Protein2",
    "Score",
    "PamDistance",
    "Start1",
    "End1",
    "Start2",
    "End2",
    "PamVariance",
    "LogEValue",
    "PIdent",
    "Global_Score",
    "Global_PamDistance",
    "Global_PamVariance",
    "Global_PIdent"
  ],
  "data": [
    [
      "Q16875",
      "A7UAK5",
      6065.7998046875,
      1.8997074365615845,
      0,
      504,
      0,
      533,
      0.3770505487918854,
      -1368.8863525390625,
      0.9288389682769775,
      5947.71533203125,
      2.2156643867492676,
      0.4352022111415863,
      0.8701754212379456
    ],
    [
      "Q5QP82",
      "A2AKB9",
      5497.88330078125,
      11.26827621459961,
      0,
      558,
      0,
      565,
      2.270573616027832,
      -1240.88671875,
      0.8838028311729431,
      5497.88330078125,
      11.26827621459961,
      2.270573616027832,
      0.8838028311729431
    ]
  ]
}
GET /(?:api/)?timings/?

Get summary timing statistics of all users contributing computations.

The method returns a dictionary with the contributed number of All-against-all units, their average duration and the total contributed cpu time in seconds per user (and host).

If a html page is requested (using the Accept header), this information is rendered as a html page including graphics, otherwise a json encoded object is returned.

Status Codes:

Indices and tables