Loading genetic intervals: genes, transcripts and exons into the database

Genes, transcripts and exons should be loaded and updated at regular intervals of time. Depending on the type of sequencing data analysed using chanjo2, loading of transcripts and exons might not be required. For instance, gene coordinates should be enough for whole genome sequencing (WGS) experiments, while transcripts and exons data are necessary to return statistics from transcripts and exons-based experiments.

Genes, transcripts and exons are retrieved from the Ensembl Biomart using the Schug[shug] library and loaded into the database in three distinct tables.

Genes should be loaded into the database before transcripts and exons intervals. Depending on the hardware in use and the HTML connection speed, the process of loading these intervals might take some time. For this reason requests sent to these endpoints are asynchronous, so that they don't time out while processing the information.

Loading/updating database genes

Loading of genes in a given genome build can be achieved by sending a POST request to the /intervals/load/genes/{<genome-build} endpoint:

curl -X 'POST' \
  'http://localhost:8000/intervals/load/genes/GRCh38' \
  -H 'accept: application/json' \
  -d ''

Please note that the process of loading genes into the database will erase eventual transcripts and exons with the same genome build that are already present in the database. This ensures that transcripts and exons intervals will be up-to-date with the latest definitions of the genes loaded into the database.

Loading/updating transcripts

Transcripts can be loaded/updated by using the /intervals/load/transcripts/{<genome-build} endpoint:

curl -X 'POST' \
  'http://localhost:8000/intervals/load/transcripts/GRCh38' \
  -H 'accept: application/json' \
  -d ''

Loading/updating exons:

As for the previous endpoints, exons are loaded by sending a POST request to the /intervals/load/exons/{<genome-build} endpoint.

curl -X 'POST' \
  'http://localhost:8000/intervals/load/transcripts/GRCh38' \
  -H 'accept: application/json' \
  -d ''

Genes, transcripts and exons queries

Once the database is populated with genomic intervals data, it is possible to run queries to retrieve its content.

Genomic intervals can be queried using genes definitions. Genes can be provided as a parameter to the query in the following formats

  • Ensembl gene IDs (use parameter ensembl_ids)
  • HGNC ids (use parameter hgnc_ids)
  • HGNC symbols (use parameter hgnc_symbols)

Genome build is always a required parameter in these queries.

Examples:

  • Send a POST request to Retrieve information on a list of genes using HGNC symbols:
{
  "build": "GRCh37",
  "hgnc_symbols": ["LAMA1","LAMA2"]
}
  • Retrieve transcripts available for one of more genes described by Ensembl IDs:
curl -X 'POST' \
  'http://localhost:8000/intervals/transcripts' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "build": "GRCh37",
  "ensembl_gene_ids": [
    "ENSG00000101680", "ENSG00000196569"
  ]
}'
  • Retrieve all exons for genes with HGNC IDs: 6481 and 6482:
curl -X 'POST' \
  'http://localhost:8000/intervals/exons' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "build": "GRCh37",
  "hgnc_ids": [
    6481, 6482
  ]
}'

Whenever ensembl_ids, hgnc_ids, hgnc_symbols parameter is not provided, these endpoints will return a list of 100 default genes, transcripts or exons. To increase the number of returned entries you can specify a custom value for the query limit parameter.