Loading datasets and variants

Adding data to the database

Dataset and variant data can be loaded into the database using specific the specific command line. To visualize command line options, from the terminal you can user the following command: beacon --help.

The default procedure to add variants to the beacon is always the following:

Create a dataset to link your variants to.
Load variants from a VCF file for one or more samples, specifying which dataset these variants belong to.

How to add:

Demo data
Gene data
Creating an authorized user for using the APIs
A new dataset (custom data)
Variants (custom data) using the command line
Variants (custom data) using the REST API

Demo data

Demo data consisting in a test dataset with public access and a set of variants (SNVs and structural variants of different type) is available under the cgbeacon2/resources/demo folder. You don't need to load this data manually since the following command will take care of everything:

beacon add demo

Adding/updating gene data

In order to accept add variants requests containing the genes option, the database should be pre-populated with gene data. VCF files can in fact be filtered by genes only if gene information containing chromosome, start and stop coordinates are already available when the variants load command is executed.

To load genes into database or to update the database gene collection, run the following command:

beacon update genes

Options:
  -build [GRCh37|GRCh38]  Genome assembly (default:GRCh37)

Creating an authorized user for using the APIs

An API user is required whenever variants are by sending a request to the Beacon API. One or more API users can be created using the command:

beacon add user


Options:
  --uid TEXT    User ID  [required]
  --name TEXT  User name  [required]
  --token TEXT  If not specified, the token will be created automatically
  --desc TEXT  User description
  --url TEXT   User url
  --help      Show this message and exit.

If no token is specified, a random user token will be created in the database for this user.

Adding a new dataset

A new dataset can be created with the following command:

beacon add dataset --did <dataset_id> --name <"A dataset name"> --build <GRCh37|GRCh38> --authlevel <public|registered|controlled>

The above parameters (ds-id, name, build, authlevel) are mandatory. If user doesn't specify any genome build then the default build used is GRCh37. One dataset can be associated to variants called using only one genome build. authlevel parameter will be used in queries to return results according to the request authentication level.

Public datasets can be interrogated by any beacon and any user in general and should not be used to store sensitive data such as individual phenotypes.
Bona fide researchers logged in via the Elixir AAI will be able to access data store in registered datasets.
Controlled access datasets might be used to store sensitive information and will be accessed only by users that have a signed agreement and their access approved by a Data Access Committee (DAC).

More info about the Elixir AAI authentication is available here

Other optional parameters that can be provided to improve the dataset description are the following.

  --desc TEXT                      dataset description
  --version FLOAT                  dataset version, i.e. 1.0
  --url TEXT                       external url
  --cc TEXT                        consent code key. i.e. HMB
  --update

The --update flag will allow to modify the information for a dataset that is already existing in the database.

Adding variant data using the command line

Variant data can be loaded to the database using the following command:

beacon add variants

Options:
  --ds TEXT      dataset ID  [required]
  --vcf PATH     [required]
  --sample TEXT  one or more samples to save variants for  [required]
  --panel PATH   one or more bed files containing genomic intervals

ds (dataset id) and vcf (path to the VCF file containing the variants) are mandatory parameters. One or more samples included in the VCF file must also be specified. To specify multiple samples use the -sample parameter multiple times (example -sample sampleA -sample sampleB ..).

VCF files might as well be filtered by genomic intervals prior to variant uploading. To upload variants filtered by multiple panels use the options -panel panelA -panel panelB, providing the path to a bed file containing the genomic intervals of interest.

Additional variants for the same sample(s) and the same dataset might be added any time by running the same beacon add variants specifying another VCF file. Whenever the variant is already found for the same sample and the same dataset it will not be saved twice.

Adding variant data using the REST API

Variant data can be alternatively loaded to the Beacon by sending a request to the /apiv1.0/add endpoint. This Endpoint is accepting json data from POST requests. If the request parameters are correct it will return a response with code 200 (success) and message "Saving variants to Beacon", whole it will start the actual thread that will save variants to database.

Sending an add request to the API

Apart from the header, an add request should contain the following parameters: - dataset_id (mandatory): string dentifier for a dataset - vcf_path (mandatory): path to variants VCF file - assemblyId (mandatory) : Genome build used in variant calling ("GRCh37", "GRCh38") - samples (mandatory): list of samples to extract variants from in VCF file - genes (optional): an object containing two keys: - ids: list of genes ids to be used to filter VCF file (only variants included in these genes will be saved to database). - id_type*: either "HGNC" or "Ensembl", to specify which type of ID format ids refers to. All genes in the list must be of the same type (for example all Ensembl IDs).

HTML Requests to add variants should contain an auth_token header which corresponds to the token of a pre-existing API user. API users are created exclusively using the command line. Follow these instructions to create a new API user.

Once the user is created in the database, make sure the request to the add API contains the following header parameters: - Content-Type: application/json - X-Auth-Token: auth_token

Example of a valid POST request to the add endpoint:

curl -X POST \
  -H 'Content-Type: application/json' \
  -H 'X-Auth-Token: auth_token' \
  -d '{"dataset_id": "test_public",
  "vcf_path": "path/to/cgbeacon2/cgbeacon2/resources/demo/test_trio.vcf.gz",
  "samples" : ["ADM1059A1", "ADM1059A2"],
  "genes" : {"ids": [17284, 29669, 11592], "id_type":"HGNC"},
  "assemblyId": "GRCh37"}' http://localhost:5000/apiv1.0/add

In order for the genes option to work, it is necessary to load genes data into the database via the command line*. Instructions on how to load genes info into the database are available here