NCBI
seqq can import Records from databases maintained by the National Center for Biological Information (NCBI). This tutorial describes the Integration set up process.
Background
The National Center for Biological Information (NCBI) maintains a list of BLAST databases. seqq downloads Records from these databases from the BLAST FTP Site.
Entries
Every Record in the NCBI-maintained BLAST databases maps to an entry to a record in another database including GenBank, RefSeq, and PDB.
For example, the Nucleotide database (nt) has a Record containing the TCP1-beta, AXL2, and REV7 genes with a GenBank entry:
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION U49845
VERSION U49845.1 GI:1293613
KEYWORDS .
SOURCE Saccharomyces cerevisiae (baker's yeast)
...
Source: Sample GenBank Record
Both the Accession and GenInfo Identifiers are globally unique IDs that can map BLAST hits back to Records.
The two systems of identifiers run in parallel to each other. That is, when any change is made to a sequence, it both receives a new GI number, and the version part of its accession number is incremented by 1.
Source: GenBank Record Identifiers
Manifests
The NCBI Integration requires that you provide the URL of the BLAST manifest for the database you are importing Records from.
Every NCBI database has a manifest. This describes the size and files of the BLAST database. For example, the nt
(Nucleotide) database has a manifest that points at the greater-than 100 files that make up the BLAST database. Smaller databases, like the 16S ribosomal RNA database, has a manifest that points at a single BLAST database file. These are all hosted and available at the BLAST FTP site.
Imported Record Names
For every Record in an NCBI database, seqq creates a new Record with a Name that ends in the GenInfo Identifier of the Record. The Name can be prefixed using the recordIdPrefix
setting on the Integration.
For example, take the Integration below:
{
"collection": "collections/my-collection",
"recordIdPrefix": "my-nucleotides",
"ncbi": {
"manifest": "https://ftp.ncbi.nlm.nih.gov/blast/db/nt-nucl-metadata.json"
}
}
It would create a new Record with the Name collections/my-collection/records/my-nucleotides/1293613
for the SCU49845
record:
Collection name | Records ID prefix | GenInfo ID |
---|---|---|
collections/my-collection | sequences/my-nucleotides | 1293613 |
Create the Integration
To create an NCBI Integration you need to:
- get the URL of the FTP manifest for the NCBI database
- call seqq's Create Integration endpoint with
ncbi.manifest
set
Find the URL of the manifest for the BLAST database to import. The database files and their manifests are listed at the BLAST FTP Site under /blast/db
: https://ftp.ncbi.nlm.nih.gov/blast/db/.
Call the Create Integration endpoint of seqq with the manifest
set to the URL of the manifest file in the BLAST FTP Site. The example below shows how to create an Integration that imports Records from the 16S Ribosomal RNA database:
curl -X POST -s "api.seqq.io/v1alpha/collections/my-collection/integrations?integrationId=my-integration" \
--data-raw '
{
"recordIdPrefix": "my-nucleotides",
"ncbi": {
"manifest": "https://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA-nucl-metadata.json"
}
}'