Entrez Direct
Edirect, or Entrez Direct, is a toolkit to retrieve sequences and other data from the NCBI sequence databases based on given query terms. The package consists of several commands:
- Navigation functions support exploration within the Entrez databases:
esearchperforms a new Entrez search using terms in indexed fields.elinklooks up neighbors (within a database) or links (between databases).efilterfilters or restricts the results of a previous query.
- Records can be retrieved in specified formats or as document summaries:
efetchdownloads records or reports in a designated format.
- Desired fields from XML results can be extracted without writing a program:
xtractconverts Edirect XML output into a table of data values.
- Several additional functions are also provided:
einfoobtains information on indexed fields in an Entrez database.epostuploads unique identifiers (UIDs) or sequence accession numbers.nquiresends a URL request to a web page or CGI service
License
Free to use for all users. Public Domain notice.
Available
Puhti: 13.4
Usage
The edirect commands listed above are activated by loading the biokit module.
After that you can, e.g., use esearch and efetch to retrieve protein or nucleotide sequence entries, whose annotation matches the given search terms. In search terms, you can also use wildcard character * to match any string. The search is case-insensitive: "Mus" and "mus" will produce the same matches. You can also focus your search to certain fields of the search database (Keywords, Author, Organism, Accession, Gene name, Protein name, Sequence length etc.). In the case of sequence length, a range should be defined with syntax from:to. For example: 120:125.
Normally, it is wise to first use just the esearch command to get an idea how many hits are found.
For example, search:
will report that 267791 hits were found.
<ENTREZ_DIRECT>
<Db>nucleotide</Db>
<WebEnv>NCID_1_7176041_130.14.18.48_9001_1567161450_1478919739_0MetA0_S_MegaStore</WebEnv>
<QueryKey>1</QueryKey>
<Count>267791</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
In this case it might be reasonable to refine the search before the search definition is further piped to efetch command for the actual data retrieval. One search can include several search terms that are combined using logical operators (AND, OR, NOT). The matching sequences can be saved in several formats, for example, fasta or Genebank formats are supported. The command below retrieves just one entry, Lyngbya majuscula barbamide biosynthesis gene cluster that contains gene with name braC.
esearch -db nucleotide -query "barc [GENE] AND Lyngbya majuscula [ORGN]" | efetch -format gb > barc_Lm.gb