생물정보학 끄적끄적

[Linux] Entrez Web API와 Entrez Direct로 NCBI 데이터베이스 접속하기

Hazel Y. 2023. 12. 21. 18:35

Entrez web API와 Entrez Direct를 사용하여 accession number AF086833.2(https://www.ncbi.nlm.nih.gov/nuccore/AF086833.2)에 해당하는 데이터를 불러와 보겠다.

 

Ebola virus - Mayinga, Zaire, 1976, complete genome - Nucleotide - NCBI

no features Feature First Previous Next Last Details

www.ncbi.nlm.nih.gov

 

 

1. Entrez Web API

$ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=AF086833.2&db=nuccore&rettype=fasta' | head
>AF086833.2 Ebola virus - Mayinga, Zaire, 1976, complete genome
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
TTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGAAATTGTTACTGTAATCACACCTGGTTTGTTT
CAGAGCCACATCACAAAGATAGAGAACAACCTAGGTCTCCGAAGGGAGCAAGGGCATCAGTGTGCTCAGT
TGAAAATCCCTTGTCAACACCTAGGTCTTATCACATCACAAGTTCCACCTCAGACTCTGCAGGGTGATCC
AACAACCTTAATAGAAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA
TTAACCTTGGTTTTGAACTTGAACACTTAGGGGATTGAAGATTCAACAACCCTAAAGCTTGGGGTAAAAC
ATTGGAAATAGTTAAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA
ATCTGGATGGCGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGTCTGTCCG
TTCAACAGGGGATTGTTCGGCAAAGAGTCATCCCAGTGTATCAAGTAAACAATCTTGAAGAAATTTGCCA

 

→ exploratory analysis를 하기에는 그닥 적합하지 않다.

 

 

2. Entrez Direct

 

(1) GenBank format

$ efetch -db nuccore -format gb -id AF086833 | head
LOCUS       AF086833               18959 bp    cRNA    linear   VRL 13-FEB-2012
DEFINITION  Ebola virus - Mayinga, Zaire, 1976, complete genome.
ACCESSION   AF086833
VERSION     AF086833.2
KEYWORDS    .
SOURCE      Ebola virus - Mayinga, Zaire, 1976 (EBOV-May)
  ORGANISM  Ebola virus - Mayinga, Zaire, 1976
            Viruses; Riboviria; Orthornavirae; Negarnaviricota;
            Haploviricotina; Monjiviricetes; Mononegavirales; Filoviridae;
            Orthoebolavirus; Orthoebolavirus zairense.

 

(2) FASTA format

$ efetch -db nuccore -format fasta -id AF086833 > AF086833.fa

$ cat AF086833.fa | head -2
>AF086833.2 Ebola virus - Mayinga, Zaire, 1976, complete genome
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA

 

(3) sequence 특정 범위만 추출

- e.g., 첫 번째 염기부터 세 번째 염기까지

$ efetch -db nuccore -format fasta -id AF086833 -seq_start 1 -seq_stop 3
>AF086833.2:1-3 Ebola virus - Mayinga, Zaire, 1976, complete genome
CGG

 

(4-1) forward strand sequence

$ efetch -db nuccore -format fasta -id AF086833 -seq_start 1 -seq_stop 5 -strand 1
>AF086833.2:1-5 Ebola virus - Mayinga, Zaire, 1976, complete genome
CGGAC

 

→ 1-5 (forward)

 

(4-2) reverse strand sequence

$ efetch -db nuccore -format fasta -id AF086833 -seq_start 1 -seq_stop 5 -strand 2
>AF086833.2:c5-1 Ebola virus - Mayinga, Zaire, 1976, complete genome
GTCCG

 

→ c5-1 (reverse)

 

 

※ 만약 Genbank accession number 대신 BioProject accession number를 알고 있다면?

 

(1) nucleotide data

$ esearch -db nucleotide -query PRJNA257197
<ENTREZ_DIRECT>
  <Db>nucleotide</Db>
  <WebEnv>MCID_6583f228c48f554d4438cc56</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>249</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>
# environment that can be passed into other Entrez Direct programs

$ esearch -db nucleotide -query PRJNA257197 | efetch -format fasta > genomes.fa

$ cat genomes.fa | head
>KR105345.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G6089.1, partial genome
ATAATTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGAAATTGTTACTGTAATCATACCTGGTTT
GTTTCAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCT
CAGTTGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAACTCTGCAGGGTG
ATCCAACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTG
AGAATTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAACAACCCTAAAGCCTGGGGTA
AAACATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCA
GAAAGTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGTCTG
TCCGTTCAACAGGGGATTGTTCGGCAAAGAGTCATCCCAGTGTATCAAGTAAACAATCTTGAGGAAATTT
GCCAACTTATCATACAGGCCTTTGAAGCTGGTGTTGATTTTCAAGAGAGTGCGGACAGTTTCCTTCTCAT

 

(2) protein (amino acid) data

$ esearch -db protein -query PRJNA257197
<ENTREZ_DIRECT>
  <Db>protein</Db>
  <WebEnv>MCID_6583f248648a963ee776d164</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>2240</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

$ esearch -db protein -query PRJNA257197 | efetch -format fasta > proteins.fa

$ cat proteins.fa | head
>AKC37233.1 polymerase [Zaire ebolavirus]
MATQHTQYPDARLSSPIVLDQCDLVTRACGLYSSYSLNPQLRNCKLPKHIYRLKYDVTVTKFLSDVPVAT
LPIDFIVPILLKALSGNGFCPVEPRCQQFLDEIIKYTMQDALFLKYYLKNVGAQEDCVDDHFQEKILSSI
QGNEFLHQMFFWYDLAILTRRGRLNRGNSRSTWFVHDDLIDILGYGDYVFWKIPISLLPLNTQGIPHAAM
DWYQTSVFKEAVQGHTHIVSVSTADVLIMCKDLITCRFNTTLISKIAEVEDPVCSDYPNFKIVSMLYQSG
DYLLSILGSDGYKIIKFLEPLCLAKIQLCSKYTERKGRFLTQMHLAVNHTLEEITEIRALKPSQAHKIRE
FHRTLIRLEMTPQQLCELFSIQKHWGHPVLHSETAIQKVKKHATVLKALRPIVIFETYCVFKYSIAKHYF
DSQGSWYSVTSDRNLTPGLNSYIKRNQFPPLPMIKELLWEFYHLDHPPLFSTKIISDLSIFIKDRATAVE
RTCWDAVFEPNVLGYNPPHKFSTKRVPEQFLEQENFSIENVLSYAQKLEYLLPQYRNFSFSLKEKELNVG
RTFGKLPYPTRNVQTLCEALLADGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFV

 

 

+ BioProject accession number로 run 관련 정보 추출 (2가지 방법)

 

(a) runinfo

$ esearch -db sra -query PRJNA257197 | efetch -format runinfo > runinfo.csv

$ cat runinfo.csv | cut -d , -f 1,2,16 | head -3
Run,ReleaseDate,LibraryLayout
SRR1972917,2015-04-14 13:59:24,PAIRED
SRR1972918,2015-04-14 13:58:26,PAIRED

 

(b) summary

$ esearch -db sra -query PRJNA257197 | efetch -format summary > summary.xml

$ cat summary.xml | xtract -pattern RUN_SET -element @accession | head
SRR1972976
SRR1972975
SRR1972974
SRR1972973
SRR1972972
SRR1972971
SRR1972970
SRR1972969
SRR1972968
SRR1972967

 

 

+ Taxonomy ID로 분류 체계 관련 정보 추출

$ efetch -db taxonomy -id 9606,7227,10090 -format xml > output.xml

$ cat output.xml | xtract -pattern Taxon -first TaxId ScientificName GenbankCommonName Division
9606    Homo sapiens    human   Primates
7227    Drosophila melanogaster fruit fly       Invertebrates
10090   Mus musculus    house mouse     Rodents

Reference

The Biostar Handbook: 2nd Edition - István Albert