Entrez web API와 Entrez Direct를 사용하여 accession number AF086833.2(https://www.ncbi.nlm.nih.gov/nuccore/AF086833.2)에 해당하는 데이터를 불러와 보겠다.
Ebola virus - Mayinga, Zaire, 1976, complete genome - Nucleotide - NCBI
no features Feature First Previous Next Last Details
www.ncbi.nlm.nih.gov
1. Entrez Web API
$ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=AF086833.2&db=nuccore&rettype=fasta' | head
>AF086833.2 Ebola virus - Mayinga, Zaire, 1976, complete genome
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
TTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGAAATTGTTACTGTAATCACACCTGGTTTGTTT
CAGAGCCACATCACAAAGATAGAGAACAACCTAGGTCTCCGAAGGGAGCAAGGGCATCAGTGTGCTCAGT
TGAAAATCCCTTGTCAACACCTAGGTCTTATCACATCACAAGTTCCACCTCAGACTCTGCAGGGTGATCC
AACAACCTTAATAGAAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA
TTAACCTTGGTTTTGAACTTGAACACTTAGGGGATTGAAGATTCAACAACCCTAAAGCTTGGGGTAAAAC
ATTGGAAATAGTTAAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA
ATCTGGATGGCGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGTCTGTCCG
TTCAACAGGGGATTGTTCGGCAAAGAGTCATCCCAGTGTATCAAGTAAACAATCTTGAAGAAATTTGCCA
→ exploratory analysis를 하기에는 그닥 적합하지 않다.
2. Entrez Direct
(1) GenBank format
$ efetch -db nuccore -format gb -id AF086833 | head
LOCUS AF086833 18959 bp cRNA linear VRL 13-FEB-2012
DEFINITION Ebola virus - Mayinga, Zaire, 1976, complete genome.
ACCESSION AF086833
VERSION AF086833.2
KEYWORDS .
SOURCE Ebola virus - Mayinga, Zaire, 1976 (EBOV-May)
ORGANISM Ebola virus - Mayinga, Zaire, 1976
Viruses; Riboviria; Orthornavirae; Negarnaviricota;
Haploviricotina; Monjiviricetes; Mononegavirales; Filoviridae;
Orthoebolavirus; Orthoebolavirus zairense.
(2) FASTA format
$ efetch -db nuccore -format fasta -id AF086833 > AF086833.fa
$ cat AF086833.fa | head -2
>AF086833.2 Ebola virus - Mayinga, Zaire, 1976, complete genome
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
(3) sequence 특정 범위만 추출
- e.g., 첫 번째 염기부터 세 번째 염기까지
$ efetch -db nuccore -format fasta -id AF086833 -seq_start 1 -seq_stop 3
>AF086833.2:1-3 Ebola virus - Mayinga, Zaire, 1976, complete genome
CGG
(4-1) forward strand sequence
$ efetch -db nuccore -format fasta -id AF086833 -seq_start 1 -seq_stop 5 -strand 1
>AF086833.2:1-5 Ebola virus - Mayinga, Zaire, 1976, complete genome
CGGAC
→ 1-5 (forward)
(4-2) reverse strand sequence
$ efetch -db nuccore -format fasta -id AF086833 -seq_start 1 -seq_stop 5 -strand 2
>AF086833.2:c5-1 Ebola virus - Mayinga, Zaire, 1976, complete genome
GTCCG
→ c5-1 (reverse)
※ 만약 Genbank accession number 대신 BioProject accession number를 알고 있다면?
(1) nucleotide data
$ esearch -db nucleotide -query PRJNA257197
<ENTREZ_DIRECT>
<Db>nucleotide</Db>
<WebEnv>MCID_6583f228c48f554d4438cc56</WebEnv>
<QueryKey>1</QueryKey>
<Count>249</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
# environment that can be passed into other Entrez Direct programs
$ esearch -db nucleotide -query PRJNA257197 | efetch -format fasta > genomes.fa
$ cat genomes.fa | head
>KR105345.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G6089.1, partial genome
ATAATTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGAAATTGTTACTGTAATCATACCTGGTTT
GTTTCAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCT
CAGTTGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAACTCTGCAGGGTG
ATCCAACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTG
AGAATTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAACAACCCTAAAGCCTGGGGTA
AAACATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCA
GAAAGTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGTCTG
TCCGTTCAACAGGGGATTGTTCGGCAAAGAGTCATCCCAGTGTATCAAGTAAACAATCTTGAGGAAATTT
GCCAACTTATCATACAGGCCTTTGAAGCTGGTGTTGATTTTCAAGAGAGTGCGGACAGTTTCCTTCTCAT
(2) protein (amino acid) data
$ esearch -db protein -query PRJNA257197
<ENTREZ_DIRECT>
<Db>protein</Db>
<WebEnv>MCID_6583f248648a963ee776d164</WebEnv>
<QueryKey>1</QueryKey>
<Count>2240</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
$ esearch -db protein -query PRJNA257197 | efetch -format fasta > proteins.fa
$ cat proteins.fa | head
>AKC37233.1 polymerase [Zaire ebolavirus]
MATQHTQYPDARLSSPIVLDQCDLVTRACGLYSSYSLNPQLRNCKLPKHIYRLKYDVTVTKFLSDVPVAT
LPIDFIVPILLKALSGNGFCPVEPRCQQFLDEIIKYTMQDALFLKYYLKNVGAQEDCVDDHFQEKILSSI
QGNEFLHQMFFWYDLAILTRRGRLNRGNSRSTWFVHDDLIDILGYGDYVFWKIPISLLPLNTQGIPHAAM
DWYQTSVFKEAVQGHTHIVSVSTADVLIMCKDLITCRFNTTLISKIAEVEDPVCSDYPNFKIVSMLYQSG
DYLLSILGSDGYKIIKFLEPLCLAKIQLCSKYTERKGRFLTQMHLAVNHTLEEITEIRALKPSQAHKIRE
FHRTLIRLEMTPQQLCELFSIQKHWGHPVLHSETAIQKVKKHATVLKALRPIVIFETYCVFKYSIAKHYF
DSQGSWYSVTSDRNLTPGLNSYIKRNQFPPLPMIKELLWEFYHLDHPPLFSTKIISDLSIFIKDRATAVE
RTCWDAVFEPNVLGYNPPHKFSTKRVPEQFLEQENFSIENVLSYAQKLEYLLPQYRNFSFSLKEKELNVG
RTFGKLPYPTRNVQTLCEALLADGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFV
+ BioProject accession number로 run 관련 정보 추출 (2가지 방법)
(a) runinfo
$ esearch -db sra -query PRJNA257197 | efetch -format runinfo > runinfo.csv
$ cat runinfo.csv | cut -d , -f 1,2,16 | head -3
Run,ReleaseDate,LibraryLayout
SRR1972917,2015-04-14 13:59:24,PAIRED
SRR1972918,2015-04-14 13:58:26,PAIRED
(b) summary
$ esearch -db sra -query PRJNA257197 | efetch -format summary > summary.xml
$ cat summary.xml | xtract -pattern RUN_SET -element @accession | head
SRR1972976
SRR1972975
SRR1972974
SRR1972973
SRR1972972
SRR1972971
SRR1972970
SRR1972969
SRR1972968
SRR1972967
+ Taxonomy ID로 분류 체계 관련 정보 추출
$ efetch -db taxonomy -id 9606,7227,10090 -format xml > output.xml
$ cat output.xml | xtract -pattern Taxon -first TaxId ScientificName GenbankCommonName Division
9606 Homo sapiens human Primates
7227 Drosophila melanogaster fruit fly Invertebrates
10090 Mus musculus house mouse Rodents
Reference
The Biostar Handbook: 2nd Edition - István Albert
'생물정보학 끄적끄적' 카테고리의 다른 글
[Linux] Ontology (1) - Sequence Ontology (80) | 2023.12.23 |
---|---|
[Linux] bio fetch와 bio search (5) | 2023.12.21 |
[Linux] Complete Genomic Data 다운 받기 (3) | 2023.12.21 |
[Linux] Human Genome 데이터 예시 (84) | 2023.12.20 |
[Linux] 데이터 분석 - 완전 기초 (81) | 2023.12.19 |