1. bio fetch
- accession number를 이용해 해당 데이터를 자동으로 정확한 destination에서 적절한 format으로 다운로드
(1) from GenBank
(a) e.g., accession number NC_045512
$ bio fetch NC_045512 | head
LOCUS NC_045512 29903 bp ss-RNA linear VRL 18-JUL-2020
DEFINITION Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome.
ACCESSION NC_045512
VERSION NC_045512.2
DBLINK BioProject: PRJNA485481
KEYWORDS RefSeq.
SOURCE Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
ORGANISM Severe acute respiratory syndrome coronavirus 2
Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;
(b) 기본 format은 GenBank이지만, FASTA와 GFF format으로도 가져오기 가능
$ bio fetch NC_045512 --format fasta | head -3
>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
$ bio fetch NC_045512 --format gff | head -6
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NC_045512.2 1 29903
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2697049
NC_045512.2 RefSeq region 1 29903 . + . ID=NC_045512.2:1..29903;Dbxref=taxon:2697049;collection-date=Dec-2019;country=China;gb-acronym=SARS-CoV-2;gbkey=Src;genome=genomic;isolate=Wuhan-Hu-1;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Wuhan seafood market pneumonia virus
(2) run information from the Short Read Archive
$ bio fetch SRR1972976
# Downloading 100 reads for SRR1972976
# Saving to SRR1972976_1.fastq.gz
# Saving to SRR1972976_2.fastq.gz
(3) from Ensembl
(a) gene data
$ bio fetch ENSG00000157764 | head
>ENSG00000157764.14 chromosome:GRCh38:7:140719327:140924929:-1
CTTCCCCCAATCCCCTCAGGCTCGGCTGCGCCCGGGGCCGCGGGCCGGTACCTGAGGTGG
CCCAGGCGCCCTCCGCCCGCGGCGCCGCCCGGGCCGCTCCTCCCCGCGCCCCCCGCGCCC
CCCGCTCCTCCGCCTCCGCCTCCGCCTCCGCCTCCCCCAGCTCTCCGCCTCCCTTCCCCC
TCCCCGCCCGACAGCGGCCGCTCGGGCCCCGGCTCTCGGTTATAAGATGGCGGCGCTGAG
CGGTGGCGGTGGTGGCGGCGCGGAGCCGGGCCAGGCTCTGTTCAACGGGGACATGGAGCC
CGAGGCCGGCGCCGGCGCCGGCGCCGCGGCCTCTTCGGCTGCGGACCCTGCCATTCCGGA
GGAGGTGAGTGCTGGCGCCACCCTGCCGCCCTCCCGACTCCGGGCTCGGCGGCTGGCTGG
TGTTTATTTTGGAAAGAGGCGGCGGTGGGGGCTTGATGCCCTCAGCCACCTTCTCGGGCC
AGCTCCGCGGGCTGGGAGGTGGGCATCGCCCCCGTGTCCCTCTCCGTCATGCAGCGCCTT
(b) transcript data (as cDNA/CDS/protein)
- cDNA: mRNA를 reverse transcription해서 얻은 DNA
- CDS (coding sequence): mRNA에서 실제로 protein으로 translate되는 부분
$ bio fetch ENST00000288602 | head
>ENST00000288602.11 chromosome:GRCh38:7:140734486:140924732:-1
CCGCTCGGGCCCCGGCTCTCGGTTATAAGATGGCGGCGCTGAGCGGTGGCGGTGGTGGCG
GCGCGGAGCCGGGCCAGGCTCTGTTCAACGGGGACATGGAGCCCGAGGCCGGCGCCGGCG
CCGGCGCCGCGGCCTCTTCGGCTGCGGACCCTGCCATTCCGGAGGAGGTGAGTGCTGGCG
CCACCCTGCCGCCCTCCCGACTCCGGGCTCGGCGGCTGGCTGGTGTTTATTTTGGAAAGA
GGCGGCGGTGGGGGCTTGATGCCCTCAGCCACCTTCTCGGGCCAGCTCCGCGGGCTGGGA
GGTGGGCATCGCCCCCGTGTCCCTCTCCGTCATGCAGCGCCTTCCTACGTAAACACACAC
AATGGCCCGGGGGGTTTCCCTGGCCCCCACCCCAGATGTGGGGATTGGGGCAGCGGTGGT
TGAGCGGGAGGCTATCAATAGGGGGCGAAACTCAGGGTTGGTCCGAGAAGGTCACGATTG
GCTGAAGTATCCAGCTCTGCATCTCTGTGGGGTGGGGGCGGCGGCGGCCTCGACGTGGAG
$ bio fetch ENST00000288602 --type cdna | head
>ENST00000288602.11
CCGCTCGGGCCCCGGCTCTCGGTTATAAGATGGCGGCGCTGAGCGGTGGCGGTGGTGGCG
GCGCGGAGCCGGGCCAGGCTCTGTTCAACGGGGACATGGAGCCCGAGGCCGGCGCCGGCG
CCGGCGCCGCGGCCTCTTCGGCTGCGGACCCTGCCATTCCGGAGGAGGTGTGGAATATCA
AACAAATGATTAAGTTGACACAGGAACATATAGAGGCCCTATTGGACAAATTTGGTGGGG
AGCATAATCCACCATCAATATATCTGGAGGCCTATGAAGAATACACCAGCAAGCTAGATG
CACTCCAACAAAGAGAACAACAGTTATTGGAATCTCTGGGGAACGGAACTGATTTTTCTG
TTTCTAGCTCTGCATCAATGGATACCGTTACATCTTCTTCCTCTTCTAGCCTTTCAGTGC
TACCTTCATCTCTTTCAGTTTTTCAAAATCCCACAGATGTGGCACGGAGCAACCCCAAGT
CACCACAAAAACCTATCGTTAGAGTCTTCCTGCCCAACAAACAGAGGACAGTGGTACCTG
$ bio fetch ENST00000288602 --type cds | head
>ENST00000288602.11
ATGGCGGCGCTGAGCGGTGGCGGTGGTGGCGGCGCGGAGCCGGGCCAGGCTCTGTTCAAC
GGGGACATGGAGCCCGAGGCCGGCGCCGGCGCCGGCGCCGCGGCCTCTTCGGCTGCGGAC
CCTGCCATTCCGGAGGAGGTGTGGAATATCAAACAAATGATTAAGTTGACACAGGAACAT
ATAGAGGCCCTATTGGACAAATTTGGTGGGGAGCATAATCCACCATCAATATATCTGGAG
GCCTATGAAGAATACACCAGCAAGCTAGATGCACTCCAACAAAGAGAACAACAGTTATTG
GAATCTCTGGGGAACGGAACTGATTTTTCTGTTTCTAGCTCTGCATCAATGGATACCGTT
ACATCTTCTTCCTCTTCTAGCCTTTCAGTGCTACCTTCATCTCTTTCAGTTTTTCAAAAT
CCCACAGATGTGGCACGGAGCAACCCCAAGTCACCACAAAAACCTATCGTTAGAGTCTTC
CTGCCCAACAAACAGAGGACAGTGGTACCTGCAAGGTGTGGAGTTACAGTCCGAGACAGT
$ bio fetch ENST00000288602 --type protein | head
>ENSP00000288602.7
MAALSGGGGGGAEPGQALFNGDMEPEAGAGAGAAASSAADPAIPEEVWNIKQMIKLTQEH
IEALLDKFGGEHNPPSIYLEAYEEYTSKLDALQQREQQLLESLGNGTDFSVSSSASMDTV
TSSSSSSLSVLPSSLSVFQNPTDVARSNPKSPQKPIVRVFLPNKQRTVVPARCGVTVRDS
LKKALMMRGLIPECCAVYRIQDGEKKPIGWDTDISWLTGEELHVEVLENVPLTTHNFVRK
TFFTLAFCDFCRKLLFQGFRCQTCGYKFHQRCSTEVPLMCVNYDQLDLLFVSKFFEHHPI
PQEEASLAETALTSGSSPSAPASDSIGPQILTSPSPSKSIPIPQPFRPADEDHRNQFGQR
DRSSSAPNVHINTIEPVNIDDLIRDQGFRGDGAPLNQLMRCLRKYQSRTPSPLLHSVPSE
IVFDFEPGPVFRGSTTGLSATPPASLPGSLTNVKALQKSPGPQRERKSSSSSEDRNRMKT
LGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKN
2. bio search
- accession number 등을 이용해 가장 적절한 데이터베이스에서 해당 데이터 검색
(1) e.g., accession number AF086833 (also in csv and tab-delimited files)
$ bio search AF086833
[
{
"accessionversion": "AF086833.2",
"assemblyacc": "",
"assemblygi": "",
"biomol": "cRNA",
"biosample": "",
"caption": "AF086833",
"completeness": "complete",
"createdate": "1999/02/10",
"extra": "gi|10141003|gb|AF086833.2|",
"flags": "",
"geneticcode": "1",
"genome": "",
"gi": 10141003,
"moltype": "rna",
"organism": "Ebola virus - Mayinga, Zaire, 1976",
"projectid": "0",
"segsetsize": "",
"slen": 18959,
"sourcedb": "insd",
"strain": "Mayinga",
"strand": "",
"subname": "Mayinga|EBOV-May",
"subtype": "strain|gb_acronym",
"taxid": 128952,
"tech": "",
"term": "10141003",
"title": "Ebola virus - Mayinga, Zaire, 1976, complete genome",
"topology": "linear",
"uid": "10141003",
"updatedate": "2012/02/13"
}
]
$ bio search AF086833 --csv
AF086833.2,,,cRNA,,AF086833,complete,1999/02/10,gi|10141003|gb|AF086833.2|,,1,,10141003,rna,"Ebola virus - Mayinga, Zaire, 1976",0,,18959,insd,Mayinga,,Mayinga|EBOV-May,strain|gb_acronym,128952,,10141003,"Ebola virus - Mayinga, Zaire, 1976, complete genome",linear,10141003,2012/02/13
$ bio search AF086833 --tab
AF086833.2 cRNA AF086833 complete 1999/02/10 gi|10141003|gb|AF086833.2| 1 10141003 rna Ebola virus - Mayinga, Zaire, 1976 0 18959 insd Mayinga Mayinga|EBOV-May strain|gb_acronym 128952 10141003 Ebola virus - Mayinga, Zaire, 1976, complete genome linear 10141003 2012/02/13
(2) e.g., SRR number SRR14575325
$ bio search SRR14575325
[
{
"run_accession": "SRR14575325",
"sample_accession": "SAMN19241174",
"sample_alias": "GSM5320434",
"sample_description": "TG3_1",
"first_public": "2021-05-20",
"country": "",
"scientific_name": "Homo sapiens",
"fastq_bytes": "612817621",
"base_count": "971964750",
"read_count": "19439295",
"library_name": "",
"library_strategy": "miRNA-Seq",
"library_source": "TRANSCRIPTOMIC",
"library_layout": "SINGLE",
"instrument_platform": "ILLUMINA",
"instrument_model": "Illumina HiSeq 2000",
"study_title": "Identification of 5'isomiR in HCC patients.",
"fastq_url": [
"https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/025/SRR14575325/SRR14575325.fastq.gz"
],
"info": "613 MB files; 19.4 million reads; 972.0 million sequenced bases"
}
]
(3) MyGene (https://mygene.info/) interface에서 검색
- search terms가 SRA number format에 부합하지 않을 때
$ bio search symbol:HBB --limit 1
[
{
"name": "hemoglobin subunit beta",
"refseq": {
"genomic": [
"NC_000011.10",
"NC_060935.1",
"NG_000007.3",
"NG_059281.1"
],
"protein": "NP_000509.1",
"rna": "NM_000518.5",
"translation": {
"protein": "NP_000509.1",
"rna": "NM_000518.5"
}
},
"symbol": "HBB",
"taxid": 9606,
"taxname": "Homo sapiens"
}
]
# showing 1 out of 38 results.
※ bio search에 대한 결과는 JSON 파일 형식을 가지며, 파일 내 key들의 값을 추출할 수 있다.
$ bio search PRJNA257197 | head -25
[
{
"run_accession": "SRR1553421",
"sample_accession": "SAMN02951955",
"sample_alias": "EM104",
"sample_description": "Zaire ebolavirus genome sequencing from 2014 outbreak in Sierra Leone",
"first_public": "2015-06-05",
"country": "Sierra Leone",
"scientific_name": "Zaire ebolavirus",
"fastq_bytes": "19517324;20042056",
"base_count": "57393048",
"read_count": "284124",
"library_name": "EM104_r1.ADXX",
"library_strategy": "RNA-Seq",
"library_source": "TRANSCRIPTOMIC",
"library_layout": "PAIRED",
"instrument_platform": "ILLUMINA",
"instrument_model": "Illumina HiSeq 2500",
"study_title": "Zaire ebolavirus Genome sequencing",
"fastq_url": [
"https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/001/SRR1553421/SRR1553421_1.fastq.gz",
"https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/001/SRR1553421/SRR1553421_2.fastq.gz"
],
"info": "20 MB, 20 MB files; 0.3 million reads; 57.4 million sequenced bases"
},
$ bio search PRJNA257197 > data.json
$ cat data.json | jq -r .[].run_accession | head -5
SRR1553426
SRR1553428
SRR1553432
SRR1553441
SRR1553442
$ cat data.json | jq -r '.[]|[.run_accession,.read_count]|@tsv' | head -5
SRR1553426 2446829
SRR1553428 732254
SRR1553432 482062
SRR1553441 531144
SRR1553442 1060564
→ jq -r .[].run_accession: root key [] 리스트 내 각 element에서 run_accession이라는 이름의 key에 해당하는 값 추출
→ jq -r '.[]|[.run_accession,.read_count]|@tsv': run_accession과 read_count라는 이름의 key들에 해당하는 값 추출
(tsv: tab-separated values)
$ bio search symbol:HBB --species human --fields ensembl
[
{
"ensembl": {
"gene": "ENSG00000244734",
"protein": [
"ENSP00000333994",
"ENSP00000369671",
"ENSP00000488004",
"ENSP00000494175",
"ENSP00000496200"
],
"transcript": [
"ENST00000335295",
"ENST00000380315",
"ENST00000475226",
"ENST00000485743",
"ENST00000633227",
"ENST00000647020"
],
"translation": [
{
"protein": "ENSP00000333994",
"rna": "ENST00000335295"
},
{
"protein": "ENSP00000494175",
"rna": "ENST00000647020"
},
{
"protein": "ENSP00000488004",
"rna": "ENST00000633227"
},
{
"protein": "ENSP00000496200",
"rna": "ENST00000485743"
},
{
"protein": "ENSP00000369671",
"rna": "ENST00000380315"
}
],
"type_of_gene": "protein_coding"
},
"name": "hemoglobin subunit beta",
"symbol": "HBB",
"taxid": 9606,
"taxname": "Homo sapiens"
}
]
$ bio search symbol:HBB --species human --fields ensembl > DATA.json
$ cat DATA.json | jq -r '.[].ensembl.translation[]|[.protein,.rna]|@tsv'
ENSP00000333994 ENST00000335295
ENSP00000494175 ENST00000647020
ENSP00000488004 ENST00000633227
ENSP00000496200 ENST00000485743
ENSP00000369671 ENST00000380315
* bio 패키지 에 대한 더 많은 내용은 https://www.bioinfo.help/ 참고.
Reference
The Biostar Handbook: 2nd Edition - István Albert
'생물정보학 끄적끄적' 카테고리의 다른 글
[Linux] Ontology (2) - Gene Ontology (with goatools) (3) | 2023.12.25 |
---|---|
[Linux] Ontology (1) - Sequence Ontology (80) | 2023.12.23 |
[Linux] Entrez Web API와 Entrez Direct로 NCBI 데이터베이스 접속하기 (3) | 2023.12.21 |
[Linux] Complete Genomic Data 다운 받기 (3) | 2023.12.21 |
[Linux] Human Genome 데이터 예시 (84) | 2023.12.20 |