생물정보학 끄적끄적

[Linux] bio fetch와 bio search

Hazel Y. 2023. 12. 21. 22:10

 

1. bio fetch

- accession number를 이용해 해당 데이터를 자동으로 정확한 destination에서 적절한 format으로 다운로드

 

 

(1) from GenBank

 

(a) e.g., accession number NC_045512

$ bio fetch NC_045512 | head
LOCUS       NC_045512              29903 bp ss-RNA     linear   VRL 18-JUL-2020
DEFINITION  Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome.
ACCESSION   NC_045512
VERSION     NC_045512.2
DBLINK      BioProject: PRJNA485481
KEYWORDS    RefSeq.
SOURCE      Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
  ORGANISM  Severe acute respiratory syndrome coronavirus 2
            Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;

 

(b) 기본 format은 GenBank이지만, FASTA와 GFF format으로도 가져오기 가능

$ bio fetch NC_045512 --format fasta | head -3
>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC

$ bio fetch NC_045512 --format gff | head -6
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NC_045512.2 1 29903
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2697049
NC_045512.2     RefSeq  region  1       29903   .       +       .       ID=NC_045512.2:1..29903;Dbxref=taxon:2697049;collection-date=Dec-2019;country=China;gb-acronym=SARS-CoV-2;gbkey=Src;genome=genomic;isolate=Wuhan-Hu-1;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Wuhan seafood market pneumonia virus

 

 

(2) run information from the Short Read Archive

$ bio fetch SRR1972976
# Downloading 100 reads for SRR1972976
# Saving to SRR1972976_1.fastq.gz
# Saving to SRR1972976_2.fastq.gz

 

 

(3) from Ensembl

 

(a) gene data

$ bio fetch ENSG00000157764 | head
>ENSG00000157764.14 chromosome:GRCh38:7:140719327:140924929:-1
CTTCCCCCAATCCCCTCAGGCTCGGCTGCGCCCGGGGCCGCGGGCCGGTACCTGAGGTGG
CCCAGGCGCCCTCCGCCCGCGGCGCCGCCCGGGCCGCTCCTCCCCGCGCCCCCCGCGCCC
CCCGCTCCTCCGCCTCCGCCTCCGCCTCCGCCTCCCCCAGCTCTCCGCCTCCCTTCCCCC
TCCCCGCCCGACAGCGGCCGCTCGGGCCCCGGCTCTCGGTTATAAGATGGCGGCGCTGAG
CGGTGGCGGTGGTGGCGGCGCGGAGCCGGGCCAGGCTCTGTTCAACGGGGACATGGAGCC
CGAGGCCGGCGCCGGCGCCGGCGCCGCGGCCTCTTCGGCTGCGGACCCTGCCATTCCGGA
GGAGGTGAGTGCTGGCGCCACCCTGCCGCCCTCCCGACTCCGGGCTCGGCGGCTGGCTGG
TGTTTATTTTGGAAAGAGGCGGCGGTGGGGGCTTGATGCCCTCAGCCACCTTCTCGGGCC
AGCTCCGCGGGCTGGGAGGTGGGCATCGCCCCCGTGTCCCTCTCCGTCATGCAGCGCCTT

 

(b) transcript data (as cDNA/CDS/protein)

- cDNA: mRNA를 reverse transcription해서 얻은 DNA

- CDS (coding sequence): mRNA에서 실제로 protein으로 translate되는 부분

$ bio fetch ENST00000288602 | head
>ENST00000288602.11 chromosome:GRCh38:7:140734486:140924732:-1
CCGCTCGGGCCCCGGCTCTCGGTTATAAGATGGCGGCGCTGAGCGGTGGCGGTGGTGGCG
GCGCGGAGCCGGGCCAGGCTCTGTTCAACGGGGACATGGAGCCCGAGGCCGGCGCCGGCG
CCGGCGCCGCGGCCTCTTCGGCTGCGGACCCTGCCATTCCGGAGGAGGTGAGTGCTGGCG
CCACCCTGCCGCCCTCCCGACTCCGGGCTCGGCGGCTGGCTGGTGTTTATTTTGGAAAGA
GGCGGCGGTGGGGGCTTGATGCCCTCAGCCACCTTCTCGGGCCAGCTCCGCGGGCTGGGA
GGTGGGCATCGCCCCCGTGTCCCTCTCCGTCATGCAGCGCCTTCCTACGTAAACACACAC
AATGGCCCGGGGGGTTTCCCTGGCCCCCACCCCAGATGTGGGGATTGGGGCAGCGGTGGT
TGAGCGGGAGGCTATCAATAGGGGGCGAAACTCAGGGTTGGTCCGAGAAGGTCACGATTG
GCTGAAGTATCCAGCTCTGCATCTCTGTGGGGTGGGGGCGGCGGCGGCCTCGACGTGGAG

$ bio fetch ENST00000288602 --type cdna | head
>ENST00000288602.11
CCGCTCGGGCCCCGGCTCTCGGTTATAAGATGGCGGCGCTGAGCGGTGGCGGTGGTGGCG
GCGCGGAGCCGGGCCAGGCTCTGTTCAACGGGGACATGGAGCCCGAGGCCGGCGCCGGCG
CCGGCGCCGCGGCCTCTTCGGCTGCGGACCCTGCCATTCCGGAGGAGGTGTGGAATATCA
AACAAATGATTAAGTTGACACAGGAACATATAGAGGCCCTATTGGACAAATTTGGTGGGG
AGCATAATCCACCATCAATATATCTGGAGGCCTATGAAGAATACACCAGCAAGCTAGATG
CACTCCAACAAAGAGAACAACAGTTATTGGAATCTCTGGGGAACGGAACTGATTTTTCTG
TTTCTAGCTCTGCATCAATGGATACCGTTACATCTTCTTCCTCTTCTAGCCTTTCAGTGC
TACCTTCATCTCTTTCAGTTTTTCAAAATCCCACAGATGTGGCACGGAGCAACCCCAAGT
CACCACAAAAACCTATCGTTAGAGTCTTCCTGCCCAACAAACAGAGGACAGTGGTACCTG

$ bio fetch ENST00000288602 --type cds | head
>ENST00000288602.11
ATGGCGGCGCTGAGCGGTGGCGGTGGTGGCGGCGCGGAGCCGGGCCAGGCTCTGTTCAAC
GGGGACATGGAGCCCGAGGCCGGCGCCGGCGCCGGCGCCGCGGCCTCTTCGGCTGCGGAC
CCTGCCATTCCGGAGGAGGTGTGGAATATCAAACAAATGATTAAGTTGACACAGGAACAT
ATAGAGGCCCTATTGGACAAATTTGGTGGGGAGCATAATCCACCATCAATATATCTGGAG
GCCTATGAAGAATACACCAGCAAGCTAGATGCACTCCAACAAAGAGAACAACAGTTATTG
GAATCTCTGGGGAACGGAACTGATTTTTCTGTTTCTAGCTCTGCATCAATGGATACCGTT
ACATCTTCTTCCTCTTCTAGCCTTTCAGTGCTACCTTCATCTCTTTCAGTTTTTCAAAAT
CCCACAGATGTGGCACGGAGCAACCCCAAGTCACCACAAAAACCTATCGTTAGAGTCTTC
CTGCCCAACAAACAGAGGACAGTGGTACCTGCAAGGTGTGGAGTTACAGTCCGAGACAGT

$ bio fetch ENST00000288602 --type protein | head
>ENSP00000288602.7
MAALSGGGGGGAEPGQALFNGDMEPEAGAGAGAAASSAADPAIPEEVWNIKQMIKLTQEH
IEALLDKFGGEHNPPSIYLEAYEEYTSKLDALQQREQQLLESLGNGTDFSVSSSASMDTV
TSSSSSSLSVLPSSLSVFQNPTDVARSNPKSPQKPIVRVFLPNKQRTVVPARCGVTVRDS
LKKALMMRGLIPECCAVYRIQDGEKKPIGWDTDISWLTGEELHVEVLENVPLTTHNFVRK
TFFTLAFCDFCRKLLFQGFRCQTCGYKFHQRCSTEVPLMCVNYDQLDLLFVSKFFEHHPI
PQEEASLAETALTSGSSPSAPASDSIGPQILTSPSPSKSIPIPQPFRPADEDHRNQFGQR
DRSSSAPNVHINTIEPVNIDDLIRDQGFRGDGAPLNQLMRCLRKYQSRTPSPLLHSVPSE
IVFDFEPGPVFRGSTTGLSATPPASLPGSLTNVKALQKSPGPQRERKSSSSSEDRNRMKT
LGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKN

 

 

2. bio search

- accession number 등을 이용해 가장 적절한 데이터베이스에서 해당 데이터 검색

 

 

(1) e.g., accession number AF086833 (also in csv and tab-delimited files)

$ bio search AF086833
[
    {
        "accessionversion": "AF086833.2",
        "assemblyacc": "",
        "assemblygi": "",
        "biomol": "cRNA",
        "biosample": "",
        "caption": "AF086833",
        "completeness": "complete",
        "createdate": "1999/02/10",
        "extra": "gi|10141003|gb|AF086833.2|",
        "flags": "",
        "geneticcode": "1",
        "genome": "",
        "gi": 10141003,
        "moltype": "rna",
        "organism": "Ebola virus - Mayinga, Zaire, 1976",
        "projectid": "0",
        "segsetsize": "",
        "slen": 18959,
        "sourcedb": "insd",
        "strain": "Mayinga",
        "strand": "",
        "subname": "Mayinga|EBOV-May",
        "subtype": "strain|gb_acronym",
        "taxid": 128952,
        "tech": "",
        "term": "10141003",
        "title": "Ebola virus - Mayinga, Zaire, 1976, complete genome",
        "topology": "linear",
        "uid": "10141003",
        "updatedate": "2012/02/13"
    }
]

$ bio search AF086833 --csv
AF086833.2,,,cRNA,,AF086833,complete,1999/02/10,gi|10141003|gb|AF086833.2|,,1,,10141003,rna,"Ebola virus - Mayinga, Zaire, 1976",0,,18959,insd,Mayinga,,Mayinga|EBOV-May,strain|gb_acronym,128952,,10141003,"Ebola virus - Mayinga, Zaire, 1976, complete genome",linear,10141003,2012/02/13

$ bio search AF086833 --tab
AF086833.2                      cRNA            AF086833        complete   1999/02/10       gi|10141003|gb|AF086833.2|              1               10141003    rna     Ebola virus - Mayinga, Zaire, 1976      0               18959       insd    Mayinga         Mayinga|EBOV-May        strain|gb_acronym  128952           10141003        Ebola virus - Mayinga, Zaire, 1976, complete genome linear  10141003        2012/02/13

 

 

(2) e.g., SRR number SRR14575325

$ bio search SRR14575325
[
    {
        "run_accession": "SRR14575325",
        "sample_accession": "SAMN19241174",
        "sample_alias": "GSM5320434",
        "sample_description": "TG3_1",
        "first_public": "2021-05-20",
        "country": "",
        "scientific_name": "Homo sapiens",
        "fastq_bytes": "612817621",
        "base_count": "971964750",
        "read_count": "19439295",
        "library_name": "",
        "library_strategy": "miRNA-Seq",
        "library_source": "TRANSCRIPTOMIC",
        "library_layout": "SINGLE",
        "instrument_platform": "ILLUMINA",
        "instrument_model": "Illumina HiSeq 2000",
        "study_title": "Identification of 5'isomiR in HCC patients.",
        "fastq_url": [
            "https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/025/SRR14575325/SRR14575325.fastq.gz"
        ],
        "info": "613 MB files; 19.4 million reads; 972.0 million sequenced bases"
    }
]

 

 

(3) MyGene (https://mygene.info/) interface에서 검색

- search terms가 SRA number format에 부합하지 않을 때

 

MyGene.info | Gene Annotation as a Service.

Gene Annotation as a Service.

mygene.info:443

 

$ bio search symbol:HBB --limit 1
[
    {
        "name": "hemoglobin subunit beta",
        "refseq": {
            "genomic": [
                "NC_000011.10",
                "NC_060935.1",
                "NG_000007.3",
                "NG_059281.1"
            ],
            "protein": "NP_000509.1",
            "rna": "NM_000518.5",
            "translation": {
                "protein": "NP_000509.1",
                "rna": "NM_000518.5"
            }
        },
        "symbol": "HBB",
        "taxid": 9606,
        "taxname": "Homo sapiens"
    }
]
#  showing 1 out of 38 results.

 

 

※ bio search에 대한 결과는 JSON 파일 형식을 가지며, 파일 내 key들의 값을 추출할 수 있다.

$ bio search PRJNA257197 | head -25
[
    {
        "run_accession": "SRR1553421",
        "sample_accession": "SAMN02951955",
        "sample_alias": "EM104",
        "sample_description": "Zaire ebolavirus genome sequencing from 2014 outbreak in Sierra Leone",
        "first_public": "2015-06-05",
        "country": "Sierra Leone",
        "scientific_name": "Zaire ebolavirus",
        "fastq_bytes": "19517324;20042056",
        "base_count": "57393048",
        "read_count": "284124",
        "library_name": "EM104_r1.ADXX",
        "library_strategy": "RNA-Seq",
        "library_source": "TRANSCRIPTOMIC",
        "library_layout": "PAIRED",
        "instrument_platform": "ILLUMINA",
        "instrument_model": "Illumina HiSeq 2500",
        "study_title": "Zaire ebolavirus Genome sequencing",
        "fastq_url": [
            "https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/001/SRR1553421/SRR1553421_1.fastq.gz",
            "https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/001/SRR1553421/SRR1553421_2.fastq.gz"
        ],
        "info": "20 MB, 20 MB files; 0.3 million reads; 57.4 million sequenced bases"
    },

$ bio search PRJNA257197 > data.json

$ cat data.json | jq -r .[].run_accession | head -5
SRR1553426
SRR1553428
SRR1553432
SRR1553441
SRR1553442

$ cat data.json | jq -r '.[]|[.run_accession,.read_count]|@tsv' | head -5
SRR1553426      2446829
SRR1553428      732254
SRR1553432      482062
SRR1553441      531144
SRR1553442      1060564

 

→ jq -r .[].run_accession: root key [] 리스트 내 각 element에서 run_accession이라는 이름의 key에 해당하는 값 추출

→ jq -r '.[]|[.run_accession,.read_count]|@tsv': run_accession과 read_count라는 이름의 key들에 해당하는 값 추출

(tsv: tab-separated values)

 

$ bio search symbol:HBB --species human --fields ensembl
[
    {
        "ensembl": {
            "gene": "ENSG00000244734",
            "protein": [
                "ENSP00000333994",
                "ENSP00000369671",
                "ENSP00000488004",
                "ENSP00000494175",
                "ENSP00000496200"
            ],
            "transcript": [
                "ENST00000335295",
                "ENST00000380315",
                "ENST00000475226",
                "ENST00000485743",
                "ENST00000633227",
                "ENST00000647020"
            ],
            "translation": [
                {
                    "protein": "ENSP00000333994",
                    "rna": "ENST00000335295"
                },
                {
                    "protein": "ENSP00000494175",
                    "rna": "ENST00000647020"
                },
                {
                    "protein": "ENSP00000488004",
                    "rna": "ENST00000633227"
                },
                {
                    "protein": "ENSP00000496200",
                    "rna": "ENST00000485743"
                },
                {
                    "protein": "ENSP00000369671",
                    "rna": "ENST00000380315"
                }
            ],
            "type_of_gene": "protein_coding"
        },
        "name": "hemoglobin subunit beta",
        "symbol": "HBB",
        "taxid": 9606,
        "taxname": "Homo sapiens"
    }
]

$ bio search symbol:HBB --species human --fields ensembl > DATA.json

$ cat DATA.json | jq -r '.[].ensembl.translation[]|[.protein,.rna]|@tsv'
ENSP00000333994 ENST00000335295
ENSP00000494175 ENST00000647020
ENSP00000488004 ENST00000633227
ENSP00000496200 ENST00000485743
ENSP00000369671 ENST00000380315

 

 

* bio 패키지 에 대한 더 많은 내용은 https://www.bioinfo.help/ 참고.

 

The bio package

bio is a bioinformatics toy to play with. Like LEGO pieces that match one another bio aims to provide you with commands that naturally fit together and let you express your intent with short, explicit and simple commands. It is a project in an exploratory

www.bioinfo.help


Reference

The Biostar Handbook: 2nd Edition - István Albert