생물정보학 끄적끄적

[Linux] Complete Genomic Data 다운 받기

Hazel Y. 2023. 12. 21. 16:12

Genomic data의 데이터 소스는 여럿 존재하지만, 그 중 대표적으로 세 플랫폼 (UCSC, Ensembl, NCBI) 이 있다.

 

 

1. UCSC

- 지난 포스팅에서 사용한 데이터 파일이 UCSC에서 다운 받은 것이기 때문에 해당 포스팅 참고.

 

Human Genome 데이터 예시

* 이번 예시에서 사용한 sequence data는 https://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/에서 찾을 수 있다. 해당 사이트에 업로드되어 있는 데이터는 hg38(GRCh38 Genome Reference Consortium Human Reference 38)의 se

livelyhheesun.tistory.com

 

 

2. Ensembl (https://ftp.ensembl.org/pub/)

 

Index of /pub

 

ftp.ensembl.org

- release별로 구분되어 있으며, 가장 최근 release는 release 111이다. 하지만 해당 디렉토리에는 제대로 등록된 파일이 없는 관계로, 2023년 11월 release된 release 110 디렉토리 내 데이터 파일 중 하나를 다운받아 보겠다.

$ wget http://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz
--2023-12-20 21:33:10--  http://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11389810 (11M) [application/x-gzip]
Saving to: 'Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz'

Homo_sapiens.GRCh3 100%[================>]  10.86M   224KB/s    in 55s

2023-12-20 21:34:06 (204 KB/s) - 'Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz' saved [11389810/11389810]

 

 

+ PyEnsembl(https://pypi.org/project/pyensembl/)을 활용해 다운받을 수도 있다.

 

pyensembl

Python interface to ensembl reference genome metadata

pypi.org

 

 

3. NCBI (https://ftp.ncbi.nlm.nih.gov/genomes/)

 

Index of /genomes

 

ftp.ncbi.nlm.nih.gov

 

(1) assembly_summary_refseq.txt (NCBI 상의 모든 genome 요약 파일) 다운

$ wget https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
--2023-12-20 22:01:48--  https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 130.14.250.12, 2607:f220:41e:250::12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 159175477 (152M) [text/plain]
Saving to: 'assembly_summary_refseq.txt'

assembly_summary_r 100%[================>] 151.80M   559KB/s    in 10m 0s

2023-12-20 22:11:52 (259 KB/s) - 'assembly_summary_refseq.txt' saved [159175477/159175477]

 

(2-1) e.g., Bacillus Cereus

$ cat assembly_summary_refseq.txt | grep cereus | wc -l
1212

→ 1,212 Bacilius Cereus genomes

 

(2-2) strain ATCC 11778

$ cat assembly_summary_refseq.txt | grep cereus | grep 11778
GCF_031316975.1 PRJNA224116     SAMN12236512    VKPK00000000.1  na      13961396    Bacillus cereus strain=ATCC 11778       na      latest  Scaffold   Major    Full    2023/09/08      ASM3131697v1    American Type Culture Collection (ATCC)     GCA_031316975.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/316/975/GCF_031316975.1_ASM3131697v1       na      na na       haploid bacteria        5233529 5233344 35.500000       0       584584      NCBI RefSeq     NCBI Prokaryotic Genome Annotation Pipeline (PGAP) 2023/09/29       5653    5231    128     na

 

(2-3) 해당 strain 데이터의 디렉토리 위치

$ cat assembly_summary_refseq.txt | grep cereus | grep 11778 | cut -f 20
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/316/975/GCF_031316975.1_ASM3131697v1

 

(2-4) CFF 파일 다운

$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/316/975/GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.gff.gz
--2023-12-21 14:58:07--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/316/975/GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.gff.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 130.14.250.7, 2607:f220:41e:250::13, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 505011 (493K) [application/x-gzip]
Saving to: 'GCF_031316975.1_ASM3131697v1_genomic.gff.gz'

GCF_031316975.1_AS 100%[================>] 493.17K   497KB/s    in 1.0s

2023-12-21 14:58:12 (497 KB/s) - 'GCF_031316975.1_ASM3131697v1_genomic.gff.gz' saved [505011/505011]

 

 

+ rsync 사용하여 NCBI 데이터 다운받기 ('genome' 폴더 새롭게 생성하여 그곳에 다운 받은 데이터 파일들 저장하기)

$ rsync --copy-links --recursive --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/316/975/GCF_031316975.1_ASM3131697v1 genome/

Warning Notice!

You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.

-------------------------------------------------------------------------------

Welcome to the NCBI rsync server.


receiving incremental file list
created directory genome
GCF_031316975.1_ASM3131697v1/
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_ani_contam_ranges.txt
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_ani_report.txt
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_assembly_report.txt
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_assembly_stats.txt
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_cds_from_genomic.fna.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_feature_count.txt.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_feature_table.txt.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.fna.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.gbff.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.gff.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.gtf.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic_gaps.txt.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_protein.faa.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_protein.gpff.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_rna_from_genomic.fna.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_translated_cds.faa.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_wgsmaster.gbff.gz
GCF_031316975.1_ASM3131697v1/README.txt
GCF_031316975.1_ASM3131697v1/annotation_hashes.txt
GCF_031316975.1_ASM3131697v1/assembly_status.txt
GCF_031316975.1_ASM3131697v1/md5checksums.txt

sent 431 bytes  received 13,712,536 bytes  2,109,687.23 bytes/sec
total size is 13,707,539  speedup is 1.00

 

* 원하는 GFF 파일을 개별적으로 다운받는 이전의 방법과는 달리, rsync를 사용하면 데이터가 필요한 strain 디렉토리 내의 모든 파일을 한 번에 다운받을 수 있다.

 

 

+ NCBI Genome Downloader(https://github.com/kblin/ncbi-genome-download)를 사용하는 방법도 있다.

 

GitHub - kblin/ncbi-genome-download: Scripts to download genomes from the NCBI FTP servers

Scripts to download genomes from the NCBI FTP servers - GitHub - kblin/ncbi-genome-download: Scripts to download genomes from the NCBI FTP servers

github.com

 

 

+ Refgenie(https://refgenie.databio.org/en/latest/) 사용법

 

Introduction - Refgenie

reference genome manager What is refgenie? Refgenie manages storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to download pre-built reference genome "assets", like indexes used by bioinformatics too

refgenie.databio.org


Reference

The Biostar Handbook: 2nd Edition - István Albert