Genomic data의 데이터 소스는 여럿 존재하지만, 그 중 대표적으로 세 플랫폼 (UCSC, Ensembl, NCBI) 이 있다.
1. UCSC
- 지난 포스팅에서 사용한 데이터 파일이 UCSC에서 다운 받은 것이기 때문에 해당 포스팅 참고.
2. Ensembl (https://ftp.ensembl.org/pub/)
- release별로 구분되어 있으며, 가장 최근 release는 release 111이다. 하지만 해당 디렉토리에는 제대로 등록된 파일이 없는 관계로, 2023년 11월 release된 release 110 디렉토리 내 데이터 파일 중 하나를 다운받아 보겠다.
$ wget http://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz
--2023-12-20 21:33:10-- http://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11389810 (11M) [application/x-gzip]
Saving to: 'Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz'
Homo_sapiens.GRCh3 100%[================>] 10.86M 224KB/s in 55s
2023-12-20 21:34:06 (204 KB/s) - 'Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz' saved [11389810/11389810]
+ PyEnsembl(https://pypi.org/project/pyensembl/)을 활용해 다운받을 수도 있다.
3. NCBI (https://ftp.ncbi.nlm.nih.gov/genomes/)
(1) assembly_summary_refseq.txt (NCBI 상의 모든 genome 요약 파일) 다운
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
--2023-12-20 22:01:48-- https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 130.14.250.12, 2607:f220:41e:250::12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 159175477 (152M) [text/plain]
Saving to: 'assembly_summary_refseq.txt'
assembly_summary_r 100%[================>] 151.80M 559KB/s in 10m 0s
2023-12-20 22:11:52 (259 KB/s) - 'assembly_summary_refseq.txt' saved [159175477/159175477]
(2-1) e.g., Bacillus Cereus
$ cat assembly_summary_refseq.txt | grep cereus | wc -l
1212
→ 1,212 Bacilius Cereus genomes
(2-2) strain ATCC 11778
$ cat assembly_summary_refseq.txt | grep cereus | grep 11778
GCF_031316975.1 PRJNA224116 SAMN12236512 VKPK00000000.1 na 13961396 Bacillus cereus strain=ATCC 11778 na latest Scaffold Major Full 2023/09/08 ASM3131697v1 American Type Culture Collection (ATCC) GCA_031316975.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/316/975/GCF_031316975.1_ASM3131697v1 na na na haploid bacteria 5233529 5233344 35.500000 0 584584 NCBI RefSeq NCBI Prokaryotic Genome Annotation Pipeline (PGAP) 2023/09/29 5653 5231 128 na
(2-3) 해당 strain 데이터의 디렉토리 위치
$ cat assembly_summary_refseq.txt | grep cereus | grep 11778 | cut -f 20
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/316/975/GCF_031316975.1_ASM3131697v1
(2-4) CFF 파일 다운
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/316/975/GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.gff.gz
--2023-12-21 14:58:07-- https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/316/975/GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.gff.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 130.14.250.7, 2607:f220:41e:250::13, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 505011 (493K) [application/x-gzip]
Saving to: 'GCF_031316975.1_ASM3131697v1_genomic.gff.gz'
GCF_031316975.1_AS 100%[================>] 493.17K 497KB/s in 1.0s
2023-12-21 14:58:12 (497 KB/s) - 'GCF_031316975.1_ASM3131697v1_genomic.gff.gz' saved [505011/505011]
+ rsync 사용하여 NCBI 데이터 다운받기 ('genome' 폴더 새롭게 생성하여 그곳에 다운 받은 데이터 파일들 저장하기)
$ rsync --copy-links --recursive --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/316/975/GCF_031316975.1_ASM3131697v1 genome/
Warning Notice!
You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.
-------------------------------------------------------------------------------
Welcome to the NCBI rsync server.
receiving incremental file list
created directory genome
GCF_031316975.1_ASM3131697v1/
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_ani_contam_ranges.txt
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_ani_report.txt
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_assembly_report.txt
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_assembly_stats.txt
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_cds_from_genomic.fna.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_feature_count.txt.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_feature_table.txt.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.fna.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.gbff.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.gff.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic.gtf.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_genomic_gaps.txt.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_protein.faa.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_protein.gpff.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_rna_from_genomic.fna.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_translated_cds.faa.gz
GCF_031316975.1_ASM3131697v1/GCF_031316975.1_ASM3131697v1_wgsmaster.gbff.gz
GCF_031316975.1_ASM3131697v1/README.txt
GCF_031316975.1_ASM3131697v1/annotation_hashes.txt
GCF_031316975.1_ASM3131697v1/assembly_status.txt
GCF_031316975.1_ASM3131697v1/md5checksums.txt
sent 431 bytes received 13,712,536 bytes 2,109,687.23 bytes/sec
total size is 13,707,539 speedup is 1.00
* 원하는 GFF 파일을 개별적으로 다운받는 이전의 방법과는 달리, rsync를 사용하면 데이터가 필요한 strain 디렉토리 내의 모든 파일을 한 번에 다운받을 수 있다.
+ NCBI Genome Downloader(https://github.com/kblin/ncbi-genome-download)를 사용하는 방법도 있다.
+ Refgenie(https://refgenie.databio.org/en/latest/) 사용법
Reference
The Biostar Handbook: 2nd Edition - István Albert
'생물정보학 끄적끄적' 카테고리의 다른 글
[Linux] Ontology (1) - Sequence Ontology (80) | 2023.12.23 |
---|---|
[Linux] bio fetch와 bio search (5) | 2023.12.21 |
[Linux] Entrez Web API와 Entrez Direct로 NCBI 데이터베이스 접속하기 (3) | 2023.12.21 |
[Linux] Human Genome 데이터 예시 (84) | 2023.12.20 |
[Linux] 데이터 분석 - 완전 기초 (81) | 2023.12.19 |