Short Read Archive (SRA)는 NCBI에서 제공하는 squence 데이터와 정보 저장 서비스이다. 대체 플랫폼으로는 European Nucleotide Archive (ENA)가 있다.
1. SRA run accession number로 fastq 파일 다운받기
- e.g., SRR1553610 (paired-end read)
$ fastq-dump --split-files SRR1553610
Read 219837 spots for SRR1553610
Written 219837 spots for SRR1553610
→ 두 개의 파일 (SRR1553610_1.fastq, SRR1553610_2.fastq)로 저장됨.
2. SRA run accession number를 알지 못할 때
- project number로 SRA run accession number 알아내기
- e.g., PRJNA257197
(1) runinfo.csv
$ esearch -db sra -query PRJNA257197
<ENTREZ_DIRECT>
<Db>sra</Db>
<WebEnv>MCID_658e6e672321f87bb12f52e9</WebEnv>
<QueryKey>1</QueryKey>
<Count>891</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
→ 해당 project에는 891개의 sequencing runs가 있다.
$ esearch -db sra -query PRJNA257197 | efetch -format runinfo > runinfo.csv
$ cat runinfo.csv | csvcut -c 1 | grep SRR | head -3
SRR1972917
SRR1972918
SRR1972919
$ cat runinfo.csv | csvcut -c 1 | grep SRR | head -3 > runids.txt
→ 그 중 첫 3개의 SRA run accession numbers만 추출해 runids.txt라는 파일로 저장.
$ cat runids.txt | parallel fastq-dump -X 10000 --split-files {}
Read 10000 spots for SRR1972919
Written 10000 spots for SRR1972919
Read 10000 spots for SRR1972917
Written 10000 spots for SRR1972917
Read 10000 spots for SRR1972918
Written 10000 spots for SRR1972918
→ runids.txt 파일에 적혀 있는 세 SRA runs (SRR1972917, SRR1972918, SRR1972919)에 해당하는 fastq 파일들이 다운로드됨.
(-X 10000: 첫 10,000번째 read까지 추출)
(2) document summary (docsum.xml)
$ esearch -db sra -query PRJNA257197 | efetch -format docsum > docsum.xml
$ cat docsum.xml | more
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE DocumentSummarySet>
<DocumentSummarySet status="OK">
<DbBuild>Build231228-1743m.1</DbBuild>
<DocumentSummary>
<Id>1442232</Id>
<ExpXml>
<Summary>
<Title>Zaire ebolavirus genome sequencing from 2014 outbreak in Sierra Leone: Sample W220.0</Title>
<Platform instrument_model="Illumina HiSeq 2500">ILLUMINA</Platform>
<Statistics total_runs="1" total_spots="8345287" total_bases="1685747974" total_size="1045828929" load_done="true" cluster_name="public"/>
</Summary>
<Submitter acc="SRA178666" center_name="BI" contact_name=" " lab_name=""/>
<Experiment acc="SRX994253" ver="1" status="public" name="Zaire ebolavirus genome sequencing from 2014 outbreak in Sierra Leone: Sample W220.0"/>
<Study acc="SRP045416" name="Zaire ebolavirus Genome sequencing"/>
<Organism taxid="186538" ScientificName="Zaire ebolavirus"/>
<Sample acc="SRS908478" name=""/>
<Instrument ILLUMINA="Illumina HiSeq 2500"/>
<Library_descriptor>
<LIBRARY_NAME>W220.0.l1</LIBRARY_NAME>
<LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY>
<LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE>
<LIBRARY_SELECTION>cDNA</LIBRARY_SELECTION>
<LIBRARY_LAYOUT>
<PAIRED/>
</LIBRARY_LAYOUT>
</Library_descriptor>
<Bioproject>PRJNA257197</Bioproject>
<Biosample>SAMN03253746</Biosample>
</ExpXml>
<Runs>
<Run acc="SRR1972976" total_spots="8345287" total_bases="1685747974" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/>
</Runs>
<ExtLinks/>
<CreateDate>2015/04/14</CreateDate>
<UpdateDate>2015/04/14</UpdateDate>
</DocumentSummary>
...
$ cat docsum.xml | xtract -pattern DocumentSummary -element Bioproject, Bios
ample, Run@acc | head
PRJNA257197 SAMN03253746 SRR1972976
PRJNA257197 SAMN03253745 SRR1972975
PRJNA257197 SAMN03253744 SRR1972974
PRJNA257197 SAMN03254300 SRR1972973
PRJNA257197 SAMN03254299 SRR1972972
PRJNA257197 SAMN03254298 SRR1972971
PRJNA257197 SAMN03254296 SRR1972970
PRJNA257197 SAMN03254295 SRR1972969
PRJNA257197 SAMN03254294 SRR1972968
PRJNA257197 SAMN03254291 SRR1972967
→ runinfo.csv 파일을 활용하는 것보다 훨씬 더 복잡함.
▶ docsum.xml 파일을 조금 더 보기 편하도록 설정 해볼 수는 있다.
$ alias pretty="python -c 'import sys;import xml.dom.minidom;s=sys.stdin.read();print(xml.dom.minidom.parseString(s).toprettyxml());'"
$ cat docsum.xml | pretty | more
<?xml version="1.0" ?>
<!DOCTYPE DocumentSummarySet>
<DocumentSummarySet status="OK">
<DbBuild>Build231228-1743m.1</DbBuild>
<DocumentSummary>
<Id>1442232</Id>
<ExpXml>
<Summary>
<Title>Zaire ebolavirus genome sequencing from 2014 outbreak in Sierra Leone: Sample W220.0</Title>
<Platform instrument_model="Illumina HiSeq 2500">ILLUMINA</Platform>
<Statistics total_runs="1" total_spots="8345287" total_bases="1685747974" total_size="1045828929" load_done="true" cluster_name="public"/>
</Summary>
<Submitter acc="SRA178666" center_name="BI" contact_name=" " lab_name=""/>
<Experiment acc="SRX994253" ver="1" status="public"name="Zaire ebolavirus genome sequencing from 2014 outbreak in Sierra Leone: Sample W220.0"/>
<Study acc="SRP045416" name="Zaire ebolavirus Genome sequencing"/>
<Organism taxid="186538" ScientificName="Zaire ebolavirus"/>
<Sample acc="SRS908478" name=""/>
<Instrument ILLUMINA="Illumina HiSeq 2500"/>
<Library_descriptor>
<LIBRARY_NAME>W220.0.l1</LIBRARY_NAME>
<LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY>
<LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE>
<LIBRARY_SELECTION>cDNA</LIBRARY_SELECTION>
<LIBRARY_LAYOUT>
<PAIRED/>
</LIBRARY_LAYOUT>
</Library_descriptor>
<Bioproject>PRJNA257197</Bioproject>
<Biosample>SAMN03253746</Biosample>
</ExpXml>
<Runs>
<Run acc="SRR1972976" total_spots="8345287" total_bases="1685747974" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/>
</Runs>
<ExtLinks/>
<CreateDate>2015/04/14</CreateDate>
<UpdateDate>2015/04/14</UpdateDate>
</DocumentSummary>
Reference
The Biostar Handbook: 2nd Edition - István Albert
'생물정보학 끄적끄적' 카테고리의 다른 글
[Linux] Merging Paired-End Reads (62) | 2023.12.31 |
---|---|
[Linux] Sequence Data Quality Control - Trimming Adapters (61) | 2023.12.30 |
GO Enrichment 분석 연습 (78) | 2023.12.28 |
[Linux] Ontology (3) - bio explain (86) | 2023.12.25 |
[Linux] Ontology (2) - Gene Ontology (with goatools) (3) | 2023.12.25 |