생물정보학 끄적끄적

[Linux] Short Read Archive 데이터 다운받기

Hazel Y. 2023. 12. 29. 17:42

 

Short Read Archive (SRA)는 NCBI에서 제공하는 squence 데이터와 정보 저장 서비스이다. 대체 플랫폼으로는 European Nucleotide Archive (ENA)가 있다.

 

Home - SRA - NCBI

 

www.ncbi.nlm.nih.gov

 

ENA Browser

 

www.ebi.ac.uk


1. SRA run accession number로 fastq 파일 다운받기

- e.g., SRR1553610 (paired-end read)

$ fastq-dump --split-files SRR1553610
Read 219837 spots for SRR1553610
Written 219837 spots for SRR1553610

 

→ 두 개의 파일 (SRR1553610_1.fastq, SRR1553610_2.fastq)로 저장됨.

 

2. SRA run accession number를 알지 못할 때

- project number로 SRA run accession number 알아내기

- e.g., PRJNA257197

 

(1) runinfo.csv

$ esearch -db sra -query PRJNA257197
<ENTREZ_DIRECT>
  <Db>sra</Db>
  <WebEnv>MCID_658e6e672321f87bb12f52e9</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>891</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

→ 해당 project에는 891개의 sequencing runs가 있다.

$ esearch -db sra -query PRJNA257197 | efetch -format runinfo > runinfo.csv

$ cat runinfo.csv | csvcut -c 1 | grep SRR | head -3
SRR1972917
SRR1972918
SRR1972919

$ cat runinfo.csv | csvcut -c 1 | grep SRR | head -3 > runids.txt

→ 그 중 첫 3개의 SRA run accession numbers만 추출해 runids.txt라는 파일로 저장.

$ cat runids.txt | parallel fastq-dump -X 10000 --split-files {}
Read 10000 spots for SRR1972919
Written 10000 spots for SRR1972919
Read 10000 spots for SRR1972917
Written 10000 spots for SRR1972917
Read 10000 spots for SRR1972918
Written 10000 spots for SRR1972918

→ runids.txt 파일에 적혀 있는 세 SRA runs (SRR1972917, SRR1972918, SRR1972919)에 해당하는 fastq 파일들이 다운로드됨.

(-X 10000: 첫 10,000번째 read까지 추출)

 

(2) document summary (docsum.xml)

$ esearch -db sra -query PRJNA257197 | efetch -format docsum > docsum.xml

$ cat docsum.xml | more
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE DocumentSummarySet>
<DocumentSummarySet status="OK">
  <DbBuild>Build231228-1743m.1</DbBuild>
  <DocumentSummary>
    <Id>1442232</Id>
    <ExpXml>
      <Summary>
        <Title>Zaire ebolavirus genome sequencing from 2014 outbreak in Sierra Leone: Sample W220.0</Title>
        <Platform instrument_model="Illumina HiSeq 2500">ILLUMINA</Platform>
        <Statistics total_runs="1" total_spots="8345287" total_bases="1685747974" total_size="1045828929" load_done="true" cluster_name="public"/>
      </Summary>
      <Submitter acc="SRA178666" center_name="BI" contact_name=" " lab_name=""/>
      <Experiment acc="SRX994253" ver="1" status="public" name="Zaire ebolavirus genome sequencing from 2014 outbreak in Sierra Leone: Sample W220.0"/>
      <Study acc="SRP045416" name="Zaire ebolavirus Genome sequencing"/>
      <Organism taxid="186538" ScientificName="Zaire ebolavirus"/>
      <Sample acc="SRS908478" name=""/>
      <Instrument ILLUMINA="Illumina HiSeq 2500"/>
      <Library_descriptor>
        <LIBRARY_NAME>W220.0.l1</LIBRARY_NAME>
        <LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY>
        <LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE>
        <LIBRARY_SELECTION>cDNA</LIBRARY_SELECTION>
        <LIBRARY_LAYOUT>
          <PAIRED/>
        </LIBRARY_LAYOUT>
      </Library_descriptor>
      <Bioproject>PRJNA257197</Bioproject>
      <Biosample>SAMN03253746</Biosample>
    </ExpXml>
    <Runs>
      <Run acc="SRR1972976" total_spots="8345287" total_bases="1685747974" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/>
    </Runs>
    <ExtLinks/>
    <CreateDate>2015/04/14</CreateDate>
    <UpdateDate>2015/04/14</UpdateDate>
  </DocumentSummary>
...

$ cat docsum.xml | xtract -pattern DocumentSummary -element Bioproject, Bios
ample, Run@acc | head
PRJNA257197     SAMN03253746    SRR1972976
PRJNA257197     SAMN03253745    SRR1972975
PRJNA257197     SAMN03253744    SRR1972974
PRJNA257197     SAMN03254300    SRR1972973
PRJNA257197     SAMN03254299    SRR1972972
PRJNA257197     SAMN03254298    SRR1972971
PRJNA257197     SAMN03254296    SRR1972970
PRJNA257197     SAMN03254295    SRR1972969
PRJNA257197     SAMN03254294    SRR1972968
PRJNA257197     SAMN03254291    SRR1972967

→ runinfo.csv 파일을 활용하는 것보다 훨씬 더 복잡함.

▶ docsum.xml 파일을 조금 더 보기 편하도록 설정 해볼 수는 있다.

$ alias pretty="python -c 'import sys;import xml.dom.minidom;s=sys.stdin.read();print(xml.dom.minidom.parseString(s).toprettyxml());'"

$ cat docsum.xml | pretty | more
<?xml version="1.0" ?>
<!DOCTYPE DocumentSummarySet>
<DocumentSummarySet status="OK">


        <DbBuild>Build231228-1743m.1</DbBuild>


        <DocumentSummary>


                <Id>1442232</Id>


                <ExpXml>


                        <Summary>


                                <Title>Zaire ebolavirus genome sequencing from 2014 outbreak in Sierra Leone: Sample W220.0</Title>


                                <Platform instrument_model="Illumina HiSeq 2500">ILLUMINA</Platform>


                                <Statistics total_runs="1" total_spots="8345287" total_bases="1685747974" total_size="1045828929" load_done="true" cluster_name="public"/>


                        </Summary>


                        <Submitter acc="SRA178666" center_name="BI" contact_name=" " lab_name=""/>


                        <Experiment acc="SRX994253" ver="1" status="public"name="Zaire ebolavirus genome sequencing from 2014 outbreak in Sierra Leone: Sample W220.0"/>


                        <Study acc="SRP045416" name="Zaire ebolavirus Genome sequencing"/>


                        <Organism taxid="186538" ScientificName="Zaire ebolavirus"/>


                        <Sample acc="SRS908478" name=""/>


                        <Instrument ILLUMINA="Illumina HiSeq 2500"/>


                        <Library_descriptor>


                                <LIBRARY_NAME>W220.0.l1</LIBRARY_NAME>


                                <LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY>


                                <LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE>


                                <LIBRARY_SELECTION>cDNA</LIBRARY_SELECTION>


                                <LIBRARY_LAYOUT>


                                        <PAIRED/>


                                </LIBRARY_LAYOUT>


                        </Library_descriptor>


                        <Bioproject>PRJNA257197</Bioproject>


                        <Biosample>SAMN03253746</Biosample>


                </ExpXml>


                <Runs>


                        <Run acc="SRR1972976" total_spots="8345287" total_bases="1685747974" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/>


                </Runs>


                <ExtLinks/>


                <CreateDate>2015/04/14</CreateDate>


                <UpdateDate>2015/04/14</UpdateDate>


        </DocumentSummary>

Reference

The Biostar Handbook: 2nd Edition - István Albert