생물정보학 끄적끄적

[Linux] Merging Paired-End Reads

Hazel Y. 2023. 12. 31. 16:07

 

Paired-end sequencing을 할 때, DNA 조각이 두 read들의 길이의 합보다 짧을 경우, read들이 겹치는 구간이 발생한다. 이때 이 겹치는 구간을 이용하여 그 두 read들을 합치는 과정을 merging이라고 한다. 이번 포스팅에서는 FLASH (Fast Length Adjustment of SHort reads) 라는 툴을 사용하여 read들을 merging하는 방법을 연습해 보겠다.

출처: https://ccb.jhu.edu/software/FLASH/


1. FLASH 설치

(참고로 내 environment에는 이미 설치가 되어 있다.)

$ micromamba install flash -y
bioconda/noarch                                      5.1MB @   3.9MB/s  1.3s
bioconda/linux-64                                    5.3MB @   3.8MB/s  1.4s
conda-forge/noarch                                  13.1MB @   8.0MB/s  1.6s
conda-forge/linux-64                                31.5MB @  11.7MB/s  2.7s

Pinned packages:
  - python 3.8.*


Transaction

  Prefix: /home/biouser/micromamba/envs/bioinfo

  All requested packages already installed


Transaction starting

Transaction finished

To activate this environment, use:

    micromamba activate bioinfo

Or to execute a single command in this environment, use:

    micromamba run -n bioinfo mycommand

 

2. 연습에 활용할 데이터 파일 (1% error test data) 다운로드

- https://ccb.jhu.edu/software/FLASH/에 접속하면 1% error test data 뿐만 아니라 2%, 3%, 5% error test data도 다운받을 수 있다.

 

FLASh

Accuracy FLASH merges reads from paired-end sequencing runs with very high accuracy. FLASH accuracy on one million 100bp long synthetic pairs generated from fragments with a mean length of 180bp, normaly distributed with a standard deviation of 20bp:    

ccb.jhu.edu

$ wget https://ccb.jhu.edu/software/FLASH/error1.tar.gz
--2023-12-31 10:56:45--  https://ccb.jhu.edu/software/FLASH/error1.tar.gz
Resolving ccb.jhu.edu (ccb.jhu.edu)... 128.220.233.141
Connecting to ccb.jhu.edu (ccb.jhu.edu)|128.220.233.141|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 91499886 (87M) [application/x-gzip]
Saving to: 'error1.tar.gz'

error1.tar.gz      100%[================>]  87.26M   454KB/s    in 3m 59s

2023-12-31 11:00:46 (374 KB/s) - 'error1.tar.gz' saved [91499886/91499886]

$ tar zxvf error1.tar.gz
frag_1.fastq
frag_2.fastq

 

3. FLASH로 merge

$ flash frag_1.fastq frag_2.fastq
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH]     frag_1.fastq
[FLASH]     frag_2.fastq
[FLASH]
[FLASH] Output files:
[FLASH]     ./out.extendedFrags.fastq
[FLASH]     ./out.notCombined_1.fastq
[FLASH]     ./out.notCombined_2.fastq
[FLASH]     ./out.hist
[FLASH]     ./out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH]     Min overlap:           10
[FLASH]     Max overlap:           65
[FLASH]     Max mismatch density:  0.250000
[FLASH]     Allow "outie" pairs:   false
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      12
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 12 combiner threads
[FLASH] Processed 25000 read pairs
[FLASH] Processed 50000 read pairs
[FLASH] Processed 75000 read pairs
[FLASH] Processed 100000 read pairs
[FLASH] Processed 125000 read pairs
[FLASH] Processed 150000 read pairs
[FLASH] Processed 175000 read pairs
[FLASH] Processed 200000 read pairs
[FLASH] Processed 225000 read pairs
[FLASH] Processed 250000 read pairs
[FLASH] Processed 275000 read pairs
[FLASH] Processed 300000 read pairs
[FLASH] Processed 325000 read pairs
[FLASH] Processed 350000 read pairs
[FLASH] Processed 375000 read pairs
[FLASH] Processed 400000 read pairs
[FLASH] Processed 425000 read pairs
[FLASH] Processed 450000 read pairs
[FLASH] Processed 475000 read pairs
[FLASH] Processed 500000 read pairs
[FLASH] Processed 525000 read pairs
[FLASH] Processed 550000 read pairs
[FLASH] Processed 575000 read pairs
[FLASH] Processed 600000 read pairs
[FLASH] Processed 625000 read pairs
[FLASH] Processed 650000 read pairs
[FLASH] Processed 675000 read pairs
[FLASH] Processed 700000 read pairs
[FLASH] Processed 725000 read pairs
[FLASH] Processed 750000 read pairs
[FLASH] Processed 775000 read pairs
[FLASH] Processed 800000 read pairs
[FLASH] Processed 825000 read pairs
[FLASH] Processed 850000 read pairs
[FLASH] Processed 875000 read pairs
[FLASH] Processed 900000 read pairs
[FLASH] Processed 925000 read pairs
[FLASH] Processed 950000 read pairs
[FLASH] Processed 975000 read pairs
[FLASH] Processed 1000000 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      1000000
[FLASH]     Combined pairs:   701191
[FLASH]     Uncombined pairs: 298809
[FLASH]     Percent combined: 70.12%
[FLASH]
[FLASH] Writing histogram files.
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 1.144 seconds elapsed

→ 1% error test data는 해당 데이터의 70.12%가 merge되었다.

 

+ 다른 error rate의 데이터들의 merging 퍼센트

Error rate no error 1% 2% 3% 5%
Percent merged 70.30% 70.12% 68.78% 65.05% 48.14%

 

+ bbmerge.sh를 사용한 방법도 있다.

$ ~/bbmap/bbmerge.sh in1=frag_1.fastq in2=frag_2.fastq out=bb_merged.fq
java -ea -Xmx1000m -Xms1000m -Djava.library.path=/home/biouser/bbmap/jni/ -cp /home/biouser/bbmap/current/ jgi.BBMerge in1=frag_1.fastq in2=frag_2.fastq out=bb_merged.fq
Executing jgi.BBMerge [in1=frag_1.fastq, in2=frag_2.fastq, out=bb_merged.fq]
Version 39.06

Writing mergable reads merged.
Started output threads.
Total time: 8.275 seconds.

Pairs:                  1000000
Joined:                 249458          24.946%
Ambiguous:              0               0.000%
No Solution:            750541          75.054%
Too Short:              1               0.000%

Avg Insert:             164.2
Standard Deviation:     15.5
Mode:                   169

Insert range:           45 - 190
90th percentile:        185
75th percentile:        176
50th percentile:        165
25th percentile:        154
10th percentile:        143

Reference

The Biostar Handbook: 2nd Edition - István Albert