Paired-end sequencing을 할 때, DNA 조각이 두 read들의 길이의 합보다 짧을 경우, read들이 겹치는 구간이 발생한다. 이때 이 겹치는 구간을 이용하여 그 두 read들을 합치는 과정을 merging이라고 한다. 이번 포스팅에서는 FLASH (Fast Length Adjustment of SHort reads) 라는 툴을 사용하여 read들을 merging하는 방법을 연습해 보겠다.
1. FLASH 설치
(참고로 내 environment에는 이미 설치가 되어 있다.)
$ micromamba install flash -y
bioconda/noarch 5.1MB @ 3.9MB/s 1.3s
bioconda/linux-64 5.3MB @ 3.8MB/s 1.4s
conda-forge/noarch 13.1MB @ 8.0MB/s 1.6s
conda-forge/linux-64 31.5MB @ 11.7MB/s 2.7s
Pinned packages:
- python 3.8.*
Transaction
Prefix: /home/biouser/micromamba/envs/bioinfo
All requested packages already installed
Transaction starting
Transaction finished
To activate this environment, use:
micromamba activate bioinfo
Or to execute a single command in this environment, use:
micromamba run -n bioinfo mycommand
2. 연습에 활용할 데이터 파일 (1% error test data) 다운로드
- https://ccb.jhu.edu/software/FLASH/에 접속하면 1% error test data 뿐만 아니라 2%, 3%, 5% error test data도 다운받을 수 있다.
$ wget https://ccb.jhu.edu/software/FLASH/error1.tar.gz
--2023-12-31 10:56:45-- https://ccb.jhu.edu/software/FLASH/error1.tar.gz
Resolving ccb.jhu.edu (ccb.jhu.edu)... 128.220.233.141
Connecting to ccb.jhu.edu (ccb.jhu.edu)|128.220.233.141|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 91499886 (87M) [application/x-gzip]
Saving to: 'error1.tar.gz'
error1.tar.gz 100%[================>] 87.26M 454KB/s in 3m 59s
2023-12-31 11:00:46 (374 KB/s) - 'error1.tar.gz' saved [91499886/91499886]
$ tar zxvf error1.tar.gz
frag_1.fastq
frag_2.fastq
3. FLASH로 merge
$ flash frag_1.fastq frag_2.fastq
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH] frag_1.fastq
[FLASH] frag_2.fastq
[FLASH]
[FLASH] Output files:
[FLASH] ./out.extendedFrags.fastq
[FLASH] ./out.notCombined_1.fastq
[FLASH] ./out.notCombined_2.fastq
[FLASH] ./out.hist
[FLASH] ./out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH] Min overlap: 10
[FLASH] Max overlap: 65
[FLASH] Max mismatch density: 0.250000
[FLASH] Allow "outie" pairs: false
[FLASH] Cap mismatch quals: false
[FLASH] Combiner threads: 12
[FLASH] Input format: FASTQ, phred_offset=33
[FLASH] Output format: FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 12 combiner threads
[FLASH] Processed 25000 read pairs
[FLASH] Processed 50000 read pairs
[FLASH] Processed 75000 read pairs
[FLASH] Processed 100000 read pairs
[FLASH] Processed 125000 read pairs
[FLASH] Processed 150000 read pairs
[FLASH] Processed 175000 read pairs
[FLASH] Processed 200000 read pairs
[FLASH] Processed 225000 read pairs
[FLASH] Processed 250000 read pairs
[FLASH] Processed 275000 read pairs
[FLASH] Processed 300000 read pairs
[FLASH] Processed 325000 read pairs
[FLASH] Processed 350000 read pairs
[FLASH] Processed 375000 read pairs
[FLASH] Processed 400000 read pairs
[FLASH] Processed 425000 read pairs
[FLASH] Processed 450000 read pairs
[FLASH] Processed 475000 read pairs
[FLASH] Processed 500000 read pairs
[FLASH] Processed 525000 read pairs
[FLASH] Processed 550000 read pairs
[FLASH] Processed 575000 read pairs
[FLASH] Processed 600000 read pairs
[FLASH] Processed 625000 read pairs
[FLASH] Processed 650000 read pairs
[FLASH] Processed 675000 read pairs
[FLASH] Processed 700000 read pairs
[FLASH] Processed 725000 read pairs
[FLASH] Processed 750000 read pairs
[FLASH] Processed 775000 read pairs
[FLASH] Processed 800000 read pairs
[FLASH] Processed 825000 read pairs
[FLASH] Processed 850000 read pairs
[FLASH] Processed 875000 read pairs
[FLASH] Processed 900000 read pairs
[FLASH] Processed 925000 read pairs
[FLASH] Processed 950000 read pairs
[FLASH] Processed 975000 read pairs
[FLASH] Processed 1000000 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH] Total pairs: 1000000
[FLASH] Combined pairs: 701191
[FLASH] Uncombined pairs: 298809
[FLASH] Percent combined: 70.12%
[FLASH]
[FLASH] Writing histogram files.
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 1.144 seconds elapsed
→ 1% error test data는 해당 데이터의 70.12%가 merge되었다.
+ 다른 error rate의 데이터들의 merging 퍼센트
Error rate | no error | 1% | 2% | 3% | 5% |
Percent merged | 70.30% | 70.12% | 68.78% | 65.05% | 48.14% |
+ bbmerge.sh를 사용한 방법도 있다.
$ ~/bbmap/bbmerge.sh in1=frag_1.fastq in2=frag_2.fastq out=bb_merged.fq
java -ea -Xmx1000m -Xms1000m -Djava.library.path=/home/biouser/bbmap/jni/ -cp /home/biouser/bbmap/current/ jgi.BBMerge in1=frag_1.fastq in2=frag_2.fastq out=bb_merged.fq
Executing jgi.BBMerge [in1=frag_1.fastq, in2=frag_2.fastq, out=bb_merged.fq]
Version 39.06
Writing mergable reads merged.
Started output threads.
Total time: 8.275 seconds.
Pairs: 1000000
Joined: 249458 24.946%
Ambiguous: 0 0.000%
No Solution: 750541 75.054%
Too Short: 1 0.000%
Avg Insert: 164.2
Standard Deviation: 15.5
Mode: 169
Insert range: 45 - 190
90th percentile: 185
75th percentile: 176
50th percentile: 165
25th percentile: 154
10th percentile: 143
Reference
The Biostar Handbook: 2nd Edition - István Albert
'생물정보학 끄적끄적' 카테고리의 다른 글
DNA Sequence에 Regular Expressions 적용하기 (63) | 2024.01.01 |
---|---|
[Linux] Sequence Data Quality Control - Trimming Adapters (61) | 2023.12.30 |
[Linux] Short Read Archive 데이터 다운받기 (63) | 2023.12.29 |
GO Enrichment 분석 연습 (78) | 2023.12.28 |
[Linux] Ontology (3) - bio explain (86) | 2023.12.25 |