[Linux] Sequence Data Quality Control

생물정보학 끄적끄적

[Linux] Sequence Data Quality Control - Trimming Adapters

Hazel Y. 2023. 12. 30. 18:52

DNA sequencing을 할 때 sequencing platform에 DNA fragment들이 붙을 수 있도록 adapter라는 인공적으로 만들어진, 그래서 알려진 염기 서열을 가진, 짧은 DNA sequence를 사용한다. 하지만 각 DNA fragment에 대한 sequencing이 끝나고, 전체 genome에 대한 sequence data가 생성될 때, 다시 말해 genome assembly 전에 adapter들의 sequence는 제거되어야 한다. 따라서 이번 포스팅에서는 sequence data로부터 adapter들을 trim하는 방법에 대해 알아보려 한다.

1. trimming이 필요한 데이터 파일 다운로드

- e.g., SRR519926

Illumina whole genome shotgun sequencing of genomic DNA paired-end li... - SRA - NCBI

show Abstracthide Abstract To test methods for assembly of very short reads, five microbial genomes were selected which have a finished reference sequence available. Escherichia coli MG1655 was one of these five microbial genomes. At least two libraries (s

www.ncbi.nlm.nih.gov

$ fastq-dump -X 10000 --split-files SRR519926
Read 10000 spots for SRR519926
Written 10000 spots for SRR519926

→ 두 개의 파일 (SRR519926_1.fastq, SRR519926_2.fastq) 이 저장됨. (paired-end)

2. fastQC report 생성 (trimming 전)

$ fastqc SRR519926_1.fastq
null
Started analysis of SRR519926_1.fastq
Approx 5% complete for SRR519926_1.fastq
Approx 20% complete for SRR519926_1.fastq
Approx 30% complete for SRR519926_1.fastq
Approx 35% complete for SRR519926_1.fastq
Approx 50% complete for SRR519926_1.fastq
Approx 60% complete for SRR519926_1.fastq
Approx 70% complete for SRR519926_1.fastq
Approx 80% complete for SRR519926_1.fastq
Approx 90% complete for SRR519926_1.fastq
Approx 100% complete for SRR519926_1.fastq
Analysis complete for SRR519926_1.fastq

→ 상당히 낮은 sequence quality

3. quality control with fastp

- single-end mode에서는 adapter를 자동으로 인식하고 trim한다. 그러나 이는 paired-end mode에서는 제대로 작동하지 않을 수 있다. 따라서 single-end mode로 adapter의 sequence를 파악한 다음, 이를 paired-end mode에 직접 제공함으로써 adapter를 trim할 수 있다.

(1) single-end mode

$ fastp -i SRR519926_1.fastq -o SRR519926_1.trimmed.fq
Detecting adapter sequence for read1...
>Illumina TruSeq Adapter Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

Read1 before filtering:
total reads: 10000
total bases: 2510000
Q20 bases: 1705380(67.9434%)
Q30 bases: 1436342(57.2248%)

Read1 after filtering:
total reads: 8123
total bases: 1925252
Q20 bases: 1480733(76.9111%)
Q30 bases: 1264611(65.6855%)

Filtering result:
reads passed filter: 8123
reads failed due to low quality: 1854
reads failed due to too many N: 23
reads failed due to too short: 0
reads with adapter trimmed: 1266
bases trimmed due to adapters: 115763

Duplication rate (may be overestimated since this is SE data): 0%

JSON report: fastp.json
HTML report: fastp.html

fastp -i SRR519926_1.fastq -o SRR519926_1.trimmed.fq
fastp v0.23.4, time used: 1 seconds

→ adapter (Illumina) sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

(2) paired-end mode

$ echo ">illumina" > adapter.fa
$ echo "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" >> adapter.fa

$ fastp --adapter_fasta adapter.fa -i SRR519926_1.fastq -I SRR519926_2.fastq \
> -o SRR519926_1.trim.fq -O SRR519926_2.trim.fq
Read1 before filtering:
total reads: 10000
total bases: 2510000
Q20 bases: 1705380(67.9434%)
Q30 bases: 1436342(57.2248%)

Read2 before filtering:
total reads: 10000
total bases: 2510000
Q20 bases: 991181(39.4893%)
Q30 bases: 778662(31.0224%)

Read1 after filtering:
total reads: 1895
total bases: 371085
Q20 bases: 327188(88.1706%)
Q30 bases: 293141(78.9956%)

Read2 after filtering:
total reads: 1895
total bases: 373054
Q20 bases: 267542(71.7167%)
Q30 bases: 225653(60.488%)

Filtering result:
reads passed filter: 3790
reads failed due to low quality: 16206
reads failed due to too many N: 4
reads failed due to too short: 0
reads with adapter trimmed: 1976
bases trimmed due to adapters: 213858

Duplication rate: 0%

Insert size peak (evaluated by paired-end reads): 92

JSON report: fastp.json
HTML report: fastp.html

fastp --adapter_fasta adapter.fa -i SRR519926_1.fastq -I SRR519926_2.fastq -o SRR519926_1.trim.fq -O SRR519926_2.trim.fq
fastp v0.23.4, time used: 0 seconds

(3) fastQC report 생성 (trimming 후)

$ fastqc SRR519926_1.trim.fq
null
Started analysis of SRR519926_1.trim.fq
Approx 50% complete for SRR519926_1.trim.fq
Analysis complete for SRR519926_1.trim.fq

→ adapter들만 trim했기 때문에 여전히 quality가 좋은 건 아니지만 확실히 trimming 전에 비해서는 quality가 향상되었음을 확인할 수 있다.

+ 특히, adapter content 그래프들을 비교해 보면 adapter들이 성공적으로 trim되었음을 알 수 있다.

Reference

The Biostar Handbook: 2nd Edition - István Albert

저작자표시 비영리 변경금지 (새창열림)

'생물정보학 끄적끄적' 카테고리의 다른 글

DNA Sequence에 Regular Expressions 적용하기 (63)	2024.01.01
[Linux] Merging Paired-End Reads (62)	2023.12.31
[Linux] Short Read Archive 데이터 다운받기 (63)	2023.12.29
GO Enrichment 분석 연습 (78)	2023.12.28
[Linux] Ontology (3) - bio explain (86)	2023.12.25

현재글[Linux] Sequence Data Quality Control - Trimming Adapters

- 인프제가 생각을 쏟아내는 곳 - INFJ, a professional overthinker 공스타: @gongstabyhazel

매일의생각, 네덜란드, 일상, 독서기록, marketing analytics, bioinformatics, 석사, Python, 리눅스, Netherlands, 생물정보학, 파이썬, 생각, Linux, 감성 분석, 일기, sentiment analysis, 유학생, 책추천, Topic Modeling,

Today :
Yesterday :

Hazel's Life Journey