Ontology는 의미적으로, 문맥적으로 연결되어 있는 단어 및 용어들 간의 집합 같은 것이다. 생물정보학에서 주로 쓰이는 ontology에는 sequence ontology와 gene ontology, 이렇게 두 가지가 있다. 이번 포스팅에서는 sequence ontology에 대해서만 다루고, 다음 포스팅에서 gene ontology를 별도로 다루겠다.
1. Sequence ontology (SO)
- sequence features와 관련된 정보에 해당하는 용어들
- http://www.sequenceontology.org/browser/obob.cgi에서 각 용어 별 정보 및 다른 개념들과의 연관 관계를 검색해 볼 수 있다.
[ SO 데이터 살펴보기 ]
(1) Github repository https://github.com/The-Sequence-Ontology/SO-Ontologies 통해서 SO 데이터 파일 다운로드
$ URL=https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/Ontology_Files/so-simple.obo
$ wget $URL
--2023-12-23 14:32:06-- https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/Ontology_Files/so-simple.obo
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1019055 (995K) [text/plain]
Saving to: 'so-simple.obo'
so-simple.obo 100%[================>] 995.17K 5.03MB/s in 0.2s
2023-12-23 14:32:06 (5.03 MB/s) - 'so-simple.obo' saved [1019055/1019055]
(2) SO 용어 수
$ cat so-simple.obo | grep 'Term' | wc -l
2625
→ 2,625 SO terms (sequence features와 관련 있는 용어들 2,625개)
(3) 특정 용어 찾기
- e.g., 'name: gene'
$ cat so-simple.obo | grep 'name: gene$'
name: gene
$ cat so-simple.obo | grep 'name: gene$' -B 1 -A 7
id: SO:0000704
name: gene
def: "A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions." [SO:immuno_workshop]
comment: This term is mapped to MGED. Do not obsolete without consulting MGED ontology. A gene may be considered as a unit of inheritance.
subset: SOFA
synonym: "INSDC_feature:gene" EXACT []
xref: http://en.wikipedia.org/wiki/Gene "wiki"
is_a: SO:0001411 ! biological_region
relationship: member_of SO:0005855 ! gene_group
※ grep B 옵션: 해당 용어 앞줄 추가 추출 (-B 1: 앞 한 줄 추가 추출)
※ grep A 옵션: 해당 용어 뒷줄 추가 추출 (-A 7: 뒤 7줄 추가 추출)
(4) 특정 키워드로 여러 정보 찾기
- e.g., 'PCR'
$ cat so-simple.obo | grep 'PCR' -B 2 -A 3
[Term]
id: SO:0000006
name: PCR_product
def: "A region amplified by a PCR reaction." [SO:ke]
comment: This term is mapped to MGED. This term is now located in OBI, with the following ID OBI_0000406.
subset: SOFA
synonym: "amplicon" RELATED []
synonym: "PCR product" EXACT []
xref: http://en.wikipedia.org/wiki/RAPD "wiki"
is_a: SO:0000695 ! reagent
--
id: SO:0000345
name: EST
def: "A tag produced from a single sequencing read from a cDNA clone or PCR product; typically a few hundred base pairs long." [SO:ke]
comment: This term is mapped to MGED. Do not obsolete without consulting MGED ontology.
subset: SOFA
synonym: "expressed sequence tag" EXACT []
--
id: SO:0000758
name: double_stranded_cDNA
def: "DNA synthesized from RNA by reverse transcriptase that has been copied by PCR to make it double stranded." []
synonym: "double strand cDNA" RELATED []
synonym: "double stranded cDNA" EXACT []
synonym: "double-strand cDNA" RELATED []
--
id: SO:0001481
name: RAPD
def: "RAPD is a 'PCR product' where a sequence variant is identified through the use of PCR with random primers." [ZFIN:mh]
synonym: "Random Amplification Polymorphic DNA" EXACT []
is_a: SO:0000006 ! PCR_product
created_by: kareneilbeck
creation_date: 2009-09-09T05:26:10Z
--
id: SO:0001830
name: AFLP_fragment
def: "A PCR product obtained by applying the AFLP technique, based on a restriction enzyme digestion of genomic DNA and an amplification of the resulting fragments." [GMOD:ea]
comment: Requested by Bayer Cropscience June, 2011.
synonym: "AFLP" EXACT []
synonym: "AFLP fragment" EXACT []
synonym: "AFLP-PCR" EXACT []
synonym: "amplified fragment length polymorphism" EXACT []
synonym: "amplified fragment length polymorphism PCR" EXACT []
xref: http://en.wikipedia.org/wiki/Amplified_fragment_length_polymorphism "wiki"
is_a: SO:0000006 ! PCR_product
created_by: kareneilbeck
creation_date: 2011-07-14T12:12:35Z
--
id: SO:0002139
name: unconfirmed_transcript
def: "This is used for non-spliced EST clusters that have polyA features. This category has been specifically created for the ENCODE project to highlight regions that could indicate the presence of protein coding genes that require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies." [GENCODE:http\://www.gencodegenes.org/gencode_biotypes.html]
synonym: "TEC" EXACT []
synonym: "to_be_experimentally_confirmed_transcript" EXACT []
is_a: SO:0002138 ! predicted_transcript
Command line 창에서는 아래 캡쳐 화면에서처럼 추출 키워드에 붉은색 표시가 된다.
Reference
The Biostar Handbook: 2nd Edition - István Albert
'생물정보학 끄적끄적' 카테고리의 다른 글
[Linux] Ontology (3) - bio explain (86) | 2023.12.25 |
---|---|
[Linux] Ontology (2) - Gene Ontology (with goatools) (3) | 2023.12.25 |
[Linux] bio fetch와 bio search (5) | 2023.12.21 |
[Linux] Entrez Web API와 Entrez Direct로 NCBI 데이터베이스 접속하기 (3) | 2023.12.21 |
[Linux] Complete Genomic Data 다운 받기 (3) | 2023.12.21 |