이전 포스팅에 이어서 이번엔 gene ontology에 대해 소개하려 한다.
2-1. Gene ontology (GO)
- gene functions와 관련된 정보에 해당하는 용어들
- 세 개의 sub-ontologies
(1) cellular component: gene function의 결과가 나타나는 곳에 대해 (e.g., Golgi cisterna membrane)
(2) molecular function: gene function의 생화학적 activity에 대해 (e.g., galactoside 2-alpha-L-fucosyltransferase activity)
(3) biological process: gene function의 목적에 대해 (e.g., carbohydrate metabolic process)
- https://www.geneontology.org/, https://www.ebi.ac.uk/QuickGO/, https://amigo.geneontology.org/amigo에서 검색 가능
2-2. GO data investigate해보기
(1) GO data와 association file 다운로드
$ curl -OL https://purl.obolibrary.org/obo/go.obo
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 100 316 0 316 0 0 440 0 --:--:-- --:--:-- --:--:-- 441
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 32.7M 0 320k 0 0 180k 0 0:03:05 0:00:01 0:03:04 32 35 32.7M 35 11.6M 0 0 4290k 0 0:00:07 0:00:02 0:00:05 598 94 32.7M 94 30.8M 0 0 8384k 0 0:00:03 0:00:03 --:--:-- 10.100 32.7M 100 32.7M 0 0 8681k 0 0:00:03 0:00:03 --:--:-- 10.6M
$ curl -OL https://geneontology.org/gene-associations/goa_human.gaf.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 100 244 100 244 0 0 139 0 0:00:01 0:00:01 --:--:-- 139
0 11.1M 0 52836 0 0 22631 0 0:08:36 0:00:02 0:08:34 226 13 11.1M 13 1573k 0 0 491k 0 0:00:23 0:00:03 0:00:20 174100 11.1M 100 11.1M 0 0 3056k 0 0:00:03 0:00:03 --:--:-- 8110k
(2) association file 압축 해제 & comments 제거
- .obo와 .gaf format의 파일에서는 # 대신 !가 comments를 나타냄.
$ gunzip -c goa_human.gaf.gz | grep -v '^!' > goa_human.gaf
(3) association mapping 수
$ cat goa_human.gaf | wc -l
635268
→ 635,268 associations/mappings
(4) unique gene 수
$ cat goa_human.gaf | cut -f 3 | sort-uniq-count | head
1 A0A0G2JLM5
1 A0A0G2JN68
1 A0A0G2JNG1
5 A0A0G2JNH3
1 A0A0G2JP03
4 A0A0G2JPB7
1 A0A0G2JQ47
8 A0A1B0GTQ1
1 A0A1B0GUZ9
4 A0A1W2PRG0
$ cat goa_human.gaf | cut -f 3 | sort-uniq-count | wc -l
19606
→ 19,606 unique genes
(5) 가장 많이 annotate된 상위 10개 단백질/유전자 알아보기
$ cat goa_human.gaf | cut -f 2 | sort-uniq-count-rank > prot_counts.txt
$ cat prot_counts.txt | head
1098 P42858
979 P04637
920 P00533
857 P62993
690 Q08379
668 P05067
664 P35222
636 P0CG48
609 Q12933
581 P62979
$ cat goa_human.gaf | cut -f 3 | sort-uniq-count-rank > gene_counts.txt
$ cat gene_counts.txt | head
1098 HTT
979 TP53
920 EGFR
857 GRB2
690 GOLGA2
668 APP
664 CTNNB1
636 UBC
609 TRAF2
581 RPS27A
(6) gene-GO terms annotations에 대한 descriptive statistics
$ cat gene_counts.txt | datamash mean 1 min 1 max 1 sstdev 1
32.401713761094 1 1098 45.499398474444
→ 표준편차 45.5: 꽤 variable한 annotation 수
(7) 연도별 annotation 추이
$ cat goa_human.gaf \
| cut -f 14 \
| awk '{ print substr($1, 1, 4) }' \
| sort \
| uniq -c \
| sort -k 2 -nr \
| awk '{ print $2 "\t" $1 }'
2023 301702
2022 21707
2021 16921
2020 27474
2019 15684
2018 18034
2017 27376
2016 20288
2015 16414
2014 21606
2013 57104
2012 12902
2011 18941
2010 11277
2009 11218
2008 7894
2007 4319
2006 6928
2005 3959
2004 2936
2003 7930
2002 952
2001 1697
2000 5
- sort -k 2 -nr: 두 번째 필드를 기준으로 (k -2) 내림차순 (-nr) 정렬
2-3. goatools
- GO terms의 시각화를 위한 패키지
(1) 설치
$ micromamba install pygraphviz
$ pip install goatools pydot statsmodels fisher
(2) GO terms가 정의된 파일 다운로드
$ wget -nc https://geneontology.org/ontology/go-basic.obo
--2023-12-24 21:01:03-- https://geneontology.org/ontology/go-basic.obo
Resolving geneontology.org (geneontology.org)... 34.233.67.155
Connecting to geneontology.org (geneontology.org)|34.233.67.155|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://purl.obolibrary.org/obo/go/go-basic.obo [following]
--2023-12-24 21:01:04-- http://purl.obolibrary.org/obo/go/go-basic.obo
Resolving purl.obolibrary.org (purl.obolibrary.org)... 104.18.37.59, 172.64.150.197, 2606:4700:4400::ac40:96c5, ...
Connecting to purl.obolibrary.org (purl.obolibrary.org)|104.18.37.59|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://current.geneontology.org/ontology/go-basic.obo [following]
--2023-12-24 21:01:05-- http://current.geneontology.org/ontology/go-basic.obo
Resolving current.geneontology.org (current.geneontology.org)... 13.224.14.105, 13.224.14.114, 13.224.14.57, ...
Connecting to current.geneontology.org (current.geneontology.org)|13.224.14.105|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31136976 (30M) [text/obo]
Saving to: 'go-basic.obo'
go-basic.obo 100%[================>] 29.69M 8.79MB/s in 3.4s
2023-12-24 21:01:09 (8.79 MB/s) - 'go-basic.obo' saved [31136976/31136976]
(3) 시각화
- 다양한 레벨과 형태로 시각화 가능
- e.g., Golgi cisterna (GO:0031985)
(a)
$ plot_go_term.py --term GO:0031985 go-basic.obo
go-basic.obo: fmt(1.2) rel(2023-11-15) 46,228 Terms
GO:0031985 level-04 depth-04 Golgi cisterna [cellular_component]
all parents: {'GO:0098791', 'GO:0005575', 'GO:0031984', 'GO:0110165'}
all children: {'GO:0005797', 'GO:0000137', 'GO:0000138'}
lineage info for terms ['GO:0031985'] written to GO_lineage.pdf
(b)
$ go_plot.py --obo go-basic.obo -o output_1.png GO:0031985
go-basic.obo: fmt(1.2) rel(2023-11-15) 46,228 Terms
GoSubDag: 1 sources in 5 GOs rcnt(True). 0 alt GO IDs
GoSubDag: namedtuple fields: NS level depth GO alt GO_name dcnt D1 id
GoSubDag: relationships: set()
GO:0005575 # CC 4060 L00 D00 cellular_component
#f1fbfd GO:0110165 # CC 1886 L01 D01 A cellular anatomical entity
GO:0031984 # CC 14 L02 D02 A organelle subcompartment
GO:0098791 # CC 6 L03 D03 A Golgi apparatus subcompartment
#ffffe4 GO:0031985 # CC 3 L04 D04 A Golgi cisterna
1 usr 5 GOs WROTE: output_1.png
(c)
$ go_plot.py --obo go-basic.obo -r -o output_2.png GO:0031985
go-basic.obo: fmt(1.2) rel(2023-11-15) 46,228 Terms; optional_attrs(relationship)
GoSubDag: 1 sources in 14 GOs rcnt(True). 0 alt GO IDs
GoSubDag: namedtuple fields: NS level depth reldepth GO alt GO_name dcnt D1 childcnt REL REL_short rel id
GoSubDag: relationships: {'negatively_regulates', 'regulates', 'part_of', 'positively_regulates'}
GO:0005575 # CC 4060 3 L00 D00 R00 .... .... cellular_component
#f1fbfd GO:0110165 # CC 3439 429 L01 D01 R01 A .... .... cellular anatomical entity
GO:0031984 # CC 84 3 L02 D02 R04 A P... .... organelle subcompartment
GO:0043226 # CC 1861 5 L02 D02 R02 A .... p... organelle
GO:0005622 # CC 2074 7 L02 D02 R02 A .... p... intracellular anatomical structure
GO:0005737 # CC 1150 5 L02 D02 R03 A P... p... cytoplasm
GO:0012505 # CC 338 0 L02 D02 R02 A .... p... endomembrane system
GO:0098791 # CC 15 3 L03 D03 R06 A P... .... Golgi apparatus subcompartment
GO:0043229 # CC 1719 2 L03 D03 R03 A P... p... intracellular organelle
GO:0043227 # CC 1355 5 L03 D03 R03 A .... p... membrane-bounded organelle
GO:0043231 # CC 1204 34 L04 D04 R04 A .... .... intracellular membrane-bounded organelle
GO:0005795 # CC 12 0 L04 D04 R07 A .... p... Golgi stack
#ffffe4 GO:0031985 # CC 10 3 L04 D04 R08 A P... p... Golgi cisterna
GO:0005794 # CC 38 1 L05 D05 R05 A P... p... Golgi apparatus
1 usr 14 GOs WROTE: output_2.png
※ goatools는 GO가 distribute될 때 사용되는 기본 데이터 format (.gaf) 을 가지고 operate할 수 없다. 따라서 population/association/study files를 만드는 별도의 전처리 과정이 필요하다.
(1) human gene annotations (association) 파일 (.gaf) 다운로드
$ wget -nc https://geneontology.org/gene-associations/goa_human.gaf.gz
--2023-12-24 21:21:32-- https://geneontology.org/gene-associations/goa_human.gaf.gz
Resolving geneontology.org (geneontology.org)... 34.233.67.155
Connecting to geneontology.org (geneontology.org)|34.233.67.155|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://current.geneontology.org/annotations/goa_human.gaf.gz [following]
--2023-12-24 21:21:33-- http://current.geneontology.org/annotations/goa_human.gaf.gz
Resolving current.geneontology.org (current.geneontology.org)... 13.224.14.114, 13.224.14.105, 13.224.14.57, ...
Connecting to current.geneontology.org (current.geneontology.org)|13.224.14.114|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11696775 (11M) [application/gzip]
Saving to: 'goa_human.gaf.gz'
goa_human.gaf.gz 100%[================>] 11.15M 6.71MB/s in 1.7s
2023-12-24 21:21:35 (6.71 MB/s) - 'goa_human.gaf.gz' saved [11696775/11696775]
$ gunzip goa_human.gaf.gz
$ cat goa_human.gaf | head
!gaf-version: 2.2
!
!generated-by: GOC
!
!date-generated: 2023-11-16T17:16
!
!Header from source association file:
!=================================
!
!generated-by: GOC
(2) !를 없앤 채 데이터 파일 살펴보기
$ cat goa_human.gaf | grep -v ! | head
UniProtKB A0A024RBG1 NUDT4B enables GO:0003723 GO_REF:0000043 IEA UniProtKB-KW:KW-0694 F Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20230911 UniProt
UniProtKB A0A024RBG1 NUDT4B enables GO:0046872 GO_REF:0000043 IEA UniProtKB-KW:KW-0479 F Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20230911 UniProt
UniProtKB A0A024RBG1 NUDT4B located_in GO:0005829 GO_REF:0000052 IDA C Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20230619 HPA
UniProtKB A0A075B6H7 IGKV3-7 involved_in GO:0002250 GO_REF:0000043 IEA UniProtKB-KW:KW-1064 P Probable non-functional immunoglobulin kappa variable 3-7 IGKV3-7 protein taxon:9606 20230911 UniProt
UniProtKB A0A075B6H7 IGKV3-7 located_in GO:0005886 GO_REF:0000044 IEA UniProtKB-SubCell:SL-0039 C Probable non-functional immunoglobulin kappa variable 3-7 IGKV3-7 protein taxon:9606 20230911 UniProt
UniProtKB A0A075B6H7 IGKV3-7 part_of GO:0019814 GO_REF:0000043 IEA UniProtKB-KW:KW-1280 C Probable non-functional immunoglobulin kappa variable 3-7 IGKV3-7 protein taxon:9606 20230911 UniProt
UniProtKB A0A075B6H8 IGKV1D-42 involved_in GO:0002250 GO_REF:0000043 IEA UniProtKB-KW:KW-1064 P Probable non-functional immunoglobulin kappa variable 1D-42 IGKV1D-42 protein taxon:9606 20230911 UniProt
UniProtKB A0A075B6H8 IGKV1D-42 located_in GO:0005886 GO_REF:0000044 IEA UniProtKB-SubCell:SL-0039 C Probable non-functional immunoglobulin kappa variable 1D-42 IGKV1D-42 protein taxon:9606 20230911 UniProt
UniProtKB A0A075B6H8 IGKV1D-42 part_of GO:0019814 GO_REF:0000043 IEA UniProtKB-KW:KW-1280 C Probable non-functional immunoglobulin kappa variable 1D-42 IGKV1D-42 protein taxon:9606 20230911 UniProt
UniProtKB A0A075B6H9 IGLV4-69 involved_in GO:0002250 GO_REF:0000043 IEA UniProtKB-KW:KW-1064 P Immunoglobulin lambda variable 4-69 IGLV4-69 protein taxon:9606 20230911 UniProt
(3) ID와 GO terms만 추출 후 .txt 파일로 저장
$ cat goa_human.gaf | grep -v ! | cut -f 3,5 > pairs.txt
$ cat pairs.txt | head
NUDT4B GO:0003723
NUDT4B GO:0046872
NUDT4B GO:0005829
IGKV3-7 GO:0002250
IGKV3-7 GO:0005886
IGKV3-7 GO:0019814
IGKV1D-42 GO:0002250
IGKV1D-42 GO:0005886
IGKV1D-42 GO:0019814
IGLV4-69 GO:0002250
(4) 연구에 더 적합한 association file 생성
- e.g., 가장 많이 annotate된 상위 5개의 유전자들의 공통적인 functions (MF) 를 알아보는 연구를 진행해보려 한다.
$ cat pairs.txt | awk '{if(a[$1])a[$1]=a[$1]";"$2; else a[$1]=$2;} END {for (i in a) print i, a[i];}' > association
$ cat association | head
CAPN6 GO:0005515;GO:0008017;GO:0001578;GO:0051493;GO:0005876;GO:0048471;GO:0004198;GO:0006508;GO:0005737
CAPN7 GO:0004175;GO:0004198;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0090541;GO:0010634;GO:0097264;GO:0005634;GO:0005813;GO:0005829;GO:0070062;GO:0070062;GO:0006508;GO:0004197
DACH1 GO:0000977;GO:0000978;GO:0001227;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0000122;GO:0001967;GO:0007585;GO:0010944;GO:0030336;GO:0033262;GO:0044342;GO:0045892;GO:0046545;GO:0048147;GO:0060244;GO:2000279;GO:0005654;GO:0005794;GO:0005829;GO:0016607;GO:0005667;GO:0000981;GO:0005634;GO:0000978;GO:0006357
HSP90AB1 GO:0008180;GO:0003723;GO:0003723;GO:0003725;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005524;GO:0005524;GO:0016887;GO:0019887;GO:0019900;GO:0019900;GO:0019901;GO:0023026;GO:0030235;GO:0030911;GO:0031072;GO:0031625;GO:0042277;GO:0042802;GO:0042802;GO:0042802;GO:0042803;GO:0042826;GO:0043008;GO:0043008;GO:0045296;GO:0046983;GO:0048156;GO:0070182;GO:0070182;GO:0097718;GO:0097718;GO:0140662;GO:1990226;GO:0001890;GO:0006986;GO:0007004;GO:0019062;GO:0030511;GO:0031396;GO:0032435;GO:0032516;GO:0032880;GO:0043066;GO:0045429;GO:0045597;GO:0051131;GO:0051248;GO:0051726;GO:0051973;GO:0071353;GO:0097435;GO:1901389;GO:1901799;GO:1902949;GO:1904031;GO:1905323;GO:2000010;GO:0005576;GO:0005576;GO:0005576;GO:0005634;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005737;GO:0005737;GO:0005737;GO:0005737;GO:0005737;GO:0005739;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0009986;GO:0016020;GO:0034774;GO:0042470;GO:0043025;GO:0044294;GO:0044295;GO:0048471;GO:0070062;GO:0070062;GO:0070062;GO:0120293;GO:1904813;GO:0032991;GO:0032991;GO:0032991;GO:0034751;GO:1990565;GO:0006457;GO:0048471;GO:0005886;GO:0005829;GO:0034605;GO:0032991;GO:0051082;GO:0050821;GO:0097718;GO:0005524
DACH2 GO:0046545;GO:0000978;GO:0005667;GO:0000981;GO:0005634;GO:0006357
CAPN8 GO:0005509;GO:0005794;GO:0006508;GO:0005737;GO:0004198
IFNA14 GO:0005126;GO:0005515;GO:0051607;GO:0060337;GO:0098586;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0019221;GO:0005125;GO:0033141;GO:0002286;GO:0002323;GO:0006959;GO:0030183;GO:0005615;GO:0005132;GO:0043330;GO:0042100;GO:0002250
AVL9 GO:0016477;GO:0016020;GO:0055037;GO:0005737
NEB GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0008307;GO:0007517;GO:0007525;GO:0030832;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0015629;GO:0030018;GO:0070062;GO:0051015;GO:0071691;GO:0030018
CAPN9 GO:0005509;GO:0007586;GO:0005575;GO:0005737;GO:0006508;GO:0004198
- i: key (e.g., CAPN6)
- a[i]: array of the key's associated values (e.g., GO:0005515;GO:0008017;GO:0001578;GO:0051493;...)
- $1: .txt 파일 각 행의 첫 번째 열의 값 (혹은 텍스트) (e.g., 첫 번째 행에서 $1은 NUDT4B)
- $2: .txt 파일 각 행의 두 번째 열의 값 (혹은 텍스트) (e.g., 첫 번째 행에서 $2은 GO:0003723)
- 만약 a[$1] (array of the associated values of $1) 가 이미 존재한다면, 그 array에 세미콜론을 구분자로 하여 $2를 append한다. 하지만 만약 a[$1]가 아직 존재하지 않는다면, 해당 array에 $2를 부여함으로써 a[$1]을 생성한다.
(5) population (background) file 생성
$ cat pairs.txt | cut -f 1 > population
$ cat population | head
NUDT4B
NUDT4B
NUDT4B
IGKV3-7
IGKV3-7
IGKV3-7
IGKV1D-42
IGKV1D-42
IGKV1D-42
IGLV4-69
(6) study file 생성
$ cat pairs.txt | cut -f 1 | sort | uniq -c | sort -rn | tr -s ' ' | cut -f 3 -d ' ' | head -5 > study
$ cat study
HTT
TP53
EGFR
GRB2
GOLGA2
- tr -s ' ': 반복되는 space 공백 제거
- cut -d ' ': space 공백을 구분자로 사용
(7) enrichment 찾기
- 통계학적인 방법들을 통해 study file 내의 유전자들과 significant하게 연관되어 있는 GO terms 발견하기
- MF sub-ontology에 해당하는 GO terms만 추출
$ find_enrichment.py study population association --obo go-basic.obo > results.txt
$ cat results.txt | grep GO: | grep MF | cut -f 1 > terms.txt
$ cat terms.txt | head
GO:0019900
GO:0019903
GO:0019901
GO:0019902
GO:0042802
GO:0019899
GO:0002039
GO:0030971
GO:0005091
GO:1990782
$ cat terms.txt | wc -l
84
(8) 시각화
$ go_plot.py --obo go-basic.obo --go_file terms.txt
go-basic.obo: fmt(1.2) rel(2023-11-15) 46,228 Terms
GoSubDag: 84 sources in 125 GOs rcnt(True). 0 alt GO IDs
GoSubDag: namedtuple fields: NS level depth GO alt GO_name dcnt D1 id
GoSubDag: relationships: set()
GO:0003674 # MF 11234 L00 D00 molecular_function
#f1fbfd GO:0005488 # MF 1821 L01 D01 B binding
#ffffe4 GO:0098772 # MF 253 L01 D01 D molecular function regulator activity
#f1fbfd GO:0003824 # MF 7588 L01 D01 A catalytic activity
#f1fbfd GO:0060089 # MF 399 L01 D01 C molecular transducer activity
#f1fbfd GO:0060090 # MF 117 L01 D01 F molecular adaptor activity
#f1fbfd GO:0140657 # MF 210 L01 D01 E ATP-dependent activity
GO:0043167 # MF 187 L02 D02 B ion binding
GO:0005515 # MF 919 L02 D02 B protein binding
GO:0044877 # MF 119 L02 D02 B protein-containing complex binding
GO:0097159 # MF 483 L02 D02 B organic cyclic compound binding
#ffffe4 GO:0140677 # MF 102 L02 D02 D molecular function activator activity
#ffffe4 GO:0042562 # MF 32 L02 D02 B hormone binding
GO:0140096 # MF 602 L02 D02 A catalytic activity, acting on a protein
GO:0016740 # MF 2527 L02 D02 A transferase activity
GO:0038023 # MF 394 L02 D02 C signaling receptor activity
GO:0030674 # MF 47 L02 D02 F protein-macromolecule adaptor activity
#ffffe4 GO:0030234 # MF 164 L02 D02 D enzyme regulator activity
GO:0030545 # MF 50 L02 D02 D signaling receptor regulator activity
GO:0043169 # MF 50 L03 D03 B cation binding
#ffffe4 GO:0019899 # MF 70 L03 D03 B enzyme binding
#ffffe4 GO:0005102 # MF 417 L03 D03 B signaling receptor binding
#ffffe4 GO:0050839 # MF 17 L03 D03 B cell adhesion molecule binding
#ffffe4 GO:0001098 # MF 13 L03 D03 B basal transcription machinery binding
GO:0008134 # MF 38 L03 D03 B transcription factor binding
GO:0003676 # MF 310 L03 D03 B nucleic acid binding
#ffffe4 GO:0097371 # MF 0 L03 D03 B MDM2/MDM4 family protein binding
#ffffe4 GO:0044325 # MF 0 L03 D03 B transmembrane transporter binding
#ffffe4 GO:0140272 # MF 2 L03 D03 B exogenous protein binding
#ffffe4 GO:0061676 # MF 0 L03 D03 B importin-alpha family protein binding
#ffffe4 GO:0019904 # MF 79 L03 D03 B protein domain specific binding
#ffffe4 GO:0008092 # MF 53 L03 D03 B cytoskeletal protein binding
#ffffe4 GO:0003682 # MF 10 L03 D03 B chromatin binding
GO:0016772 # MF 604 L03 D03 A transferase activity, transferring phosphorus-containing groups
GO:0004888 # MF 331 L03 D03 C transmembrane signaling receptor activity
#ffffe4 GO:0031072 # MF 3 L03 D03 B heat shock protein binding
#ffffe4 GO:0035591 # MF 5 L03 D03 F signaling adaptor activity
#ffffe4 GO:0043621 # MF 1 L03 D03 B protein self-association
#ffffe4 GO:0002039 # MF 0 L03 D03 B p53 binding
#ffffe4 GO:0051219 # MF 5 L03 D03 B phosphoprotein binding
#ffffe4 GO:0030235 # MF 1 L03 D03 D nitric-oxide synthase regulator activity
#ffffe4 GO:0051087 # MF 1 L03 D03 B protein-folding chaperone binding
#ffffe4 GO:0005522 # MF 0 L03 D03 B profilin binding
GO:0019207 # MF 32 L03 D03 D kinase regulator activity
GO:0008047 # MF 56 L03 D03 D enzyme activator activity
#ffffe4 GO:0019838 # MF 26 L03 D03 B growth factor binding
#ffffe4 GO:0045505 # MF 0 L03 D03 B dynein intermediate chain binding
#ffffe4 GO:0043560 # MF 0 L03 D03 B insulin receptor substrate binding
#ffffe4 GO:0042802 # MF 1 L03 D03 B identical protein binding
#ffffe4 GO:0000149 # MF 3 L03 D03 B SNARE binding
GO:0030546 # MF 37 L03 D03 D signaling receptor activator activity
GO:0060589 # MF 8 L03 D03 D nucleoside-triphosphatase regulator activity
#ffffe4 GO:0071889 # MF 0 L03 D03 B 14-3-3 protein binding
#ffffe4 GO:0035033 # MF 1 L03 D03 D histone deacetylase regulator activity
GO:0046872 # MF 26 L04 D04 B metal ion binding
#ffffe4 GO:0051117 # MF 0 L04 D04 B ATPase binding
#ffffe4 GO:0005178 # MF 1 L03 D04 B integrin binding
#ffffe4 GO:0001099 # MF 11 L04 D04 B basal RNA polymerase II transcription machinery binding
#ffffe4 GO:0140296 # MF 13 L04 D04 B general transcription initiation factor binding
GO:0003677 # MF 131 L04 D04 B DNA binding
#ffffe4 GO:0002020 # MF 2 L04 D04 B protease binding
#ffffe4 GO:0019902 # MF 5 L04 D04 B phosphatase binding
#ffffe4 GO:0017124 # MF 0 L04 D04 B SH3 domain binding
#ffffe4 GO:0019900 # MF 17 L04 D04 B kinase binding
#ffffe4 GO:0015631 # MF 7 L04 D04 B tubulin binding
#ffffe4 GO:0001618 # MF 0 L04 D04 B virus receptor activity
GO:0016301 # MF 357 L04 D04 A kinase activity
GO:0016773 # MF 322 L04 D04 A phosphotransferase activity, alcohol group as acceptor
#ffffe4 GO:1990841 # MF 0 L04 D04 B promoter-specific chromatin binding
GO:0005126 # MF 91 L04 D04 B cytokine receptor binding
#ffffe4 GO:0097718 # MF 0 L04 D04 B disordered domain specific binding
#ffffe4 GO:0045309 # MF 4 L04 D04 B protein phosphorylated amino acid binding
#ffffe4 GO:0030159 # MF 2 L04 D04 F signaling receptor complex adaptor activity
#ffffe4 GO:0019209 # MF 9 L04 D04 D kinase activator activity
GO:0019887 # MF 23 L04 D04 D protein kinase regulator activity
GO:0003723 # MF 163 L04 D04 B RNA binding
#ffffe4 GO:0140666 # MF 4 L04 D04 B annealing activity
#ffffe4 GO:0070851 # MF 29 L04 D04 B growth factor receptor binding
#ffffe4 GO:0046875 # MF 0 L04 D04 B ephrin receptor binding
#ffffe4 GO:0042826 # MF 0 L04 D04 B histone deacetylase binding
#ffffe4 GO:0019905 # MF 2 L04 D04 B syntaxin binding
#ffffe4 GO:0048408 # MF 0 L03 D04 B epidermal growth factor binding
GO:0001067 # MF 35 L04 D04 B transcription regulatory region nucleic acid binding
GO:0030695 # MF 7 L04 D04 D GTPase regulator activity
#ffffe4 GO:0044389 # MF 1 L04 D04 B ubiquitin-like protein ligase binding
#ffffe4 GO:0034452 # MF 0 L04 D04 B dynactin binding
#ffffe4 GO:0045296 # MF 1 L04 D04 B cadherin binding
GO:0046914 # MF 14 L05 D05 B transition metal ion binding
#ffffe4 GO:0001091 # MF 6 L05 D05 B RNA polymerase II general transcription initiation factor binding
#ffffe4 GO:0003697 # MF 8 L05 D05 B single-stranded DNA binding
#ffffe4 GO:0019901 # MF 15 L05 D05 B protein kinase binding
#ffffe4 GO:0048487 # MF 0 L05 D05 B beta-tubulin binding
GO:0004672 # MF 116 L03 D05 A protein kinase activity
#ffffe4 GO:0005165 # MF 7 L05 D05 B neurotrophin receptor binding
#ffffe4 GO:0019903 # MF 4 L05 D05 B protein phosphatase binding
#ffffe4 GO:0030295 # MF 7 L05 D05 D protein kinase activator activity
GO:0003729 # MF 74 L05 D05 B mRNA binding
#ffffe4 GO:0001784 # MF 0 L05 D05 B phosphotyrosine residue binding
#ffffe4 GO:0005154 # MF 0 L05 D05 B epidermal growth factor receptor binding
GO:0043565 # MF 61 L05 D05 B sequence-specific DNA binding
GO:0003690 # MF 78 L05 D05 B double-stranded DNA binding
#ffffe4 GO:0005091 # MF 0 L03 D05 DF guanyl-nucleotide exchange factor adaptor activity
#ffffe4 GO:0031625 # MF 0 L05 D05 B ubiquitin protein ligase binding
#ffffe4 GO:0005507 # MF 2 L06 D06 B copper ion binding
#ffffe4 GO:0001094 # MF 0 L03 D06 B TFIID-class transcription factor complex binding
#ffffe4 GO:0019199 # MF 30 L04 D06 AC transmembrane receptor protein kinase activity
#ffffe4 GO:0004713 # MF 22 L04 D06 A protein tyrosine kinase activity
#ffffe4 GO:0005167 # MF 3 L06 D06 B neurotrophin TRK receptor binding
GO:0004674 # MF 59 L04 D06 A protein serine/threonine kinase activity
#ffffe4 GO:1990782 # MF 2 L06 D06 B protein tyrosine kinase binding
#ffffe4 GO:0003730 # MF 1 L06 D06 B mRNA 3'-UTR binding
#ffffe4 GO:0051721 # MF 0 L06 D06 B protein phosphatase 2A binding
#ffffe4 GO:0030296 # MF 2 L06 D06 D protein tyrosine kinase activator activity
#ffffe4 GO:1990814 # MF 1 L05 D06 B DNA/DNA annealing activity
GO:1990837 # MF 49 L06 D06 B sequence-specific double-stranded DNA binding
#ffffe4 GO:0004714 # MF 18 L05 D07 AC transmembrane receptor protein tyrosine kinase activity
#ffffe4 GO:0004709 # MF 1 L05 D07 A MAP kinase kinase kinase activity
#ffffe4 GO:0030971 # MF 1 L04 D07 B receptor tyrosine kinase binding
#ffffe4 GO:0005168 # MF 0 L07 D07 B neurotrophin TRKA receptor binding
#ffffe4 GO:0030297 # MF 0 L04 D07 D transmembrane receptor protein tyrosine kinase activator activity
#ffffe4 GO:0036310 # MF 0 L02 D07 BE ATP-dependent DNA/DNA annealing activity
GO:0000976 # MF 32 L05 D07 B transcription cis-regulatory region binding
#ffffe4 GO:0005006 # MF 0 L06 D08 AC epidermal growth factor receptor activity
#ffffe4 GO:0005068 # MF 0 L05 D08 BF transmembrane receptor protein tyrosine kinase adaptor activity
#ffffe4 GO:0001046 # MF 2 L06 D08 B core promoter sequence-specific DNA binding
84 usr 125 GOs WROTE: go_plot.png
Reference
The Biostar Handbook: 2nd Edition - István Albert
'생물정보학 끄적끄적' 카테고리의 다른 글
GO Enrichment 분석 연습 (78) | 2023.12.28 |
---|---|
[Linux] Ontology (3) - bio explain (86) | 2023.12.25 |
[Linux] Ontology (1) - Sequence Ontology (80) | 2023.12.23 |
[Linux] bio fetch와 bio search (5) | 2023.12.21 |
[Linux] Entrez Web API와 Entrez Direct로 NCBI 데이터베이스 접속하기 (3) | 2023.12.21 |