생물정보학 끄적끄적

[Linux] Ontology (2) - Gene Ontology (with goatools)

Hazel Y. 2023. 12. 25. 15:26

 

이전 포스팅에 이어서 이번엔 gene ontology에 대해 소개하려 한다.

 

Ontology (1) - Sequence Ontology

Ontology는 의미적으로, 문맥적으로 연결되어 있는 단어 및 용어들 간의 집합 같은 것이다. 생물정보학에서 주로 쓰이는 ontology에는 sequence ontology와 gene ontology, 이렇게 두 가지가 있다. 이번 포스팅

livelyhheesun.tistory.com

 

 

2-1. Gene ontology (GO)

- gene functions와 관련된 정보에 해당하는 용어들

- 세 개의 sub-ontologies

   (1) cellular component: gene function의 결과가 나타나는 곳에 대해 (e.g., Golgi cisterna membrane)

   (2) molecular function: gene function의 생화학적 activity에 대해 (e.g., galactoside 2-alpha-L-fucosyltransferase activity)

   (3) biological process: gene function의 목적에 대해 (e.g., carbohydrate metabolic process)

- https://www.geneontology.org/, https://www.ebi.ac.uk/QuickGO/, https://amigo.geneontology.org/amigo에서 검색 가능

 

Gene Ontology Resource

The Gene Ontology (GO) project is a major bioinformatics initiative to develop a computational representation of our evolving knowledge of how genes encode biological functions at the molecular, cellular and tissue system levels.

geneontology.org

 

QuickGO

 

www.ebi.ac.uk

 

AmiGO 2: Welcome

Many more tools are available from the software list, such as alternate searching modes, Visualize, non-JavaScript pages. Go »

amigo.geneontology.org

 

 

2-2. GO data investigate해보기

 

(1) GO data와 association file 다운로드

$ curl -OL https://purl.obolibrary.org/obo/go.obo
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--    100   316    0   316    0     0    440      0 --:--:-- --:--:-- --:--:--   441
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--      0 32.7M    0  320k    0     0   180k      0  0:03:05  0:00:01  0:03:04  32 35 32.7M   35 11.6M    0     0  4290k      0  0:00:07  0:00:02  0:00:05 598 94 32.7M   94 30.8M    0     0  8384k      0  0:00:03  0:00:03 --:--:-- 10.100 32.7M  100 32.7M    0     0  8681k      0  0:00:03  0:00:03 --:--:-- 10.6M
  
 $ curl -OL https://geneontology.org/gene-associations/goa_human.gaf.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--      0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--      0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--    100   244  100   244    0     0    139      0  0:00:01  0:00:01 --:--:--   139
  0 11.1M    0 52836    0     0  22631      0  0:08:36  0:00:02  0:08:34 226 13 11.1M   13 1573k    0     0   491k      0  0:00:23  0:00:03  0:00:20 174100 11.1M  100 11.1M    0     0  3056k      0  0:00:03  0:00:03 --:--:-- 8110k

 

(2) association file 압축 해제 & comments 제거

- .obo와 .gaf format의 파일에서는 # 대신 !가 comments를 나타냄.

$ gunzip -c goa_human.gaf.gz | grep -v '^!' > goa_human.gaf

 

(3) association mapping 수

$ cat goa_human.gaf | wc -l
635268

 

 

→ 635,268 associations/mappings

 

(4) unique gene 수

$ cat goa_human.gaf | cut -f 3 | sort-uniq-count | head
1       A0A0G2JLM5
1       A0A0G2JN68
1       A0A0G2JNG1
5       A0A0G2JNH3
1       A0A0G2JP03
4       A0A0G2JPB7
1       A0A0G2JQ47
8       A0A1B0GTQ1
1       A0A1B0GUZ9
4       A0A1W2PRG0

$ cat goa_human.gaf | cut -f 3 | sort-uniq-count | wc -l
19606

 

→ 19,606 unique genes

 

(5) 가장 많이 annotate된 상위 10개 단백질/유전자 알아보기

$ cat goa_human.gaf | cut -f 2 | sort-uniq-count-rank > prot_counts.txt

$ cat prot_counts.txt | head
1098    P42858
979     P04637
920     P00533
857     P62993
690     Q08379
668     P05067
664     P35222
636     P0CG48
609     Q12933
581     P62979

$ cat goa_human.gaf | cut -f 3 | sort-uniq-count-rank > gene_counts.txt

$ cat gene_counts.txt | head
1098    HTT
979     TP53
920     EGFR
857     GRB2
690     GOLGA2
668     APP
664     CTNNB1
636     UBC
609     TRAF2
581     RPS27A

 

(6) gene-GO terms annotations에 대한 descriptive statistics

$ cat gene_counts.txt | datamash mean 1 min 1 max 1 sstdev 1
32.401713761094 1       1098    45.499398474444

 

→ 표준편차 45.5: 꽤 variable한 annotation 수

 

(7) 연도별 annotation 추이

$ cat goa_human.gaf \
| cut -f 14 \
| awk '{ print substr($1, 1, 4) }' \
| sort \
| uniq -c \
| sort -k 2 -nr \
| awk '{ print $2 "\t" $1 }'
2023    301702
2022    21707
2021    16921
2020    27474
2019    15684
2018    18034
2017    27376
2016    20288
2015    16414
2014    21606
2013    57104
2012    12902
2011    18941
2010    11277
2009    11218
2008    7894
2007    4319
2006    6928
2005    3959
2004    2936
2003    7930
2002    952
2001    1697
2000    5

 

- sort -k 2 -nr: 두 번째 필드를 기준으로 (k -2) 내림차순 (-nr) 정렬

 

 

2-3. goatools

- GO terms의 시각화를 위한 패키지

 

(1) 설치

$ micromamba install pygraphviz
$ pip install goatools pydot statsmodels fisher

 

(2) GO terms가 정의된 파일 다운로드

$ wget -nc https://geneontology.org/ontology/go-basic.obo
--2023-12-24 21:01:03--  https://geneontology.org/ontology/go-basic.obo
Resolving geneontology.org (geneontology.org)... 34.233.67.155
Connecting to geneontology.org (geneontology.org)|34.233.67.155|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://purl.obolibrary.org/obo/go/go-basic.obo [following]
--2023-12-24 21:01:04--  http://purl.obolibrary.org/obo/go/go-basic.obo
Resolving purl.obolibrary.org (purl.obolibrary.org)... 104.18.37.59, 172.64.150.197, 2606:4700:4400::ac40:96c5, ...
Connecting to purl.obolibrary.org (purl.obolibrary.org)|104.18.37.59|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://current.geneontology.org/ontology/go-basic.obo [following]
--2023-12-24 21:01:05--  http://current.geneontology.org/ontology/go-basic.obo
Resolving current.geneontology.org (current.geneontology.org)... 13.224.14.105, 13.224.14.114, 13.224.14.57, ...
Connecting to current.geneontology.org (current.geneontology.org)|13.224.14.105|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31136976 (30M) [text/obo]
Saving to: 'go-basic.obo'

go-basic.obo       100%[================>]  29.69M  8.79MB/s    in 3.4s

2023-12-24 21:01:09 (8.79 MB/s) - 'go-basic.obo' saved [31136976/31136976]

 

(3) 시각화

- 다양한 레벨과 형태로 시각화 가능

- e.g., Golgi cisterna (GO:0031985)

 

(a)

$ plot_go_term.py --term GO:0031985 go-basic.obo
go-basic.obo: fmt(1.2) rel(2023-11-15) 46,228 Terms
GO:0031985      level-04        depth-04        Golgi cisterna [cellular_component]
all parents: {'GO:0098791', 'GO:0005575', 'GO:0031984', 'GO:0110165'}
all children: {'GO:0005797', 'GO:0000137', 'GO:0000138'}
lineage info for terms ['GO:0031985'] written to GO_lineage.pdf

 

(b)

$ go_plot.py --obo go-basic.obo -o output_1.png GO:0031985
go-basic.obo: fmt(1.2) rel(2023-11-15) 46,228 Terms
 GoSubDag:   1 sources in   5 GOs rcnt(True). 0 alt GO IDs
 GoSubDag: namedtuple fields: NS level depth GO alt GO_name dcnt D1 id
 GoSubDag: relationships: set()
        GO:0005575  # CC  4060 L00 D00       cellular_component
#f1fbfd GO:0110165  # CC  1886 L01 D01 A     cellular anatomical entity
        GO:0031984  # CC    14 L02 D02 A     organelle subcompartment
        GO:0098791  # CC     6 L03 D03 A     Golgi apparatus subcompartment
#ffffe4 GO:0031985  # CC     3 L04 D04 A     Golgi cisterna
    1 usr   5 GOs  WROTE: output_1.png

 

(c)

$ go_plot.py --obo go-basic.obo -r -o output_2.png GO:0031985
go-basic.obo: fmt(1.2) rel(2023-11-15) 46,228 Terms; optional_attrs(relationship)
 GoSubDag:   1 sources in  14 GOs rcnt(True). 0 alt GO IDs
 GoSubDag: namedtuple fields: NS level depth reldepth GO alt GO_name dcnt D1 childcnt REL REL_short rel id
 GoSubDag: relationships: {'negatively_regulates', 'regulates', 'part_of', 'positively_regulates'}
        GO:0005575  # CC  4060   3 L00 D00 R00       .... .... cellular_component
#f1fbfd GO:0110165  # CC  3439 429 L01 D01 R01 A     .... .... cellular anatomical entity
        GO:0031984  # CC    84   3 L02 D02 R04 A     P... .... organelle subcompartment
        GO:0043226  # CC  1861   5 L02 D02 R02 A     .... p... organelle
        GO:0005622  # CC  2074   7 L02 D02 R02 A     .... p... intracellular anatomical structure
        GO:0005737  # CC  1150   5 L02 D02 R03 A     P... p... cytoplasm
        GO:0012505  # CC   338   0 L02 D02 R02 A     .... p... endomembrane system
        GO:0098791  # CC    15   3 L03 D03 R06 A     P... .... Golgi apparatus subcompartment
        GO:0043229  # CC  1719   2 L03 D03 R03 A     P... p... intracellular organelle
        GO:0043227  # CC  1355   5 L03 D03 R03 A     .... p... membrane-bounded organelle
        GO:0043231  # CC  1204  34 L04 D04 R04 A     .... .... intracellular membrane-bounded organelle
        GO:0005795  # CC    12   0 L04 D04 R07 A     .... p... Golgi stack
#ffffe4 GO:0031985  # CC    10   3 L04 D04 R08 A     P... p... Golgi cisterna
        GO:0005794  # CC    38   1 L05 D05 R05 A     P... p... Golgi apparatus
    1 usr  14 GOs  WROTE: output_2.png

 

 

※ goatools는 GO가 distribute될 때 사용되는 기본 데이터 format (.gaf) 을 가지고 operate할 수 없다. 따라서 population/association/study files를 만드는 별도의 전처리 과정이 필요하다.

 

(1) human gene annotations (association) 파일 (.gaf) 다운로드

$ wget -nc https://geneontology.org/gene-associations/goa_human.gaf.gz
--2023-12-24 21:21:32--  https://geneontology.org/gene-associations/goa_human.gaf.gz
Resolving geneontology.org (geneontology.org)... 34.233.67.155
Connecting to geneontology.org (geneontology.org)|34.233.67.155|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://current.geneontology.org/annotations/goa_human.gaf.gz [following]
--2023-12-24 21:21:33--  http://current.geneontology.org/annotations/goa_human.gaf.gz
Resolving current.geneontology.org (current.geneontology.org)... 13.224.14.114, 13.224.14.105, 13.224.14.57, ...
Connecting to current.geneontology.org (current.geneontology.org)|13.224.14.114|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11696775 (11M) [application/gzip]
Saving to: 'goa_human.gaf.gz'

goa_human.gaf.gz   100%[================>]  11.15M  6.71MB/s    in 1.7s

2023-12-24 21:21:35 (6.71 MB/s) - 'goa_human.gaf.gz' saved [11696775/11696775]

$ gunzip goa_human.gaf.gz

$ cat goa_human.gaf | head
!gaf-version: 2.2
!
!generated-by: GOC
!
!date-generated: 2023-11-16T17:16
!
!Header from source association file:
!=================================
!
!generated-by: GOC

 

(2) !를 없앤 채 데이터 파일 살펴보기

$ cat goa_human.gaf | grep -v ! | head
UniProtKB       A0A024RBG1      NUDT4B  enables GO:0003723      GO_REF:0000043      IEA     UniProtKB-KW:KW-0694    F       Diphosphoinositol polyphosphate phosphohydrolase NUDT4B     NUDT4B  protein taxon:9606      20230911   UniProt
UniProtKB       A0A024RBG1      NUDT4B  enables GO:0046872      GO_REF:0000043      IEA     UniProtKB-KW:KW-0479    F       Diphosphoinositol polyphosphate phosphohydrolase NUDT4B     NUDT4B  protein taxon:9606      20230911   UniProt
UniProtKB       A0A024RBG1      NUDT4B  located_in      GO:0005829      GO_REF:0000052      IDA             C       Diphosphoinositol polyphosphate phosphohydrolase NUDT4B     NUDT4B  protein taxon:9606      20230619        HPA
UniProtKB       A0A075B6H7      IGKV3-7 involved_in     GO:0002250      GO_REF:0000043      IEA     UniProtKB-KW:KW-1064    P       Probable non-functional immunoglobulin kappa variable 3-7   IGKV3-7 protein taxon:9606      20230911    UniProt
UniProtKB       A0A075B6H7      IGKV3-7 located_in      GO:0005886      GO_REF:0000044      IEA     UniProtKB-SubCell:SL-0039       C       Probable non-functional immunoglobulin kappa variable 3-7   IGKV3-7 protein taxon:9606 20230911 UniProt
UniProtKB       A0A075B6H7      IGKV3-7 part_of GO:0019814      GO_REF:0000043      IEA     UniProtKB-KW:KW-1280    C       Probable non-functional immunoglobulin kappa variable 3-7   IGKV3-7 protein taxon:9606      20230911   UniProt
UniProtKB       A0A075B6H8      IGKV1D-42       involved_in     GO:0002250 GO_REF:0000043   IEA     UniProtKB-KW:KW-1064    P       Probable non-functional immunoglobulin kappa variable 1D-42 IGKV1D-42       protein taxon:9606 20230911 UniProt
UniProtKB       A0A075B6H8      IGKV1D-42       located_in      GO:0005886 GO_REF:0000044   IEA     UniProtKB-SubCell:SL-0039       C       Probable non-functional immunoglobulin kappa variable 1D-42 IGKV1D-42       protein taxon:9606  20230911        UniProt
UniProtKB       A0A075B6H8      IGKV1D-42       part_of GO:0019814      GO_REF:0000043      IEA     UniProtKB-KW:KW-1280    C       Probable non-functional immunoglobulin kappa variable 1D-42 IGKV1D-42       protein taxon:9606 20230911 UniProt
UniProtKB       A0A075B6H9      IGLV4-69        involved_in     GO:0002250 GO_REF:0000043   IEA     UniProtKB-KW:KW-1064    P       Immunoglobulin lambda variable 4-69 IGLV4-69        protein taxon:9606      20230911        UniProt

 

(3) ID와 GO terms만 추출 후 .txt 파일로 저장

$ cat goa_human.gaf | grep -v ! | cut -f 3,5 > pairs.txt

$ cat pairs.txt | head
NUDT4B  GO:0003723
NUDT4B  GO:0046872
NUDT4B  GO:0005829
IGKV3-7 GO:0002250
IGKV3-7 GO:0005886
IGKV3-7 GO:0019814
IGKV1D-42       GO:0002250
IGKV1D-42       GO:0005886
IGKV1D-42       GO:0019814
IGLV4-69        GO:0002250

 

(4) 연구에 더 적합한 association file 생성

- e.g., 가장 많이 annotate된 상위 5개의 유전자들의 공통적인 functions (MF) 를 알아보는 연구를 진행해보려 한다.

$ cat pairs.txt | awk '{if(a[$1])a[$1]=a[$1]";"$2; else a[$1]=$2;} END {for (i in a) print i, a[i];}' > association

$ cat association | head
CAPN6 GO:0005515;GO:0008017;GO:0001578;GO:0051493;GO:0005876;GO:0048471;GO:0004198;GO:0006508;GO:0005737
CAPN7 GO:0004175;GO:0004198;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0090541;GO:0010634;GO:0097264;GO:0005634;GO:0005813;GO:0005829;GO:0070062;GO:0070062;GO:0006508;GO:0004197
DACH1 GO:0000977;GO:0000978;GO:0001227;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0000122;GO:0001967;GO:0007585;GO:0010944;GO:0030336;GO:0033262;GO:0044342;GO:0045892;GO:0046545;GO:0048147;GO:0060244;GO:2000279;GO:0005654;GO:0005794;GO:0005829;GO:0016607;GO:0005667;GO:0000981;GO:0005634;GO:0000978;GO:0006357
HSP90AB1 GO:0008180;GO:0003723;GO:0003723;GO:0003725;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005524;GO:0005524;GO:0016887;GO:0019887;GO:0019900;GO:0019900;GO:0019901;GO:0023026;GO:0030235;GO:0030911;GO:0031072;GO:0031625;GO:0042277;GO:0042802;GO:0042802;GO:0042802;GO:0042803;GO:0042826;GO:0043008;GO:0043008;GO:0045296;GO:0046983;GO:0048156;GO:0070182;GO:0070182;GO:0097718;GO:0097718;GO:0140662;GO:1990226;GO:0001890;GO:0006986;GO:0007004;GO:0019062;GO:0030511;GO:0031396;GO:0032435;GO:0032516;GO:0032880;GO:0043066;GO:0045429;GO:0045597;GO:0051131;GO:0051248;GO:0051726;GO:0051973;GO:0071353;GO:0097435;GO:1901389;GO:1901799;GO:1902949;GO:1904031;GO:1905323;GO:2000010;GO:0005576;GO:0005576;GO:0005576;GO:0005634;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005654;GO:0005737;GO:0005737;GO:0005737;GO:0005737;GO:0005737;GO:0005739;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0009986;GO:0016020;GO:0034774;GO:0042470;GO:0043025;GO:0044294;GO:0044295;GO:0048471;GO:0070062;GO:0070062;GO:0070062;GO:0120293;GO:1904813;GO:0032991;GO:0032991;GO:0032991;GO:0034751;GO:1990565;GO:0006457;GO:0048471;GO:0005886;GO:0005829;GO:0034605;GO:0032991;GO:0051082;GO:0050821;GO:0097718;GO:0005524
DACH2 GO:0046545;GO:0000978;GO:0005667;GO:0000981;GO:0005634;GO:0006357
CAPN8 GO:0005509;GO:0005794;GO:0006508;GO:0005737;GO:0004198
IFNA14 GO:0005126;GO:0005515;GO:0051607;GO:0060337;GO:0098586;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0005576;GO:0019221;GO:0005125;GO:0033141;GO:0002286;GO:0002323;GO:0006959;GO:0030183;GO:0005615;GO:0005132;GO:0043330;GO:0042100;GO:0002250
AVL9 GO:0016477;GO:0016020;GO:0055037;GO:0005737
NEB GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0005515;GO:0008307;GO:0007517;GO:0007525;GO:0030832;GO:0005829;GO:0005829;GO:0005829;GO:0005829;GO:0015629;GO:0030018;GO:0070062;GO:0051015;GO:0071691;GO:0030018
CAPN9 GO:0005509;GO:0007586;GO:0005575;GO:0005737;GO:0006508;GO:0004198

 

- i: key (e.g., CAPN6)

- a[i]: array of the key's associated values (e.g., GO:0005515;GO:0008017;GO:0001578;GO:0051493;...)

- $1: .txt 파일 각 행의 첫 번째 열의 값 (혹은 텍스트) (e.g., 첫 번째 행에서 $1은 NUDT4B)

- $2: .txt 파일 각 행의 두 번째 열의 값 (혹은 텍스트) (e.g., 첫 번째 행에서 $2은 GO:0003723)

- 만약 a[$1] (array of the associated values of $1) 가 이미 존재한다면, 그 array에 세미콜론을 구분자로 하여 $2를 append한다. 하지만 만약 a[$1]가 아직 존재하지 않는다면, 해당 array에 $2를 부여함으로써 a[$1]을 생성한다.

 

(5) population (background) file 생성

$ cat pairs.txt | cut -f 1 > population

$ cat population | head
NUDT4B
NUDT4B
NUDT4B
IGKV3-7
IGKV3-7
IGKV3-7
IGKV1D-42
IGKV1D-42
IGKV1D-42
IGLV4-69

 

(6) study file 생성

$ cat pairs.txt | cut -f 1 | sort | uniq -c | sort -rn | tr -s ' ' | cut -f 3 -d ' ' | head -5 > study

$ cat study
HTT
TP53
EGFR
GRB2
GOLGA2

 

- tr -s ' ': 반복되는 space 공백 제거

- cut -d ' ': space 공백을 구분자로 사용

 

(7) enrichment 찾기

- 통계학적인 방법들을 통해 study file 내의 유전자들과 significant하게 연관되어 있는 GO terms 발견하기

- MF sub-ontology에 해당하는 GO terms만 추출

$ find_enrichment.py study population association --obo go-basic.obo > results.txt

$ cat results.txt | grep GO: | grep MF | cut -f 1 > terms.txt

$ cat terms.txt | head
GO:0019900
GO:0019903
GO:0019901
GO:0019902
GO:0042802
GO:0019899
GO:0002039
GO:0030971
GO:0005091
GO:1990782

$ cat terms.txt | wc -l
84

 

(8) 시각화

$ go_plot.py --obo go-basic.obo --go_file terms.txt
go-basic.obo: fmt(1.2) rel(2023-11-15) 46,228 Terms
 GoSubDag:  84 sources in 125 GOs rcnt(True). 0 alt GO IDs
 GoSubDag: namedtuple fields: NS level depth GO alt GO_name dcnt D1 id
 GoSubDag: relationships: set()
        GO:0003674  # MF 11234 L00 D00       molecular_function
#f1fbfd GO:0005488  # MF  1821 L01 D01 B     binding
#ffffe4 GO:0098772  # MF   253 L01 D01 D     molecular function regulator activity
#f1fbfd GO:0003824  # MF  7588 L01 D01 A     catalytic activity
#f1fbfd GO:0060089  # MF   399 L01 D01 C     molecular transducer activity
#f1fbfd GO:0060090  # MF   117 L01 D01 F     molecular adaptor activity
#f1fbfd GO:0140657  # MF   210 L01 D01 E     ATP-dependent activity
        GO:0043167  # MF   187 L02 D02 B     ion binding
        GO:0005515  # MF   919 L02 D02 B     protein binding
        GO:0044877  # MF   119 L02 D02 B     protein-containing complex binding
        GO:0097159  # MF   483 L02 D02 B     organic cyclic compound binding
#ffffe4 GO:0140677  # MF   102 L02 D02 D     molecular function activator activity
#ffffe4 GO:0042562  # MF    32 L02 D02 B     hormone binding
        GO:0140096  # MF   602 L02 D02 A     catalytic activity, acting on a protein
        GO:0016740  # MF  2527 L02 D02 A     transferase activity
        GO:0038023  # MF   394 L02 D02 C     signaling receptor activity
        GO:0030674  # MF    47 L02 D02 F     protein-macromolecule adaptor activity
#ffffe4 GO:0030234  # MF   164 L02 D02 D     enzyme regulator activity
        GO:0030545  # MF    50 L02 D02 D     signaling receptor regulator activity
        GO:0043169  # MF    50 L03 D03 B     cation binding
#ffffe4 GO:0019899  # MF    70 L03 D03 B     enzyme binding
#ffffe4 GO:0005102  # MF   417 L03 D03 B     signaling receptor binding
#ffffe4 GO:0050839  # MF    17 L03 D03 B     cell adhesion molecule binding
#ffffe4 GO:0001098  # MF    13 L03 D03 B     basal transcription machinery binding
        GO:0008134  # MF    38 L03 D03 B     transcription factor binding
        GO:0003676  # MF   310 L03 D03 B     nucleic acid binding
#ffffe4 GO:0097371  # MF     0 L03 D03 B     MDM2/MDM4 family protein binding
#ffffe4 GO:0044325  # MF     0 L03 D03 B     transmembrane transporter binding
#ffffe4 GO:0140272  # MF     2 L03 D03 B     exogenous protein binding
#ffffe4 GO:0061676  # MF     0 L03 D03 B     importin-alpha family protein binding
#ffffe4 GO:0019904  # MF    79 L03 D03 B     protein domain specific binding
#ffffe4 GO:0008092  # MF    53 L03 D03 B     cytoskeletal protein binding
#ffffe4 GO:0003682  # MF    10 L03 D03 B     chromatin binding
        GO:0016772  # MF   604 L03 D03 A     transferase activity, transferring phosphorus-containing groups
        GO:0004888  # MF   331 L03 D03 C     transmembrane signaling receptor activity
#ffffe4 GO:0031072  # MF     3 L03 D03 B     heat shock protein binding
#ffffe4 GO:0035591  # MF     5 L03 D03 F     signaling adaptor activity
#ffffe4 GO:0043621  # MF     1 L03 D03 B     protein self-association
#ffffe4 GO:0002039  # MF     0 L03 D03 B     p53 binding
#ffffe4 GO:0051219  # MF     5 L03 D03 B     phosphoprotein binding
#ffffe4 GO:0030235  # MF     1 L03 D03 D     nitric-oxide synthase regulator activity
#ffffe4 GO:0051087  # MF     1 L03 D03 B     protein-folding chaperone binding
#ffffe4 GO:0005522  # MF     0 L03 D03 B     profilin binding
        GO:0019207  # MF    32 L03 D03 D     kinase regulator activity
        GO:0008047  # MF    56 L03 D03 D     enzyme activator activity
#ffffe4 GO:0019838  # MF    26 L03 D03 B     growth factor binding
#ffffe4 GO:0045505  # MF     0 L03 D03 B     dynein intermediate chain binding
#ffffe4 GO:0043560  # MF     0 L03 D03 B     insulin receptor substrate binding
#ffffe4 GO:0042802  # MF     1 L03 D03 B     identical protein binding
#ffffe4 GO:0000149  # MF     3 L03 D03 B     SNARE binding
        GO:0030546  # MF    37 L03 D03 D     signaling receptor activator activity
        GO:0060589  # MF     8 L03 D03 D     nucleoside-triphosphatase regulator activity
#ffffe4 GO:0071889  # MF     0 L03 D03 B     14-3-3 protein binding
#ffffe4 GO:0035033  # MF     1 L03 D03 D     histone deacetylase regulator activity
        GO:0046872  # MF    26 L04 D04 B     metal ion binding
#ffffe4 GO:0051117  # MF     0 L04 D04 B     ATPase binding
#ffffe4 GO:0005178  # MF     1 L03 D04 B     integrin binding
#ffffe4 GO:0001099  # MF    11 L04 D04 B     basal RNA polymerase II transcription machinery binding
#ffffe4 GO:0140296  # MF    13 L04 D04 B     general transcription initiation factor binding
        GO:0003677  # MF   131 L04 D04 B     DNA binding
#ffffe4 GO:0002020  # MF     2 L04 D04 B     protease binding
#ffffe4 GO:0019902  # MF     5 L04 D04 B     phosphatase binding
#ffffe4 GO:0017124  # MF     0 L04 D04 B     SH3 domain binding
#ffffe4 GO:0019900  # MF    17 L04 D04 B     kinase binding
#ffffe4 GO:0015631  # MF     7 L04 D04 B     tubulin binding
#ffffe4 GO:0001618  # MF     0 L04 D04 B     virus receptor activity
        GO:0016301  # MF   357 L04 D04 A     kinase activity
        GO:0016773  # MF   322 L04 D04 A     phosphotransferase activity, alcohol group as acceptor
#ffffe4 GO:1990841  # MF     0 L04 D04 B     promoter-specific chromatin binding
        GO:0005126  # MF    91 L04 D04 B     cytokine receptor binding
#ffffe4 GO:0097718  # MF     0 L04 D04 B     disordered domain specific binding
#ffffe4 GO:0045309  # MF     4 L04 D04 B     protein phosphorylated amino acid binding
#ffffe4 GO:0030159  # MF     2 L04 D04 F     signaling receptor complex adaptor activity
#ffffe4 GO:0019209  # MF     9 L04 D04 D     kinase activator activity
        GO:0019887  # MF    23 L04 D04 D     protein kinase regulator activity
        GO:0003723  # MF   163 L04 D04 B     RNA binding
#ffffe4 GO:0140666  # MF     4 L04 D04 B     annealing activity
#ffffe4 GO:0070851  # MF    29 L04 D04 B     growth factor receptor binding
#ffffe4 GO:0046875  # MF     0 L04 D04 B     ephrin receptor binding
#ffffe4 GO:0042826  # MF     0 L04 D04 B     histone deacetylase binding
#ffffe4 GO:0019905  # MF     2 L04 D04 B     syntaxin binding
#ffffe4 GO:0048408  # MF     0 L03 D04 B     epidermal growth factor binding
        GO:0001067  # MF    35 L04 D04 B     transcription regulatory region nucleic acid binding
        GO:0030695  # MF     7 L04 D04 D     GTPase regulator activity
#ffffe4 GO:0044389  # MF     1 L04 D04 B     ubiquitin-like protein ligase binding
#ffffe4 GO:0034452  # MF     0 L04 D04 B     dynactin binding
#ffffe4 GO:0045296  # MF     1 L04 D04 B     cadherin binding
        GO:0046914  # MF    14 L05 D05 B     transition metal ion binding
#ffffe4 GO:0001091  # MF     6 L05 D05 B     RNA polymerase II general transcription initiation factor binding
#ffffe4 GO:0003697  # MF     8 L05 D05 B     single-stranded DNA binding
#ffffe4 GO:0019901  # MF    15 L05 D05 B     protein kinase binding
#ffffe4 GO:0048487  # MF     0 L05 D05 B     beta-tubulin binding
        GO:0004672  # MF   116 L03 D05 A     protein kinase activity
#ffffe4 GO:0005165  # MF     7 L05 D05 B     neurotrophin receptor binding
#ffffe4 GO:0019903  # MF     4 L05 D05 B     protein phosphatase binding
#ffffe4 GO:0030295  # MF     7 L05 D05 D     protein kinase activator activity
        GO:0003729  # MF    74 L05 D05 B     mRNA binding
#ffffe4 GO:0001784  # MF     0 L05 D05 B     phosphotyrosine residue binding
#ffffe4 GO:0005154  # MF     0 L05 D05 B     epidermal growth factor receptor binding
        GO:0043565  # MF    61 L05 D05 B     sequence-specific DNA binding
        GO:0003690  # MF    78 L05 D05 B     double-stranded DNA binding
#ffffe4 GO:0005091  # MF     0 L03 D05 DF    guanyl-nucleotide exchange factor adaptor activity
#ffffe4 GO:0031625  # MF     0 L05 D05 B     ubiquitin protein ligase binding
#ffffe4 GO:0005507  # MF     2 L06 D06 B     copper ion binding
#ffffe4 GO:0001094  # MF     0 L03 D06 B     TFIID-class transcription factor complex binding
#ffffe4 GO:0019199  # MF    30 L04 D06 AC    transmembrane receptor protein kinase activity
#ffffe4 GO:0004713  # MF    22 L04 D06 A     protein tyrosine kinase activity
#ffffe4 GO:0005167  # MF     3 L06 D06 B     neurotrophin TRK receptor binding
        GO:0004674  # MF    59 L04 D06 A     protein serine/threonine kinase activity
#ffffe4 GO:1990782  # MF     2 L06 D06 B     protein tyrosine kinase binding
#ffffe4 GO:0003730  # MF     1 L06 D06 B     mRNA 3'-UTR binding
#ffffe4 GO:0051721  # MF     0 L06 D06 B     protein phosphatase 2A binding
#ffffe4 GO:0030296  # MF     2 L06 D06 D     protein tyrosine kinase activator activity
#ffffe4 GO:1990814  # MF     1 L05 D06 B     DNA/DNA annealing activity
        GO:1990837  # MF    49 L06 D06 B     sequence-specific double-stranded DNA binding
#ffffe4 GO:0004714  # MF    18 L05 D07 AC    transmembrane receptor protein tyrosine kinase activity
#ffffe4 GO:0004709  # MF     1 L05 D07 A     MAP kinase kinase kinase activity
#ffffe4 GO:0030971  # MF     1 L04 D07 B     receptor tyrosine kinase binding
#ffffe4 GO:0005168  # MF     0 L07 D07 B     neurotrophin TRKA receptor binding
#ffffe4 GO:0030297  # MF     0 L04 D07 D     transmembrane receptor protein tyrosine kinase activator activity
#ffffe4 GO:0036310  # MF     0 L02 D07 BE    ATP-dependent DNA/DNA annealing activity
        GO:0000976  # MF    32 L05 D07 B     transcription cis-regulatory region binding
#ffffe4 GO:0005006  # MF     0 L06 D08 AC    epidermal growth factor receptor activity
#ffffe4 GO:0005068  # MF     0 L05 D08 BF    transmembrane receptor protein tyrosine kinase adaptor activity
#ffffe4 GO:0001046  # MF     2 L06 D08 B     core promoter sequence-specific DNA binding
   84 usr 125 GOs  WROTE: go_plot.png

너무 큰 이미지라 일부만 캡쳐


Reference

The Biostar Handbook: 2nd Edition - István Albert