의미 연결망 분석 | Semantic Network Analysis (16.03.2022.)

Bachelor of Business Administration @PNU/Marketing Analytics

의미 연결망 분석 | Semantic Network Analysis (16.03.2022.)

Hazel Y. 2022. 4. 15. 21:35

다음은 의미 연결망 분석 시 사용한 코드이다.

The following are the steps and codes for the semantic network analysis.

1. Import necessary packages.

import pandas as pd

from nltk.corpus import stopwords

import numpy as np

import matplotlib.pyplot as plt

import re

import networkx as nx

import operator

2. Import datasets.

suc = pd.read_csv('suc.csv')
un = pd.read_csv('un.csv')

3. Define a function for the preprocessing.

def data_text_cleaning(data):
 
    # 영문자 이외 문자는 공백으로 변환
    only_english = re.sub('[^a-zA-Z]', ' ', data)
 
    # 소문자 변환
    no_capitals = only_english.lower().split()
 
    # 불용어 제거
    stops = set(stopwords.words('english'))
    added_stops = ['mascara', 'mascaras', 'tarte', 'ilia', 'benefit', 'cosmetics', 'sephora', 'collection', 'lara', 'devgan', 'scientific', 'beauty', 'guerlain', 'clinique', 'lashes']
    no_stops = [word for word in no_capitals if not word in stops]
    no_stops = [word for word in no_stops if not word in added_stops]
 
    return no_stops

added_stops라는 변수로 'mascara'와 데이터셋에 포함된 제품 브랜드명과 같은 너무 당연히 자주 언급될 수 있는 단어들을 분석에서 제외하였다. 이를 통해, 더 의미 있는 결과를 만들어내고자 하였다.

With a variable, added_stops, I removed words that are too obvious such as mascara as well as the names of the brands whose products are part of the dataset. By doing so, I tried to get more meaningful results.

4. Preprocessing

# successful product reviews
for i in range(len(suc)):
    
    review = str(suc['review'][i])
    suc['review'][i] = data_text_cleaning(review)
    
    i += 1
  
# unsuccessful product reviews
for i in range(len(un)):
    
    u_review = str(un['review'][i])
    un['review'][i] = data_text_cleaning(u_review)
    
    i += 1

5. Create a dataframe of word co-occurrence frequency.

# successful product reviews
suc_rev = suc['review']

count_suc = {} # 동시출현 빈도가 저장될 dict

for line in suc_rev:
    
    for i, a in enumerate(line):
        
        for b in line[i+1:]:
            
            if a > b:
                count_suc[b,a] = count_suc.get((b, a), 0) + 1
                
            else:
                count_suc[a,b] = count_suc.get((a, b), 0) + 1
                
df_suc = pd.DataFrame.from_dict(count_suc, orient='index')

list_suc = []

for i in range(len(df_suc)):
    
    list_suc.append([df_suc.index[i][0], df_suc.index[i][1], df_suc[0][i]])

df_suc = pd.DataFrame(list_suc, columns=['term1','term2','freq'])
df_suc = df_suc.sort_values(by=['freq'], ascending=False) # freq 기준으로 내림차순 정렬
df_suc = df_suc.reset_index(drop=True)
df_suc.head(20)

# unsuccessful product reviews
un_rev = un['review']

count_un = {} # 동시출현 빈도가 저장될 dict

for line_u in un_rev:
    
    for i, a in enumerate(line_u):
        
        for b in line_u[i+1:]:
            
            if a>b:
                count_un[b,a] = count_un.get((b, a), 0) + 1
                
            else:
                count_un[a,b] = count_un.get((a, b), 0) + 1
                
df_un = pd.DataFrame.from_dict(count_un, orient='index')

list_un = []

for i in range(len(df_un)):
    
    list_un.append([df_un.index[i][0], df_un.index[i][1], df_un[0][i]])

df_un = pd.DataFrame(list_un, columns=['term1','term2','freq'])
df_un = df_un.sort_values(by=['freq'], ascending=False) # freq 기준으로 내림차순 정렬
df_un = df_un.reset_index(drop=True)
df_un.head(20)

6. Draw a semantic network.

# successful product reviews
# 중심성 척도 계산을 위한 Graph 만들기
G_centrality = nx.Graph()

# 빈도수가 2000 이상인 단어쌍에 대해서만 edge(간선) 표현
for ind in range(len(np.where(df_suc['freq']>=700)[0])):
    G_centrality.add_edge(df_suc['term1'][ind], df_suc['term2'][ind], weight=int(df_suc['freq'][ind]))
    
dgr = nx.degree_centrality(G_centrality) # 연결 중심성(Degree Centrality)
pgr = nx.pagerank(G_centrality) # 페이지 랭크(PageRank)

# 중심성이 큰 순서대로 정렬
sorted_dgr = sorted(dgr.items(), key=operator.itemgetter(1), reverse=True)
sorted_pgr = sorted(pgr.items(), key=operator.itemgetter(1), reverse=True)

# 단어 네트워크를 그려줄 Graph 선언
G = nx.Graph()

# 페이지 랭크에 따라 두 노드 사이의 연관성 결정 (단어쌍의 연관성)
# 연결 중심성으로 계산한 척도에 따라 노드의 크기 결정 (단어의 등장 빈도수)
for i in range(len(sorted_pgr)):
    G.add_node(sorted_pgr[i][0], nodesize=sorted_dgr[i][1])

for ind in range(len(np.where(df_suc['freq']>700)[0])):
    G.add_weighted_edges_from([(df_suc['term1'][ind], df_suc['term2'][ind], int(df_suc['freq'][ind]))])
    
# 노드 크기 조정
sizes = [G.nodes[node]['nodesize'] * 500 for node in G]

options = {
    'edge_color': '#FFDEA2',
    'width': 1,
    'with_labels': True,
    'font_weight': 'regular'
}

nx.draw(G, node_color='#FFA07A', node_size=sizes, pos=nx.spring_layout(G, k=10, iterations=100), **options)
ax = plt.gca()
plt.title('Semantic Network for Successful Product Reviews')
plt.show()

# unsuccessful product reviews
# 중심성 척도 계산을 위한 Graph 만들기
G_centrality = nx.Graph()

# 빈도수가 2000 이상인 단어쌍에 대해서만 edge(간선) 표현
for ind in range(len(np.where(df_un['freq']>=15)[0])):
    G_centrality.add_edge(df_un['term1'][ind], df_un['term2'][ind], weight=int(df_un['freq'][ind]))
    
dgr = nx.degree_centrality(G_centrality) # 연결 중심성
pgr = nx.pagerank(G_centrality) # 페이지 랭크

# 중심성이 큰 순서대로 정렬
sorted_dgr = sorted(dgr.items(), key=operator.itemgetter(1), reverse=True)
sorted_pgr = sorted(pgr.items(), key=operator.itemgetter(1), reverse=True)

# 단어 네트워크를 그려줄 Graph 선언
G = nx.Graph()

# 페이지 랭크에 따라 두 노드 사이의 연관성 결정 (단어쌍의 연관성)
# 연결 중심성으로 계산한 척도에 따라 노드의 크기 결정 (단어의 등장 빈도수)
for i in range(len(sorted_pgr)):
    G.add_node(sorted_pgr[i][0], nodesize=sorted_dgr[i][1])

for ind in range(len(np.where(df_un['freq']>15)[0])):
    G.add_weighted_edges_from([(df_un['term1'][ind], df_un['term2'][ind], int(df_un['freq'][ind]))])

# 노드 크기 조정
sizes = [G.nodes[node]['nodesize'] * 500 for node in G]

options = {
    'edge_color': '#FFDEA2',
    'width': 1,
    'with_labels': True,
    'font_weight': 'regular'
}

nx.draw(G, node_color='#FFA07A', node_size=sizes, pos=nx.spring_layout(G, k=10, iterations=100), **options)
ax = plt.gca()
plt.title('Semantic Network Analysis for Unsuccessful Product Reviews')
plt.show()

연결 중심성: 더 많은 노드들이 연결될 수록 연결 중심성의 크기는 커진다. (연결 중심성의 크기는 해당 단어가 얼마나 중요한지 보여준다.)
페이지랭크: 다른 노드들의 중심성에 연결되어 있는 노드의 수에 따라 노드의 상대적 크기(중요도)를 결정한다.
- 예) 하이퍼링크 구조 (월드 와이드 웹)

Degree Centrality: Its size gets bigger as more nodes are connected. (The size shows how important the word is.)
PageRank: It determines the relative size(importance) of the node based on the number of other nodes that are connected to the centrality of others.
- e.g. hyperlink structure (World Wide Web)

* Unauthorized copying and distribution of this post are not allowed.

'Bachelor of Business Administration @PNU > Marketing Analytics' 카테고리의 다른 글

구 빈도 분석 \| Phrase Frequency Analysis (16.03.2022.) (0)	2022.04.17
LDA 토픽 모델링 \| Topic Modeling using LDA (16.03.2022.) (0)	2022.04.15
감성 분석 - 비지도 학습 \| Sentiment Analysis - Unsupervised Learning (24.02.2022.) (0)	2022.04.15
단어 빈도 분석 \| Word Frequency Analysis (24.02.2022.) (0)	2022.04.14
세포라 리뷰 웹 크롤링 - 2 \| Sephora Review Web Crawling - 2 (24.02.2022.) (0)	2022.04.14

현재글의미 연결망 분석 | Semantic Network Analysis (16.03.2022.)

- 인프제가 생각을 쏟아내는 곳 - INFJ, a professional overthinker 공스타: @gongstabyhazel

sentiment analysis, Topic Modeling, 매일의생각, Python, 생각, 일상, Linux, 감성 분석, 책추천, 일기, Netherlands, 석사, bioinformatics, 네덜란드, 독서기록, 파이썬, 유학생, 생물정보학, 리눅스, marketing analytics,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Hazel's Life Journey