감성 분석 - 지도 학습 | Sentiment Analysis - Supervised Learning (16.03.2022.)

Bachelor of Business Administration @PNU/Marketing Analytics

감성 분석 - 지도 학습 | Sentiment Analysis - Supervised Learning (16.03.2022.)

Hazel Y. 2022. 4. 17. 21:59

교수님께서 나에게 지도 학습을 통한 감성 분석을 진행해보라고 조언해주셨다. 하지만, 그 때 나는 지도 학습 방법에 대해 많이 알고 있지 않았고, 그래서 학습용 데이터셋을 구하거나 새로운 데이터 수집을 통해 만들었어야 했는데, 그러지 않았다. (학습용 데이터가 따로 필요한 줄 몰랐다.) 이미 수집해서 만들어놓고 분석에 계속 사용해오고 있는 성공한 제품들과 성공하지 않은 제품의 리뷰들로 구성된 데이터셋을 학습용으로도 사용하였다. 결론부터 얘기하자면, 이 분석은 완전히 실패했고 아무 의미가 없었다. 그래서 아래에 적혀 있는 그대로로 분석을 따라하지는 말길 바란다.

어쨌든, 다음은 내가 지도 학습을 통한 감성 분석 시 사용한 코드이다.

* 이 분석에서, 실제 감성 라벨링의 기준을 변경하였따. 이전에 비지도 학습을 통한 감성 분석을 할 때에는 별점 4점과 5점짜리 리뷰는 1(긍정), 3점은 0(중립), 1점과 2점은 -1(부정)로 라벨링하였다면, 이번 지도 학습 분석에서는 5점짜리 리뷰만 1로, 그리고 1점짜리 리뷰만 -1로 라벨링하고, 나머지 2점부터 4점짜리 리뷰에게는 모두 0을 부여했다.

The professor advised me to try the supervised learning sentiment analysis. However, at that moment, I didn't know much about the supervised learning method, so I should have found or newly gathered a training dataset, but I didn't. Instead, I used the datasets of successful and unsuccessful product reviews. Hence, to tell the conclusion, this analysis was a total fail and meaningless. So, I don't recommend following the codes below in the exact same way.

Anyway, the followings are the steps and codes for the supervised learning sentiment analysis.

* In this analysis, I changed the criteria for the real sentiment labeling. For the previous unsupervised learning analysis, I labeled reviews with four or five stars as 1, three stars as 0, and one or two stars as -1. However, for the supervised learning analysis, I gave 1 only to five-star reviews and -1 only to one-star reviews. Reviews with two to four stars got 0.

1. Import necessary packages.

import pandas as pd

import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

2. Define a function for preprocessing.

def data_text_cleaning(data):
 
    # 영문자 이외 문자는 공백으로 변환
    only_english = re.sub('[^a-zA-Z]', ' ', data)
 
    # 소문자 변환
    no_capitals = only_english.lower()
 
    return no_capitals

3. Import datasets.

suc = pd.read_csv('suc.csv')
un = pd.read_csv('un.csv')

4. Preprocessing

# successful product reviews
for i in range(len(suc)):
    
    review = str(suc['review'][i])
    suc['review'][i] = data_text_cleaning(review)
    
# unsuccessful product reviews
for i in range(len(un)):
    
    u_review = str(un['review'][i])
    un['review'][i] = data_text_cleaning(u_review)

5. Add a column called 'sent' and label each review.

# 평점을 기준으로 한 감성 (5점 긍정(1), 2~4점 중립(0) 1점 부정(-1))

# successful product reviews
suc['sent'] = 1

for i in range(len(suc)):
    
    if suc['rating'][i] == 5:
        suc['sent'][i] = 1
        
    else:
        if suc['rating'][i] == 1:
            suc['sent'][i] = -1
        
        else:
            suc['sent'][i] = 0
            
# unsuccessful product reviews
un['sent'] = 1

for i in range(len(un)):
    
    if un['rating'][i] == 5:
        un['sent'][i] = 1
        
    else:
        if un['rating'][i] == 1:
            un['sent'][i] = -1
        
        else:
            un['sent'][i] = 0

6. Separate train and test data.

# 학습, 테스트 데이터 분리

# successful product reviews
suc_target = suc['sent']
suc_feature = suc['review']

X_strain, X_stest, y_strain, y_stest = train_test_split(suc_feature, suc_target, test_size=0.3, random_state=42)

# unsuccessful product reviews
un_target = un['sent']
un_feature = un['review']

X_utrain, X_utest, y_utrain, y_utest = train_test_split(un_feature, un_target, test_size=0.3, random_state=42)

7. Calculate prediction accuracy - CountVectorizer

# successful product reviews
# CountVectorizer로 예측 성능 평가
cpipeline = Pipeline([('cnt_vect', CountVectorizer(stop_words='english', ngram_range=(1,2))), ('lt_clf', LogisticRegression(C=10))])
    
# Pipline 객체를 이용해 fit(), predict()로 학습/예측 수행, predict_proba()는 roc_auc 때문에 수행
cpipeline.fit(X_strain, y_strain)
spredc = cpipeline.predict(X_stest)
spredc_proba = cpipeline.predict_proba(X_stest)

acc = accuracy_score(y_stest, spredc)
auc = roc_auc_score(y_stest, spredc_proba, multi_class='ovr')
print(f"정확도:{acc : .3f}\nAUC score:{auc : .3f}")

# unsuccessful product reviews
cpipeline.fit(X_utrain, y_utrain)
upredc = cpipeline.predict(X_utest)
upredc_proba = cpipeline.predict_proba(X_utest)

acc = accuracy_score(y_utest, upredc)
auc = roc_auc_score(y_utest, upredc_proba, multi_class='ovr')
print(f"정확도:{acc : .3f}\nAUC score:{auc : .3f}")

8. Calculate prediction accuracy - TfidVectorizer

# successful product reviews
# TfidfVectorizer로 예측 성능 평가
tpipeline = Pipeline([('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1,2))), ('lt_clf', LogisticRegression(C=10))])

tpipeline.fit(X_strain, y_strain)
spredt = tpipeline.predict(X_stest)
spredt_proba = tpipeline.predict_proba(X_stest)

acc = accuracy_score(y_stest, spredt)
auc = roc_auc_score(y_stest, spredt_proba, multi_class='ovr')
print(f"정확도:{acc : .3f}\nAUC score:{auc : .3f}")

# unsuccessful product reviews
tpipeline.fit(X_utrain, y_utrain)
upredt = tpipeline.predict(X_utest)
upredt_proba = tpipeline.predict_proba(X_utest)

acc = accuracy_score(y_utest, upredt)
auc = roc_auc_score(y_utest, upredt_proba, multi_class='ovr')
print(f"정확도:{acc : .3f}\nAUC score:{auc : .3f}")

자연어에서 성질 추출하기
- CounterVectorizer: 가장 간단한 방법 (단순히 단어의 빈도를 계산함으로써 벡터화)
  - 하지만, 조사와 지시대명사와 같은 단어들은 높은 빈도를 가지지만 특별한 의미가 없다.
- TfidVectorizer(TF-IDF): CounterVectorizer의 단점을 보완하기 위한 방법
  - TF-IDF
    - TF(Term Frequency): 한 데이터나 문장에 있는 특정 단어의 빈도
    - IDF(Inverse Document Frequency): DF의 역수
      - DF(Document Frequency): 다른 데이터나 문장에 있는 특정 단어의 빈도
    - TF가 높고 DF가 낮을 수록, TF-IDF는 높다.
    - 그러므로, 만약 한 단어가 하나의 데이터에는 자주 출현하지만 다른 데이터에서는 그렇지 않다면, 그 단어는 독특한 성질을 보여주는 것이다.

Extracting characteristics of natural language
- CounterVectorizer: the simplest method (vectorization by simply counting the frequency of words(units))
  - However, words such as postpositions and demonstrative pronouns have a high frequency but no special meaning.
- TfidVectorizer(TF-IDF): the method to make up for the weaknesses of CounterVectorizer
  - TF * IDF
    - TF(Term Frequency): frequency of a certain word in a data(sentence)
    - IDF(Inverse Document Frequency)
      - DF(Document Frequency): frequency of a certain word in other data(sentences)
    - The higher TF and lower DF, the higher TF-IDF.
    - Hence, if a word appears a lot in one data but rarely does in others, the word shows unique characteristics.

* Unauthorized copying and distribution of this post are not allowed.

'Bachelor of Business Administration @PNU > Marketing Analytics' 카테고리의 다른 글

토픽 별 감성 분석 \| Sentiment Analysis for Each Topic (30.03.2022.) (0)	2022.04.19
선행 연구 - 2 \| Pilot Study - 2 (16.03.2022.) (0)	2022.04.18
구 빈도 분석 \| Phrase Frequency Analysis (16.03.2022.) (0)	2022.04.17
LDA 토픽 모델링 \| Topic Modeling using LDA (16.03.2022.) (0)	2022.04.15
의미 연결망 분석 \| Semantic Network Analysis (16.03.2022.) (0)	2022.04.15

현재글감성 분석 - 지도 학습 | Sentiment Analysis - Supervised Learning (16.03.2022.)

- 인프제가 생각을 쏟아내는 곳 - INFJ, a professional overthinker 공스타: @gongstabyhazel

Topic Modeling, Python, 감성 분석, 일기, 독서기록, 일상, 석사, 생각, 유학생, Linux, Netherlands, marketing analytics, 네덜란드, 생물정보학, 책추천, 리눅스, 파이썬, sentiment analysis, 매일의생각, bioinformatics,

Today :
Yesterday :

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Hazel's Life Journey