Bachelor of Business Administration @PNU/Marketing Analytics

단어 빈도 분석 | Word Frequency Analysis (24.02.2022.)

Hazel Y. 2022. 4. 14. 16:51

다음은 성공한 제품 리뷰와 성공하지 않은 제품 리뷰 각각으로 단어 빈도 분석 시 사용한 코드이다.

The following are the steps and codes that I used for the word frequency analysis for each of the successful and unsuccessful product reviews.


1. Import necessary packages.

 

import pandas as pd

import matplotlib.pyplot as plt

import re

import nltk
from nltk.corpus import stopwords

from wordcloud import WordCloud

 

2. Download stopwords package.

 

nltk.download('stopwords')

 

3. Import datasets.

 

suc = pd.read_csv('suc.csv')
un = pd.read_csv('un.csv')

 

4. Define a preprocessing function.

 

def data_text_cleaning(data):
 
    # 영문자 이외 문자는 공백으로 변환
    only_english = re.sub('[^a-zA-Z]', ' ', data)
 
    # 소문자 변환
    no_capitals = only_english.lower().split()
 
    # 불용어 제거
    stops = set(stopwords.words('english'))
    added_stops = ['mascara', 'mascaras', 'tarte', 'ilia', 'benefit', 'cosmetics', 'sephora', 'collection', 'lara', 'devgan', 'scientific', 'beauty', 'guerlain', 'clinique']
    no_stops = [word for word in no_capitals if not word in stops]
    no_stops = [word for word in no_stops if not word in added_stops]
 
    return no_stops

added_stops라는 변수로 'mascara​'와 데이터셋에 포함된 제품 브랜드명과 같은 너무 당연히 자주 언급될 수 있는 단어들을 분석에서 제외하였다. 이를 통해, 더 의미 있는 결과를 만들어내고자 하였다.

 

With a variable, added_stops, I removed words that are too obvious such as 'mascara' as well as the names of the brands whose products are part of the dataset. By doing so, I tried to get more meaningful results.


5. Successful product reviews

 

  • Preprocessing
review_list = []

for i in range(len(suc)):
    
    review = str(suc['review'][i])
    suc['review'][i] = data_text_cleaning(review)
    review_list += suc['review'][i]
    
    i += 1

이제, review_list에는 성공한 제품 리뷰에서 변수 stops와 added_stops로 제거된 단어들을 제외한 모든 단어들이 포함되어 있다.

 

Now, the review_list contains all the words in successful product reviews except for the removed ones with variable stops and added stops. Here, duplicated words exist.


 

  • Extract 50 most frequently mentioned words.
top_words = pd.Series(review_list).value_counts().head(50)
print("Top 50 words from successful product reviews")
print(top_words)

 

 

  •  Visualization

 

  • Bar chart (in descending order, only with top 20 words)
# 빈도수 시각화하기
# 내림차순 정렬
top_words.head(20).sort_values().plot(kind='barh',title='successful product reviews word counter')

 

 

  • Wordcloud
# 워드 클라우드 생성

def make_wordcloud(text):
    word_max = 100
    wordcloud = WordCloud(background_color='white', max_words=word_max, max_font_size=200, height=700, width=900).generate(text)
    plt.imshow(wordcloud, interpolation='lanczos') #이미지의 부드럽기 정도
    plt.axis('off') #x y 축 숫자 제거
    
top_words = str(top_words).strip('dtype: int64')

make_wordcloud(str(top_words))

 

 

6. Unsuccessful product reviews

 

  • Preprocessing
u_review_list = []

for i in range(len(un)):
    
    u_review = str(un['review'][i])
    un['review'][i] = data_text_cleaning(u_review)
    u_review_list += un['review'][i]
    
    i += 1

 

 

  • Extract 50 most frequently mentioned words.
u_top_words = pd.Series(u_review_list).value_counts().head(50)
print("Top 50 words from unsuccessful product reviews")
print(u_top_words)

 

 

  •  Visualization

 

  • Bar chart (in descending order, only with top 20 words)
# 빈도수 시각화하기
# 내림차순 정렬
u_top_words.head(20).sort_values().plot(kind='barh',title='unsuccessful product reviews word counter')

 

 

  • Wordcloud
u_top_words = str(u_top_words).strip('dtype: int64')

make_wordcloud(str(u_top_words))

 

 

 

* Unauthorized copying and distribution of this post are not allowed.

* 해당 글에 대한 무단 배포 및 복사를 허용하지 않습니다.