단어 빈도 분석 | Word Frequency Analysis (24.02.2022.)

Bachelor of Business Administration @PNU/Marketing Analytics

단어 빈도 분석 | Word Frequency Analysis (24.02.2022.)

Hazel Y. 2022. 4. 14. 16:51

다음은 성공한 제품 리뷰와 성공하지 않은 제품 리뷰 각각으로 단어 빈도 분석 시 사용한 코드이다.

The following are the steps and codes that I used for the word frequency analysis for each of the successful and unsuccessful product reviews.

1. Import necessary packages.

import pandas as pd

import matplotlib.pyplot as plt

import re

import nltk
from nltk.corpus import stopwords

from wordcloud import WordCloud

2. Download stopwords package.

nltk.download('stopwords')

3. Import datasets.

suc = pd.read_csv('suc.csv')
un = pd.read_csv('un.csv')

4. Define a preprocessing function.

def data_text_cleaning(data):
 
    # 영문자 이외 문자는 공백으로 변환
    only_english = re.sub('[^a-zA-Z]', ' ', data)
 
    # 소문자 변환
    no_capitals = only_english.lower().split()
 
    # 불용어 제거
    stops = set(stopwords.words('english'))
    added_stops = ['mascara', 'mascaras', 'tarte', 'ilia', 'benefit', 'cosmetics', 'sephora', 'collection', 'lara', 'devgan', 'scientific', 'beauty', 'guerlain', 'clinique']
    no_stops = [word for word in no_capitals if not word in stops]
    no_stops = [word for word in no_stops if not word in added_stops]
 
    return no_stops

added_stops라는 변수로 'mascara'와 데이터셋에 포함된 제품 브랜드명과 같은 너무 당연히 자주 언급될 수 있는 단어들을 분석에서 제외하였다. 이를 통해, 더 의미 있는 결과를 만들어내고자 하였다.

With a variable, added_stops, I removed words that are too obvious such as 'mascara' as well as the names of the brands whose products are part of the dataset. By doing so, I tried to get more meaningful results.

5. Successful product reviews

Preprocessing

review_list = []

for i in range(len(suc)):
    
    review = str(suc['review'][i])
    suc['review'][i] = data_text_cleaning(review)
    review_list += suc['review'][i]
    
    i += 1

이제, review_list에는 성공한 제품 리뷰에서 변수 stops와 added_stops로 제거된 단어들을 제외한 모든 단어들이 포함되어 있다.

Now, the review_list contains all the words in successful product reviews except for the removed ones with variable stops and added stops. Here, duplicated words exist.

Extract 50 most frequently mentioned words.

top_words = pd.Series(review_list).value_counts().head(50)
print("Top 50 words from successful product reviews")
print(top_words)

Visualization

Bar chart (in descending order, only with top 20 words)

# 빈도수 시각화하기
# 내림차순 정렬
top_words.head(20).sort_values().plot(kind='barh',title='successful product reviews word counter')

Wordcloud

# 워드 클라우드 생성

def make_wordcloud(text):
    word_max = 100
    wordcloud = WordCloud(background_color='white', max_words=word_max, max_font_size=200, height=700, width=900).generate(text)
    plt.imshow(wordcloud, interpolation='lanczos') #이미지의 부드럽기 정도
    plt.axis('off') #x y 축 숫자 제거
    
top_words = str(top_words).strip('dtype: int64')

make_wordcloud(str(top_words))

6. Unsuccessful product reviews

Preprocessing

u_review_list = []

for i in range(len(un)):
    
    u_review = str(un['review'][i])
    un['review'][i] = data_text_cleaning(u_review)
    u_review_list += un['review'][i]
    
    i += 1

Extract 50 most frequently mentioned words.

u_top_words = pd.Series(u_review_list).value_counts().head(50)
print("Top 50 words from unsuccessful product reviews")
print(u_top_words)

Visualization

Bar chart (in descending order, only with top 20 words)

# 빈도수 시각화하기
# 내림차순 정렬
u_top_words.head(20).sort_values().plot(kind='barh',title='unsuccessful product reviews word counter')

Wordcloud

u_top_words = str(u_top_words).strip('dtype: int64')

make_wordcloud(str(u_top_words))

* Unauthorized copying and distribution of this post are not allowed.

'Bachelor of Business Administration @PNU > Marketing Analytics' 카테고리의 다른 글

의미 연결망 분석 \| Semantic Network Analysis (16.03.2022.) (0)	2022.04.15
감성 분석 - 비지도 학습 \| Sentiment Analysis - Unsupervised Learning (24.02.2022.) (0)	2022.04.15
세포라 리뷰 웹 크롤링 - 2 \| Sephora Review Web Crawling - 2 (24.02.2022.) (0)	2022.04.14
선행 연구 -1 \| Pilot Study - 1 (24.02.2022.) (0)	2022.04.14
세포라 웹사이트 리뷰 크롤링 \| Sephora Website Review Crawling (18.02.2022.) (0)	2022.04.14

현재글단어 빈도 분석 | Word Frequency Analysis (24.02.2022.)

- 인프제가 생각을 쏟아내는 곳 - INFJ, a professional overthinker 공스타: @gongstabyhazel

독서기록, 네덜란드, 일기, 생물정보학, Python, 생각, Netherlands, 일상, 파이썬, 석사, sentiment analysis, 책추천, 감성 분석, 리눅스, 유학생, Linux, Topic Modeling, bioinformatics, 매일의생각, marketing analytics,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Hazel's Life Journey