Bachelor of Business Administration @PNU/Marketing Analytics

불용어 제거한 데이터셋 만들기 | How to Make Stop Words-Removed Datasets (13.04.2022.)

Hazel Y. 2022. 4. 21. 15:35

다음은 불용어를 제거한 데이터셋을 만들기 위한 코드이다.

The followings are the steps and codes to make and save datasets without stop words.


1. Import necessary packages.

import pandas as pd

import re

from nltk.corpus import stopwords

 

2. Download the stopwords package.

nltk.download('stopwords')

 

3. Import the datasets.

suc = pd.read_csv('suc_bert.csv')
un = pd.read_csv('un_bert.csv')

 

4. Define the function for preprocessing (including removing stop words)

def data_text_cleaning(data):
 
    # 영문자 이외 문자는 공백으로 변환
    only_english = re.sub('[^a-zA-Z]', ' ', data)
 
    # 소문자 변환
    no_capitals = only_english.lower().split()
 
    # 불용어 제거
    stops = set(stopwords.words('english'))
    no_stops = [word for word in no_capitals if not word in stops]
    
    return no_stops

 

5. Preprocess and save CSV files.

# successful product reviews
for i in range(len(suc)):
    
    review = str(suc['review'][i])
    suc['review'][i] = data_text_cleaning(review)
    suc['review'][i] = " ".join(suc['review'][i])
    
suc.to_csv('suc_stopremoved.csv')
# unsuccessful product reviews
for i in range(len(un)):
    
    u_review = str(un['review'][i])
    un['review'][i] = data_text_cleaning(u_review)
    un['review'][i] = " ".join(un['review'][i])
    
un.to_csv('un_stopremoved.csv')

 

 

 

* Unauthorized copying and distribution of this post are not allowed.

* 해당 글에 대한 무단 배포 및 복사를 허용하지 않습니다.