NLP 기반 영화 리뷰 감성 분석 모델 파인튜닝과 모델 배포과정

자연어 처리를 활용하여 영화 리뷰의 긍정 및 부정을 판별하는 인공지능 모델을 파인튜닝하는 방법을 소개합니다.

이번 프로젝트에서는 Hugging Face의 AutoModelForSequenceClassification 모델을 활용하여 IMDb 데이터셋을 Fine-Tuning하여 감성 분석 모델을 구축하고, 이를 Hugging Face Hub에 업로드 후 서비스화하는 과정을 다룹니다.

1. 모델 준비

먼저, Hugging Face의 distilbert-base-uncased-finetuned-sst-2-english 모델을 로드합니다.

이 모델은 사전 학습된 감성 분석 모델로, 이를 기반으로 IMDb 데이터셋을 Fine-Tuning하여 영화 리뷰 감성 분석을 수행할 수 있습니다.

from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)

2. 데이터셋 로드 및 전처리

IMDb 데이터셋을 Hugging Face의 datasets 라이브러리를 이용하여 로드합니다.

!pip install datasets
from datasets import load_dataset

dataset = load_dataset('imdb')

IMDb 데이터셋은 다음과 같이 구성되어 있습니다:

train: 25,000개 리뷰
test: 25,000개 리뷰
unsupervised: 50,000개 리뷰 (라벨 없음)

데이터셋의 토큰화(Tokenization) 작업을 수행하기 위해 tokenize_function 함수를 정의합니다.

def tokenize_function(examples):
    return tokenizer(examples['text'], padding=True, truncation=True, max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

3. 모델 Fine-Tuning 설정

3.1 TrainingArguments 설정

IMDb 데이터셋을 이용해 기존 사전 학습된 모델을 Fine-Tuning하는 과정을 설정합니다.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results', # 모델 저장 디렉터리
    evaluation_strategy='epoch', # 에포크마다 평가 수행
    per_device_train_batch_size=8, # 한번에 8개씩 학습
    per_device_eval_batch_size=8,
    num_train_epochs=1, # 학습 횟수
    weight_decay=0.01, # 가중치 감쇠 (Overfitting 방지)
)

3.2 Trainer 설정

Trainer를 이용하여 모델을 Fine-Tuning합니다.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
)

3.3 모델 학습 및 평가

학습을 진행한 후, 평가를 수행합니다.

trainer.train()
trainer.evaluate()

4. 감성 분석 예측 수행

Fine-Tuning된 모델을 이용하여 새로운 리뷰의 감성을 예측할 수 있습니다.

import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

new_text = [
    "I love this movie! It was fantastic!",
    "This was the worst experience ever. I will never come back."
]

#new_text 토큰화
inputs = tokenizer(new_text, padding=True, truncation=True, max_length=128, return_tensors='pt').to(device)

#예측 수행
with torch.no_grad():
    logits = model(**inputs).logits

#소프트맥스를 적용해 확률 계산
probs = torch.nn.functional.softmax(logits, dim=-1)
predicted_classes = torch.argmax(probs, dim=-1)

for id in predicted_classes.tolist():
    print(config.id2label[id])
    
    


#출력결과
POSITIVE
NEGATIVE

5. 모델 저장 및 Hugging Face Hub 업로드

5.1 모델 저장

trainer.save_model(training_args.output_dir)

학습 과정에서 results 폴더에 모델이 저장됩니다.

Checkpoint 폴더에 중간 저장된 모델이 존재하며, 최종 Fine-Tuning된 모델은 checkpoint-3125 폴더에 위치합니다.

(25000개 데이터를 8개씩 학습하니, 3125가 최종 체크포인트!)

5.2 Hugging Face Hub 업로드

Hugging Face에 로그인한 후, Fine-Tuning된 모델을 업로드합니다.

토큰 인증이 필요합니다.

from huggingface_hub import notebook_login
notebook_login()

현재 메모리의 모델 업로드

model.push_to_hub("marurun66/distilbert-base-uncased-imdb")
tokenizer.push_to_hub("marurun66/distilbert-base-uncased-imdb")
config.push_to_hub("marurun66/distilbert-base-uncased-imdb")

#("marurun66/distilbert-base-uncased-imdb") 는 아이디/레파지토리

model, tokenizer, config

저장된 모델 파일 업로드

from huggingface_hub import HfApi

my_model_name = 'marurun66/distilbert-base-uncased-imdb2'
api = HfApi()
repo_id = api.create_repo(repo_id=my_model_name, exist_ok=True, private=True)

api.upload_folder(
    folder_path='./results/checkpoint-3125',
    repo_id=my_model_name,
    repo_type='model',
    ignore_patterns=['.gitattributes', '.git', '.gitignore', 'README.md']
)

이렇게 업로드 된 모델은, 파이프라인으로 편리하게 불러와서 사용할 수 있습니다.

from transformers import pipeline

model_name = 'marurun66/distilbert-base-uncased-imdb'
nlp_pipe = pipeline('text-classification', model=model_name, tokenizer=model_name)

text = [
    "I love this movie! It was fantastic!",
    "This was the worst experience ever. I will never come back."
]

nlp_pipe(text)

결론

이번 프로젝트에서는 Hugging Face의 AutoModelForSequenceClassification을 활용하여 IMDb 영화 리뷰 감성 분석 모델을 Fine-Tuning하는 방법을 살펴보았습니다. 모델을 학습하고 평가한 후, 이를 Hugging Face Hub에 업로드하여 재사용할 수 있도록 설정하였습니다.

이제 해당 모델을 활용하여 다양한 응용 서비스를 개발할 수 있습니다. 예를 들어, 영화 리뷰를 자동으로 분석하여 긍정 및 부정 리뷰를 필터링하는 시스템을 구축할 수도 있습니다. 앞으로 다양한 NLP 프로젝트에서 활용할 수 있도록 이 모델을 확장해보는 것도 좋은 아이디어입니다!

매일코딩

NLP 기반 영화 리뷰 감성 분석 모델 파인튜닝과 모델 배포과정

1. 모델 준비

2. 데이터셋 로드 및 전처리

3. 모델 Fine-Tuning 설정

3.1 TrainingArguments 설정

3.2 Trainer 설정

3.3 모델 학습 및 평가

4. 감성 분석 예측 수행

5. 모델 저장 및 Hugging Face Hub 업로드

5.1 모델 저장

5.2 Hugging Face Hub 업로드

현재 메모리의 모델 업로드

저장된 모델 파일 업로드

결론

티스토리툴바

NLP 기반 영화 리뷰 감성 분석 모델 파인튜닝과 모델 배포과정

1. 모델 준비

2. 데이터셋 로드 및 전처리

3. 모델 Fine-Tuning 설정

3.1 TrainingArguments 설정

3.2 Trainer 설정

3.3 모델 학습 및 평가

4. 감성 분석 예측 수행

5. 모델 저장 및 Hugging Face Hub 업로드

5.1 모델 저장

5.2 Hugging Face Hub 업로드

현재 메모리의 모델 업로드

저장된 모델 파일 업로드

결론

관련글

티스토리툴바