자연어 처리(NLP) AutoModelForSequenceClassification

자연어 처리(NLP)에서 사전 학습된 모델을 활용하여 문장을 분석하고 예측하는 과정에서 AutoModel을 직접 사용하면 더욱 유연한 조정이 가능합니다. 본 글에서는 AutoModelForSequenceClassification을 활용하여 모델을 다루는 방법을 다룹니다.

긍정/ 부정 두개로 분류하는 모델 만들기 (Fine-tuning)

I love this movie! It was fantastic!

This is the worst experience I have ever had.

As the sun slowly began to set over the vast horizon, casting a warm golden glow across the sky, painting it with hues of orange, pink, and purple, a gentle breeze whispered through the towering trees, rustling the vibrant green leaves that danced gracefully in the wind, while birds chirped melodiously, their harmonious songs filling the tranquil evening air, and distant waves crashed rhythmically against the rugged shore, echoing nature’s timeless symphony, as a young traveler, weary from the long journey, wandered along the winding path, marveling at the breathtaking beauty of the untouched landscape, feeling a deep sense of serenity wash over them, knowing that in this fleeting moment, surrounded by the wonders of the natural world, time itself seemed to slow down, allowing them to fully embrace the present, free from the burdens of the past and the uncertainties of the future, simply existing in perfect harmony with the universe, breathing in the crisp, refreshing air, their heart beating in sync with the rhythmic pulse of nature, while the last rays of sunlight kissed the earth goodbye, gradually fading into the calm darkness of the approaching night, welcoming the first twinkling stars that would soon illuminate the endless expanse of the night sky.

위 세 문장을 긍정, 부정으로 분류하는 자연어처리모델을 만들어 보겠습니다.

texts=["I love this movie! It was fantastic!","This is the worst experience I have ever had.",""""As the sun slowly began to set over the vast horizon, casting a warm golden glow across the sky, painting it with hues of orange, pink, and purple, a gentle breeze whispered through the towering trees, rustling the vibrant green leaves that danced gracefully in the wind, while birds chirped melodiously, their harmonious songs filling the tranquil evening air, and distant waves crashed rhythmically against the rugged shore, echoing nature’s timeless symphony, as a young traveler, weary from the long journey, wandered along the winding path, marveling at the breathtaking beauty of the untouched landscape, feeling a deep sense of serenity wash over them, knowing that in this fleeting moment, surrounded by the wonders of the natural world, time itself seemed to slow down, allowing them to fully embrace the present, free from the burdens of the past and the uncertainties of the future, simply existing in perfect harmony with the universe, breathing in the crisp, refreshing air, their heart beating in sync with the rhythmic pulse of nature, while the last rays of sunlight kissed the earth goodbye, gradually fading into the calm darkness of the approaching night, welcoming the first twinkling stars that would soon illuminate the endless expanse of the night sky."""]

from transformers import AutoModelForSequenceClassification, AutoTokenizer
#AutoModel는 여러 모델이 있다. QuestionAnswering (FAQ) 등등 상황에 맞는 모델 선택

model_name='distilbert-base-uncased-finetuned-sst-2-english'

#감정분석을 위한 사전학습 모델과 토크나이저를 가져온다.
model=AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer=AutoTokenizer.from_pretrained(model_name)

1. 🔠 토크나이징(Tokenizing)

토크나이징은 문장을 구성하는 단어를 숫자로 변환하여 모델이 처리할 수 있는 형식으로 변환하는 과정입니다.

이를 통해 텍스트가 토큰 ID(input IDs), 어텐션 마스크 로 변환되어 모델의 입력으로 사용됩니다.

파이프라인과의 차이점
pipeline("text-classification")을 사용할 경우, 내부적으로 토크나이징, 모델 입력, 예측, 후처리가 모두 자동으로 수행됩니다.

하지만 AutoModelForSequenceClassification을 직접 사용할 경우, 사용자가 AutoTokenizer를 활용하여 입력 데이터를 미리 변환해야 합니다.

inputs=tokenizer(texts,padding=True, truncation=True, max_length=128, return_tensors='pt')
#padding=짧은 문장은 긴 문장에 맞춰 부족한 행렬 채워넣어라, truncation=처음 학습한 컬럼 길이보다 초과하는건 삭제해라

padding=True: 짧은 문장은 긴 문장에 맞춰 패딩을 추가합니다.
truncation=True: 학습된 모델의 최대 길이를 초과하는 경우, 초과된 부분을 잘라냅니다.

#inputs
{'input_ids': tensor([[  101,  1045,  2293,  2023,  3185,   999,  2009,  2001, 10392,   999,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  2023,  2003,  1996,  5409,  3325,  1045,  2031,  2412,  2018,
          1012,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1000,  2004,  1996,  3103,  3254,  2211,  2000,  2275,  2058,
          1996,  6565,  9154,  1010,  9179,  1037,  4010,  3585,  8652,  2408,
          1996,  3712,  1010,  4169,  2009,  2007, 20639,  2015,  1997,  4589,
          1010,  5061,  1010,  1998,  6379,  1010,  1037,  7132,  9478,  3990,
          2083,  1996, 20314,  3628,  1010, 29188,  1996, 17026,  2665,  3727,
          2008, 10948, 28266,  1999,  1996,  3612,  1010,  2096,  5055,  9610,
         14536,  2098, 11463,  7716, 19426,  1010,  2037, 25546,  6313,  2774,
          8110,  1996, 25283, 26147,  3944,  2250,  1010,  1998,  6802,  5975,
          8007, 14797,  3973,  2114,  1996, 17638,  5370,  1010, 17142,  3267,
          1521,  1055, 27768,  6189,  1010,  2004,  1037,  2402, 20174,  1010,
         16040,  2013,  1996,  2146,  4990,  1010, 13289,  2247,  1996, 12788,
          4130,  1010,  8348,  2075,  2012,  1996,  3052, 17904,  5053,  1997,
          1996, 22154,  5957,  1010,  3110,  1037,  2784,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1]])}

긴 문장에 맞춰 0으로 패딩된 1,2번 문장

의미없는 0은 어텐션마스크에서 0으로 무시하도록 처리되었습니다.

2. 🛑 어텐션 마스크 (Attention Mask)

모델이 문장을 처리할 때, 패딩된 부분을 무시하도록 attention_mask를 사용합니다.

1 : 의미 있는 토큰 (모델이 처리해야 할 부분)
0 : 의미 없는 패딩된 토큰 (무시해야 할 부분)

이를 통해 모델이 불필요한 연산을 수행하지 않도록 합니다.

3. 📊 모델 예측 및 로짓 (Logits) 생성

모델에 입력을 전달하면, 로짓(logits)이라는 예측값이 생성됩니다.

로짓은 정규화되지 않은 점수로, 이를 확률로 변환하기 위해 소프트맥스를 적용해야합니다.

with torch.no_grad():
  logits=model(**inputs).logits
  
# **을 풀어서 쓰자면 아래와 같습니다.
# with torch.no_grad():
# model(input_ids=inputs['input_ids'],attention_mask=inputs['attention_mask'])

#logits
tensor([[-4.3242,  4.6727],
        [ 4.6273, -3.7186],
        [-4.0575,  4.3437]])

4. 🔢 소프트맥스 (Softmax) 변환

로짓 값은 0~1 사이의 확률 값이 아니기 때문에, 소프트맥스 함수를 사용하여 확률 값으로 변환합니다.

probs=torch.nn.functional.softmax(logits.logits, dim=-1)

#probs
tensor([[1.2378e-04, 9.9988e-01],
        [9.9976e-01, 2.3730e-04],
        [2.2455e-04, 9.9978e-01]])

둘 중 숫자가 큰 쪽으로 분류 처리됨.

5. 🎯 클래스 매핑

최종적으로 확률이 가장 높은 클래스의 인덱스를 가져와 라벨로 변환합니다.

predicted_classes = torch.argmax(probs, dim=-1).tolist()
print(predicted_classes)

각 클래스는 config.id2label을 사용하여 실제 라벨 이름으로 변환할 수 있습니다.

from transformers import AutoConfig

config = AutoConfig.from_pretrained("bert-base-uncased")
print(config.id2label)

for i in predicted_classes:
    print(config.id2label[i])

#결과
POSITIVE
NEGATIVE
POSITIVE

이와 같이, AutoModelForSequenceClassification은 분류 태스크를 쉽게 수행할 수 있도록 설계된 모델이며,

이를 활용하면 감성 분석, 문장 분류 등 다양한 NLP 작업에서 예측 결과를 쉽게 활용할 수 있습니다. 🚀

매일코딩

자연어 처리(NLP) AutoModelForSequenceClassification

긍정/ 부정 두개로 분류하는 모델 만들기 (Fine-tuning)

1. 🔠 토크나이징(Tokenizing)

2. 🛑 어텐션 마스크 (Attention Mask)

3. 📊 모델 예측 및 로짓 (Logits) 생성

4. 🔢 소프트맥스 (Softmax) 변환

5. 🎯 클래스 매핑

티스토리툴바

자연어 처리(NLP) AutoModelForSequenceClassification

긍정/ 부정 두개로 분류하는 모델 만들기 (Fine-tuning)

1. 🔠 토크나이징(Tokenizing)

2. 🛑 어텐션 마스크 (Attention Mask)

3. 📊 모델 예측 및 로짓 (Logits) 생성

4. 🔢 소프트맥스 (Softmax) 변환

5. 🎯 클래스 매핑

관련글

티스토리툴바