KT AIVLE(KT 에이블스쿨) 5기 DX트랙_ 딥러닝 심화

KT AIVLE

KT AIVLE(KT 에이블스쿨) 5기 DX트랙_ 딥러닝 심화_언어 모델 이해

gonii00 2024. 4. 20. 12:16

728x90

언어모델 이해

NLP (Natural Language Processing) - 자연어 처리

- 인간 언어와 관련된 모든 것을 이해하는데 초점을 맞춘 언어학 및 기계 학습 분야
- 작업의 목표는 단일 단어를 개별적으로 이해하는 것뿐만 아니라 해당 단어의 맥락(context)을 이해하는 것

- 일반적인 NLP 작업 : 문장 분류 / 개체 명 인식 / 문장 생성 / 질문에 대한 답변 / 텍스트 번역, 요약

[ Transformer ]

기존의 NLP : RNN 기반

- 오랫동안 언어모델을 위한 주요한 접근 방식

- 단점: 병렬 처리 어려움, 장기 의존성 문제, 확장성 제한

Transformer 등장

- 논문: Google, 2017, Attention Is All You Need, https://arxiv.org/abs/1706.03762

- RNN 모델의 단점을 극복

- 언어 모델의 Game Changer

- Transformer 덕분에 LLM이 발전하게 됨.

- 특징: Attention 1_이전 문장들을 잘 기억, 2_문맥상 집중해야 할 단어를 잘 캐치

Transformer 사용하기
< pipeline 함수 >
- transformer 기반 LLM 모델을 손쉽게 사용할 수 있게 해주는 함수 ex) AI pipeline, data pipeline
- NLP 과정을 감추고 다음 과정이 물 흐르듯이 흘러가게 함

다양한 활용

(1) 감성 분석

classifier = pipeline(task = "sentiment-analysis", model = 'bert-base-multilingual-cased')

# sentiment-analysis 모델 파이프라인 생성
# 기본값 : distilbert-base-uncased-finetuned-sst-2-english

classifier = pipeline("sentiment-analysis")

# 모델 사용
text = ["I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
        "I have a dream.",
        "She was so happy."]

classifier(text)

그외..

(2) Zero-shot classification

(3) 번역
(4) 요약(영문 텍스트/ 한글 텍스트 요약)
(5) 문장생성

언어 모델링 절차

데이터 전처리: Tokenize (토큰화)
토큰화
- 문장을 분석하기 위한 최소 단위 데이터
- 사람이 결정해 줘야 할 부분

Embedding(임베딩)
임베딩
- 사람이 쓰는 자연어를 machine이 이해할 수 있는 숫자의 나열(벡터)로 변환
- 사람의 언어인 자연어를 처리하게 하려면 자연어를 숫자로 바꿔 입력을 해줘야 함

[ Fine-tuning ]

: 사전 훈련된(pre-trained) 모델을, 특정 작업이나 데이터셋에 맞게 미세 조정(fine-tuning)하는 과정

1. 환경준비

# 라이브러리 설치 
!pip install transformers==4.31.0
!pip install datasets

# 라이브러리 로딩 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from datasets import load_dataset  # 데이터셋 다운로드

# 데이터셋 다운로드 
# emotion 데이터셋 다운로드
emotions = load_dataset("emotion")

# 데이터 구조
emotions

# 데이터 레이블
classes = emotions['train'].features['label'].names
classes

2. 데이터 둘러보기

# 1. 데이터 프레임으로 변환 
# 데이터프레임으로 변환
emotions.set_format(type="pandas")

# train 데이터 만 추출
df = emotions["train"][:]

# 정수인코딩된 레이블에 원래 문자 추가하기
def label_int2str(row):
    return emotions["train"].features["label"].int2str(row)

# apply 함수는 판다스에서 데이터를 한건씩 순차적으로 처리해줍니다. 
df["label_name"] = df["label"].apply(label_int2str)
df.head()

# 2. 클래스 분포 살펴보기 
df['label_name'].value_counts()

sns.countplot(x = 'label_name', data = df)
plt.grid()
plt.show()

# 3. 트윗 문장 길이(단어 수) 분포 확인
df["Words Per Tweet"] = df["text"].str.split().apply(len)
sns.histplot(x = 'Words Per Tweet', data = df, bins = 30)
plt.grid()
plt.show()

sns.kdeplot(x = 'Words Per Tweet', data = df, hue = 'label_name', common_norm = False)
plt.grid()
plt.show()

emotions.reset_format()

3. 데이터 준비

# 1. 토크나이저 다운로드 
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

tokenizer.model_input_names

# 2. 데이터셋 토큰화 
# 문장 하나씩 토크나이즈 하기 위한 함수 생성
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

em_encoded = emotions.map(tokenize, batched=True, batch_size=None)

# 데이터 한건에 대한 내용을 살펴봅시다.
col_names = em_encoded["train"].column_names
sample_data = em_encoded["train"][0]
for i in col_names :
    print(i + ' :', sample_data[i])

# 3. 텐서플로 학습을 위한 데이터 구성 
# 학습 배치에 포함될 샘플의 수
batch_size = 64

# 필요한 칼럼 : ['input_ids', 'attention_mask']
token_cols = tokenizer.model_input_names

# 데이터셋 구성
train = em_encoded["train"].to_tf_dataset(columns=token_cols, label_cols="label",
                                          shuffle=True, batch_size=batch_size)

val = em_encoded["validation"].to_tf_dataset(columns=token_cols, label_cols="label",
                                             shuffle=False, batch_size=batch_size)

test = em_encoded["test"].to_tf_dataset(columns=token_cols, label_cols="label",
                                        shuffle=False, batch_size=batch_size)

4. 파인튜닝

from transformers import TFAutoModelForSequenceClassification
from sklearn.metrics import accuracy_score, f1_score
import tensorflow as tf
from keras.optimizers import Adam
from sklearn.metrics import *

# 1. 사전 훈련된 모델 로드하기 
# 사전훈련된 모델 지정
preTrModel = "distilbert-base-uncased"

# Output Layer 노드 수
nclass = 6

# 모델 로드하기
model_ft = TFAutoModelForSequenceClassification.from_pretrained(preTrModel, num_labels=nclass)

# 2. 추가 학습 
# 컴파일 및 학습
model_ft.compile(optimizer = Adam(5e-5), loss = 'sparse_categorical_crossentropy')
model_ft.fit(train, validation_data = val, epochs=5, batch_size = 64)

# 3. 예측 및 평가 
pred = model_ft.predict(test)
pred = pred.logits.argmax(axis=1)

y_test = em_encoded["test"]['label']

print(confusion_matrix(y_test, pred))
print()
print(classification_report(y_test, pred, target_names = classes))

def plot_confusion_matrix(y_true, y_pred, classes):
    cm = confusion_matrix(y_true, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)
    disp.plot(cmap=plt.cm.Blues, colorbar=False)
    plt.show()

plot_confusion_matrix(y_test, pred, classes)

요약

Fine-tuning 수행 절차

1. 사전 훈련된 모델 선택
- NLP: BERT, GPT, RoBERTa 등 (Hugging face에서)
- 컴퓨터 비전: ResNet, VGGNet, EfficientNet 등

2. 데이터 준비
- 특정 작업에서 사용할 데이터 준비
- 사전 훈련된 모델과 호환되는 형태로 전처리
- NLP: 텍스트 토큰화
- 컴퓨터 비전: 이미지를 적절한 크기로 리사이징

트랜스포머 모델이 요구하는 입력 데이터의 형식
- input_ids : 토큰화 된 입력 시퀀스를 숫자 ID로 변환한 것
- attention_mask : 모델이 패딩 된 부분을 무시하고 실제 유용한 데이터에만 집중할 수 있도록 함.
- 패딩(padding, 덧대다) : 토큰의 길이를 맞추기 위해서, 짧은 문장은 0으로 채움.

3. 모델 수정
- 대부분의 경우, 사전 훈련된 모델의 출력
- 텍스트 분류 작업: 출력 레이어의 뉴런 수를 분류하려는 클래스의 수와 일치시킴

4. 추가 학습 (.fit)
- 준비된 데이터셋을 사용하여 모델의 가중치를 추가로 학습
- 보통 작은 학습률을 사용
- 사전 훈련 과정에서 습득한 지식을 유지하면서도 새로운 작업에 맞게 조정됨.