Building on AWS

13. 전통적 머신러닝(ML)과 생성형 AI 결합

Window to the world 2025. 1. 16. 15:25
반응형

 

본 게시글에서는 전통적인 머신러닝(ML)과 생성형 AI를 결합하여 고객 지원 시스템을 구축하는 과정을 통해, Amazon SageMaker, Bedrock, Comprehend 등의 AWS 서비스를 활용한 하이브리드 AI 솔루션 개발 방안을 설명하고자 합니다.

 


1. S3 통합

 

AWS SageMaker 환경에서 머신러닝 워크플로를 설정하기 위한 초기 단계입니다.

  • AWS S3와 통합: 데이터를 읽거나 쓰기 위해 S3와 통신할 준비를 합니다.
  • SageMaker 사용 준비: SageMaker에서 모델 학습 및 배포를 위한 세션과 IAM 역할을 설정합니다.
  • 데이터 전처리 도구 준비: 데이터 전처리를 위한 텍스트 벡터화, 레이블 인코딩, 데이터 분할 라이브러리를 임포트합니다.
!pip install s3fs


import boto3
import sagemaker
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

role = get_execution_role()
session = sagemaker.Session()

S3_BUCKET_NAME = '사용할 Bucket Name 입력'

 

2. 데이터 생성

 

전통적 머신러닝 모델 학습을 위한 고객 지원 데이터를 생성하는 스크립트입니다.

  • 기능:
    1. 카테고리별 고객 지원 쿼리 생성: 6개 카테고리의 고객 쿼리를 랜덤으로 생성하여 데이터셋을 만듭니다.
    2. CSV 파일로 저장: 생성된 데이터를 CSV 파일로 저장합니다.
    3. S3 업로드: Amazon S3에 파일을 업로드합니다.
import pandas as pd
import random
import boto3
import os

S3_FILE_NAME = 'customer_support_data.csv'

# 카테고리 & 샘플 쿼리 설정
categories = [
    "Account Issues",
    "Billing",
    "Technical Support",
    "Product Information",
    "Shipping",
    "Returns and Refunds"
]

queries = {
    "Account Issues": [
        "How do I reset my password?",
        "I can't log into my account",
        "How can I change my email address?",
        "I want to delete my account",
        "How do I update my profile information?",
        "I forgot my username",
        "How can I enable two-factor authentication?",
        "My account is locked, what should I do?",
        "Can I merge two accounts?",
        "How do I add a secondary user to my account?"
    ],
    "Billing": [
        "Why was I charged twice?",
        "How do I update my payment method?",
        "Can I get a refund for my last purchase?",
        "When will my subscription renew?",
        "How do I cancel my subscription?",
        "I don't recognize a charge on my account",
        "Can I change my billing cycle?",
        "Do you offer any discounts for annual subscriptions?",
        "How do I view my billing history?",
        "Can I get an invoice for tax purposes?"
    ],
    "Technical Support": [
        "The app keeps crashing on my phone",
        "I can't connect to the server",
        "How do I troubleshoot connection issues?",
        "The website is not loading properly",
        "I'm getting an error message when I try to upload a file",
        "How do I clear the cache on your app?",
        "The video quality is poor, how can I improve it?",
        "I'm experiencing lag, what can I do?",
        "How do I update the software?",
        "My data isn't syncing across devices"
    ],
    "Product Information": [
        "What are the dimensions of product X?",
        "Is this product compatible with my device?",
        "When will product Y be back in stock?",
        "What's the difference between model A and model B?",
        "Do you offer any warranties on your products?",
        "Can you tell me more about the features of product Z?",
        "What materials is this product made from?",
        "Is this product eco-friendly?",
        "Do you have any tutorials on how to use this product?",
        "Are replacement parts available for this product?"
    ],
    "Shipping": [
        "How long does shipping usually take?",
        "Can I change my shipping address after placing an order?",
        "Do you offer international shipping?",
        "How can I track my order?",
        "What are your shipping rates?",
        "Is express shipping available?",
        "My package is delayed, what should I do?",
        "Do you ship to PO boxes?",
        "Can I choose a specific delivery date?",
        "How are shipping costs calculated?"
    ],
    "Returns and Refunds": [
        "What is your return policy?",
        "How do I initiate a return?",
        "When will I receive my refund?",
        "Do I need to pay for return shipping?",
        "Can I exchange an item instead of returning it?",
        "I received a damaged item, what should I do?",
        "How long do I have to return an item?",
        "Can I return a gift without a receipt?",
        "Do you offer full refunds or store credit?",
        "What items are non-returnable?"
    ]
}

# dummy data 생성
data = []
for _ in range(1000):  # Generate 1000 samples
    category = random.choice(categories)
    query = random.choice(queries[category])
    data.append({"query": query, "category": category})

df = pd.DataFrame(data)

df = df.sample(frac=1, random_state=42).reset_index(drop=True)

local_file_path = "customer_support_data.csv"
df.to_csv(local_file_path, index=False)

print("Dataset statistics:")
print(df['category'].value_counts())
print("\nTotal samples:", len(df))

print("\nFirst few rows of the dataset:")
print(df.head())

# Upload to S3
try:
    s3_client = boto3.client('s3')
    s3_client.upload_file(local_file_path, S3_BUCKET_NAME, S3_FILE_NAME)
    print(f"\nFile successfully uploaded to S3 bucket: {S3_BUCKET_NAME}/{S3_FILE_NAME}")
except Exception as e:
    print(f"An error occurred while uploading to S3: {str(e)}")

print(f"\nLocal file saved as: {os.path.abspath(local_file_path)}")

 

3. 데이터 분할

 

데이터를 학습용(train)과 테스트용(test) 세트로 나눕니다. 이를 통해 모델이 학습한 데이터와 새로운 데이터에 대해 얼마나 잘 일반화하는지 평가할 수 있습니다.

 

 

# 생성 데이터 로드
data = pd.read_csv(f's3://{S3_BUCKET_NAME}/customer_support_data.csv')
print(data.head())
print(data['category'].value_counts())

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(
    data['query'], data['category'], test_size=0.2, random_state=42
)

 

4. 분류 모델 학습 (Traditional ML)

 

텍스트 데이터를 벡터화하고 라벨을 인코딩하여 머신러닝 모델 학습용 데이터셋을 생성한 뒤, 이를 CSV로 저장하고 S3에 업로드 합니다. Amazon SageMaker를 사용해 Linear Learner 알고리즘으로 다중 클래스 분류 모델을 학습합니다.

 

# CountVectorizer
vectorizer = CountVectorizer()

X_train_vec = vectorizer.fit_transform(X_train)

# Labels 생성
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)

train_df = pd.DataFrame(X_train_vec.toarray(), columns=vectorizer.get_feature_names_out())
train_df.insert(0, 'label', y_train_encoded)

# 학습 데이터셋 저장
train_df.to_csv('train.csv', index=False, header=False)
session.upload_data('train.csv', bucket=S3_BUCKET_NAME, key_prefix='input')

from sagemaker.amazon.amazon_estimator import get_image_uri

train_data = sagemaker.inputs.TrainingInput(
    f's3://{S3_BUCKET_NAME}/input/train.csv',
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
    record_wrapping=None,
    compression=None,
)

session = sagemaker.Session()

# Estimator 설정
container = get_image_uri(session.boto_region_name, 'linear-learner')

linear_learner = sagemaker.estimator.Estimator(
    container,
    sagemaker.get_execution_role(),
    input_mode="File",
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=f's3://{S3_BUCKET_NAME}/output'
)

# Hyperparameters 설정
linear_learner.set_hyperparameters(
    feature_dim=len(vectorizer.get_feature_names_out()),
    predictor_type='multiclass_classifier',
    num_classes=len(label_encoder.classes_),
    mini_batch_size=100,
    epochs=10
)

# 모델 학습
linear_learner.fit({'train': train_data})

print("Training complete.")

 

5. Deployment & Inferencing

 

SageMaker에서 학습된 모델을 엔드포인트에 배포하고, 이를 통해 실시간 추론(Inference)을 수행할 수 있도록 설정합니다. 주어진 텍스트 쿼리에 대해 모델이 예측한 카테고리를 반환합니다.

 

# 엔드포인트 설정
import sagemaker

linear_learner_predictor = linear_learner.deploy(
    initial_instance_count=1,
    instance_type='ml.t2.medium'
)

# Inference
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
import numpy as np
import json

linear_learner_predictor.serializer = CSVSerializer()
linear_learner_predictor.deserializer = JSONDeserializer()

def classify_query(query):
    
    query_vector = vectorizer.transform([query]).toarray()
    result = linear_learner_predictor.predict(query_vector)
    label = np.argmax(result['predictions'][0]['score'])
    category = label_encoder.inverse_transform([label])[0]
    return category

 

6. 하이브리드 활용 (Traditional ML + Gen AI)

 

전통적 ML 모델과 생성형 AI 모델의 결합을 통해 사용자 요청을 분류하고, 그에 맞는 맥락적이고 자연스러운 응답을 생성합니다.

# LLM 설정
def generate_conversation(bedrock_client, model_id, system_prompt, message):
    
    inference_config = {"temperature": 0.5} 

    additional_model_fields = {"top_k": 200}
    
    response = bedrock_client.converse(
        modelId=model_id,
        messages=[message],
        system=system_prompt,
        inferenceConfig=inference_config,
        additionalModelRequestFields=additional_model_fields
    )

    return response
    
    # 각 카테고리별 시스템 프롬프트 설정
    def get_system_prompt(query):   
    
    SYSTEM_PROMPTS = {
        "Account Issues": "You are a customer support AI specializing in account-related issues. Your responses should be secure, focusing on user authentication and account management. Never ask for or provide sensitive information like passwords. Guide users through account recovery processes, and explain security measures clearly.",
    
        "Billing": "You are a billing support AI with expertise in invoices, payment methods, and subscription management. Provide clear explanations of charges, guide users through payment processes, and offer solutions for billing disputes. Be knowledgeable about refund policies and pricing structures.",
    
        "Technical Support": "You are a technical support AI with broad knowledge of our products and services. Provide step-by-step troubleshooting guidance, explain technical concepts in user-friendly terms, and offer workarounds when possible. If an issue seems complex, know when to escalate to human support.",
    
        "Product Information": "You are a product specialist AI with comprehensive knowledge of our product lineup. Provide detailed information on product features, compatibility, and use cases. Compare products when relevant, and suggest alternatives if a product doesn't meet the user's needs.",
    
        "Shipping": "You are a shipping and logistics AI expert. Provide accurate information on shipping times, tracking processes, and delivery options. Handle inquiries about delays professionally, and be knowledgeable about international shipping regulations and customs procedures.",
    
        "Returns and Refunds": "You are a returns and refunds specialist AI. Guide customers through the return process, explain refund timelines, and clarify return policies. Be empathetic to customer concerns while adhering to company policies. Offer alternatives to returns when appropriate, such as exchanges or store credit.",
    
        "Other": "You are a versatile customer support AI capable of handling a wide range of inquiries. Provide helpful and accurate information, and when faced with unique or complex issues, guide the customer towards the appropriate specialized department or human support."
    }
        
    category = classify_query(query)
    
    return {
        'category': category,
        'system_prompt': SYSTEM_PROMPTS.get(category, SYSTEM_PROMPTS["Other"])
    }
    
   # 하이브리드 워크플로우 설정
   def process_query(query):
    
    bedrock_client = boto3.client('bedrock-runtime')

    model_id = 'anthropic.claude-3-5-sonnet-20240620-v1:0'

    category_and_prompt = get_system_prompt(query)
    
    system_prompt = [{"text": category_and_prompt['system_prompt']}]

    message = {
        "role": "user",
        "content": [{"text": query}]
    }

    response = generate_conversation(bedrock_client, model_id, system_prompt, message)

    return {
        'category': category_and_prompt['category'],
        'response': response['output']['message']['content'][0]['text']
    }

 

7. 하이브리드 워크플로우 Inferencing

query = "Can I get my order faster?"
result = process_query(query)
print(f"Original query: {query}")
print(f"Category: {result['category']}")
print(f"Response: {result['response']}")

 

결과 예시:

 

 

 Written in August 24, 2024 

by Sang Hyun Jo

반응형