RAG & Agent 시스템 성능 검증은 어떻게 할까?

NLP | LLM

RAG & Agent 시스템 성능 검증은 어떻게 할까?

삐롱K 2025. 6. 19. 23:48

728x90

1. RAG 시스템 성능 검증

1.1 컴포넌트별 평가 (Component-wise Evaluation)

1.1.1 Retrieval 성능 평가

지표공식설명측정 방법

Precision@K	P@K = (관련 문서 수 in top-K) / K	상위 K개 중 관련 문서 비율	수동 라벨링 또는 자동 판정
Recall@K	R@K = (검색된 관련 문서 수) / (전체 관련 문서 수)	전체 관련 문서 중 검색된 비율	완전한 관련성 데이터셋 필요
Mean Reciprocal Rank (MRR)	MRR = (1/\|Q\|) Σ (1/rank_i)	첫 번째 관련 문서의 평균 역순위	여러 쿼리에 대한 평균
Normalized DCG (NDCG)	NDCG = DCG / IDCG	순위를 고려한 누적 이득	관련성 점수 기반 (0-3점 등)
Hit Rate	HR@K = (관련 문서를 포함한 쿼리 수) / (전체 쿼리 수)	상위 K개에 관련 문서가 있는 쿼리 비율	이진 관련성 판정

1.1.2 구체적인 Retrieval 평가 방법

# 예시: BEIR 벤치마크 활용
from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval

# 1. 데이터셋 로드 (MS MARCO, Natural Questions 등)
dataset = "msmarco"
corpus, queries, qrels = GenericDataLoader(dataset).load()

# 2. 검색 모델 평가
retriever = YourRetriever()  # 커스텀 검색 모델
results = retriever.retrieve(corpus, queries)

# 3. 성능 지표 계산
evaluator = EvaluateRetrieval()
ndcg, _map, recall, precision = evaluator.evaluate(qrels, results, [1, 3, 5, 10, 100])

1.1.3 Generation 성능 평가

지표측정 대상계산 방법특징

Faithfulness	생성된 답변이 검색된 문서에 얼마나 충실한가	LLM-based 평가 또는 NLI 모델	환각(hallucination) 방지
Answer Relevance	생성된 답변이 질문과 얼마나 관련있는가	코사인 유사도, LLM 판정	질문-답변 일치도
Context Relevance	검색된 컨텍스트가 질문과 얼마나 관련있는가	관련성 점수 매기기	불필요한 정보 필터링
Groundedness	답변이 제공된 컨텍스트에 근거하는가	사실 검증, 인용 확인	사실 정확성

1.2 End-to-End RAG 평가

1.2.1 자동 평가 지표

지표공식/방법장점단점

ROUGE-L	LCS 기반 유사도	빠른 계산, 일관성	의미적 차이 무시
BERTScore	BERT 임베딩 F1	의미적 유사도 반영	계산 비용 높음
BLEURT	학습된 평가 모델	인간 판정과 높은 상관관계	도메인 의존성
Exact Match (EM)	정확한 문자열 일치	명확한 기준	표현 다양성 무시
F1 Score	토큰 레벨 F1	부분 일치 고려	순서 무시

1.2.2 LLM-as-a-Judge 평가

# GPT-4를 활용한 RAG 답변 평가 예시
evaluation_prompt = """
다음 기준으로 RAG 시스템의 답변을 평가해주세요:

1. 정확성 (1-5점): 제공된 문서 내용과 일치하는가?
2. 완성도 (1-5점): 질문에 완전히 답하는가?
3. 관련성 (1-5점): 질문과 관련된 내용인가?
4. 일관성 (1-5점): 논리적으로 일관된가?

질문: {question}
검색된 문서: {retrieved_docs}
생성된 답변: {generated_answer}

각 기준별 점수와 이유를 제시해주세요.
"""

def evaluate_with_llm(question, docs, answer):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": evaluation_prompt.format(
            question=question, retrieved_docs=docs, generated_answer=answer
        )}]
    )
    return parse_scores(response.choices[0].message.content)

1.3 RAG 벤치마크 데이터셋

데이터셋도메인크기특징평가 지표

MS MARCO	일반	1M passages	웹 검색 기반	MRR@10, Recall@1000
Natural Questions	Wikipedia	307K questions	실제 사용자 질문	EM, F1
TriviaQA	퀴즈	95K questions	사실 기반 질문	EM, F1
HotpotQA	Wikipedia	113K questions	다중 문서 추론	EM, F1, Supporting Facts
FEVER	Wikipedia	185K claims	사실 검증	Label Accuracy, FEVER Score
RGB	다양한 도메인	7개 태스크	종합 벤치마크	태스크별 지표

1.4 고급 RAG 평가 방법

1.4.1 Adversarial Testing

# 적대적 테스트 예시
adversarial_tests = [
    # 1. 무관한 문서가 포함된 경우
    "irrelevant_documents_test",
    # 2. 상충하는 정보가 있는 경우  
    "conflicting_information_test",
    # 3. 불완전한 정보만 있는 경우
    "incomplete_information_test",
    # 4. 시간에 민감한 정보 테스트
    "temporal_information_test"
]

def run_adversarial_test(rag_system, test_type):
    test_cases = load_adversarial_cases(test_type)
    results = []
    for case in test_cases:
        result = rag_system.query(case['question'])
        score = evaluate_robustness(result, case['expected_behavior'])
        results.append(score)
    return np.mean(results)

1.4.2 Ablation Study

제거할 컴포넌트측정 목적예상 효과

Re-ranking	재순위화의 효과	Precision 감소
Query Expansion	쿼리 확장의 효과	Recall 감소
Chunk Overlap	청킹 전략의 효과	Context 연결성 감소
Metadata Filtering	메타데이터 활용 효과	정확도 감소

2. Agent 시스템 성능 검증

2.1 Task-specific 평가

2.1.1 Tool-using Agent 평가

평가 요소지표측정 방법예시

Tool Selection Accuracy	올바른 도구 선택 비율	(정확한 도구 선택 수) / (전체 선택 수)	계산기 vs 검색엔진 선택
Parameter Passing Accuracy	올바른 파라미터 전달 비율	파라미터 정확도 점수	API 호출 시 올바른 인자 전달
Tool Chain Efficiency	목표 달성까지의 도구 사용 횟수	최소 필요 단계 대비 실제 단계	불필요한 도구 호출 최소화
Error Recovery	오류 상황에서의 복구 능력	성공적 복구 비율	API 오류 후 대안 찾기

2.1.2 구체적인 Tool-use 평가 방법

# ToolBench 스타일 평가
class ToolUseEvaluator:
    def __init__(self):
        self.tools = ["calculator", "search", "email", "calendar"]
        self.ground_truth = load_tool_sequences()
    
    def evaluate_tool_sequence(self, agent_actions, task_id):
        gt_sequence = self.ground_truth[task_id]
        
        # 1. Tool Selection Accuracy
        tool_accuracy = self.calculate_tool_accuracy(agent_actions, gt_sequence)
        
        # 2. Parameter Accuracy  
        param_accuracy = self.calculate_param_accuracy(agent_actions, gt_sequence)
        
        # 3. Execution Success Rate
        success_rate = self.check_execution_success(agent_actions)
        
        # 4. Efficiency Score
        efficiency = len(gt_sequence) / len(agent_actions)
        
        return {
            'tool_accuracy': tool_accuracy,
            'param_accuracy': param_accuracy, 
            'success_rate': success_rate,
            'efficiency': efficiency
        }

2.2 Planning & Reasoning 평가

2.2.1 Multi-step Reasoning 평가

평가 차원지표계산 방법적용 사례

Plan Quality	계획의 논리적 일관성	전문가 평가 + 자동화 검증	여행 계획, 프로젝트 관리
Step Accuracy	각 단계의 정확성	단계별 성공/실패 체크	수학 문제 해결 과정
Dependency Handling	의존성 관리 능력	순서 위반 검출	요리 레시피 따르기
Adaptability	계획 수정 능력	변경 상황 대응 점수	실시간 일정 조정

2.2.2 GSM8K 스타일 수학 문제 평가

# 단계별 추론 평가 예시
def evaluate_math_reasoning(agent_response, ground_truth):
    steps = parse_reasoning_steps(agent_response)
    gt_steps = parse_reasoning_steps(ground_truth)
    
    scores = {
        'final_answer_correct': check_final_answer(steps[-1], gt_steps[-1]),
        'reasoning_steps_correct': [],
        'logical_flow_score': 0
    }
    
    # 각 단계별 정확성 체크
    for i, step in enumerate(steps):
        if i < len(gt_steps):
            step_score = evaluate_step_correctness(step, gt_steps[i])
            scores['reasoning_steps_correct'].append(step_score)
    
    # 논리적 흐름 평가
    scores['logical_flow_score'] = evaluate_logical_flow(steps)
    
    return scores

2.3 Memory & Context 관리 평가

2.3.1 대화 일관성 평가

평가 요소측정 방법지표도구

Entity Consistency	엔티티 추적 정확도	엔티티 참조 일치율	NER + Coreference Resolution
Temporal Consistency	시간 정보 일관성	시간순 모순 검출	Timeline Extraction
Persona Consistency	페르소나 유지	성격/스타일 일관성 점수	Style Classifier
Factual Consistency	사실 정보 일관성	모순 발견 비율	Fact Verification

2.3.2 장기 대화 평가 방법

# 장기 대화 일관성 평가
class LongTermConsistencyEvaluator:
    def __init__(self):
        self.entity_tracker = EntityTracker()
        self.fact_checker = FactChecker()
        self.temporal_checker = TemporalChecker()
    
    def evaluate_conversation(self, conversation_history):
        consistency_scores = []
        
        for turn_idx, turn in enumerate(conversation_history):
            # 1. 엔티티 일관성 검사
            entity_score = self.entity_tracker.check_consistency(
                turn, conversation_history[:turn_idx]
            )
            
            # 2. 사실 일관성 검사  
            fact_score = self.fact_checker.verify_facts(
                turn, conversation_history[:turn_idx]
            )
            
            # 3. 시간 일관성 검사
            temporal_score = self.temporal_checker.check_timeline(
                turn, conversation_history[:turn_idx]
            )
            
            turn_score = {
                'entity': entity_score,
                'factual': fact_score, 
                'temporal': temporal_score,
                'overall': (entity_score + fact_score + temporal_score) / 3
            }
            
            consistency_scores.append(turn_score)
        
        return consistency_scores

2.4 Agent 벤치마크 & 데이터셋

2.4.1 주요 Agent 벤치마크

벤치마크도메인태스크 유형평가 지표특징

ToolBench	다양한 API	Tool-use	Success Rate, Efficiency	16,000+ APIs
WebShop	E-commerce	Web Navigation	Task Success Rate	실제 웹사이트 환경
ALFWorld	가정 환경	Interactive Planning	Success Rate	텍스트 기반 환경
ScienceWorld	과학 실험	Scientific Reasoning	Task Completion	물리/화학 시뮬레이션
HotPotQA	지식 검색	Multi-hop QA	EM, F1	복잡한 추론 필요
MATH	수학	Problem Solving	Accuracy	고등학교/대학 수학

2.4.2 Agent 성능 종합 평가 프레임워크

# 종합 Agent 평가 시스템
class ComprehensiveAgentEvaluator:
    def __init__(self):
        self.evaluators = {
            'tool_use': ToolUseEvaluator(),
            'reasoning': ReasoningEvaluator(), 
            'memory': MemoryEvaluator(),
            'planning': PlanningEvaluator(),
            'robustness': RobustnessEvaluator()
        }
    
    def comprehensive_evaluation(self, agent, test_suite):
        results = {}
        
        for category, evaluator in self.evaluators.items():
            category_tests = test_suite.get_tests(category)
            category_results = []
            
            for test in category_tests:
                # Agent 실행
                agent_output = agent.run(test.input)
                
                # 평가 수행
                score = evaluator.evaluate(agent_output, test.expected)
                category_results.append(score)
            
            results[category] = {
                'scores': category_results,
                'mean': np.mean(category_results),
                'std': np.std(category_results),
                'percentile_95': np.percentile(category_results, 95)
            }
        
        # 가중 평균 종합 점수 계산
        weights = {'tool_use': 0.3, 'reasoning': 0.25, 'memory': 0.2, 
                  'planning': 0.15, 'robustness': 0.1}
        
        overall_score = sum(results[cat]['mean'] * weights[cat] 
                           for cat in weights.keys())
        
        results['overall'] = overall_score
        return results

3. 실험 설계 및 통계적 검증

3.1 A/B 테스트 설계

3.1.1 RAG 시스템 A/B 테스트

# RAG A/B 테스트 프레임워크
class RAGABTest:
    def __init__(self, baseline_rag, experimental_rag):
        self.baseline = baseline_rag
        self.experimental = experimental_rag
        self.metrics = ['accuracy', 'latency', 'user_satisfaction']
    
    def run_experiment(self, test_queries, users, duration_days=14):
        # 사용자 그룹 분할 (50:50)
        baseline_users = users[:len(users)//2]
        experimental_users = users[len(users)//2:]
        
        # 실험 실행
        baseline_results = self.collect_metrics(
            self.baseline, test_queries, baseline_users, duration_days
        )
        experimental_results = self.collect_metrics(
            self.experimental, test_queries, experimental_users, duration_days
        )
        
        # 통계적 유의성 검증
        statistical_results = {}
        for metric in self.metrics:
            stat_result = self.statistical_test(
                baseline_results[metric], 
                experimental_results[metric]
            )
            statistical_results[metric] = stat_result
        
        return statistical_results
    
    def statistical_test(self, baseline_data, experimental_data):
        from scipy import stats
        
        # 정규성 검정
        baseline_normal = stats.shapiro(baseline_data).pvalue > 0.05
        experimental_normal = stats.shapiro(experimental_data).pvalue > 0.05
        
        if baseline_normal and experimental_normal:
            # t-test 사용
            statistic, p_value = stats.ttest_ind(baseline_data, experimental_data)
            test_type = "t-test"
        else:
            # Mann-Whitney U test 사용
            statistic, p_value = stats.mannwhitneyu(baseline_data, experimental_data)
            test_type = "mann-whitney"
        
        # 효과 크기 계산 (Cohen's d)
        pooled_std = np.sqrt(((len(baseline_data)-1)*np.var(baseline_data) + 
                             (len(experimental_data)-1)*np.var(experimental_data)) / 
                            (len(baseline_data)+len(experimental_data)-2))
        
        cohens_d = (np.mean(experimental_data) - np.mean(baseline_data)) / pooled_std
        
        return {
            'test_type': test_type,
            'statistic': statistic,
            'p_value': p_value,
            'significant': p_value < 0.05,
            'effect_size': cohens_d,
            'baseline_mean': np.mean(baseline_data),
            'experimental_mean': np.mean(experimental_data),
            'improvement': (np.mean(experimental_data) - np.mean(baseline_data)) / np.mean(baseline_data) * 100
        }

3.2 신뢰구간 및 통계적 유의성

3.2.1 Bootstrap 신뢰구간

def bootstrap_confidence_interval(data, func=np.mean, n_bootstrap=10000, confidence=0.95):
    """Bootstrap을 사용한 신뢰구간 계산"""
    bootstrap_samples = []
    
    for _ in range(n_bootstrap):
        # 복원 추출
        sample = np.random.choice(data, size=len(data), replace=True)
        bootstrap_samples.append(func(sample))
    
    bootstrap_samples = np.array(bootstrap_samples)
    
    # 신뢰구간 계산
    alpha = 1 - confidence
    lower_percentile = (alpha/2) * 100
    upper_percentile = (1 - alpha/2) * 100
    
    ci_lower = np.percentile(bootstrap_samples, lower_percentile)
    ci_upper = np.percentile(bootstrap_samples, upper_percentile)
    
    return ci_lower, ci_upper, bootstrap_samples

# 사용 예시
accuracy_scores = [0.85, 0.87, 0.83, 0.89, 0.86, 0.88, 0.84, 0.90]
ci_lower, ci_upper, bootstrap_dist = bootstrap_confidence_interval(accuracy_scores)
print(f"95% 신뢰구간: [{ci_lower:.3f}, {ci_upper:.3f}]")

3.3 다중 비교 문제 해결

3.3.1 Bonferroni 보정

def multiple_comparison_correction(p_values, method='bonferroni'):
    """다중 비교 보정"""
    from statsmodels.stats.multitest import multipletests
    
    reject, p_adjusted, alpha_sidak, alpha_bonf = multipletests(
        p_values, alpha=0.05, method=method
    )
    
    return {
        'original_p_values': p_values,
        'adjusted_p_values': p_adjusted,
        'significant_after_correction': reject,
        'correction_method': method
    }

# 여러 RAG 방법 비교 시 사용
methods = ['baseline', 'improved_chunking', 'better_embedding', 'hybrid_search']
p_values = [0.03, 0.008, 0.045, 0.12]  # 각 방법 간 비교 p-값
correction_result = multiple_comparison_correction(p_values)

4. 실시간 모니터링 및 평가

4.1 온라인 평가 시스템

4.1.1 실시간 성능 모니터링

class RealTimeEvaluationSystem:
    def __init__(self):
        self.metrics_buffer = defaultdict(list)
        self.alert_thresholds = {
            'accuracy': 0.8,
            'latency_p95': 2.0,  # seconds
            'error_rate': 0.05
        }
    
    def log_interaction(self, query, response, latency, user_feedback=None):
        """실시간 상호작용 로깅"""
        timestamp = datetime.now()
        
        # 자동 평가 수행
        auto_scores = self.automatic_evaluation(query, response)
        
        # 메트릭 버퍼에 추가
        self.metrics_buffer['accuracy'].append(auto_scores['accuracy'])
        self.metrics_buffer['latency'].append(latency)
        self.metrics_buffer['timestamp'].append(timestamp)
        
        if user_feedback:
            self.metrics_buffer['user_satisfaction'].append(user_feedback)
        
        # 임계값 확인 및 알림
        self.check_thresholds()
        
        # 주기적으로 메트릭 집계 및 저장
        if len(self.metrics_buffer['accuracy']) % 100 == 0:
            self.aggregate_and_store_metrics()
    
    def check_thresholds(self):
        """임계값 모니터링 및 알림"""
        recent_accuracy = np.mean(self.metrics_buffer['accuracy'][-50:])
        recent_latency_p95 = np.percentile(self.metrics_buffer['latency'][-50:], 95)
        
        if recent_accuracy < self.alert_thresholds['accuracy']:
            self.send_alert(f"Accuracy dropped to {recent_accuracy:.3f}")
        
        if recent_latency_p95 > self.alert_thresholds['latency_p95']:
            self.send_alert(f"P95 latency increased to {recent_latency_p95:.2f}s")

4.2 사용자 피드백 통합

4.2.1 암시적 피드백 수집

class ImplicitFeedbackCollector:
    def __init__(self):
        self.feedback_indicators = {
            'click_through': 1.0,      # 답변의 링크 클릭
            'follow_up_query': 0.5,    # 후속 질문 (부분적 만족)
            'reformulation': -0.5,     # 질문 재구성 (불만족)
            'session_end': 0.8,        # 세션 종료 (만족)
            'long_dwell_time': 0.7     # 긴 체류 시간
        }
    
    def calculate_implicit_score(self, user_actions):
        """사용자 행동 기반 만족도 점수 계산"""
        score = 0
        for action, weight in user_actions.items():
            if action in self.feedback_indicators:
                score += self.feedback_indicators[action] * weight
        
        # 0-1 사이로 정규화
        return max(0, min(1, (score + 1) / 2))

4.3 성능 예측 및 조기 경고

4.3.1 성능 저하 예측 모델

from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

class PerformanceDegradationDetector:
    def __init__(self):
        self.anomaly_detector = IsolationForest(contamination=0.1)
        self.scaler = StandardScaler()
        self.is_trained = False
        
    def train(self, historical_metrics):
        """과거 성능 데이터로 이상 탐지 모델 훈련"""
        features = np.column_stack([
            historical_metrics['accuracy'],
            historical_metrics['latency'],
            historical_metrics['error_rate'],
            historical_metrics['user_satisfaction']
        ])
        
        scaled_features = self.scaler.fit_transform(features)
        self.anomaly_detector.fit(scaled_features)
        self.is_trained = True
    
    def detect_anomaly(self, current_metrics):
        """현재 성능이 이상한지 탐지"""
        if not self.is_trained:
            return False, 0.0
        
        features = np.array([[
            current_metrics['accuracy'],
            current_metrics['latency'], 
            current_metrics['error_rate'],
            current_metrics['user_satisfaction']
        ]])
        
        scaled_features = self.scaler.transform(features)
        anomaly_score = self.anomaly_detector.decision_function(scaled_features)[0]
        is_anomaly = self.anomaly_detector.predict(scaled_features)[0] == -1
        
        return is_anomaly, anomaly_score

728x90

저작자표시 (새창열림)

'NLP | LLM' 카테고리의 다른 글

[LangChain] 2. LangChain과 RAG에 대한 모든 것 (3)	2025.06.24
고도화된 RAG: 단순히 검색만으로는 한계가 있다 (Self-RAG, GraphRAG) (2)	2025.06.20
로컬 환경에서 필수인 Ollama에 대해 알아보기 (3)	2025.06.18
AI Agent의 모든 것 (5)	2025.06.17
[LangChain] 1. LangChain의 모든 것 (3)	2025.06.17

현재글RAG & Agent 시스템 성능 검증은 어떻게 할까?

wave_to_ai

2년차 AI Engineer | DL, LLM, RAG

250x250

Today :
Yesterday :

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

RAG & Agent 시스템 성능 검증은 어떻게 할까?

1. RAG 시스템 성능 검증

1.1 컴포넌트별 평가 (Component-wise Evaluation)

1.1.1 Retrieval 성능 평가

1.1.2 구체적인 Retrieval 평가 방법

1.1.3 Generation 성능 평가

1.2 End-to-End RAG 평가

1.2.1 자동 평가 지표

1.2.2 LLM-as-a-Judge 평가

1.3 RAG 벤치마크 데이터셋

1.4 고급 RAG 평가 방법

1.4.1 Adversarial Testing

1.4.2 Ablation Study

2. Agent 시스템 성능 검증

2.1 Task-specific 평가

2.1.1 Tool-using Agent 평가

2.1.2 구체적인 Tool-use 평가 방법

2.2 Planning & Reasoning 평가

2.2.1 Multi-step Reasoning 평가

2.2.2 GSM8K 스타일 수학 문제 평가

2.3 Memory & Context 관리 평가

2.3.1 대화 일관성 평가

2.3.2 장기 대화 평가 방법

2.4 Agent 벤치마크 & 데이터셋

2.4.1 주요 Agent 벤치마크

2.4.2 Agent 성능 종합 평가 프레임워크

3. 실험 설계 및 통계적 검증

3.1 A/B 테스트 설계

3.1.1 RAG 시스템 A/B 테스트

3.2 신뢰구간 및 통계적 유의성

3.2.1 Bootstrap 신뢰구간

3.3 다중 비교 문제 해결

3.3.1 Bonferroni 보정

4. 실시간 모니터링 및 평가

4.1 온라인 평가 시스템

4.1.1 실시간 성능 모니터링

4.2 사용자 피드백 통합

4.2.1 암시적 피드백 수집

4.3 성능 예측 및 조기 경고

4.3.1 성능 저하 예측 모델

'NLP | LLM' 카테고리의 다른 글

'NLP | LLM'의 다른글

관련글

티스토리툴바