“Sufficient Context: A New Lens on Retrieval Augmented Generation Systems” 논문의 주요 내용

Paper Review

“Sufficient Context: A New Lens on Retrieval Augmented Generation Systems” 논문의 주요 내용

AIStat 2025. 5. 26. 11:24

아래는 “Sufficient Context: A New Lens on Retrieval Augmented Generation Systems” 논문의 주요 내용을 정리한 후, 논문에서 제안한 선택적 생성(selective generation) 방법을 중심으로 파이썬 구현 예시를 제시한 코드입니다.

논문 요약

문제 제기 및 동기
- 대형 언어모델(LLM)에 외부 컨텍스트를 제공하는 RAG(Retrieval Augmented Generation) 시스템은 사실성(factuality)을 크게 향상시키지만, 여전히 모델이 잘못된 답을 자신 있게 생성하거나 불필요한 정보에 주의를 빼앗기는 문제가 존재한다.
- 이 오류의 원인이 “컨텍스트가 충분치 않아서”인지, “모델이 충분한 컨텍스트를 활용하지 못해서”인지 명확하지 않다.
충분한 컨텍스트(Sufficient Context) 개념 도입
- 정의: 질문(Q)과 컨텍스트(C) 쌍에 대해, “C만으로 질문 Q에 대해 그럴듯한 답변 A′를 만들 수 있으면” 충분한 컨텍스트라 정의. 이때 실제 정답(A)을 알 필요 없이 판별 가능하다.
- 모호한 질문 또는 모호한 컨텍스트의 경우에도, 문맥이 해석·구분에 필요한 정보를 모두 제공해야 충분하다고 본다.
충분한 컨텍스트 자동판정기(Autorater)
- Gemini 1.5 Pro(1-shot)를 이용한 분류가 93% 정확도를 기록. GT 정답 없이도 쿼리-컨텍스트만으로 판정 가능.
- 소형 모델(FLAMe)이나 NLI 모델(TRUE-NLI) 대비 높은 균형 성능을 보임.
RAG 시스템 성능 분석
- 벤치마크(FreshQA, Musique, HotPotQA) 데이터셋의 절반가량이 불충분 컨텍스트를 포함. 충분·불충분 컨텍스트로 나누어 모델 응답을 평가했더니:
  - 충분 컨텍스트에서는 정답 비율↑, 하지만 오답보다 ‘환각(hallucination)’ 비율이 높음.
  - 불충분 컨텍스트에서는 ‘자신 없음(abstain)’보다 환각이 더 빈번.
질적 분석(qualitative categories)
- 모델이 불충분 컨텍스트에서도 정답을 맞히는 경우를 여덟 가지 유형으로 분류(예: Yes/No, partial fragment inference, parametric 지식 이용 등).
선택적 생성(Selective Generation) 기법
- 신호: ① 모델의 self-rated confidence (P(Correct) 또는 P(True)), ② sufficient context 바이너리
- 두 신호를 결합한 로지스틱 회귀 모델로 “환각 확률”을 예측하고, 임계값(threshold) 이하일 때만 답변하도록 제어.
- 이로 인해 Gemini·GPT·Gemma에서 응답 중 정답 비율이 2–10% 향상됨.
파인튜닝 실험
- LoRA를 이용해 Mistral-7B-Instruct를 fine-tune하며 일부 예시에 “I don’t know”로 학습시켜 abstention을 유도했으나, 여전히 환각 비율이 높아 추가 연구 필요함.

파이썬 코드 구현 예시

아래 코드는 선택적 생성 기법을 중심으로,

Sufficient Context 판정기
Self-rated Confidence 계산
로지스틱 회귀 기반 환각 예측기 학습
추론 시 abstain/answer 결정
단계를 구현한 예시입니다. 필요에 따라 OpenAI API 키와 모델명을 설정하여 사용하세요.

import openai
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV

class SufficientContextAutorater:
    def __init__(self, api_key: str, model_name: str = "gpt-4o-2024-08-06"):
        openai.api_key = api_key
        self.model_name = model_name

    def classify(self, query: str, context: str) -> bool:
        prompt = f"""
You are a sufficiency autorater. Given a question and its context, answer "Yes" if the context contains
enough information to provide a definitive answer to the question, or "No" otherwise.

Question: {query}

Context:
{context}

Answer Yes or No:"""
        res = openai.ChatCompletion.create(
            model=self.model_name,
            messages=[{"role":"user","content":prompt}],
            temperature=0
        )
        return res.choices[0].message.content.strip().lower().startswith("yes")

def get_self_confidence(query: str, context: str, api_key: str, model_name: str="gpt-4o-2024-08-06") -> float:
    openai.api_key = api_key
    prompt = f"""
Provide the most likely answer to the question with its estimated probability of correctness.

Question: {query}
Context:
{context}

Output format:
Answer: <answer>
Probability: <value between 0 and 1>"""
    res = openai.ChatCompletion.create(
        model=model_name,
        messages=[{"role":"user","content":prompt}],
        temperature=0
    )
    # 간단 파싱: "Probability:" 뒤 숫자 추출
    for line in res.choices[0].message.content.splitlines():
        if "probability" in line.lower():
            try:
                return float(line.split(":")[1].strip())
            except:
                pass
    return 0.0

def prepare_dataset(instances, autorater, api_key, model_name):
    """
    instances: list of (query, context, model_answer, ground_truth)
    returns: pandas.DataFrame with features and binary label for hallucination
    """
    records = []
    for q, c, ans, gt in instances:
        suff = autorater.classify(q, c)
        conf = get_self_confidence(q, c, api_key, model_name)
        is_halluc = int(ans.lower() not in gt.lower() and ans.lower() not in ["i don't know","idk"])
        records.append({"sufficient": int(suff), "confidence": conf, "halluc": is_halluc})
    return pd.DataFrame(records)

def train_selective_model(df):
    X = df[["sufficient","confidence"]].values
    y = df["halluc"].values
    clf = LogisticRegressionCV(cv=5, scoring="neg_log_loss", max_iter=1000)
    clf.fit(X, y)
    return clf

def should_answer(query, context, autorater, clf, api_key, model_name, threshold=0.5):
    suff = autorater.classify(query, context)
    conf = get_self_confidence(query, context, api_key, model_name)
    prob_halluc = clf.predict_proba([[int(suff), conf]])[0][1]
    return prob_halluc < threshold

위 예시는 논문 Section 5.1의 선택적 생성(selective generation) 방법을 구현하는 골격입니다. 이를 기반으로 실제 데이터와 모델 API 호출을 연결하여 실험해 보실 수 있습니다.

'Paper Review' 카테고리의 다른 글

CLLP (Learning Transferable Visual Models From Natural Language Supervision) 리뷰 (0)	2025.06.10
s3: You Don't Need That Much Data to Train a Search Agent via RL 논문 리뷰 (0)	2025.06.01
"CycleNet: 주기적 패턴 모델링을 통한 시계열 예측 향상" 논문 리뷰 (0)	2025.05.03
[Paper Review] Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval(Paul Primus, 2024) (0)	2024.08.09
[Paper Review] Adapting a ConvNeXt model to audio classification on AudioSet (5)	2024.07.16

현재글“Sufficient Context: A New Lens on Retrieval Augmented Generation Systems” 논문의 주요 내용

AI Stat Lab

Today :
Yesterday :

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

AI Stat Lab

“Sufficient Context: A New Lens on Retrieval Augmented Generation Systems” 논문의 주요 내용

논문 요약

파이썬 코드 구현 예시

'Paper Review' 카테고리의 다른 글

'Paper Review'의 다른글

티스토리툴바

“Sufficient Context: A New Lens on Retrieval Augmented Generation Systems” 논문의 주요 내용

논문 요약

파이썬 코드 구현 예시

'Paper Review' 카테고리의 다른 글

'Paper Review'의 다른글

관련글

티스토리툴바