【编译】迷失在中间:语言模型如何使用长上下文 Lost in the Middle: How Language Models Use Long Contexts

Based on the paper summary you provided, here are a few key points about how language models use long contexts:

根据您提供的论文摘要,以下是有关语言模型如何使用长上下文的几个关键点:

  • The paper analyzes language model performance on multi-document question answering and key-value retrieval tasks that require accessing relevant information within long input contexts.

  • 本文分析了语言模型在多文档问答和键值检索任务中的性能,这些任务需要在长输入上下文中访问相关信息。

  • It finds a "U-shaped" performance curve - models perform best when relevant information is at the very beginning or end of the input context, and performance degrades significantly when models must access information in the middle of long contexts.

  • 它发现了一条“U 形”性能曲线 - 当相关信息位于输入上下文的最开始或结束时,模型性能最佳,当模型必须在长上下文中访问信息时,性能会显着下降。

  • Performance substantially decreases as the input context grows longer, even for models with extended context lengths like GPT-3.5-Turbo (16K). Extended-context models are not necessarily better at using their full context.

  • 随着输入上下文的变长,性能会大幅下降,即使对于具有扩展上下文长度的模型,如 GPT-3.5-Turbo (16K)。扩展上下文模型不一定能更好地使用其完整上下文。

  • On a key-value retrieval task requiring simply matching tokens, many models still struggle when retrieving keys from the middle of 140+ key-value pairs.

  • 在需要简单匹配令牌的键值检索任务中,许多模型在从 140+ 键值对中间检索键时仍然很困难。

  • Preliminary analysis suggests encoder-decoder models are more robust when evaluated on sequences shorter than their training length, but degrade on longer sequences. Query-aware contextualization improves key-value retrieval but not multi-document QA trends.

  • 初步分析表明,编码器-解码器模型在对短于其训练长度的序列进行评估时更稳健,但在较长的序列上会降级。查询感知上下文化改进了键值检索,但不能改善多文档 QA 趋势。

  • Even base LMs without instruction tuning show a U-shaped performance curve, indicating the effect is not solely due to training procedures.

  • 即使是没有指令调整的基本LM也显示出U形性能曲线,表明效果不仅仅是由于训练过程。

  • Case study on open-domain QA finds model performance saturates before retriever recall plateaus, indicating models fail to effectively use additional retrieved documents.

  • 开放域 QA 的案例研究发现,在检索器召回平台之前,模型性能饱和,表明模型无法有效地使用其他检索到的文档。

In summary, the paper provides evidence that current language models struggle to fully utilize long input contexts, with degraded performance when accessing and using information from the middle of contexts. The analysis offers insights into model limitations and introduces new evaluation protocols for future long-context models.

总之,本文提供了证据,证明当前的语言模型难以充分利用长输入上下文,在访问和使用来自上下文中间的信息时性能下降。该分析提供了对模型局限性的见解,并为未来的长上下文模型引入了新的评估协议。

[2307.03172] Lost in the Middle: How Language Models Use Long Contexts
https://arxiv.org/abs/2307.03172