【ChatGPT】随着时间的推移,ChatGPT 的行为有何变化?How Is ChatGPT’s Behavior Changing over Time?

This is an interesting analysis of how ChatGPT's behavior has changed over time between March and June 2023. Here are a few key points:

这是对 ChatGPT 的行为在 2023 年 3 月至 6 月之间如何随时间变化的有趣分析。以下是几个关键点:

  • The study evaluated GPT-3.5 and GPT-4 on 4 diverse tasks: math problem solving, answering sensitive questions, code generation, and visual reasoning. This allows assessing drifts in different capabilities.

  • 该研究评估了 GPT-3.5 和 GPT-4 在 4 个不同的任务上:数学问题解决、回答敏感问题、代码生成和视觉推理。这允许评估不同能力的漂移。

  • Significant performance drifts were observed between the March 2023 and June 2023 versions of both models across the tasks. For example, GPT-4's accuracy on math problems dropped from 97.6% to 2.4%, while GPT-3.5's accuracy improved from 7.4% to 86.8%.

  • 在任务中,两个模型的 2023 年 3 月版本和 2023 年 6 月版本之间观察到明显的性能漂移。例如,GPT-4 在数学问题上的准确率从 97.6% 下降到 2.4%,而 GPT-3.5 的准确率从 7.4% 提高到 86.8%。

  • The changes were not uniformly positive or negative. For instance, GPT-4 became less willing to answer sensitive questions directly from March to June, suggesting improved safety, but it also provided less explanatory rationale.

  • 这些变化不是一致的积极或消极。例如,从 3 月到 6 月,GPT-4 越来越不愿意直接回答敏感问题,这表明安全性有所提高,但它提供的解释性理由也较少。

  • The study found issues like deviations from the requested prompting style (e.g. not following chain-of-thought for math problems), reduced direct executability of generated code, and inconsistencies in visual reasoning.

  • 该研究发现了诸如偏离请求的提示风格(例如,不遵循数学问题的思维链),生成的代码的直接可执行性降低以及视觉推理的不一致等问题。

  • The authors highlight the need for continuous monitoring and evaluation of LLMs as their behavior can vary substantially even over a short 3-month timeframe. For applications relying on LLM services, regular assessments are recommended.

  • 作者强调需要对LLM进行持续监测和评估,因为即使在短短的3个月内,它们的行为也可能有很大差异。对于依赖LLM服务的应用程序,建议定期评估。

Overall, this analysis provides valuable insights into how major LLMs like GPT-3.5 and GPT-4 are evolving and underscores the importance of tracking their capabilities over time rather than assuming consistency. The results reveal potential pitfalls for developers working with these models as well as areas needing improvement. More longitudinal studies like this will be important for understanding and steering the development of LLMs responsibly.

总体而言,该分析为GPT-3.5和GPT-4等主要LLM如何发展提供了有价值的见解,并强调了随着时间的推移跟踪其能力的重要性,而不是假设一致性。结果揭示了使用这些模型的开发人员的潜在陷阱以及需要改进的领域。更多像这样的纵向研究对于负责任地理解和指导LLM的发展非常重要。

[2307.09009] How is ChatGPT's behavior changing over time?

[2307.09009] ChatGPT 的行为如何随时间变化?

https://arxiv.org/abs/2307.09009