Skip to content

【译】构建有效的 AI 智能体(Building effective agents)【原文

Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.

在过去的一年里,我们与数十个团队合作构建服务于各种行业的大语言模型(下文统一简称为大模型)智能体。有趣的是,那些最成功的案例并非采用复杂的 Agent 开发框架或专门的库,而是使用简单且可组合的 Agent 设计模式来进行开发。

In this post, we share what we’ve learned from working with our customers and building agents ourselves, and give practical advice for developers on building effective agents.

在这篇文章中,我们将分享与客户合作以及亲自构建 AI 智能体过程中学到的经验,并为开发者提供构建有效且高效的 AI 智能体的实用建议。

何为 AI 智能体?(What are agents?)

"Agent" can be defined in several ways. Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:

"AI 智能体"可以有多种定义方式。一些客户将其定义为完全自主的系统,这类系统能够长期独立运行,并运用各种工具来完成复杂任务。另一些则用这个术语来描述更具规定性的实现方式,即遵循预定义工作流程的系统。在 Anthropic,我们将所有这些变体都归类为智能代理系统,但在工作流和 AI 智能体之间做出了一个重要的架构区分:

  • Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
  • 工作流是通过预定义的代码路径来编排大模型和工具调用的系统。
  • Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
  • AI 智能体则是一种由大模型自主控住自身的处理流程和工具使用,并且始终掌控任务完成方式的系统。

Below, we will explore both types of agentic systems in detail. In Appendix 1 ("Agents in Practice"), we describe two domains where customers have found particular value in using these kinds of systems.

下文中,我们将详细探讨这两种智能代理系统。在附录1("AI 智能体实践")中,我们描述了两个客户在使用这类系统时发现特别有价值的领域。

何时(以及何时不)使用 AI 智能体(When (and when not) to use agents)

When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense.

在构建基于大模型的应用时,我们建议尽量采用最简单的解决方案,只在必要时才增加实现复杂度。这可能意味着完全不去构建 AI 智能体系统。AI 智能体系统通常以延迟和成本为代价来提升任务表现,你需要考虑这种取舍权衡在什么情况下是合理的。

When more complexity is warranted, workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale. For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough.

当确实需要采用更复杂的方案时,工作流能为明确定义的任务提供可预测性和一致性,而当需要大规模的灵活性和模型驱动的决策时,AI 智能体则是更好的选择。然而对于许多应用来说,通过检索和上下文示例来优化单个 LLM 调用通常就足够了。

何时以及如何使用框架(When and how to use frameworks)

There are many frameworks that make agentic systems easier to implement, including:

目前市面上有多个框架可以简化 AI 智能体系统的实现过程。这些框架主要通过提供标准化的底层功能来降低开发难度。以下是几个代表性的框架:

  • LangGraph from LangChain;
  • LangChain 的 LangGraph
  • Amazon Bedrock's AI Agent framework;
  • Amazon Bedrock 的 AI 智能体框架
  • Rivet, a drag and drop GUI LLM workflow builder; and
  • Rivet,一个 LLM 工作流构建器,通过拖放式的 GUI;以及
  • Vellum, another GUI tool for building and testing complex workflows.
  • Vellum,另一个用于构建和测试复杂工作流的 GUI 工具。

These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts ​​and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice.

这些框架通过简化标准底层任务,如调用 LLM、定义和解析各种工具以及串联调用等,大大降低了入门门槛。然而,它们往往会引入额外的抽象层级,这些层级会掩盖底层的提示词和响应,从而增加调试的难度。同时,在一个简单的方案就能满足需求的情况下,它们可能会诱导开发者过度增加系统的复杂度。

We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what's under the hood are a common source of customer error.

我们建议开发者首先直接使用 LLM API:许多开发模式只需要几行代码就能实现。如果你确实需要使用框架,请确保你理解其底层实现。对框架内部实现的错误理解,往往是导致客户出错的常见原因。

See our cookbook for some sample implementations.

可以查看我们的代码示例来获取一些示例实现。

构建模块、工作流和智能体(Building blocks, workflows, and agents)

In this section, we’ll explore the common patterns for agentic systems we’ve seen in production. We'll start with our foundational building block — the augmented LLM — and progressively increase complexity, from simple compositional workflows to autonomous agents.

在本节中,我们将要探讨在实际生产中所观察到的 AI 智能体系统的常见模式。我们将从基础构建模块 —— 增强型 LLM 开始,逐步增加复杂性,从简单的组合工作流讲到自主智能体。

构建模块:增强型 LLM(Building block: The augmented LLM)

The basic building block of agentic systems is an LLM enhanced with augmentations such as retrieval, tools, and memory. Our current models can actively use these capabilities — generating their own search queries, selecting appropriate tools, and determining what information to retain.

AI 智能体系统的基本构建模块是通过检索、工具和记忆等增强功能扩展的 LLM。我们当前的模型可以主动运用这些功能 - 自主生成搜索查询、选择适当的工具,并判断需要保留什么信息。

The augmented LLM

We recommend focusing on two key aspects of the implementation: tailoring these capabilities to your specific use case and ensuring they provide an easy, well-documented interface for your LLM. While there are many ways to implement these augmentations, one approach is through our recently released Model Context Protocol, which allows developers to integrate with a growing ecosystem of third-party tools with a simple client implementation.

我们建议重点关注实现的两个关键方面:将这些功能适配到你的具体应用场景中,同时确保为你的 LLM 提供一个文档完善、易于使用的接口。虽然实现这些增强功能的方法有多种,其中一种方法是通过我们最近发布的 Model Context Protocol(模型上下文协议)来实现,开发者只需要实现一个简单的客户端,就能与日益壮大的第三方工具生态系统实现对接。

For the remainder of this post, we'll assume each LLM call has access to these augmented capabilities.

在本文的后续部分,我们将假设每个 LLM 调用都能访问这些增强功能。

提示词链式工作流(Workflow: Prompt chaining)

Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks (see "gate" in the diagram below) on any intermediate steps to ensure that the process is still on track.

提示词链式工作流将任务分解为一系列步骤,其中每个 LLM 调用都会处理前一个调用的输出。你可以在任何中间步骤添加程序化验证(参见下图中的"gate"),以确保整个流程按预期执行。

The prompt chaining workflow

When to use this workflow: This workflow is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks. The main goal is to trade off latency for higher accuracy, by making each LLM call an easier task.

何时使用此工作流:这个工作流最适合那些可以清晰地分解为固定子任务的业务场景。其主要目标是通过将每个 LLM 调用简化为更小的任务,以延迟为代价来提高准确性。

Examples where prompt chaining is useful:

提示词链式工作流的适用示例:

  • Generating Marketing copy, then translating it into a different language.
  • 生成营销文案,然后将其翻译成其他语言。
  • Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.
  • 编写文档大纲,验证大纲是否符合特定标准,然后基于大纲撰写完整文档。

路由工作流(Workflow: Routing)

Routing classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.

路由工作流会对输入进行分类并将其引导至特定的后续任务。这种工作流程实现了职责分离,使得能够为各种特定任务编写更有针对性的提示词。。如果没有这种工作流,为某一类输入优化可能会影响到对其他输入的处理表现。

The routing workflow

When to use this workflow: Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm.

何时使用此工作流:当复杂任务中存在需要分别处理的明显类别,并且分类可以被大模型或传统的分类模型/算法准确处理时,路由工作流会发挥很好的效果。

Examples where routing is useful:

路由工作流的适用示例:

  • Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools.
  • 将不同类型的客服查询(一般问题、退款请求、技术支持)引导到不同的下游流程、提示词和工具。
  • Routing easy/common questions to smaller models like Claude 3.5 Haiku and hard/unusual questions to more capable models like Claude 3.5 Sonnet to optimize cost and speed.
  • 将简单/常见问题路由到 Claude 3.5 Haiku 等较小的模型,而将困难/不寻常的问题路由到 Claude 3.5 Sonnet 等更强大的模型,从而优化成本和速度。

并行化工作流(Workflow: Parallelization)

LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:

大模型有时可以同时处理任务,并通过程序化方式汇总其输出。这种并行化工作流主要有两种关键变体:

  • Sectioning: Breaking a task into independent subtasks run in parallel.
  • 分段:将任务分解为可并行运行的独立子任务。
  • Voting: Running the same task multiple times to get diverse outputs.
  • 投票:多次运行相同任务以获得不同的输出。

The parallelization workflow

When to use this workflow: Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.

何时使用此工作流:当子任务可以并行化以提高任务处理速度,或需要多个视角或尝试来获得更高可信度的结果时,并行化非常有效。对于具有多重考虑因素的复杂任务,让每个考虑因素由独立的大模型调用来处理通常表现更好,因为这样可以让模型专注于每个具体方面。

Examples where parallelization is useful:

并行化工作流的适用示例:

  • 分段(Sectioning):
    • Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.
    • 实现安全防护,其中一个模型实例处理用户查询,而另一个模型负责筛查不当内容或请求。这种方式往往比让同一个大模型调用同时处理安全防护和核心响应表现更好。
    • Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt.
    • 自动化评估大模型性能,其中每个大模型调用负责评估模型在给定提示词上的各个方面性能表现。
  • 投票(Voting):
    • Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem.
    • 审查一段代码中的漏洞,其中多个不同的提示词负责审查代码并在发现问题时进行标记。
    • Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.
    • 通过多个提示词从不同角度出发对给定的一段内容进行不恰当性评估,或采用不同的判定阈值来权衡误报和漏报,从而实现更准确的内容审核。

协调者-工作者工作流(Workflow: Orchestrator-workers)

In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.

在协调者-工作者工作流中,一个主控大语言模型能够动态地将任务分解,并分配给各个工作者大模型,然后整合它们的输出结果。

The orchestrator-workers workflow

When to use this workflow: This workflow is well-suited for complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task). Whereas it’s topographically similar, the key difference from parallelization is its flexibility — subtasks aren't pre-defined, but determined by the orchestrator based on the specific input.

何时使用此工作流:这种工作流非常适合那些无法预测所需子任务的复杂任务(例如在编码中,需要更改的文件数量以及每个文件中需要更改的代码可能取决于具体任务)。虽然在结构上与并行化工作流相似,但与并行化工作流的主要区别在于其灵活性 - 子任务不是预先定义的,而是由协调者大模型根据具体输入来确定的。

Example where orchestrator-workers is useful:

协调者-工作者工作流的适用示例:

  • Coding products that make complex changes to multiple files each time.
  • 需要同时对多个文件进行复杂变更的编码产品。
  • Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information.
  • 涉及从多个来源收集和分析可能相关信息的搜索任务。

评估者-优化者工作流(Workflow: Evaluator-optimizer)

In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.

在评估者-优化者工作流中,一个大模型负责生成输出,同时另一个大模型循环不断地对其输出结果进行评估和反馈。

The evaluator-optimizer workflow

When to use this workflow: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.

何时使用此工作流:当我们有明确的评估标准,且迭代改进能提供可衡量的价值时,这种工作流特别有效。判断是否适合使用的两个标志是:首先,当人类明确表达他们的反馈时,大模型的输出能够得到明显改进;其次,大模型能够提供这样的反馈。这类似于人类作家在创作高质量作品时可能经历的迭代写作过程。

Examples where evaluator-optimizer is useful:

评估者-优化者工作流的适用示例:

  • Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.
  • 文学翻译,其中存在译者大模型最初可能无法捕捉的细微差别,但评估者大模型可以提供有用的批评。
  • Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.
  • 复杂的搜索任务,需要多轮搜索和分析才能收集到全面的信息,由评估者决定是否需要进一步搜索。

AI 智能体(Agents)

Agents are emerging in production as LLMs mature in key capabilities — understanding complex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors. Agents begin their work with either a command from, or interactive discussion with, the human user. Once the task is clear, agents plan and operate independently, potentially returning to the human for further information or judgement. During execution, it's crucial for the agents to gain "ground truth" from the environment at each step (such as tool call results or code execution) to assess its progress. Agents can then pause for human feedback at checkpoints or when encountering blockers. The task often terminates upon completion, but it’s also common to include stopping conditions (such as a maximum number of iterations) to maintain control.

随着大模型在关键能力上日趋成熟(包括理解复杂输入、进行推理和规划、可靠地使用工具以及从错误中恢复),AI 智能体正在逐步应用于生产环境。AI 智能体通过接收用户的命令或与用户进行互动对话来启动工作。当任务明确后,AI 智能体会独立进行规划和操作,必要时会向用户请求更多信息或判断建议。在执行过程中,AI 智能体需要在每个步骤获取环境中的实际反馈(如工具调用结果或代码执行情况)来评估进展。AI 智能体可以在检查点或遇到障碍时暂停,等待人类反馈。任务通常在完成时自动终止,但也常常需要设置停止条件(如最大迭代次数)以确保可控性。

Agents can handle sophisticated tasks, but their implementation is often straightforward. They are typically just LLMs using tools based on environmental feedback in a loop. It is therefore crucial to design toolsets and their documentation clearly and thoughtfully. We expand on best practices for tool development in Appendix 2 ("Prompt Engineering your Tools").

AI 智能体能够处理复杂的任务,但它们的实现方式却相当直观。实际上,它们就是在一个反馈循环中,大模型基于环境反馈来调用各种工具。因此,清晰且周详地设计工具集及其文档至关重要。我们在附录2("工具的提示词工程")中详细阐述了工具开发的最佳实践。

Autonomous agent

When to use agents: Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents' autonomy makes them ideal for scaling tasks in trusted environments.

使用场景:AI 智能体特别适用于那些难以或无法预测所需步骤数量、且无法预先设定固定执行路径的开放性问题。在这些场景中,大模型可能需要多轮操作,因此你必须对其决策能力具有一定的信任度。AI 智能体的自主能力使其成为在可信环境中执行规模化任务的理想选择。

The autonomous nature of agents means higher costs, and the potential for compounding errors. We recommend extensive testing in sandboxed environments, along with the appropriate guardrails.

需要注意的是,AI 智能体的自主特性意味着更高的运行成本和潜在的错误累积风险。因此,我们强烈建议在沙盒环境中进行充分测试,并设置适当的安全防护措施。

Examples where agents are useful:

AI 智能体的适用示例:

The following examples are from our own implementations:

以下是来自我们自己实现的案例:

  • A coding Agent to resolve SWE-bench tasks, which involve edits to many files based on a task description;
  • 一个解决 SWE-bench 任务的编码 AI 智能体,用于根据任务描述对多个文件进行编辑;
  • Our "computer use" reference implementation, where Claude uses a computer to accomplish tasks.
  • 我们的"计算机使用"参考实现,展示了 Claude 如何使用计算机完成各种任务。

High-level flow of a coding agent

模式的组合与定制(Combining and customizing these patterns)

These building blocks aren't prescriptive. They're common patterns that developers can shape and combine to fit different use cases. The key to success, as with any LLM features, is measuring performance and iterating on implementations. To repeat: you should consider adding complexity only when it demonstrably improves outcomes.

这些构建模块并非固定模式,而是开发者可以根据具体用例灵活调整和组合的常用方案。与任何大模型功能一样,成功的关键在于持续评估性能并迭代优化实现方案。需要特别强调的是:只有在确实能够提升效果的情况下,你才应该考虑增加系统复杂度。

这些基础组件并不是强制性的。它们是开发者可以根据不同用例进行调整和组合的通用模式。与任何大模型功能一样,成功的关键在于评估性能并持续迭代优化。再次强调一点:只有当它能确实改善结果时,你才应该考虑增加复杂度。

总结(Summary)

Success in the LLM space isn't about building the most sophisticated system. It's about building the right system for your needs. Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.

在大模型领域,成功的关键不在于构建最复杂的系统,而在于打造最适合需求的解决方案。建议从基础的提示词开始,通过全面评估来优化效果,只有在简单方案无法满足需求时,才考虑引入多步骤的 AI 智能体系统。

When implementing agents, we try to follow three core principles:

在实现 AI 智能体时,我们遵循三个核心原则:

  1. Maintain simplicity in your agent's design.
  2. Prioritize transparency by explicitly showing the agent’s planning steps.
  3. Carefully craft your agent-computer interface (ACI) through thorough tool documentation and testing.

  1. 始终保持智能体设计的简洁性;
  2. 通过清晰展示智能体的规划步骤来确保透明度;
  3. 通过完善的工具文档和充分的测试来精心设计智能体-计算机接口(ACI)。

Frameworks can help you get started quickly, but don't hesitate to reduce abstraction layers and build with basic components as you move to production. By following these principles, you can create agents that are not only powerful but also reliable, maintainable, and trusted by their users.

开发框架能够帮助你快速启动项目,但在迁移到生产环境时,不要犹豫于精简抽象层次,直接使用基础组件进行构建。通过遵循这些原则,你可以创建既强大又可靠、易于维护,并能赢得用户信任的 AI 智能体。

致谢(Acknowledgements)

Written by Erik Schluntz and Barry Zhang. This work draws upon our experiences building agents at Anthropic and the valuable insights shared by our customers, for which we're deeply grateful.

本文由 Erik Schluntz 和 Barry Zhang 撰写。本文的内容来自我们在 Anthropic 构建 AI 智能体的经验,以及我们客户分享的宝贵见解,在此特别感谢。

附录 1:AI 智能体的实践应用(Appendix 1: Agents in practice)

Our work with customers has revealed two particularly promising applications for AI agents that demonstrate the practical value of the patterns discussed above. Both applications illustrate how agents add the most value for tasks that require both conversation and action, have clear success criteria, enable feedback loops, and integrate meaningful human oversight.

通过与客户的合作,我们发现了两个特别有前景的 AI 智能体应用场景,这两个场景很好地展示了上述模式的实际价值。这两个应用都表明了 AI 智能体在以下类型的任务中能够创造最大价值:需要对话和行动相结合、具有明确成功标准、能够形成反馈循环,并且包含有效人工监督的任务。

A. 客户支持(A. Customer support)

Customer support combines familiar chatbot interfaces with enhanced capabilities through tool integration. This is a natural fit for more open-ended agents because:

客户支持将传统的聊天机器人界面与工具集成带来的增强功能相结合。这特别适合开放式 AI 智能体,原因如下:

  • Support interactions naturally follow a conversation flow while requiring access to external information and actions;
  • 支持互动自然地遵循对话流程,同时需要访问外部信息和执行具体操作;
  • Tools can be integrated to pull customer data, order history, and knowledge base articles;
  • 可以集成各种工具来获取客户数据、订单历史和知识库文章;
  • Actions such as issuing refunds or updating tickets can be handled programmatically; and
  • 可以通过程序自动处理退款或更新工单等操作;
  • Success can be clearly measured through user-defined resolutions.
  • 可以通过用户定义的解决方案明确衡量成功率。

Several companies have demonstrated the viability of this approach through usage-based pricing models that charge only for successful resolutions, showing confidence in their agents' effectiveness.

多家领先企业已经通过按成功解决案例收费的计费模式,证明了这种方法的可行性,这也体现了他们对 AI 智能体效能的充分信心。

B. 编程智能体(B. Coding agents)

The software development space has shown remarkable potential for LLM features, with capabilities evolving from code completion to autonomous problem-solving. Agents are particularly effective because:

软件开发领域展现出了大模型特性的巨大潜力,其功能已经从简单的代码补全进化到独立解决复杂问题。AI 智能体在这一领域特别有效,原因如下:

  • Code solutions are verifiable through automated tests;
  • 代码解决方案可以通过自动化测试进行验证;
  • Agents can iterate on solutions using test results as feedback;
  • AI 智能体可以根据测试结果反馈来不断优化解决方案;
  • The problem space is well-defined and structured; and
  • 问题空间定义明确且结构化;
  • Output quality can be measured objectively.
  • 输出质量可以客观衡量。

In our own implementation, agents can now solve real GitHub issues in the SWE-bench Verified benchmark based on the pull request description alone. However, whereas automated testing helps verify functionality, human review remains crucial for ensuring solutions align with broader system requirements.

在我们的实际应用中,AI 智能体现在可以仅根据拉取请求(Pull Request)的描述来解决 SWE-bench Verified 基准测试中的实际 GitHub 问题。不过,尽管自动化测试能够验证功能正确性,但人工代码审查在确保解决方案符合整体系统要求方面仍然发挥着不可替代的作用。

附录 2:对工具进行提示词工程(Appendix 2: Prompt engineering your tools)

No matter which agentic system you're building, tools will likely be an important part of your agent. Tools enable Claude to interact with external services and APIs by specifying their exact structure and definition in our API. When Claude responds, it will include a tool use block in the API response if it plans to invoke a tool. Tool definitions and specifications should be given just as much prompt engineering attention as your overall prompts. In this brief appendix, we describe how to prompt engineer your tools.

无论你在构建哪种 AI 智能体系统,工具都可能是你的智能体的重要组成部分。工具使 Claude 能够通过在我们的 API 中指定其确切结构和定义来与外部服务和 API 进行交互。当 Claude 响应时,如果它计划调用工具,API 响应中会包含一个工具调用块。工具定义和规范应该得到与整体提示词同等的提示词工程关注。在这个简短的附录中,我们将描述如何对工具进行提示词工程。

There are often several ways to specify the same action. For instance, you can specify a file edit by writing a diff, or by rewriting the entire file. For structured output, you can return code inside markdown or inside JSON. In software engineering, differences like these are cosmetic and can be converted losslessly from one to the other. However, some formats are much more difficult for an LLM to write than others. Writing a diff requires knowing how many lines are changing in the chunk header before the new code is written. Writing code inside JSON (compared to markdown) requires extra escaping of newlines and quotes.

通常有几种方式可以指定相同的操作。例如,你可以通过编写差异(diff)或重写整个文件来指定文件编辑。对于结构化输出,你可以在 markdown 或 JSON 中返回代码。在软件工程中,这些差异只是形式上的,可以无损地从一种格式转换为另一种格式。然而,某些格式对于 LLM 来说要难写得多。编写差异需要在写入新代码之前知道块头中有多少行在变化。在 JSON 中编写代码(相比 markdown)需要额外转义换行符和引号。

Our suggestions for deciding on tool formats are the following:

我们关于决定工具格式的建议如下:

  • Give the model enough tokens to "think" before it writes itself into a corner.
  • 给模型足够的 token 来进行充分思考,避免陷入困境;
  • Keep the format close to what the model has seen naturally occurring in text on the internet.
  • 保持格式接近模型在互联网文本中自然出现的内容;
  • Make sure there's no formatting "overhead" such as having to keep an accurate count of thousands of lines of code, or string-escaping any code it writes.
  • 确保没有格式"开销",比如必须保持对数千行代码的准确计数,或对它写的任何代码进行字符串转义。

One rule of thumb is to think about how much effort goes into human-computer interfaces (HCI), and plan to invest just as much effort in creating good agent-computer interfaces (ACI). Here are some thoughts on how to do so:

一个经验法则是思考在人机交互(HCI)上投入了多少努力,并计划在创建良好的智能体-计算机接口(ACI)上投入同样多的努力。以下是一些关于如何做到这一点的想法:

  • Put yourself in the model's shoes. Is it obvious how to use this tool, based on the description and parameters, or would you need to think carefully about it? If so, then it’s probably also true for the model. A good tool definition often includes example usage, edge cases, input format requirements, and clear boundaries from other tools.
  • 设身处地为模型着想。基于描述和参数,使用这个工具是否显而易见,还是需要仔细思考?如果是后者,那么对模型来说可能也是如此。一个好的工具定义通常包括使用示例、边缘情况、输入格式要求以及与其他工具的明确界限;
  • How can you change parameter names or descriptions to make things more obvious? Think of this as writing a great docstring for a junior developer on your team. This is especially important when using many similar tools.
  • 你如何更改参数名称或描述以使事情更明显?把这想象成为团队中的初级程序员写一个很棒的文档字符串。这在使用多个相似工具时尤其重要;
  • Test how the model uses your tools: Run many example inputs in our workbench to see what mistakes the model makes, and iterate.
  • 测试模型如何使用你的工具:在我们的工作台中运行多个示例输入,看看模型会犯什么错误,然后迭代改进;
  • Poka-yoke your tools. Change the arguments so that it is harder to make mistakes.
  • 对你的工具进行防错设计(poka-yoke)。更改参数,使其更难出错。

While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt. For example, we found that the model would make mistakes with tools using relative filepaths after the agent had moved out of the root directory. To fix this, we changed the tool to always require absolute filepaths — and we found that the model used this method flawlessly.

在构建我们的 SWE-bench 智能体时,我们实际上花在优化工具上的时间比花在整体提示词上的时间还多。例如,我们发现当智能体移出根目录后,模型在使用相对文件路径的工具时会出错。为了解决这个问题,我们更改了工具,始终要求使用绝对文件路径 —— 我们发现模型完美地使用了这种方法。