AI agents and tool use: how language models act on the world

AI agents pair language models with external tools that execute real actions, handling software, enterprise workflows, and computer control at frontier scale in 2026.

AI· ·3 시각 ·2026년 7월 3일

What it is

AI agents are systems in which a language model dynamically directs its own processes and tool usage, rather than simply responding to a prompt. Anthropic drew this distinction explicitly in a December 2024 research post: "workflows" follow pre-defined call sequences; "agents" decide their own sequence at runtime. The key mechanism is the tool call: the model outputs a structured request for an external capability, execution happens outside the model, and the result feeds back as input. Tools span web search, code execution, database queries, file-system operations, and arbitrary API calls.

Agentic behavior emerges when tool calls chain across multiple steps without human intervention. The dominant production architecture as of mid-2026 is "orchestrator-workers": a central LLM decomposes a task and delegates sub-tasks to worker LLMs, each using tools in narrower, predefined ways. Agentic AI is shifting from experiment to production infrastructure across software, finance, and healthcare, with direct consequences for labor markets and for the security of any system an agent can reach.

History

Function calling, the forerunner of modern tool use, entered US company OpenAI's API in June 2023, replacing prompt-engineering workarounds by embedding the model-to-tool handoff inside the API contract.

The next structural shift was the Model Context Protocol (MCP), open-sourced by US company Anthropic in November 2024. MCP defined a vendor-neutral client-server interface so any model could call any tool over a standard transport, ending per-framework bespoke wiring. By early 2026, MCP had approximately 97 million downloads and over 200 server implementations. In December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF), a directed fund of the Linux Foundation co-founded with Block and OpenAI.

SWE-bench (2023) first measured agentic software-engineering at scale. By 2026, the leading evaluations shifted to Terminal-Bench 2.0, OSWorld-Verified, and BrowseComp, which score sustained multi-step task completion rather than single-shot accuracy.

Current state

As of 3 July 2026, Claude Mythos 5 leads the BenchLM.ai weighted agentic leaderboard at 100.0, followed by Claude Sonnet 5 (97.9) and OpenAI's GPT-5.5 (96.6). Frontier models can sustain autonomous work for close to five hours on structured tasks; task-length capacity has roughly doubled every 196 days since 2024, per Prosus's 2026 State of AI Agents report. Terminal-Bench 2.0 carries 28% of BenchLM.ai's agentic weighting, OSWorld-Verified (computer use across real GUIs) carries 24%, and BrowseComp (web-research agents) 18%.

Computer-use agents, which drive arbitrary GUIs without requiring an API, are embedded in OpenAI's GPT-5.5 and Codex, serving over 2 million weekly users. The protocol stack has settled on MCP for vertical agent-to-tool calls and Agent2Agent (A2A) for horizontal agent-to-agent delegation, as covered in Agent protocols harden: MCP goes stateless, A2A passes 150 orgs. Coding agents are the primary commercial proving ground. The orchestration layer has become the defensible moat: Meta's US$2 billion acquisition of US startup Manus in 2026 signalled that infrastructure, not raw model weights, attracts the largest capital allocations.

Relationships

This beat spans the three US frontier labs, OpenAI, Anthropic, and Google DeepMind, alongside open-weight challengers from China, including DeepSeek and Alibaba's Qwen series. Orchestration startups, Cognition, Imbue, and Manus (now Meta), sit one layer above the models. Standards governance runs through the AAIF at the Linux Foundation, where Anthropic, OpenAI, and Block co-own the MCP roadmap; Google governs A2A under the same umbrella.

The three related nodes show how the layers connect: coding agents are the leading commercial instance; computer-use models extend tool use to arbitrary GUIs; MCP and A2A are the plumbing. In a typical production flow, a coding agent calls a file-system tool via MCP, delegates a browser subtask to a sub-agent via A2A, and an orchestrator synthesizes results, the three-layer architecture around which enterprise procurement now turns.

What to watch

The 28 July 2026 MCP stateless specification and the security record of MCP Apps' sandboxed-iframe architecture. The Q3 2026 joint MCP/A2A specification, where shared authorization is the missing piece for cross-vendor multi-agent deployments. Whether open-weight agents from China, DeepSeek-V4 and Qwen 4, close the task-completion gap with frontier closed-weight models. Error and oversight regimes for computer-use agents across live enterprise systems, where a mis-step has real consequences and US and EU regulatory frameworks have not kept pace. If the 196-day doubling rate holds, agents capable of unsupervised 24-hour runs would arrive before end of 2026.

지역별 시각 · 2

▸ investor analysis

Prosus (State of AI Agents 2026) · Global · en

Prosus's 2026 survey documenting frontier models sustaining nearly five hours of autonomous work, task-length capacity doubling every 196 days, and orchestration infrastructure emerging as the new competitive moat over raw model capability.

출처 ↗

▸ benchmark tracker

BenchLM.ai (Agentic Leaderboard) · Global · en

Real-time agentic benchmark leaderboard as of July 2026, weighting Terminal-Bench 2.0 at 28%, OSWorld-Verified at 24%, and BrowseComp at 18%, with Claude Mythos 5 leading at 100.0, Claude Sonnet 5 at 97.9, and GPT-5.5 at 96.6.