What is the difference between an AI agent and a chatbot?

A chatbot responds to individual prompts in a stateless, turn-by-turn manner. An AI agent maintains state, plans multi-step tasks, uses tools (like APIs, databases, or code execution), and pursues goals autonomously across multiple interactions — often without step-by-step human guidance.

What are the most popular AI agent frameworks in 2026?

The leading frameworks include LangChain, AutoGen (Microsoft), CrewAI, the OpenAI Agents SDK, Hermes Agent (Nous Research), Openclaw, Semantic Kernel (Microsoft), and Haystack. Each framework has different strengths — from rapid prototyping to production-grade enterprise deployments.

What is MCP (Model Context Protocol)?

MCP is an open protocol developed by Anthropic that standardizes how AI models connect to external tools, data sources, and services. It defines a client-server architecture where agents discover and invoke capabilities — analogous to USB for AI tool integration — and is rapidly becoming the industry standard for agent-tool connectivity.

What is prompt injection and why is it dangerous for agents?

Prompt injection is a security attack where malicious instructions are embedded in data an agent processes — such as a web page, email, or document. It is especially dangerous for agents because their tool-use capabilities create a larger attack surface, and a compromised agent could execute harmful actions through its connected tools and APIs.

How do multi-agent systems work?

Multi-agent systems (MAS) use multiple specialized AI agents that collaborate, communicate, and coordinate to solve complex problems. Each agent typically has a defined role, and the system includes protocols for task delegation, information sharing, conflict resolution, and orchestration to produce coherent outcomes.

What is the difference between fine-tuning and RAG?

Fine-tuning permanently modifies a model's weights by training on domain-specific data, improving its inherent capabilities. RAG keeps the model unchanged but retrieves relevant external information at query time. Fine-tuning is better for teaching new skills or formats; RAG is better for giving access to frequently updated knowledge without retraining.

What does an autonomous enterprise look like?

An autonomous enterprise is an organization where AI agents handle the majority of operational, analytical, and decision-support tasks, with humans shifting to strategic oversight and exception handling. It represents the endpoint of the automation journey from RPA through AI agents, where agents run core business processes end-to-end with minimal human intervention.

Should I build AI agents in Python or TypeScript?

Python remains dominant with the deepest ecosystem of frameworks and tools. TypeScript is the fastest-growing alternative and is better if your stack is already JavaScript/TypeScript or if you're building agents integrated with web applications. Frameworks like Mastra and Vercel AI SDK are rapidly closing the feature parity gap.

How do open source agent frameworks compare to hosted platforms?

Code-first frameworks like LangGraph and CrewAI are libraries you integrate into your own application. Hosted platforms like Dify (~143k stars), LangSmith, and deepset Cloud operate at a higher abstraction level. They are often complementary — you might build agents with LangGraph and monitor them with LangSmith.

The Agent Report

Q: What is an AI agent?

An AI agent is a software system that uses a large language model (LLM) as its reasoning engine to perceive its environment, make decisions, and take actions autonomously. Unlike a chatbot, an agent can plan multi-step tasks, use external tools, and adapt its behavior based on outcomes.

Q: What is RAG in the context of AI agents?

RAG (Retrieval-Augmented Generation) grounds an LLM's responses in external knowledge by retrieving relevant documents from a database before generating an answer. In agent systems, RAG is often used as a tool — the agent queries a knowledge base, retrieves context, and uses that context to inform decisions or responses.

Q: What is sandboxing in AI agent systems?

Sandboxing runs agent code execution, tool invocations, or entire agent instances in isolated environments with restricted permissions. It prevents an agent from causing damage if it makes a mistake or is compromised — critical for agents that execute arbitrary code, access file systems, or interact with production infrastructure.

Meta MTIA: Four Custom AI Chips in Two Years — How Meta Is Powering Llama at Global Scale

2026-05-28T14:00:00+00:00

March 11, 2026 — Meta published a detailed technical overview of its Meta Training and Inference Accelerator (MTIA) family, revealing four successive chip generations — MTIA 300, 400, 450, and 500 — designed and deployed in rapid succession over roughly two years. The chips form the hardware backbone powering Llama inference, Muse Spark deployments, and the ranking and recommendation systems that drive billions of daily interactions across WhatsApp, Instagram, Facebook, and Messenger.

For the open-source AI community, the MTIA story matters because it directly shapes what Meta can deliver with Llama. Custom silicon designed in tight iteration loops with the model team means Meta can optimize the hardware-software stack end-to-end — and the fruits of that investment are now being pressed into service for GenAI workloads at unprecedented scale.

Why Custom Silicon Matters for Llama and GenAI

Every day, billions of people across Meta’s platforms use AI-powered features — personalized recommendations, real-time translation, AI assistants, content moderation, and more. Serving this workload at the lowest possible cost requires purpose-built hardware. Off-the-shelf GPUs, while powerful, carry overhead in memory bandwidth, interconnect topology, and instruction set generality that Meta cannot afford at its scale.

Meta’s response is MTIA: a family of custom ASICs developed in close partnership with Broadcom. The chip family began with two earlier generations (MTIA 100 and MTIA 200, detailed at ISCA’23 and ISCA’25) that were initially optimized for ranking and recommendation (R&R) inference — the dominant AI workload before GenAI took off.

Then GenAI happened. And Meta pivoted hard.

The Four Generations of MTIA

MTIA 300 — The Foundation

MTIA 300 was designed primarily for R&R training workloads. Key innovations include:

Built-in NIC chiplets for low-latency communication
Dedicated message engines for offloading communication collectives
Near-memory compute for reduction-based collectives

While optimized for R&R, these building blocks — low-latency, high-bandwidth communication components — proved foundational for GenAI inference in subsequent generations. MTIA 300 is in production today for R&R training.

MTIA 400 — The GenAI Pivot

As the GenAI wave surged, Meta evolved MTIA 300 into the MTIA 400, rebalancing the design to better support GenAI models while retaining R&R capability. The chip features a 72-accelerator scale-up domain and delivers performance competitive with leading commercial products. MTIA 400 has completed lab testing and is on the path to data-center deployment.

MTIA 450 — Inference-Optimized

Anticipating massive GenAI inference demand, MTIA 400 transitioned into MTIA 450 with specific optimizations for inference workloads. The standout improvement: HBM bandwidth was doubled from MTIA 400, making it significantly higher than existing commercial alternatives. Meta also introduced low-precision data types co-designed for inference workloads. MTIA 450 is scheduled for mass deployment in early 2027.

MTIA 500 — The Flagship

Continuing the GenAI inference focus, MTIA 500 increases HBM bandwidth by an additional 50% over MTIA 450 and introduces further innovations in low-precision data types. Scheduled for mass deployment in 2027, it represents the culmination of two years of relentless iteration.

The Numbers: 25× Compute Growth in Two Years

The raw specs tell the story of a team executing at remarkable velocity:

Metric	MTIA 300 → MTIA 500 Improvement
HBM Bandwidth	4.5× increase
Compute FLOPS	25× increase (MX8 → MX4 precision)
Generations	4 in under 2 years

This rapid advancement is the result of a deliberate iterative strategy. Rather than betting on a single long-cycle design, Meta builds each generation on the last using modular chiplets, incorporating the latest AI workload insights and hardware technologies on a shorter cadence. As Meta’s blog post explains: “Chip designs are based on projected workloads, but by the time the hardware reaches production — often two years later — those workloads may have shifted substantially.” The solution is to shorten the loop.

From R&R to GenAI: Why the Pivot Matters for the Open-Source Ecosystem

The MTIA journey is a case study in how quickly an organization can realign its hardware roadmap around a paradigm shift. In 2023, Meta’s dominant AI workload was ranking and recommendation — the systems that decide what you see in your feed. By 2025, GenAI had become the primary focus, with Llama and Muse Spark driving demand for inference compute at previously unimaginable scales.

For developers building on Llama, the implications are significant:

Lower inference costs: Custom silicon tailored to Llama’s architecture means Meta can offer Llama API pricing that undercuts general-purpose cloud providers. As MTIA 450 and 500 come online in 2027, margins improve further.
Tighter model-hardware co-design: When the chip team and the model team work from the same playbook, the entire stack is more efficient. Meta has confirmed it tested MTIA with Llama LLMs during development, a feedback loop that benefits both sides.
Strategic independence: By owning its silicon roadmap, Meta reduces dependence on NVIDIA and other GPU vendors — a critical factor as global AI chip supply remains constrained. Hundreds of thousands of MTIA chips are already deployed in production.

The Bigger Picture: Meta’s AI Infrastructure Bet

The MTIA program is part of a broader infrastructure strategy that includes massive data-center buildouts and a commitment to a diverse silicon portfolio. Meta has stated it will continue to leverage the best solutions available — both internally and externally — but MTIA is increasingly central to its plans.

This matters because Meta’s investment in custom silicon directly affects the open-source Llama ecosystem. Every efficiency gain in the inference stack makes it cheaper and more sustainable for Meta to run Llama-based services — and by extension, to justify continued investment in the model family.

The MTIA roadmap also signals something about Meta’s long-term intentions: the company is not outsourcing its AI future to chipmakers. By building its own accelerators, Meta retains control over the hardware-software interface — and that control translates into faster iteration, lower costs, and a competitive moat that grows deeper with each new chip generation.

What’s Next

With MTIA 450 and 500 on the horizon for 2027, and MTIA 400 entering deployment, Meta’s hardware story is only accelerating. The company has demonstrated that it can move from design to deployment faster than traditional chip-development cycles — a capability that will become increasingly valuable as AI models continue to evolve at breakneck speed.

For the open-source AI community, the takeaway is clear: Meta is building the infrastructure to run Llama and its successors at a scale that few organizations can match. Whether you access those models through the Llama API, run them on your own hardware, or fine-tune them for specific tasks, the economics of inference — and therefore the viability of open-source AI — will be shaped in part by what Meta achieves with MTIA in 2026 and 2027.

This article was researched from Meta’s official blog post “Four MTIA Chips in Two Years: Scaling AI Experiences for Billions” (March 11, 2026), the ISCA’23 and ISCA’25 papers on MTIA architecture, and Meta’s published infrastructure strategy documents. All information is current as of May 28, 2026.

BadHost: The Starlette Vulnerability That Exposed Millions of AI Agents and MCP Servers

2026-05-28T10:00:00+00:00

BadHost: The Starlette Vulnerability That Exposed Millions of AI Agents and MCP Servers

May 28, 2026 — A critical authentication bypass vulnerability in Starlette, the Python ASGI framework that underpins much of the AI infrastructure ecosystem, has put millions of AI agents and MCP (Model Context Protocol) servers at risk of data theft, credential exposure, and remote code execution.

Tracked as CVE-2026-48710 and nicknamed BadHost, the vulnerability allows attackers to bypass path-based authentication middleware with a single malformed HTTP Host header character. The flaw affects all Starlette versions prior to 1.0.1, which was released on Friday.

“Millions of AI agents and tools around the world have been imperiled by a critical vulnerability that can allow hackers to breach the servers running them and make off with sensitive data and credentials,” Ars Technica’s Dan Goodin reported.

How BadHost Works

Starlette reconstructs request.url by concatenating the HTTP Host header with the request path — without validating the Host value against RFC 9112 or RFC 3986 grammar. An attacker can send a crafted header like Host: example.com/health?x= that shifts path and query boundaries during re-parsing, making request.url.path point to a different endpoint than the one the ASGI server actually routed to.

The result: the router dispatches on the real wire path (e.g., /admin), but middleware sees the poisoned re-parsed path (e.g., /health). Any path-based security decision made in middleware can be bypassed.

X41 D-Sec, the security firm that discovered the bug during an audit sponsored by OSTIF, described it in stark terms:

“A single character injected into the HTTP Host header bypasses path-based authorization in Starlette, the routing core of FastAPI.”

The Scope: Millions of Affected Systems

Starlette receives 325 million downloads per week and is the foundation of FastAPI — the most popular Python web framework for AI applications. The downstream impact is staggering:

vLLM — where the bug was originally discovered — the leading open-source LLM inference server
LiteLLM — the widely-used LLM proxy that sits in front of dozens of model providers
MCP servers — the Model Context Protocol infrastructure that connects AI agents to external tools, databases, and APIs
Agent harnesses and eval dashboards
Google ADK-Python and Ray Serve
BentoML and other ML serving platforms

MCP servers are particularly exposed because the MCP specification mandates unauthenticated OAuth discovery endpoints, providing attackers with a reliable path to find and exploit vulnerable instances. These servers store credentials for databases, email accounts, cloud services, and internal tools — making them exceptionally valuable targets.

Data Types Exposed by Scans

X41 D-Sec’s internet-wide scan revealed a disturbing range of exposed data across vulnerable systems:

Sector	Exposed Data
Biopharma AI	Clinical trial databases, M&A data
Identity Verification	Face analysis, KYB, live PII, internal codebases
IoT/Industrial	SSH access to devices, remote code execution
Email/SaaS	Full mailbox access (read/send/delete), S3 exports
HR/Recruitment	Candidate PII, hiring pipeline data
Cloud Monitoring	AWS topology, metric queries
Cybersecurity	Asset inventory, live scanner access

Why This Matters for AI Agents

The BadHost vulnerability is emblematic of a structural risk in the AI agent ecosystem: trust at the wrong layer. Starlette, FastAPI, vLLM, and LiteLLM form the backbone of most Python-based AI infrastructure, yet the interaction between ASGI server behavior, framework URL construction, and middleware auth decisions created a vulnerability that no single component could fix alone.

As OSTIF noted in their disclosure: “This bug is a classic ‘responsibility gap’ where if this maintainer didn’t patch, thousands of exposed projects would have to individually secure their projects.”

The vulnerability also highlights a limitation of current AI-powered security tools. The researchers noted that even Claude Mythos (Anthropic’s code-scanning agent) did not find CVE-2026-48710 during Project Glasswing, because the bug spans three independent layers — each behaving correctly in isolation — rather than existing in a single codebase.

Mitigation and Response

The fix: Upgrade Starlette to version 1.0.1 or later. The patched version rejects Host headers containing invalid characters instead of using them for URL construction.

For those who cannot upgrade immediately:

Replace request.url.path with request.scope["path"] in every middleware, dependency, and decorator that makes security decisions
Deploy an RFC-compliant reverse proxy (nginx, Caddy, Traefik, HAProxy) that validates Host headers before forwarding to ASGI servers
Audit bundled and vendored Starlette — container images, virtualenvs, and pip-installed dependencies may pin vulnerable versions

A free online scanner is available at badhost.org — developed jointly by X41 D-Sec, Persistent Security Industries, and Bintech — to check if any reachable endpoint is vulnerable. The open-source repository also includes PoC exploits, Semgrep rules for static detection, and CodeQL queries for large-scale scanning.

The Takeaway

For developers building on AI agent infrastructure, BadHost is a wake-up call. The Python AI tooling ecosystem has grown so fast that foundational security assumptions at the framework layer have gone unexamined. Every team running FastAPI-based MCP servers, LLM proxies, or agent harnesses should treat this as a critical priority — scan their infrastructure, patch Starlette, and audit middleware for path-based auth patterns.

The vulnerability may have been disclosed, but the real impact depends on how quickly the ecosystem patches. With 325 million weekly downloads and MCP servers holding credentials to production systems, the window for exploitation is wide open.

Sources: Ars Technica — Millions of AI agents imperiled by critical vulnerability | OSTIF — Disclosing the BADHOST Vulnerability | badhost.org — Scanner & Details | OSV — CVE-2026-48710

Openclaw v2026.5.26 Makes Transcripts Core, Ships Faster Gateway and Production-Ready Channels

2026-05-28T10:00:00+00:00

Just two days after the v2026.5.22 release with its 4,100× model-listing optimization, Openclaw is back with v2026.5.26 — a stable release that makes transcripts a first-class core capability, delivers substantial gateway performance improvements, and brings Telegram, iMessage, WhatsApp, and Discord to genuine production-readiness.

With 375,000+ GitHub stars, 78,200+ forks, and 61 named contributors in the release changelog alone, the project continues to consolidate its position as the leading open-source claw controller for AI agents.

Transcripts Go Core

The defining architectural change in v2026.5.26 is the elevation of transcripts from a plugin-level concern to a core system capability. Every agent interaction — whether initiated via CLI, WebChat, media upload, follow-up, hook, or Codex mirror — now flows through a unified transcript pipeline.

What This Means

Transcript-backed meeting summaries — Agent conversations are captured with full source-provider metadata, cleaned user turns, and media provenance, enabling accurate post-hoc summaries
Codex mirror transcripts — Codex app-server interactions are mirrored into the same transcript store, giving operators a single pane of glass across all agent activity
CLI/TUI replay — Transcripts support deterministic replay with hooks, making debugging and auditing dramatically simpler
Media provenance — Every image, file, and generated asset is tracked with its origin context in the transcript record

The transcript capture happens at the gateway level, not the plugin level, which means every conversation path is covered — including system events, hook-generated turns, and fallback routing. This is a foundational change that lays the groundwork for compliance, audit, and training-data pipelines.

# Access transcripts via the new CLI surface
openclaw transcript list
openclaw transcript view 

Gateway Performance: Less Rediscovery, Faster Replies

The v2026.5.26 release targets one of the most pervasive sources of latency in agent gateways: repeated rediscovery of the same information. The team audited every hot path in the gateway startup and reply pipeline, adding smart caching where it matters most.

Key Optimizations

Area	Optimization	Impact
Plugin metadata	Plugin metadata snapshots are cached for the process lifetime	Reply-time skill setup no longer rescans plugin metadata on every turn
Startup warnings	Startup-warning metadata is cached and reused	Gateway startup avoids repeated filesystem scans
Auth stores	Auth env snapshots are prepared once and reused	No repeated credential resolution on every request
Model cost indexes	Model pricing metadata is cached	Usage-cost tracking is near-instant
Channel resolution	Channel routing is cached per session	No repeated dispatch table rebuilds
Session caches	Session read paths avoid cloning	Lower memory pressure under load

The most visible impact is on visible reply delivery latency. Telegram typing/progress context is preserved, slash-command startup metadata is lazy-loaded, model hydration on hot paths is avoided, Codex profiler timing is flag-gated, and context compaction maintenance is deferred until after the user-facing reply is sent. The net effect: users see responses faster, even as the gateway continues processing background work.

Four Channels Reach Production Readiness

Telegram receives the most extensive channel update in this release. Inbound text entities are preserved, overlapping DM replies are handled correctly, account-scoped topic caches keep forum context, outbound replies carry proper context, targeted bot-command mentions work reliably, durable group retry targets are maintained, and native progress callbacks keep users informed during long-running operations.

iMessage

iMessage now handles attachment roots correctly — images saved under ~/Library/Messages/Attachments are read through the existing inbound path policy. Duplicate local Messages-source accounts are deduplicated at startup, direct DM history is seeded reliably, and image/group media attachment commands work as expected. The development team also addressed the long-standing issue where channels.imessage.accounts listing both default and a named account would spawn duplicate watchers.

WhatsApp regains proper group/media behavior with restored ack identity and group-drop warnings. The update also fixes media path resolution when OPENCLAW_HOME differs from the OS home directory.

Discord

Discord voice playback reliability is significantly improved. Large model picker menus are now bucketed alphabetically when the provider list exceeds 25 items. Media captions are merged into a single message, gateway metadata is routed through the configured proxy, numeric channel IDs work for outbound sends, self-reply echoes are suppressed, and wake-name matching is tightened without breaking fuzzy wake phrases.

Voice and Talk: Full Realtime Control

The voice subsystem receives a major architectural upgrade in v2026.5.26. The team extracted a shared realtime voice SDK that provides common primitives for turn-context tracking, output activity monitoring, consult question matching, speakable-result extraction, and alias-aware forced-consult coordination. This SDK is then reused across Discord, browser voice, Google Meet, and all other voice surfaces.

Key capabilities now available:

Realtime Talk runs can be inspected, steered, cancelled, or followed up from both the Web UI and Discord voice
Wake-name handling is more tolerant of ambient noise without letting ambient speech falsely trigger agents
iOS Talk mode now features direct realtime voice sessions, a compact toolbar status indicator, and responsive voice waveform feedback
Android gains the pair-new-gateway action with improved offline voice recovery
Google Meet command bridges reuse the shared output activity tracking for local barge-in detection

Safer Content Boundaries

Security hardening continues with several important improvements:

Browser snapshot reads now honor SSRF policy before ChromeMCP or direct CDP reads
System-event text cannot spoof nested prompt markers — untrusted plugin/channel labels are sanitized before they reach the prompt
Fetched file text is wrapped as external content with metadata boundaries
ClickClack inbound sender allowlists are applied before agent dispatch
Stale device tokens are rejected during rotation
Serialized tool-call text is scrubbed from visible replies

The team also enabled the default auth rate limiter for remote non-browser HTTP gateway auth failures when gateway.auth.rateLimit is unset, while preserving the loopback exemption for local development.

Providers, Codex, and Local Models

The provider layer sees steady improvements across the board:

Named auth profiles allow multiple login configurations per provider, with migration support for Hermes, OpenCode, and Codex auth profiles
OpenAI sampling params are now forwarded through the gateway
Codex app-server resume/timeout/usage-limit recovery is hardened — Codex turn timeouts stay inside the Codex runtime boundary so they don’t poison shared app-server clients
xAI usage limits are surfaced in status output
Ollama receives top-p normalization to ensure consistent generation behavior
Local approval resolution is fixed for plugin command paths
Memory/local embeddings now run GGUF embeddings in an isolated worker sidecar — if the native embedding process crashes, the gateway degrades gracefully to keyword search instead of taking down the entire system

Observability Gets Richer

v2026.5.26 introduces several observability improvements that make it easier to understand what the gateway is doing:

Activity tab — A new ephemeral tab in the Control UI shows sanitized live tool activity summaries without persisting raw telemetry
Gateway secret-prep traces — The diagnostics pipeline now traces secret preparation, making it easier to debug auth failures
Model stream progress — Users can see streaming progress for model responses
Explicit fast-mode status — The TUI now shows when fast-mode is active
OpenTelemetry LLM spans — Content spans are now emitted through the OTLP exporter, giving operators full visibility into model interactions
Alertable telemetry — Blocked tools, model failover, stale sessions, liveness warnings, oversized payloads, and webhook ingress all generate actionable signals

The Big Picture

Openclaw v2026.5.26 is a consolidation release — it takes capabilities that were scattered across plugins, channels, and undocumented code paths and pulls them into a coherent, performant, observable core. Transcripts as a first-class feature, a faster gateway through smarter caching, and four channels reaching genuine production readiness represent meaningful progress toward the project’s vision of being the universal control plane for AI agents.

The release follows a familiar pattern: a feature-packed stable release (v2026.5.22) is followed by a hardening release that fixes edge cases, shores up performance, and closes security gaps. With v2026.5.27-beta.1 already published today (May 28) — bringing Pixverse video generation, enhanced security boundaries, and more reliable Codex runs — the release cadence remains relentless.

npm install -g openclaw

Openclaw v2026.5.26 is available now via npm install -g openclaw. Full release notes on GitHub.

DuckDuckGo Surges 28% as Users Flee Google’s AI Mode — The Great Search Rebellion?

2026-05-28T08:00:00+00:00

DuckDuckGo Surges 28% as Users Flee Google’s AI Mode — The Great Search Rebellion?

May 28, 2026 — When Google CEO Sundar Pichai told investors earlier this month that users “love” the company’s new AI Mode in Search, he may have unintentionally triggered the most visible user exodus in the search market’s recent history.

According to data reported by PC Gamer, DuckDuckGo saw nearly 28% more visits in the week immediately following Google’s insistence that people love AI Mode — a signal that a significant portion of the search-using public may be looking for an alternative to the AI-first future being pushed by the major tech platforms.

The Backlash Is Real

The numbers are hard to ignore. DuckDuckGo’s chief communications officer Kamyl Bazbaz confirmed the surge, noting that while DuckDuckGo’s own AI overviews remain popular, so does the option to filter out AI-generated images from search results.

“People just want a choice,” Bazbaz told PC Gamer. “Amen to that,” the publication’s reporter added — a sentiment that appears to resonate with a growing number of search users.

The story gained explosive traction on Hacker News, where it garnered over 835 points and 390 comments — making it one of the most-discussed stories of the day.

What Is Google AI Mode?

Google’s AI Mode, launched earlier this year, represents the company’s most aggressive push yet into AI-generated search results. Instead of displaying a traditional list of links, AI Mode generates comprehensive, conversational answers powered by Google’s Gemini models — complete with citations, follow-up suggestions, and synthesized information from multiple sources.

While Google frames this as a productivity enhancement — getting you an answer faster, without clicking through multiple pages — critics argue that it fundamentally undermines the web’s traffic economy. If users get their answers from AI summaries, the sites that produce the original content see fewer visits, less ad revenue, and ultimately less incentive to create.

This is the same tension that has plagued Google’s AI Overviews since their launch in 2024, but AI Mode takes it several steps further by making the AI response the default experience rather than a supplementary feature.

DuckDuckGo’s Counter-Positioning

DuckDuckGo has positioned itself as the anti-AI-search alternative. The privacy-focused search engine went viral earlier this year with its promise:

“Everything you do in DuckDuckGo is private, we don’t collect search histories or chats, and nothing is used for AI training.”

This message has proven remarkably effective. In a world where every major tech company is racing to inject AI into every product surface, DuckDuckGo offers a radically simple value proposition: search that just searches. No AI summaries. No model training on your queries. No personalized tracking.

The 28% traffic surge suggests this message is landing with a substantial audience — not just privacy diehards, but mainstream users who find AI Mode intrusive, slow, or untrustworthy.

The Numbers Behind the Story

The surge is particularly noteworthy given DuckDuckGo’s trajectory:

Weekly visits : Baseline → +28%
Hacker News rank : N/A → #3 trending (835 pts)
User sentiment : Stable → Rapidly growing

DuckDuckGo has been steadily growing for years, but a 28% weekly spike is extraordinary for a mature search engine. For context, DuckDuckGo processed approximately 3.5 billion searches per month in 2025. A 28% increase would represent nearly 1 billion additional searches per month — a massive shift in user behavior.

What This Means for the AI Agent Ecosystem

The DuckDuckGo surge has implications beyond the search market itself. AI agents fundamentally depend on access to high-quality, up-to-date information — and the primary way they get it is through web search APIs and indexed content.

If the push toward AI-generated search results continues to cannibalize web traffic, we could see:

A reduction in the quality of web content as publishers find it harder to monetize traffic from AI-intermediated searches
Increased reliance on specialized data sources (academic papers, newsletters, proprietary APIs) rather than general web search
A fragmentation of the search market into AI-powered and traditional segments, forcing agent developers to choose which backend to prioritize
Privacy-first search APIs becoming more attractive for agent developers who want to avoid their agent’s queries being used for training

For agent developers building search-dependent workflows, the takeaway is clear: the search landscape is fragmenting, and relying exclusively on one provider’s API carries both technical and reputational risk.

The Bigger Picture: Users Want Agency Over AI

The DuckDuckGo surge is part of a broader pattern. Recent polling and survey data consistently shows that while users find AI tools useful in specific contexts, there is growing resistance to AI being forced into every digital experience:

YouTube announced this week it will automatically label AI-generated videos, after creator backlash over undisclosed synthetic content.
Apple and Google are facing increasing scrutiny over how push notifications are handled — including AI-driven notification management that users didn’t ask for.
Google’s own data reportedly shows that AI Overviews have lower click-through rates than traditional search results, suggesting users may not actually “love” them as much as the company claims.

The common thread is user agency. People want the ability to choose when AI helps them and when it stays out of the way. DuckDuckGo’s surge proves that “AI-free” is not just a niche selling point — it’s a competitive advantage.

What’s Next?

DuckDuckGo’s 28% surge is a warning shot for Google and every other platform betting everything on AI. The message from users is clear: AI is a tool, not a mandate. The companies that treat it as optional — letting users opt in rather than forcing them to opt out — may end up winning the long game.

For DuckDuckGo, the challenge will be retaining these new users and proving that an AI-free search experience can keep up with the features and accuracy that users expect. For Google, the challenge is more fundamental: how do you convince users that AI Mode is valuable when every signal suggests they’re running away from it?

One thing is certain: the search wars just got a lot more interesting.

Sources: PC Gamer — DuckDuckGo's AI-free search saw nearly 28% more visits | Hacker News discussion

Anthropic and OpenAI Finally Found Product-Market Fit — and It’s All About Coding Agents

2026-05-28T07:00:00+00:00

Anthropic and OpenAI Finally Found Product-Market Fit — and It’s All About Coding Agents

May 28, 2026 — Is the AI industry’s massive infrastructure spend finally paying off? According to a deeply researched analysis by Simon Willison, the answer is a resounding yes — and the driver is not chatbots, not image generators, but coding agents.

Willison’s post, which rocketed to the top of Hacker News with over 830 points and 950 comments, argues that April 2026 marks a genuine inflection point for the frontier AI labs. “I think they’ve finally found product-market fit, with the coding/general-purpose agent products embodied by Claude Code/Cowork and Codex,” he writes.

Enterprise Customers Are Now Paying API Prices

The cornerstone of Willison’s argument is a seismic shift in how Anthropic and OpenAI charge their enterprise customers. At some point in the last six months, Anthropic switched its Enterprise plan to $20/seat/month plus API pricing for usage. OpenAI followed suit in April 2026, aligning Codex pricing with API token costs.

This is a dramatic departure from the flat-rate enterprise deals that characterized the 2024–2025 era. Now, companies signing year-long contracts are locked into full API prices — no more deep discounts.

“I currently subscribe to the $100/month Max plan from Anthropic and the $100/month Pro plan from OpenAI,” Willison notes. “I just ran the ccusage tool on my laptop to get an estimate of how much I would have spent if I were to pay for API tokens in the past 30 days and got $1,199.79 for Anthropic Claude Code and $980.37 for OpenAI Codex.”

That’s $2,180.16 worth of tokens for $200 — and Willison describes himself as a “moderately heavy user,” not someone running agents around the clock.

The pricing becomes even starker when you consider the latest model releases. GPT-5.5 (released April 23rd) costs 2× the API price of GPT-5.4. Opus 4.7 (April 16th) runs around 1.4× the price of Opus 4.6 when accounting for a new tokenizer. Enterprise customers face a double whammy: higher model prices and the removal of bulk discounts.

Why This Is Product-Market Fit

Willison draws a crucial distinction between popularity and profitability. ChatGPT boasts more than 900 million weekly active users, but only 50 million — 5.6% — are paying consumer subscribers.

“Charging $10–$20/month per user is an OK business, but you’d need 1–2 billion subscribers sticking around for four years to cover $1 trillion in infrastructure,” he calculates.

Coding agents change this equation entirely. These tools burn vastly more tokens than chat interfaces, but they are quickly becoming daily drivers for extremely well-compensated professionals. Companies spending $200+/month per user — or in Willison’s power-user case, ~$1,000/month per vendor — generate revenue at a scale that can meaningfully offset infrastructure costs.

“Coding agents really did change everything. These are tools which burn vastly more tokens, but are also quickly becoming daily drivers for the work carried out by extremely well-compensated professionals.”

The models released in November 2025 — GPT-5.1 and Opus 4.5 combined with their respective coding agent harnesses — elevated agents to being genuinely useful. We’ve now had six months for organizations to integrate these tools into their workflows, and the spending is following.

The Ramp-Up: Enterprise Sales Teams Are Growing Fast

As further evidence, Willison points to the open job listings at both companies:

OpenAI: 703 open jobs, of which 229 (32.6%) relate to enterprise sales and support — account executives, “Go To Market” roles, and Forward Deployed Engineers.
Anthropic: 390 open jobs, with 105 (26.9%) in enterprise-facing roles.

“It’s pleasingly ironic that these AI labs have picked a business model with such a heavy demand on human labor — enterprise sales contracts don’t close themselves without a whole lot of humans in the mix!” Willison observes.

Notably, he conducted this analysis using Claude Code itself — scraping job sites, piping data into Datasette Cloud, and analyzing with Datasette Agent. Full dogfooding.

The “AI Failure” Stories Are Actually Evidence of PMF

The narrative around companies being “shocked” by their AI bills — most notably Uber reportedly maxing out its full-year AI budget just months into 2026 — is actually evidence for the PMF thesis, not against it.

“The best advice I ever heard on pricing a product was that your customer should suck air through their teeth and then say yes. Uber’s budget overrun and Microsoft’s seat cancellations look like that effect playing out in practice.”

Microsoft’s decision to cancel Claude Code licenses — ostensibly to encourage dogfooding of Copilot CLI — was also reported to be a financial decision triggered by the June 30th end of Microsoft’s fiscal year. When your customers are making billion-dollar budget allocation decisions about your product, you’ve found product-market fit.

The Colossus Deal Changes Everything

Perhaps the most staggering data point comes from an unexpected source: SpaceX’s recent S-1 filing revealed that Anthropic signed a deal for cloud services worth $1.25 billion per month through May 2029 for access to compute capacity across the Colossus and Colossus II clusters.

The Anthropic announcement indicated this deal would allow them to “increase our usage limits for Claude Code and the Claude API,” heavily implying Colossus is being used for inference, not training. Given that Anthropic already has vast compute from other providers, the willingness to spend $1.25 billion/month from just one vendor hints at the enormous scale of inference budgets today.

A Two-Inflection-Point Story

Willison identifies two critical inflection points:

November 2025 — The capability inflection point, when GPT-5.1 and Opus 4.5, combined with their coding agent harnesses, became genuinely useful for real work.
April 2026 — The revenue inflection point, when the enterprise pricing shift and the resulting budget impacts made clear that these are real businesses, not just research projects.

“We’ll know for sure how real this moment is when the S-1 documents for the upcoming Anthropic and OpenAI IPOs give us some real, audited numbers to get our teeth into.”

What This Means for the Agent Ecosystem

For the broader AI agent ecosystem, the implications are profound:

Cursor and Copilot face direct competition from Anthropic and OpenAI’s own agent products. No wonder Cursor is investing in their own models.
Enterprise pricing at API rates means the cost of running AI agents at scale is now transparent and predictable — but expensive.
The middleman squeeze is real: Anthropic’s Claude Code directly competes with the very tools that were previously Anthropic’s biggest API customers (Cursor and GitHub Copilot were reportedly responsible for $1.2 billion of Anthropic’s then-$4 billion revenue in 2025).
Infrastructure providers (CoreWeave, Lambda, and now SpaceX) become critical — and their own IPOs will provide visibility into the AI industry’s true scale.

For developers and enterprises building on AI agents, the message is clear: the era of cheap, flat-rate agentic automation is over. But the value these tools deliver is now proven enough that organizations are willing to pay real money. That’s not a bug — it’s product-market fit.

Read the full original analysis: "I think Anthropic and OpenAI have found product-market fit" by Simon Willison

AI Agent Terminology: 55+ Terms You Need to Know in 2026

2026-05-27T22:00:00+00:00

The AI agent landscape has exploded in 2026. New frameworks launch weekly, protocols are being ratified in real time, and the vocabulary is evolving faster than most of us can keep up with. Whether you’re reading a research paper, evaluating a vendor, or debugging a multi-agent system at 2 a.m., you’ve probably hit a term that made you pause and think, “Wait — what exactly does that mean in an agent context?” For a deeper dive into the architectures and frameworks these terms describe, our Complete Guide to AI Agents provides the full technical context.

This glossary is built for that moment. It covers the terms you’ll actually encounter: the core concepts that define how agents work, the frameworks everyone is debating on Hacker News, the technical primitives that power production systems, the safety vocabulary that regulators and red teams use, and the enterprise terminology that’s shaping how companies adopt agentic AI.

We’ve aimed for clarity over exhaustiveness. Every definition is 1–3 sentences, written in plain English, and grounded in how the term is used in practice — not in a whitepaper. Think of this as a field guide, not an encyclopedia. For a historical perspective on how these frameworks evolved, our 2025 open-source agent framework comparison shows where the landscape stood before the 2026 explosion.

Core Concepts

AI Agent

A software system that uses a large language model (LLM) as its reasoning engine to perceive its environment, make decisions, and take actions autonomously. Unlike a chatbot, an agent can plan multi-step tasks, use external tools, and adapt its behavior based on outcomes.

Autonomous Agent

An AI agent capable of operating with minimal or no human supervision over extended periods. Autonomous agents set their own sub-goals, recover from errors without intervention, and persist across sessions — key for production workloads like customer support triage or infrastructure monitoring.

Multi-Agent System (MAS)

An architecture where multiple AI agents collaborate, compete, or negotiate to solve problems that are too complex for a single agent. Each agent may have a specialized role (researcher, coder, reviewer) and the system includes protocols for communication, task delegation, and conflict resolution.

Agentic AI

A term describing AI systems that exhibit goal-directed, autonomous behavior — the quality of being an agent rather than a passive tool. Agentic AI implies planning, tool use, memory, and the ability to pursue objectives over multiple steps without step-by-step human prompting.

Tool Use

The ability of an AI agent to invoke external functions, APIs, or software tools to accomplish tasks beyond text generation. Tools can include web search, code execution, file system operations, database queries, or any external capability exposed through a defined interface.

Function Calling

A specific mechanism by which an LLM outputs structured data (typically JSON) that triggers a predefined function in the host application. Function calling is the most common implementation pattern for tool use — the model decides which function to call and with what arguments based on the user’s intent.

Reasoning

The cognitive process by which an LLM breaks down complex problems, evaluates alternatives, and draws logical conclusions before acting. Advanced reasoning techniques — like step-by-step decomposition and self-verification — are what separate simple instruction-following from genuine agentic behavior.

Planning

An agent’s ability to decompose a high-level goal into a sequence of actionable steps before execution. Effective planning involves anticipating dependencies, ordering tasks correctly, and dynamically re-planning when intermediate steps fail or produce unexpected results.

Memory (Short-Term / Long-Term)

Short-term memory refers to context held within the model’s context window during a single session — the current conversation, recent tool outputs, and in-flight reasoning. Long-term memory persists across sessions via external storage (vector databases, knowledge graphs, or structured logs), allowing agents to remember user preferences, past decisions, and learned patterns over days or months.

RAG (Retrieval-Augmented Generation)

A technique that grounds an LLM’s responses in external knowledge by retrieving relevant documents from a database before generating an answer. In agent systems, RAG is often used as a tool — the agent queries a knowledge base, retrieves context, and uses that context to inform decisions or responses, reducing hallucination on factual queries.

Orchestration

The coordination layer that manages how multiple agents, tools, and workflows interact within a larger system. Orchestration handles task routing, dependency management, state tracking, and error handling — it’s the conductor that keeps a multi-agent system from descending into chaos.

Agent Loop

The core execution cycle of an AI agent: observe (gather information from the environment or tool outputs), reason (analyze and decide what to do next), act (execute a tool call or produce output), and observe again. The loop repeats until the agent determines the task is complete or a termination condition is met.

ReAct (Reasoning + Acting)

A prompting and execution pattern where the agent interleaves reasoning traces with concrete actions. Instead of thinking fully and then acting, the agent thinks a step, acts, observes the result, thinks about the result, and acts again — producing more grounded and correctable behavior than pure chain-of-thought approaches.

Chain-of-Thought (CoT)

A prompting technique that instructs the LLM to produce intermediate reasoning steps before giving a final answer. By verbalizing its thinking, the model often achieves higher accuracy on complex reasoning tasks — and makes its decision process interpretable to human observers.

Tree-of-Thought (ToT)

An extension of chain-of-thought where the LLM explores multiple reasoning paths simultaneously, evaluates them, and prunes unpromising branches — much like a search algorithm. Tree-of-thought is especially powerful for planning and problem-solving tasks where the agent must consider several possible strategies before committing.

Frameworks & Platforms

LangChain

An open-source framework for building LLM-powered applications with a focus on composability. LangChain provides abstractions for chains, agents, tools, and memory, along with a growing ecosystem of integrations — making it one of the most widely adopted starting points for agent development.

AutoGen

Microsoft’s open-source multi-agent conversation framework. AutoGen lets developers define specialized agents that communicate through structured conversations, with built-in support for human-in-the-loop patterns, code execution sandboxes, and group chat topologies.

CrewAI

A Python framework for orchestrating role-based AI agents that work together as a “crew.” CrewAI assigns each agent a defined role, goal, and backstory, then manages sequential or hierarchical task execution — popular for rapid prototyping of multi-agent workflows.

OpenAI Agents SDK

OpenAI’s official software development kit for building, testing, and deploying AI agents. The SDK provides primitives for tool definitions, guardrails, handoffs between agents, and tracing — designed to work natively with OpenAI models and the Responses API.

Claude (Anthropic)

Anthropic’s family of frontier LLMs, widely used as the reasoning engine in agent systems. Claude models are known for strong instruction following, long context windows (up to 200K tokens), native tool-use capabilities, and safety-focused design principles that make them popular for production agent deployments.

Hermes Agent

An open-source AI agent runtime and personal assistant framework by Nous Research, designed to give users full control over their agent’s skills, plugins, memory, and model backend. Hermes Agent emphasizes local-first operation, cross-platform support, and a community-driven ecosystem of shareable skills and profiles.

Openclaw

An open-source personal AI agent platform focused on multi-channel communication (Telegram, Discord, WhatsApp, Slack, email, voice) with a plugin architecture. Openclaw emphasizes multi-profile management, policy plugins for compliance, and the ability to run entirely on user-owned infrastructure.

Haystack

An open-source NLP framework by deepset for building search and retrieval pipelines. In the agent ecosystem, Haystack is commonly used to implement RAG backends, document processing, and knowledge retrieval — often as a tool invoked by higher-level agent frameworks.

Semantic Kernel

Microsoft’s open-source SDK for integrating LLMs into applications with an emphasis on enterprise scenarios. Semantic Kernel provides a plugin model, orchestration patterns, and native integration with the Microsoft ecosystem (Azure, Copilot, Teams).

Microsoft Copilot Studio

A low-code platform for building custom AI copilots and agents within the Microsoft 365 ecosystem. Copilot Studio enables organizations to create agents that work across Teams, SharePoint, Dynamics 365, and Power Platform — with built-in connectors to enterprise data sources.

Technical

MCP (Model Context Protocol)

An open protocol developed by Anthropic that standardizes how AI models connect to external tools, data sources, and services. MCP defines a client-server architecture where agents (clients) discover and invoke capabilities exposed by MCP servers — analogous to how USB standardized peripheral connections, but for AI tool integration.

ACP (Agent Communication Protocol)

An emerging standard for how AI agents communicate with each other across different frameworks and platforms. ACP aims to solve agent-to-agent interoperability — allowing a LangChain agent to delegate work to an AutoGen agent using a common message format, capability discovery mechanism, and security model.

GGUF

A file format for storing quantized LLM weights, widely used in the local and open-source model ecosystem. GGUF enables running large models on consumer hardware by bundling model architecture metadata with compressed weights that tools like llama.cpp can load efficiently.

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning technique that adds small, trainable adapter layers to a pre-trained model rather than modifying all weights. LoRA makes it practical to customize foundation models for specific agent tasks — like tool-calling or domain-specific reasoning — at a fraction of the cost and storage of full fine-tuning.

Quantization

The process of reducing the numerical precision of a model’s weights (e.g., from 16-bit to 4-bit) to decrease memory usage and inference latency. Quantization is essential for running capable agent models on edge devices, laptops, and cost-constrained cloud instances.

Fine-tuning

The process of further training a pre-trained LLM on a curated dataset to improve performance on a specific task or domain. In agent development, fine-tuning is used to improve tool-calling accuracy, teach domain-specific reasoning, or align model behavior with enterprise policies.

Structured Output

A capability where the LLM generates responses in a guaranteed format (typically JSON conforming to a schema) rather than free-form text. Structured output is critical for agent systems because tool calls, data extraction, and agent-to-agent messages must be machine-parseable with zero tolerance for malformed syntax.

JSON Mode

A specific LLM feature that constrains the model’s output to valid JSON. While less rigorous than full structured output with schema validation, JSON mode is widely supported and sufficient for many agent tool-calling implementations.

Rate Limiting

A mechanism that restricts how many requests an agent can make to an API or service within a given time window. Proper rate-limit handling — with exponential backoff, queuing, and graceful degradation — is essential for production agents that call external APIs without overwhelming them or exhausting budgets.

Token

The atomic unit of text that an LLM processes — roughly corresponding to a word fragment (~4 characters in English). Token count determines context window usage, API pricing, and latency, making token-aware design critical for cost-efficient agents that handle long conversations or large documents.

Context Window

The maximum number of tokens an LLM can process in a single forward pass, encompassing the system prompt, conversation history, tool outputs, and the current query. Modern agents rely on large context windows (128K–200K tokens) to maintain coherence across long, multi-turn interactions — but must still manage context strategically to avoid hitting limits.

Embedding

A numerical vector representation of text, images, or other data that captures semantic meaning in a high-dimensional space. Embeddings enable agents to perform similarity search, clustering, and retrieval — the mathematical foundation behind semantic memory and RAG systems.

Vector Database

A specialized database optimized for storing and querying high-dimensional vectors (embeddings). Vector databases power the “retrieval” half of RAG by enabling fast nearest-neighbor search across millions of documents — letting agents find semantically relevant information even when keywords don’t match.

Agent-to-Agent Communication

The mechanisms by which AI agents exchange information, delegate tasks, and coordinate actions. This can range from simple structured message passing to sophisticated protocols involving capability discovery, negotiation, and shared memory — and is a central challenge in multi-agent system design.

Safety & Alignment

Alignment

The field of ensuring that AI systems behave in accordance with human values, intentions, and safety constraints. In agent systems, alignment means the agent pursues its goals without causing unintended harm — even when the shortest path to the goal would violate ethical or operational boundaries.

RLHF (Reinforcement Learning from Human Feedback)

A training technique where human evaluators rank model outputs and those rankings are used to train a reward model that fine-tunes the LLM via reinforcement learning. RLHF has been the dominant approach for teaching models to be helpful, harmless, and aligned with user intent.

Constitutional AI

Anthropic’s alignment methodology where an AI is trained to follow a written “constitution” of principles rather than relying solely on human feedback. The model self-critiques and revises its outputs against these principles, enabling scalable oversight without requiring humans to review every output.

Red Teaming

The adversarial practice of probing an AI system for vulnerabilities, harmful behaviors, or alignment failures before deployment. Red teams simulate attacks — from prompt injection to social engineering — to identify weaknesses that need to be addressed via guardrails, fine-tuning, or architectural changes.

Prompt Injection

A security attack where malicious instructions are embedded in data that an agent processes (e.g., a web page, email, or document), causing the agent to disregard its original instructions and follow the attacker’s commands. Prompt injection is one of the most challenging unsolved security problems in agent systems.

Guardrails

Protective constraints placed around an agent’s behavior — implemented as input filters, output validators, or runtime monitors. Guardrails can enforce content policies, prevent harmful actions, validate tool calls against schemas, and ensure the agent stays within its defined operational boundaries.

Sandboxing

The practice of running agent code execution, tool invocations, or entire agent instances in isolated environments with restricted permissions. Sandboxing prevents an agent from causing damage if it makes a mistake or is compromised — critical for agents that execute arbitrary code or access file systems.

Agent Safety

The interdisciplinary field concerned with ensuring that autonomous AI agents operate reliably, predictably, and without causing harm — even in unexpected situations. Agent safety encompasses alignment, robustness, monitoring, and the design of “off-switch” mechanisms that remain under human control.

Interpretability

The study of understanding why an AI model made a specific decision, by examining its internal representations, attention patterns, or reasoning traces. In agent systems, interpretability is essential for debugging failures, building trust with users, and satisfying regulatory requirements for explainable AI.

Jailbreaking

The practice of circumventing an AI system’s safety restrictions through crafted prompts, role-playing scenarios, or encoding tricks. Agent systems face heightened jailbreak risk because their tool-use and multi-step reasoning capabilities create larger attack surfaces for bypassing guardrails.

Enterprise & Industry

SLA (Service Level Agreement)

A contractual commitment defining the expected performance, availability, and reliability of an AI agent service. For production agents, SLAs cover uptime (e.g., 99.9%), response latency, accuracy thresholds, and escalation procedures — critical for enterprise procurement and vendor evaluation.

RPA (Robotic Process Automation)

A technology for automating structured, rule-based business processes — such as data entry, invoice processing, or form submission. While traditional RPA follows fixed scripts, the industry is converging with AI agents to create “intelligent automation” that handles exceptions and unstructured data.

ERP Agent

An AI agent integrated with Enterprise Resource Planning systems (SAP, Oracle, Microsoft Dynamics) to automate workflows like order-to-cash, procurement, and financial close. ERP agents represent one of the largest enterprise adoption vectors for agentic AI, with SAP deploying 200+ production agents in 2026.

Autonomous Enterprise

A vision of the future organization where AI agents handle the majority of operational, analytical, and decision-support tasks — with humans shifting to strategic oversight, exception handling, and creative direction. The autonomous enterprise is the endpoint of the agent adoption curve that began with RPA and is accelerating through LLM-powered agents.

Digital Worker

A term used in enterprise contexts to describe an AI agent that performs a specific job function — analogous to a human employee. Digital workers have defined roles, performance metrics, access permissions, and escalation paths, and are increasingly managed alongside human teams in workforce orchestration platforms.

Compliance

The requirement that AI agent systems adhere to regulatory frameworks (GDPR, SOC 2, HIPAA, EU AI Act), industry standards, and internal governance policies. Compliance covers data handling, decision auditability, bias monitoring, and the ability to explain agent actions to regulators and auditors.

Observability

The practice of instrumenting agent systems to understand their internal state through logs, metrics, traces, and dashboards. In agent contexts, observability goes beyond traditional APM — it must capture reasoning chains, tool-call sequences, memory access patterns, and multi-agent interactions to enable debugging and optimization.

Anthropic Launches Project Glasswing — Claude Mythos Preview, $100M Cyber Defense Initiative with AWS, Apple, Google, Microsoft, and NVIDIA

2026-05-27T14:00:00+00:00

Anthropic today announced Project Glasswing, the most ambitious cross-industry cybersecurity initiative ever mounted by an AI company. The project brings together AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks around a single goal: securing the world’s most critical software before AI-augmented attackers can exploit it.

At the heart of Glasswing is Claude Mythos Preview — a new, unreleased frontier model that Anthropic describes as having crossed a threshold where AI can “surpass all but the most skilled humans at finding and exploiting software vulnerabilities.” The model has already discovered thousands of high-severity vulnerabilities, including critical flaws in every major operating system and web browser.

$100M

Model Usage Credits

$4M

Open Source Donations

Launch Partners

40+

Additional Participants

What Claude Mythos Preview Found

The model’s capabilities were demonstrated through a series of striking vulnerability discoveries conducted entirely autonomously:

🔴 A 27-year-old vulnerability in OpenBSD — one of the most security-hardened operating systems in the world, used to run firewalls and critical infrastructure. The flaw allowed an attacker to remotely crash any machine running the OS just by connecting to it.

🔴 A 16-year-old vulnerability in FFmpeg — the ubiquitous video encoding library used by countless applications. The vulnerable line of code had been hit five million times by automated testing tools without ever triggering a detection.

🔴 A chain of vulnerabilities in the Linux kernel — the software powering most of the world’s servers. Mythos autonomously found and chained together several flaws to escalate from ordinary user access to complete system control.

All reported vulnerabilities have been patched by the respective maintainers.

Benchmark Performance

Claude Mythos Preview’s security-specific capabilities far exceed any publicly evaluated model:

Benchmark	Mythos Preview	Opus 4.6	Improvement
CyberGym (Vulnerability Reproduction)	83.1%	66.6%	+16.5 pp
SWE-bench Verified	93.9%	80.8%	+13.1 pp
SWE-bench Pro (Agentic Coding)	77.8%	53.4%	+24.4 pp
Terminal-Bench 2.0	82.0%	65.4%	+16.6 pp
GPQA Diamond (Reasoning)	94.6%	91.3%	+3.3 pp
Humanity’s Last Exam (with tools)	64.7%	53.1%	+11.6 pp
OSWorld-Verified (Computer Use)	79.6%	72.7%	+6.9 pp

The model’s SWE-bench Pro score of 77.8% is particularly notable — it represents a 24.4 percentage point leap over Opus 4.6, reflecting Mythos’ ability to handle complex, multi-step software engineering tasks autonomously.

How Project Glasswing Works

The initiative is structured around defensive deployment of Claude Mythos Preview:

Launch partners (AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Microsoft, NVIDIA, Palo Alto Networks) receive direct access to Mythos Preview for scanning their foundational systems — codebases representing a large share of the world’s shared cyberattack surface.
40+ additional organizations that build or maintain critical software infrastructure can apply for access to scan both first-party and open-source systems.
$100M in model usage credits from Anthropic covers substantial usage throughout the research preview period.
$4M in donations to open-source security organizations: $2.5M to Alpha-Omega and OpenSSF (via the Linux Foundation) and $1.5M to the Apache Software Foundation.
After the preview period, Mythos Preview will be available to participants at $25/$125 per million input/output tokens on the Claude API, Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft Foundry.

The Urgency: Why Now?

Anthropic’s announcement makes a compelling case for urgency. The core argument:

“Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely.”

The company notes that Claude Mythos Preview is not yet generally available and will not be made broadly accessible. Instead, the model’s cyber capabilities are being channeled exclusively through Project Glasswing for defensive purposes. Anthropic plans to develop and refine safety safeguards with an upcoming Claude Opus model before considering broader deployment of Mythos-class capabilities.

Industry Response

The breadth of industry participation is remarkable for a single AI company initiative:

Cisco (Anthony Grieco, SVP & Chief Security & Trust Officer): “AI capabilities have crossed a threshold that fundamentally changes the urgency required to protect critical infrastructure from cyber threats, and there is no going back.”
AWS (Amy Herzog, VP and CISO): “Our teams analyze over 400 trillion network flows every day for threats, and AI is central to our ability to defend at scale.”
Microsoft (Igor Tsyganskiy, EVP of Cybersecurity): “When tested against CTI-REALM, our open-source security benchmark, Claude Mythos Preview showed substantial improvements compared to previous models.”
Google (Heather Adkins, VP of Security Engineering): “We have long believed that AI poses new challenges and opens new opportunities in cyber defense.”
Linux Foundation (Jim Zemlin, CEO): “By giving the maintainers of critical open source codebases access to a new generation of AI models that can proactively identify and fix vulnerabilities at scale, Project Glasswing offers a credible path to changing that equation.”

What This Means for the Agent Ecosystem

Project Glasswing has implications beyond cybersecurity:

A new capability tier is confirmed. Mythos Preview’s benchmark scores — particularly the 24.4pp jump on SWE-bench Pro — validate that a significant capability leap exists beyond Claude Opus 4.7. This is the model that will inform Anthropic’s next general-purpose release.
Agentic security is now a first-class use case. Autonomous vulnerability discovery and patching is one of the highest-value agent applications yet demonstrated. The model found vulnerabilities without human steering, wrote exploits autonomously, and in some cases chained multiple bugs together — all capabilities that transfer directly to non-security agent tasks. This autonomous security capability underscores the concerns raised in our agent safety trust gap analysis, which found that only 14.4% of agents receive full security approval before deployment.
The defensive vs. offensive AI debate gets real. Anthropic is explicitly withholding Mythos from general release while deploying it defensively. This sets a precedent for how frontier AI companies might gate access to especially powerful capabilities.
Cross-industry AI security coalitions become the norm. The participation of virtually every major tech company signals that AI-powered cybersecurity is shifting from competitive differentiator to shared infrastructure problem.
Open source maintainers get AI-powered help. The $4M in donations and access program means that resource-constrained open-source projects — which power the vast majority of modern software — can now benefit from frontier AI vulnerability detection.

The Bottom Line

Project Glasswing is the most significant AI security initiative to date — not just because of Claude Mythos Preview’s capabilities, but because of the unprecedented breadth of industry alignment around a defensive AI deployment model. The partnership roster reads like a who’s-who of global technology: every major cloud provider (AWS, Google Cloud, Microsoft Azure), every major chipmaker (Apple, Broadcom, NVIDIA), every major security vendor (Cisco, CrowdStrike, Palo Alto Networks), and the world’s largest financial institution (JPMorganChase).

Anthropic has committed to reporting publicly within 90 days on vulnerabilities fixed, lessons learned, and practical recommendations for how security practices should evolve in the AI era. If Project Glasswing succeeds at its stated goals, it could fundamentally reshape how the industry approaches software security — from reactive patching to proactive, AI-driven vulnerability discovery at scale.

For agent builders, the key takeaway is clear: the frontier of autonomous agent capability is advancing faster than most expected. If a model can autonomously find a 27-year-old vulnerability in OpenBSD, it can autonomously handle far more than most production agent systems ask of it today.

Block Open-Sourced Goose: How a YAML Recipe File Scaled an AI Agent to 60% of the Company

2026-05-27T12:00:00+00:00

Block — the parent company of Square, Cash App, and Afterpay — released Goose as an open-source AI agent in early 2025. A year later, the tool has 44,000+ GitHub stars, 368+ contributors, and 2,600+ forks. But the headline number isn’t open-source adoption. It’s internal adoption: roughly 60% of Block’s ~12,000 employees use Goose weekly, spanning 15 different job profiles — engineering, sales, design, product, and customer success.

The question that follows is obvious: how does a single CLI tool serve both an engineer debugging a flaky test and a product manager triaging a Jira ticket?

The answer is a 30-line YAML file.

The Architecture: Local Agent, MCP Tools, Recipe Workflow

Goose is a Rust binary that runs entirely on the user’s machine. It connects to any major LLM provider — Anthropic, OpenAI, Gemini, Mistral, xAI, or a local model via Ollama — and uses MCP (Model Context Protocol) servers as its tool surface. The architecture has three layers that each evolve independently:

The agent runtime — a core loop (plan → call tools → evaluate → repeat) that stays generic.
The extension system — every tool is an MCP server. Adding GitHub access, Jira integration, or an internal API is a config entry, not a code change.
The recipe — a YAML document bundling instructions, required extensions, parameters, and the prompt into a single shareable file.

The separation is deliberate. The agent doesn’t decide which tools to load — the recipe does. The agent doesn’t free-form its way through the task — the recipe provides a numbered sequence with checkpoints.

What a Recipe Looks Like

A recipe is a YAML file that any team member can author. Here’s an abridged example for reviewing a GitHub pull request:

name: review-pr
description: Review a GitHub PR for risk areas
params:
  pr_url:
    type: string
    description: The GitHub PR URL to review.
extensions:
  - name: developer
  - name: github
    args: ["-y", "@modelcontextprotocol/server-github"]
prompt: |
  1. Fetch the PR diff and the list of changed files.
  2. For each file, identify: behavior changes, new dependencies,
     missing tests, anything that looks rushed.
  3. Group findings by severity: must-fix, should-fix, nit.
  4. Post a single review comment with the grouped findings.

The recipe is run with a single command:

goose recipe run review-pr --params pr_url=https://github.com/org/repo/pull/42

The params block makes the recipe a function — you call it with different inputs instead of writing one per task. The extensions block loads MCP servers dynamically for the duration of the run and discards them afterward. The numbered prompt steps act as a planning skeleton — the agent doesn’t reinvent the workflow each time.

Why This Pattern Scaled

The recipe format is the architectural breakthrough that explains the adoption number. A YAML file is a thing a product manager can author. They can copy a recipe a teammate wrote, change the prompt, run it, and see what happened — no deploy, no code review, no engineering handoff.

For engineers, the value is different: a recipe is a committable artifact that lives in the same repo as the code it operates on. A team’s review workflow sits at recipes/review-pr.yaml next to the service code. New hires read the recipe to understand the workflow. Changes get reviewed like any other artifact.

The MCP extension layer is the multiplier. Every new internal capability is a one-time MCP server build, and then it’s available to every recipe. Block doesn’t write a separate “PR review agent” and “ticket triage agent.” They write one Goose binary, then ship a directory of recipes and a directory of MCP servers. Composition does the rest. This MCP-based composition pattern is a core theme in our Ultimate Guide to Open Source AI Agent Frameworks.

Goose Is Now a Foundation Project

In a move that changes the risk calculus for enterprise adoption, Goose has moved from Block’s governance to the Agentic AI Foundation under the Linux Foundation. The tool is now a community-governed project — no single company controls its roadmap.

The implications are significant. The governance risk that held back enterprise teams (“what if Block stops investing?”) is gone. Recent community activity points toward a public recipe registry, tighter MCP server interoperability, and richer parameter types including file uploads and structured objects.

What This Means for the Industry

Goose’s recipe pattern is the strongest signal yet that the future of enterprise AI agents is not about better models — it’s about workflow abstractions that non-engineers can author. The recipe is an architectural pattern, not a Goose-specific feature. The same shape works on top of Claude Code skills, Cursor agents, or any runtime that supports YAML-defined workflows and MCP tools. For a broader view of how open-source agent tools are evolving, see our Top 20 Open Source AI Agent Tools ranking.

The takeaway for any team building internal agent platforms: if your system doesn’t have an analog for the recipe, you’re going to end up with bespoke agent builds per team. The recipe is what lets one tool serve a 12,000-person company without forking.

The boring abstraction — a YAML file with a name, a prompt, and an extension list — is how you reach 60% of the company.

Hermes Agent Post-Foundation Sprint: Dashboard OAuth, Kynver Memory, Qwen 3.7-Max, and 30+ Merged PRs

2026-05-27T12:00:00+00:00

Just 11 days after the massive v0.14.0 “Foundation” release, the Hermes Agent team is showing no signs of slowing down. Today, May 27, saw a coordinated batch of 30+ merged pull requests ship across the entire stack — from infrastructure and auth to new model support and security tooling.

The numbers tell the story: the repo has climbed from 155K to 169.5K stars (+14,500 in 11 days), while forks have surged from 24,980 to 28,216, and open issues have grown to 14,219 — reflecting a community that’s not just watching but building.

169.5K

GitHub Stars

28.2K

Forks

14.2K

Open Issues

30+

PRs Merged Today

Here’s what landed in today’s sprint.

The most user-facing change in today’s batch is the Dashboard OAuth login flow. Previously, dashboard users had to configure their provider credentials manually through config files. Now the dashboard supports a full OAuth login flow — operators can log in through their identity provider directly from the dashboard UI.

The implementation is backed by the new dashboard.public_url config option (commit by @benbarclay), which allows operators behind reverse proxies to set the absolute base URL for OAuth callbacks. This fixes a common pain point for self-hosted deployments behind nginx, on-prem ingress controllers, and custom-domain Fly.io setups where X-Forwarded-Host headers aren’t reliably forwarded.

“When set, it is the complete authority — scheme + host + optional path prefix — and becomes the base for the OAuth redirect_uri.” — Commit message on HERMES_DASHBOARD_PUBLIC_URL

The config follows a clean precedence chain: env var > config.yaml > auto-detected from request headers, matching the existing dashboard.oauth.client_id pattern.

🧠 Kynver Memory Provider + AgentOS Bridge (#33158)

Memory is one of the most critical subsystems in any self-improving agent, and today Hermes gained a new backend. PR #33158 adds the Kynver memory provider alongside an AgentOS bridge.

Kynver is a specialized memory substrate for AI agents, offering persistent, queryable storage optimized for agentic workloads. The AgentOS bridge means Hermes can now leverage AgentOS-compatible memory tools and infrastructure. This is a significant expansion of Hermes’ already rich memory ecosystem, which previously depended on filesystem-based, vector-store, and other backends.

🤖 Qwen 3.7-Max Joins the Model Catalog (#32806, #33129)

Two PRs today add Qwen 3.7-Max — Alibaba’s latest frontier model — to Hermes’ model catalogs. PR #32806 adds it to the Alibaba provider list, while #33129 adds it to the alibaba-coding-plan catalog.

Qwen 3.7-Max has been making waves in the open-source AI community for its strong reasoning capabilities and competitive benchmark scores. Hermes users on the Alibaba provider can now select it via hermes model and start building agents with it immediately.

🔌 API Server Session Controls (#33134, #29302)

The API server — Hermes’ HTTP interface for programmatic access — gets a major upgrade with session control APIs. PR #33134 (salvaging #29302) introduces endpoints for:

Session management — create, list, and manage active sessions
Chat endpoints — send messages, retrieve conversation history
Fork support — branch a session into a new independent context
SSE streaming — real-time event streaming for live agent responses

This transforms the API server from a basic HTTP interface into a full-featured agent interaction platform — enabling custom UIs, CI/CD integrations, and programmatic agent orchestration.

🛡️ Security Plugins: Pattern-Matched Code Warnings (#33131)

A new plugin category lands today: security-guidance plugins. PR #33131 introduces a system that pattern-matches against dangerous code patterns in agent-written code and surfaces warnings before the code is executed.

This is especially important for self-improving agents that write and execute their own code — Hermes’ core value proposition. The security-guidance plugin catches common dangerous patterns (unsafe eval(), file-system traversal, shell injection vectors) and flags them with actionable remediation hints.

🛠️ Codex Reliability Cluster

A significant portion of today’s merged PRs focus on Codex (GitHub Copilot) provider reliability — the workhorse backend for many Hermes users:

Credential pool sync on re-auth (#33074) — fixes a bug where Codex re-authentication via hermes setup / hermes model would write fresh OAuth tokens but leave the credential pool holding stale entries, causing 401 errors on every subsequent request
Foreign-issuer reasoning on replay (#33156, salvaging #31629) — prevents HTTP 400 invalid_encrypted_content errors when switching between model providers mid-conversation (e.g., from Grok to GPT-5.5)
Transient rs_tmp reasoning state (#33146) — drops stale temporary reasoning items that could accumulate and cause failures
Null output stream handling (#33008, #33050) — normalizes response.output=None to empty lists, preventing iteration crashes
Silent-hang workaround hints (#33133, #33034) — improved user-facing hints when ChatGPT silent-hang scenarios are detected
Homebrew CI poller nudges (#33142) — the terminal tool now detects anti-pattern CI polling scripts and nudges users toward canonical green-CI snippets

💬 Telegram UX Cleanup

A cluster of three PRs addresses Telegram operational noise:

#31034 — quiets operational chatter in Telegram gateway
#31098 — ignores /start platform pings on Telegram
#31941 — hides compaction status noise

These are small but important UX improvements — reducing noise in Telegram channels where Hermes operates as a bot makes the conversation feel more natural and less “robotic.”

📊 What This Sprint Means

Eleven days after the Foundation release, Hermes Agent’s development velocity is accelerating:

The dashboard is becoming a real product — OAuth login and session control APIs point to Hermes evolving beyond a CLI-only tool into a platform with proper web UI and API access layers
Memory diversity is growing — the Kynver + AgentOS bridge means Hermes can plug into more enterprise and research-grade memory substrates
Security is front-and-center — pattern-matched security plugins for code writing is a direct response to the unique risks of self-improving agents
Daily reliability compounding — the Codex cluster alone fixes 7+ distinct failure modes that real users were hitting

The pace is remarkable: 30+ PRs merged in a single day, spanning infrastructure (auth, config, API), models (Qwen 3.7-Max), memory systems, security, and reliability. If the Foundation release was about surface area, this sprint is about depth — making every subsystem more reliable, more secure, and more capable. The project’s momentum mirrors what we’ve documented across the broader Hermes Agent community ecosystem, which has grown to 276 documented use cases and 165K GitHub stars.

With 169.5K stars and counting, Hermes Agent continues to be the fastest-growing open-source agent framework — and if today’s sprint is any indication, the next release (v0.15.0?) will be worth the wait.

Ultimate Guide to Open Source AI Agent Frameworks in 2026

2026-05-27T10:00:00+00:00

The open-source AI agent framework landscape in 2026 is both richer and more turbulent than it was even twelve months ago. The year began with two major transitions: Microsoft moved AutoGen into maintenance mode and merged it with Semantic Kernel into the new Microsoft Agent Framework (GA April 2026), while OpenAI archived its experimental Swarm library and redirected users to the production-grade Agents SDK. LangGraph hit 1.0 GA. CrewAI crossed the 1.0 threshold. And TypeScript-native frameworks like Mastra and Vercel AI SDK surged past 20,000 GitHub stars, proving that the agent revolution is not Python’s alone. For context on how these frameworks fit into the broader agent landscape, see our Complete Guide to AI Agents.

This guide is for developers and technical leaders who need to cut through the noise. We compare eight frameworks across eight criteria — language support, agent types, key features, learning curve, production readiness, best use case, GitHub stars, and 2026 momentum — with deep dives into each. The goal is not to crown a winner but to help you choose the right tool for your use case, team, and stack.

Quick links: Comparison Table · LangChain / LangGraph · AutoGen / AG2 · CrewAI · OpenAI Agents SDK · Haystack · Semantic Kernel · Mastra · Vercel AI SDK · How to Choose · FAQ

Comparison Table

The table below compares all eight frameworks across eight essential dimensions. Star counts are approximate and sourced from GitHub and third-party trackers as of early June 2026. Production readiness reflects consensus across multiple independent comparisons, not vendor claims.

Framework	Language(s)	Agent Types	Key Features	Learning Curve	Production Readiness	Best Use Case	~ GitHub Stars
LangChain / LangGraph	Python, JavaScript	Single, Multi, Hierarchical, Swarm	Stateful graphs, checkpointing, memory, human-in-the-loop, LangSmith tracing	Advanced	Mature	Complex stateful workflows, enterprise orchestration	137k / 33k
AutoGen (AG2)	Python	Multi, Conversational, GroupChat	Event-driven, async messaging, code execution sandboxes	Intermediate	Maintenance Mode	Legacy multi-agent research systems (use MAF for new builds)	~48k
CrewAI	Python	Multi, Hierarchical, Role-based	Role/Goal/Backstory agents, sequential & hierarchical processes, Flows, MCP native	Beginner	Stable	Rapid multi-agent prototyping, marketing automation	~38k
OpenAI Agents SDK	Python, TypeScript	Single, Multi (handoff)	Handoff delegation, guardrails, tracing, sandboxed execution, provider-agnostic (100+ LLMs)	Beginner	Stable	Delegation chains, TypeScript/Next.js teams, rapid prototyping	~19k
Haystack	Python	Single, Multi (pipeline agents)	Typed pipelines, 50+ document stores, RAG-native, multimodal, agentic pipelines	Intermediate	Stable	Production RAG, semantic search, question answering	~22k
Semantic Kernel	.NET, Python, Java	Single, Multi, Planner	Enterprise SDK, Azure integration, OpenTelemetry, A2A protocol, plugin architecture	Intermediate	Mature	.NET enterprise teams, Azure-native AI applications	~28k
Mastra	TypeScript	Single, Multi, Graph-based	Graph workflows (then/branch/parallel), RAG, MCP, evals, 4-tier memory, 81+ providers	Intermediate	Stable	TypeScript-native production agents, integrated framework	~21k
Vercel AI SDK	TypeScript, JavaScript	Single, Multi (tools-based)	Streaming, React hooks, 2.8M weekly downloads, Next.js native, provider-agnostic	Beginner	Mature	Web app AI features, React/Next.js teams, chatbots	~20k

A note on star counts: Star counts are a lagging indicator of community size — not of production readiness. LangGraph has roughly one-quarter the stars of LangChain but more verified enterprise deployments. AutoGen has ~48k stars but is in maintenance mode. Choose by mental model and production track record, not by GitHub popularity. For a practical, ranked view of the tools built on these frameworks, see our Top 20 Open Source AI Agent Tools guide.

Deep Dives

1. LangChain / LangGraph

The most mature agent ecosystem, for teams that need ultimate control.

LangChain is the granddaddy of the open-source LLM application ecosystem — 137,000 GitHub stars, 3,900+ contributors, and 281,000 dependent repositories as of mid-2026. But for agents specifically, LangGraph is the framework that matters. Released as a standalone library in 2024 and reaching 1.0 GA in October 2025, LangGraph models agent behavior as a directed state graph: nodes are computation steps, edges are conditional transitions, and checkpointers provide persistent state with Postgres or Redis backends.

LangGraph’s power lies in its explicitness. You define every state transition. You can pause workflows mid-execution for human approval. You can rewind to any checkpoint during debugging. This makes it the go-to for enterprises — confirmed deployments include Klarna (853 employee-equivalent agents, saving $60M), Uber (~21,000 developer-hours saved), LinkedIn, Cisco, JPMorgan, and Elastic. LangSmith provides monitoring and tracing for observability at scale.

The trade-off is complexity. LangGraph has a steep learning curve — expect a multi-day ramp-up before you’re productive. For teams that don’t need stateful orchestration, LangGraph is overkill. But for production systems where failure is expensive and audit trails are mandatory, nothing else in the open-source ecosystem matches its depth.

2026 momentum: Deep Agents (launched March 2026) adds built-in planning, filesystem-based context management, and sub-agent spawning on top of LangGraph — pushing it further toward batteries-included, without sacrificing the underlying graph model.

2. AutoGen / AG2

Microsoft’s multi-agent pioneer — now in maintenance mode.

AutoGen was the framework that sparked the multi-agent revolution. Originally from Microsoft Research, it introduced event-driven, conversational multi-agent systems where agents collaborate through message-passing rather than rigid pipelines. At its peak in 2025, AutoGen amassed ~55,000 GitHub stars and proved that multi-agent setups could outperform single-agent solutions on benchmarks like GAIA.

But 2026 brought a major reset. In early 2026, Microsoft announced that AutoGen was entering maintenance mode — bug fixes only, no new features. The team merged AutoGen’s orchestration ideas with Semantic Kernel’s production infrastructure into the Microsoft Agent Framework (MAF), which reached 1.0 GA on April 3, 2026. MAF ships as a unified SDK for .NET and Python under Microsoft.Agents.AI, with Semantic Kernel as the foundation layer and AutoGen-style graph workflows on top.

The community fork AG2 continues AutoGen development independently, but its long-term trajectory is uncertain. For new projects, the unambiguous guidance from Microsoft and independent analysts is: start with MAF, not AutoGen.

Best remaining use case: Teams with existing AutoGen 0.2 or 0.4 deployments that aren’t ready to migrate, or researchers who need AutoGen’s specific conversational multi-agent paradigm for academic work. For everyone else, the migration path leads to MAF or to alternative frameworks.

3. CrewAI

The simplest path to multi-agent orchestration.

If LangGraph is a precision instrument, CrewAI is a power tool for multi-agent workflows. The framework’s mental model is intuitive: define agents with roles, goals, and backstories, then assign them tasks in a sequential or hierarchical process. A working multi-agent crew can be scaffolded in under 10 minutes via the CLI — the fastest time-to-value of any framework in this comparison.

CrewAI hit 1.0 GA in October 2025 and has since added significant capabilities: CrewAI Flows (event-driven workflows with @start, @listen, and @router decorators for complex branching), native MCP server support (v1.10.x), and streaming tool calls. The crewAIInc/crewAI repository has grown to approximately 38,000 GitHub stars.

CrewAI’s strength is accessibility. The role-based abstraction maps naturally to how teams think about delegation — researcher, writer, reviewer — making it popular for content generation pipelines, marketing automation, customer service triage, and rapid prototypes. The company claims ~1.4 billion automations per month and 60% Fortune 500 adoption (though these figures are not independently audited).

The trade-off: CrewAI is less suited for deeply stateful, long-running agents that need persistent memory across sessions. For those use cases, LangGraph’s checkpointing model is more appropriate. But for teams that need to orchestrate multiple agents quickly without building a state machine from scratch, CrewAI remains the most approachable option.

4. OpenAI Agents SDK

Lightweight, official, and provider-agnostic.

OpenAI’s Agents SDK, released in March 2025, is the successor to the experimental Swarm library (archived March 2025). It’s a minimalist, open-source toolkit built around a single elegant primitive: the handoff. Agents can delegate tasks to other agents, enabling triage-and-specialist architectures with remarkably little code.

Despite the name, the SDK is provider-agnostic — it works with 100+ LLMs, not just OpenAI models. It ships for both Python (v0.17.3 as of May 2026) and TypeScript (v0.8.3 as of April 2026), making it one of the few frameworks with first-class support in both ecosystems. Key features include built-in guardrails (parallel input validation that halts on failure), tracing via OpenAI’s observability platform, and — as of April 2026 — sandboxed code execution with providers like Modal, E2B, Cloudflare, and Vercel.

The Agents SDK has approximately 19,000 GitHub stars (Python repo) and roughly 10.3 million monthly PyPI downloads. Its learning curve is the gentlest of any framework here: if you can write a Python function and decorate it as a tool, you can build an agent.

Best for: Teams that want a lightweight, official toolkit without the abstraction overhead of LangChain or CrewAI. Particularly strong for delegation-heavy workflows (triage → specialist → response) and for TypeScript/Next.js teams that want the same SDK across frontend and backend. The main limitation is that the handoff model, while elegant, is less expressive than LangGraph’s state graphs for complex branching and looping workflows.

5. Haystack

The RAG specialist with growing agent capabilities.

Haystack, built by Berlin-based deepset (~$45.6M raised), occupies a distinct niche: it is the framework you choose when retrieval quality is the primary constraint. While other frameworks treat RAG as a feature, Haystack was built around it from day one. Its pipeline architecture — where components (Retriever, Ranker, PromptBuilder, Generator) are composed into typed, directed graphs — maps directly to the structure of production search and question-answering systems.

The Haystack 2.x rewrite modernized the framework significantly, adding agentic pipelines, multimodal support (text + images), and a growing component ecosystem. With approximately 22,000 GitHub stars and an Apache 2.0 license, Haystack provides 50+ document store integrations, hybrid retrieval strategies, and a REST API for deployment.

Haystack’s agent capabilities are structured as “agentic pipelines” — agents that can reason, use tools (including Haystack components as tools), and iterate within the pipeline framework. This is a different mental model from LangGraph’s freeform graphs or CrewAI’s role-playing, but it’s well-suited for use cases where the primary workflow is retrieval-centric and agents assist within that pipeline.

Best for: Production RAG systems where retrieval precision and pipeline predictability matter more than open-ended agent autonomy. Teams building semantic search, enterprise knowledge management, or customer support chatbots with grounded answers. Not the best choice for purely conversational multi-agent systems where retrieval is secondary. Deepset also offers a managed cloud platform (Haystack Enterprise) for teams that want a hosted solution.

6. Semantic Kernel

The enterprise-grade .NET SDK — now part of Microsoft Agent Framework.

Semantic Kernel (SK) is Microsoft’s open-source AI orchestration SDK, purpose-built for the .NET ecosystem with additional support for Python and Java. As of mid-2026, it has approximately 28,000 GitHub stars and has become the default answer for enterprise .NET teams asking “how do we build AI agents without leaving our stack?”

SK’s architecture centers on plugins (reusable AI functions, equivalent to tools in other frameworks), planners (agents that chain plugins to accomplish goals), and memories (vector-backed semantic storage). It is model-agnostic, supporting OpenAI, Azure OpenAI, Anthropic, Google, and local models. Enterprise features include OpenTelemetry integration for observability, Azure AI Foundry deployment, and support for Google’s Agent-to-Agent (A2A) protocol for cross-framework interoperability.

The major 2026 development was SK’s merger with AutoGen into the Microsoft Agent Framework (MAF) 1.0, which shipped GA on April 3, 2026. In MAF, SK provides the production foundation (stability, telemetry, enterprise integration) while AutoGen contributes the multi-agent orchestration patterns. The unified Microsoft.Agents.AI SDK ships as first-class packages for both .NET and Python with identical API shapes.

Best for: .NET and C# enterprise teams building AI agents on Azure. Organizations that need deep integration with the Microsoft ecosystem — Azure AI Foundry, Microsoft 365, Power Platform. Teams that value long-term support stability over cutting-edge experimentation. The main limitation is that SK’s Python experience, while solid, is secondary to its .NET-native design — Python-first teams may find other frameworks more idiomatic.

7. Mastra

The TypeScript-native contender with graph-based workflows.

Mastra represents the new wave of TypeScript-first agent frameworks. Built by the team behind Gatsby (YC W25, $13M seed round), Mastra provides an integrated, opinionated framework where agents, workflows, RAG, memory, and evaluation live in a single coherent package — no stitching together of separate libraries required.

Mastra’s workflow engine models agent orchestration as composable graphs with then(), branch(), and parallel() primitives, plus suspend/resume for human-in-the-loop patterns. The .network() method turns any agent into a router that delegates to sub-agents. Memory is structured across four tiers: message history, working memory, semantic recall, and RAG — a more comprehensive model than most Python frameworks offer. MCP support is built-in, and Mastra connects to 81 providers covering 2,436+ models via the Vercel AI SDK.

With approximately 21,000 GitHub stars and accelerating adoption (Replit used Mastra to improve Agent 3 task success from 80% to 96%), Mastra is carving out a position as “the most complete TypeScript agent framework.” The framework has attracted enterprise users including Marsh McLennan (75,000 employees) and SoftBank.

Best for: TypeScript teams that want a single integrated framework rather than assembling agents from separate libraries. Developers who value type safety, IDE autocomplete, and Zod schema validation in their agent pipelines. Projects where observability (built-in tracing and eval harness) and MCP connectivity are requirements from day one. The main limitation is that Mastra’s ecosystem is younger than the Python equivalents — fewer community contributions, third-party integrations, and Stack Overflow answers — though the trajectory is steeply upward.

8. Vercel AI SDK

The web developer’s agent toolkit — massive adoption, streaming-first.

The Vercel AI SDK is not a traditional “agent framework” in the same sense as LangGraph or CrewAI, but it is the most downloaded TypeScript AI toolkit by an enormous margin: 2.8 million weekly npm downloads and approximately 20,000 GitHub stars. Built by the creators of Next.js, the SDK is designed to add AI features to web applications with minimal friction.

Its architectural philosophy is streaming-first. The useChat and useCompletion React hooks handle the full lifecycle of AI interactions — streaming responses, tool calls, loading states, and error handling — with a few lines of code. The SDK is provider-agnostic, supporting OpenAI, Anthropic, Google, Mistral, and dozens of others through a unified interface. As of 2026, the SDK has grown agentic capabilities: the generateText and streamText functions support tool calling, multi-step reasoning, and structured output generation, effectively enabling single-agent workflows.

The SDK’s agent capabilities are lighter than dedicated frameworks — you won’t find built-in multi-agent orchestration, persistent memory, or human-in-the-loop checkpointing. But for the most common use case in web applications — an AI feature that uses tools, streams responses, and generates structured data — the Vercel AI SDK is faster to implement than any Python alternative.

Best for: React and Next.js developers adding AI chat, tool use, or structured generation to web applications. Projects where streaming UX is a priority. Teams that want the DX advantages of TypeScript-native tooling. Not the right choice for complex multi-agent systems or backend-only agent deployments — pair it with LangGraph.js or Mastra for those cases.

How to Choose a Framework

The right framework depends on your team’s language, your use case’s complexity, and your production requirements. Here’s a practical decision guide:

Choose by Language

Python-first team → LangGraph, CrewAI, or OpenAI Agents SDK. LangGraph for complex stateful workflows, CrewAI for rapid multi-agent prototyping, OpenAI Agents SDK for lightweight delegation chains.
TypeScript/JavaScript-first team → Mastra or Vercel AI SDK. Mastra for full agent applications with RAG and evals, Vercel AI SDK for web app AI features.
.NET enterprise team → Semantic Kernel / Microsoft Agent Framework. The only framework with first-class .NET support and deep Azure integration.
Mixed Python + TypeScript → OpenAI Agents SDK (ships for both) or LangGraph (LangGraph.js for TypeScript).

Choose by Complexity

Simple single-agent with tools → OpenAI Agents SDK or Vercel AI SDK. Both minimize boilerplate for the most common agent patterns.
Multi-agent orchestration, fast prototype → CrewAI. Role-based abstractions get you from idea to working crew in under 10 minutes.
Complex stateful workflows, human-in-the-loop → LangGraph. The checkpointing and graph model are unmatched for production-critical systems.
RAG-first with agent augmentation → Haystack. When retrieval quality is more important than agent autonomy.

Choose by Production Posture

Enterprise, long-term support → LangGraph or Semantic Kernel/MAF. Both have 1.0 releases, stable APIs, and verified enterprise deployments.
Startup, rapid iteration → CrewAI or Mastra. Fastest time-to-value, growing fast, backed by venture funding.
Lightweight, low commitment → OpenAI Agents SDK. Minimal dependencies, works with any LLM provider.

Special Cases

Building on Azure → Microsoft Agent Framework (Semantic Kernel + AutoGen merged). Native Azure AI Foundry deployment, Entra ID auth, OpenTelemetry.
Building on Vercel/Next.js → Vercel AI SDK. Native streaming, React hooks, Edge runtime support.
Building a coding agent → Consider LangGraph Deep Agents for planning + sub-agent spawning, or Mastra for TypeScript-native tool execution.

If you’re migrating from AutoGen: The official successor is the Microsoft Agent Framework (GA April 2026). For teams not on the Microsoft stack, CrewAI or LangGraph are the most common migration targets based on community discussion.

Frequently Asked Questions

Which framework has the most GitHub stars?

LangChain has the most stars at approximately 137,000, followed by AutoGen (~48k) and CrewAI (~38k). However, star counts are not a reliable measure of production readiness — LangGraph (~33k stars) has more verified enterprise deployments than any framework with more stars. AutoGen, despite ~48k stars, is now in maintenance mode.

What’s the difference between LangChain and LangGraph?

LangChain is the broader ecosystem — a platform for building LLM applications with chains, agents, tools, and output parsing. LangGraph is a specific library within that ecosystem focused on stateful, graph-based agent orchestration with checkpointing and human-in-the-loop patterns. For agent-specific work in 2026, LangGraph is the recommended starting point within the LangChain ecosystem.

Is AutoGen dead?

Not dead, but in maintenance mode as of early 2026. Microsoft is no longer adding features to AutoGen and has merged its orchestration concepts into the Microsoft Agent Framework (MAF). The community fork AG2 continues development, but for new projects, MAF or alternative frameworks are recommended.

Which framework is best for beginners?

CrewAI has the gentlest learning curve for multi-agent systems — its role-based abstraction is intuitive and the CLI scaffolds working crews in minutes. For single-agent applications, the OpenAI Agents SDK is similarly approachable with minimal boilerplate.

Can I use these frameworks with local/open-source models?

Yes, all eight frameworks are provider-agnostic or support multiple providers. LangGraph, OpenAI Agents SDK, Mastra, and Vercel AI SDK all support 80+ LLM providers including local models via Ollama, vLLM, or similar. Haystack and Semantic Kernel have built-in support for local models. CrewAI supports any model via LiteLLM integration.

Which framework is most “production-ready”?

LangGraph consistently ranks #1 in production-readiness across independent comparisons, with confirmed enterprise deployments at Klarna, Uber, Cisco, LinkedIn, JPMorgan, and Elastic. Semantic Kernel (via Microsoft Agent Framework) is the most production-ready option for .NET/Azure teams, with GA 1.0 guarantees and long-term support commitments.

Should I build agents in Python or TypeScript?

Python remains the dominant language for AI agent development, with the deepest ecosystem of frameworks, tools, and community resources. TypeScript is the fastest-growing alternative and is the better choice if your application stack is already JavaScript/TypeScript or if you’re building agents that integrate deeply with web applications. The gap is narrowing rapidly — frameworks like Mastra and Vercel AI SDK are closing the feature parity gap with their Python counterparts.

How do these compare to hosted platforms like Dify or LangSmith?

This guide focuses on code-first frameworks — libraries and SDKs you integrate into your own application. Hosted platforms like Dify (~143k stars, visual builder), LangSmith (observability), and deepset Cloud (managed Haystack) operate at a higher level of abstraction. They’re often complementary: you might build agents with LangGraph and monitor them with LangSmith, or use CrewAI for orchestration and Dify for the end-user interface.