🇫🇷 Cet article est aussi disponible en français.

MIT & Microsoft's Murakkab Cuts AI Agent Energy Use by 3.7× — The Infrastructure Breakthrough No One Is Talking About

MIT & Microsoft's Murakkab Cuts AI Agent Energy Use by 3.7× — The Infrastructure Breakthrough No One Is Talking About
🇫🇷 Cet article est aussi disponible en français.
📑 Table of Contents

TL;DR — MIT and Microsoft Research have unveiled Murakkab, a resource-efficient orchestration system for agentic AI workflows that slashes GPU usage by 2.8×, energy consumption by 3.7×, and cloud costs by 4.3× — all while maintaining service-level objectives. The system introduces a declarative abstraction that decouples what a workflow does from how it runs, enabling a profile-guided optimizer to dynamically reconfigure execution across models, hardware, and tool configurations. In a world where data centers are projected to consume over 1,000 TWh of electricity by 2026, Murakkab represents the first production-grade attempt to bring AI agents’ infrastructure footprint under control.


The Dirty Secret of Agentic Workflows

AI agents are eating the cloud — and they’re doing it with terrible table manners.

An agentic workflow is no longer a single LLM call. It’s a system of multiple autonomous agents, each invoking different models (GPT-5.5 for reasoning, Claude Opus 5 for coding, a cheaper model for summarization), calling external tools (APIs, databases, Python interpreters), and chaining these steps into multi-turn, multi-branch execution graphs. What used to be a simple prompt → response pipeline has become a tangled web of model invocations, tool calls, and routing decisions.

(Source: MIT News — Improving the speed and energy-efficiency of AI agents)

The problem is that each of these components — the models, the tools, the execution orchestrator — is managed by a different layer of the stack. The LLM provider handles model scaling. The cloud scheduler handles GPU allocation. The agent framework handles workflow routing. Nobody sees the full picture.

The result is massive resource waste. A 2025 study by the Lawrence Berkeley National Laboratory projected that AI data centers could consume up to 12% of total U.S. electricity by 2028. The IEA estimates that data centers already consume 415 TWh annually (~1.5% of global electricity), and the Brookings Institution warns that this could approach 1,050 TWh by 2026 — roughly the consumption of a mid-sized country.

(Source: IEA — Energy Demand from AI)

(Source: Brookings — Global Energy Demands Within the AI Regulatory Landscape)

“Agentic workflows are getting very complicated and quickly becoming the backbone of what cloud providers are doing,” says Gohar Chaudhry, lead author of the Murakkab paper. He’s not exaggerating — every major cloud provider has an agent framework now (Azure AI Agent Service, Google Agent Development Kit, AWS Bedrock Agents), and each one faces the same resource allocation problem.

(Source: MIT News — June 25, 2026)

Enter Murakkab: A Cross-Layer Optimizer for Agentic Systems

Murakkab (Arabic for “compound” or “composite” — a nod to the Compound AI Systems research at Microsoft) introduces two architectural principles that fundamentally change how agentic workflows are deployed:

1. Declarative Workflow Specification

Instead of hard-coding model choices, GPU allocations, and tool configurations into the workflow code, developers describe what the workflow should accomplish in plain language. Murakkab’s specification language decouples the application logic from the execution configuration.

This means a developer writes: “Classify this support ticket, route it to the right team, and draft a response” — and Murakkab’s optimizer decides which model handles classification (a small, fast one), which handles drafting (a more capable one), and on which hardware each step runs.

(Source: arXiv — Murakkab: Resource-Efficient Agentic Workflow Orchestration)

2. Adaptive, SLO-Aware Runtime

The system’s profile-guided optimizer continuously evaluates workflow performance against user-defined Service Level Objectives (SLOs) — typically a trade-off between accuracy, latency, cost, and energy. During each optimization epoch, the optimizer considers current workflow load and available resources (including spot instances) and dynamically reconfigures:

  • Which model variant serves each workflow step
  • How many GPU instances are allocated
  • Whether steps can be co-located on shared hardware
  • When to preempt low-priority tasks for high-priority ones

The key insight is that existing cloud schedulers and agent frameworks can’t do this because they operate in silos. The cloud scheduler doesn’t know that two different workflow steps could share a GPU without conflict. The agent framework doesn’t know that a cheaper model would satisfy the accuracy requirement for step #3. Murakkab sees the whole stack.

The Numbers: Real Efficiency Gains

The Murakkab paper reports results from testing on diverse agentic workflows — including code generation pipelines, data processing chains, and multi-model reasoning tasks. The headline numbers are striking:

Metric Improvement Context
GPU usage ↓ 2.8× Same workloads, fewer GPUs
Energy consumption ↓ 3.7× Directly addresses the AI energy crisis
Cloud cost ↓ 4.3× Across compute, inference, and orchestration
Workflow completion speed ↑ 3.4× Through better resource allocation
Energy efficiency ↑ 4.5× Per-unit-of-work metric

(Source: arXiv — Murakkab (2508.18298))

To put the 4.3× cost reduction in perspective: if an enterprise is spending $50,000/month on GPU inference for their agentic workflows, Murakkab brings that down to roughly $11,600/month — without sacrificing quality. For large-scale deployments running hundreds of concurrent agent workflows, the savings run into millions annually.

More importantly, the 3.7× energy reduction isn’t just about cost. At a time when every hyperscaler is scrambling to secure power for new data centers (Microsoft alone has signed PPAs for over 10 GW of renewable energy), efficiency gains of this magnitude directly translate to reduced carbon footprint and the ability to deploy more AI workloads without building new data centers.

How It Works Under the Hood

Murakkab’s optimization pipeline runs in cycles:

  1. Profile Collection — The system observes running workflows, collecting performance data on each model-tool-hardware combination: latency percentiles (p50, p95, p99), GPU memory footprint, token throughput, and energy draw.

  2. SLO Translation — User-defined objectives (“max 500ms latency for classification step,” “min 95% accuracy for code generation”) are translated into a multi-dimensional optimization problem.

  3. Configuration Search — The optimizer explores the Pareto frontier across accuracy, latency, cost, and energy. It uses global visibility across all running workflows to identify co-location opportunities — two workflows that both need a summarization step can share a single GPU instance, for example.

  4. Dynamic Reconfiguration — The adaptive runtime applies the optimal configuration, hot-swapping model endpoints and reallocating GPU resources without interrupting in-flight workflows.

What makes this novel is that existing systems treat each agentic workflow as an isolated black box. The workflow calls a model, the model provider serves it, the cloud bills for it. Murakkab opens the black box and reasons across the entire stack — something that neither Kubernetes, nor vLLM, nor any existing LLM orchestration platform can do today.

Why This Matters Now (And What It Means for the Ecosystem)

The timing of Murakkab’s release isn’t coincidental. Three converging trends make this research urgently relevant:

1. The Multi-Model Reality

In 2024, most AI agent deployments used a single model (usually GPT-4 or Claude). In 2026, every serious agentic workflow uses multiple models — a reasoning model for planning, a fast model for tool calling, a specialized model for code generation, and sometimes an open-weight model for sensitive data. Each model has different performance characteristics, cost profiles, and hardware requirements.

This multi-model pattern is now the norm, not the exception. The Forbes AI 50 2026 list shows that 78% of the top private AI companies ship multi-model agent systems. But manually tuning which model handles which task across dozens of workflow steps is a combinatorial nightmare. Murakkab automates this.

2. The Energy Reckoning

The AI industry is facing a credibility problem on sustainability. NVIDIA’s Vera CPU and Rubin GPU — both announced at GTC 2026 — promise significant efficiency improvements at the hardware level. But hardware efficiency alone won’t solve the problem if software keeps wasting cycles.

As the G7 AI Leaders Summit in June 2026 made clear, governments are increasingly scrutinizing AI’s environmental impact. France’s Minister for AI, Clara Chappaz, explicitly called for “efficiency-by-design” in AI infrastructure at the summit. Murakkab, as an open research system (arXiv paper, MIT license), represents exactly the kind of software-level efficiency breakthrough that policymakers are demanding.

3. Enterprise AI Adoption at Scale

The Gartner forecast for AI agent spending predicts that enterprise spending on agentic AI platforms will reach $41 billion by 2028. But the biggest barrier to adoption isn’t model capability — it’s cost predictability.

When a single agentic workflow can spin up 5 different model endpoints across 3 cloud regions and produce an unpredictable bill, CFOs get nervous. Murakkab’s SLO-driven cost optimization provides the kind of budget predictability that enterprises need to scale from pilot projects to production deployments.

Murakkab vs. Existing Solutions: Where It Fits

System Scope Cross-Layer Optimization Multi-Tenant Energy-Aware
vLLM / TGI Single model serving
Kubernetes + GPU operator Container orchestration
LangGraph / CrewAI Workflow logic
Murakkab Full-stack workflow

Murakkab doesn’t replace any of these systems — it sits on top of them. It uses existing model servers (vLLM, TGI) and container orchestration (Kubernetes), but adds a cross-layer optimization plane that reasons about the whole workflow, not just individual components.

The closest analog in the infrastructure world is something like Uber’s Peloton scheduler for big data workloads — a system that understood the full DAG of a Spark job and could optimize resource allocation globally. Murakkab does the same for agentic workflows.

Limitations and Open Questions

The Murakkab paper is clear-eyed about its current limitations:

  • Cold-start latency — The profile-guided optimizer needs a warm-up period to collect sufficient performance data. For new or infrequent workflow patterns, the system falls back to conservative defaults.
  • Model-agnostic, but not tool-agnostic — While Murakkab can optimize across different LLM providers, optimizing across tool providers (different API gateways, different database backends) is a harder problem because tool performance characteristics are less predictable than model inference.
  • Research prototype — Murakkab is a research system, not a product. The paper is from MIT and Microsoft Research, and while the code is open-source, there’s no managed service or enterprise support yet.
  • SLO definition burden — The declarative specification is powerful but requires developers to think explicitly about accuracy-latency-cost trade-offs. For teams used to “just call GPT-5.5 and see what happens,” this is a cultural shift.

FAQ

Q: Is Murakkab a commercial product or a research project?

It’s a research project from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) in collaboration with Microsoft Research. The paper was published on arXiv in August 2025, and the MIT News feature (June 25, 2026) marks its public debut. The code is open-source, but there’s no managed service.

Q: Can I use Murakkab with my existing LangGraph or CrewAI workflows?

In theory, yes — Murakkab is designed as an orchestration layer that sits below the workflow definition. However, integration requires implementing the declarative specification interface, which isn’t a drop-in replacement. Expect community adapters to emerge if the system gains traction.

Q: How does Murakkab compare to NVIDIA’s upcoming efficiency improvements on Vera/Rubin?

They’re complementary. NVIDIA’s hardware improvements (Vera CPU + Rubin GPU) optimize at the silicon level. Murakkab optimizes at the software orchestration level. Combined, they could deliver multiplicative efficiency gains — a 2× improvement in hardware efficiency × a 3.7× improvement in software efficiency = roughly 7× better overall.

Q: Does Murakkab work with open-weight models like Llama 4.5?

Yes. Murakkab is model-agnostic. It can route workflow steps to any model that exposes a compatible API, including self-hosted open-weight models. In fact, open-weight models benefit more from Murakkab’s optimization because their deployment characteristics (GPU memory, token throughput) vary more widely than closed-source API models.

Q: What’s the catch with the 4.3× cost reduction? Are there trade-offs?

The 4.3× figure represents the maximum observed reduction across tested workflows. Real-world results will vary depending on workflow complexity, the diversity of models used, and how aggressive the SLOs are. Stricter latency requirements reduce the optimization headroom. But even conservative configurations showed 2-3× cost reductions.


Further Reading