>_Skillful
Need help with advanced AI agent engineering?Contact FirmAdapt
All Posts

MCP Server Performance Benchmarks: What the Numbers Actually Mean for Production

Latency, throughput, and resource consumption benchmarks across popular MCP servers, plus a methodology for testing that reflects real production conditions.

March 28, 2026Basel Ismail
mcp performance benchmarks

Why Benchmarking MCP Servers Is Harder Than It Looks

Benchmarking MCP servers isn't like benchmarking a REST API. The protocol introduces its own overhead, tool invocation patterns vary wildly between agents, and most popular testing frameworks weren't designed with the Model Context Protocol in mind. If you run a naive throughput test, you'll get numbers that look great on a spreadsheet and mean almost nothing in a real agent loop.

The servers people actually use in production, things like the Filesystem MCP server, GitHub MCP server, Postgres MCP server, and Brave Search MCP server, have very different performance profiles depending on what the agent is asking them to do. A single tool call to read a file isn't the same as a tool call that triggers a database query with a large result set. Treating them as equivalent is where most benchmark comparisons go wrong.

The Methodology That Actually Reflects Production

Simulate Agent Behavior, Not HTTP Clients

The right way to benchmark an MCP server is to drive it the way an agent drives it. That means using the MCP SDK directly (the TypeScript or Python client), sending tools/call requests in sequence and in parallel, and measuring from the client's perspective, not from inside the server process.

A useful test harness looks roughly like this: spin up the server process, establish the stdio or SSE transport connection, send an initialization handshake, then run your tool call sequences. Measure wall-clock latency from the moment you send the JSON-RPC request to the moment you receive the complete response. Don't subtract transport overhead; that overhead is real and your agent pays it.

Three Workload Profiles Worth Testing

For any MCP server you're evaluating, run at least three workload profiles. First, single-threaded sequential calls, which shows you baseline latency per tool invocation. Second, concurrent calls with a parallelism factor of 4 to 8, which is realistic for agents that fan out tool calls. Third, a sustained load test over 5 to 10 minutes, which surfaces memory leaks and connection pool exhaustion that short tests miss entirely.

For each profile, collect p50, p95, and p99 latency. The p50 tells you what typical feels like. The p99 tells you what your agent experiences when things get slightly busy. In agentic workflows, a slow tool call blocks the entire reasoning loop, so p99 matters more than it does in traditional web services.

Latency Numbers Across Popular Servers

The numbers below come from testing on an M2 MacBook Pro (local development) and a c6i.xlarge EC2 instance (4 vCPU, 8 GB RAM, Ubuntu 22.04), using the official server implementations from the MCP servers repository and community-maintained packages. All servers were run with default configuration unless noted.

Filesystem MCP Server

The official Filesystem server is about as fast as you would expect for local I/O. Reading a small file (under 10 KB) via read_file clocks in at roughly 2 to 5 ms p50 on local disk. Writing a file is similar. The p99 jumps to around 15 to 20 ms under concurrent load, which is acceptable for most use cases.

Where it gets interesting is with list_directory on large directories. A directory with 10,000 entries can push p50 latency above 80 ms because the server serializes the entire listing into a single JSON response. If your agent is navigating large codebases, this is worth profiling specifically. One mitigation is to use search_files with a pattern instead of listing and filtering client-side.

GitHub MCP Server

The GitHub MCP server is network-bound by definition. Tool calls like get_file_contents or search_repositories are essentially wrappers around the GitHub REST API, so your latency floor is GitHub's API response time plus the server's serialization overhead.

In practice, p50 latency for get_file_contents on a public repo runs 180 to 320 ms depending on GitHub's API latency at the time. The MCP server itself adds roughly 5 to 10 ms of overhead on top of that. Rate limiting is a more pressing concern than raw latency; the server doesn't implement automatic retry with backoff by default, so sustained agent workloads can hit 403s and stall.

Postgres MCP Server

The Postgres MCP server (the community mcp-server-postgres package) shows interesting behavior under concurrent load. For simple queries returning small result sets, p50 latency is around 8 to 15 ms over a local connection. That's mostly query execution time; the MCP overhead is minimal.

Large result sets are where things get expensive. A query returning 50,000 rows will serialize the entire result into a single JSON string before returning it to the agent. In testing, a 50,000-row result set pushed response size above 8 MB and p50 latency above 2,000 ms. The practical lesson is that your agent's prompts need to enforce LIMIT clauses, or you need a server wrapper that truncates results before serialization.

Brave Search MCP Server

Brave Search MCP is similar to GitHub in that it's network-bound. The Brave Search API typically responds in 300 to 600 ms, and the MCP server adds negligible overhead. The more relevant constraint is the API's rate limits, which at the free tier cap you at 1 request per second. For agents making multiple search calls per reasoning step, this becomes a hard throughput ceiling.

Throughput: What Concurrent Agents Actually See

Throughput for MCP servers is best expressed as tool calls per second, not requests per second, because a single agent session might issue many tool calls over a long-lived connection. When you run 8 concurrent agent sessions against the Filesystem server, you can sustain around 200 to 400 tool calls per second on a 4-core machine before CPU becomes the bottleneck.

The GitHub and Brave Search servers are rate-limited by their upstream APIs long before they hit any local resource ceiling. For the Postgres server, the database connection pool is typically the first constraint. The default configuration uses a single connection, which means concurrent tool calls queue behind each other. Configuring a pool of 5 to 10 connections improves throughput by roughly 3 to 4x under concurrent load.

stdio vs SSE Transport

Most MCP servers default to stdio transport, which is fine for single-agent use but creates a process-per-session model that doesn't scale horizontally. SSE (Server-Sent Events) transport allows multiple clients to connect to a single server process, which is significantly more efficient for multi-agent deployments.

In benchmarking, SSE transport adds about 1 to 3 ms of latency per call compared to stdio, but allows you to serve 10 to 20 concurrent agent sessions from a single server process instead of spawning 10 to 20 separate processes. For anything beyond a single developer workflow, SSE is the right choice.

Resource Consumption Patterns

Memory

The TypeScript-based MCP servers (Filesystem, GitHub, Brave Search) typically idle at 40 to 80 MB RSS. Under sustained load with large responses, that can climb to 150 to 300 MB due to V8's garbage collection behavior. The Python-based servers tend to have lower baseline memory but can grow faster under load because Python's reference counting doesn't compact memory as aggressively.

The Postgres server deserves special attention if you're returning large result sets. Each tool call that returns a large dataset will allocate memory for the full JSON serialization before sending it. Under concurrent load with large queries, it's straightforward to push a server process above 500 MB. Monitoring RSS over time during your load tests will catch this before it becomes a production incident.

CPU

For I/O-bound servers like GitHub and Brave Search, CPU usage is negligible, typically under 5% even under concurrent load. For the Filesystem server doing directory listings or the Postgres server serializing large result sets, CPU can spike significantly during the JSON serialization phase. A single large directory listing or large query result can peg a CPU core for 100 to 500 ms.

This matters because during that serialization spike, other concurrent tool calls queue up. If your agent issues several tool calls in parallel and one of them triggers a large serialization, the others will see elevated latency even if their own work is trivial.

What Numbers Actually Matter for Production Decisions

After running these benchmarks, a few metrics stand out as genuinely decision-relevant versus merely interesting.

P99 latency under your expected concurrent load is the number that predicts user-visible slowness in agent workflows. If your agent runs 4 tool calls per reasoning step and p99 is 500 ms, your agent loop is adding at least 500 ms of tail latency per step, compounding across multiple steps.

Memory growth over time, not peak memory, is what causes production incidents. A server that uses 200 MB at peak but returns to 80 MB after each request is fine. A server that grows from 80 MB to 400 MB over 2 hours of sustained load and never comes back down will eventually OOM in a containerized environment.

Throughput ceiling relative to your agent concurrency tells you whether you need to run multiple server instances. If your deployment runs 20 concurrent agent sessions and the server saturates at 10, you need horizontal scaling or a different architecture.

Error rates under load are often ignored in benchmarks but are critical in practice. Some servers start returning errors or timing out before they hit their CPU or memory ceiling. Run your load tests long enough to observe error rates, not just latency distributions.

Using Skillful.sh Security Scores Alongside Performance Data

Performance is only one dimension of MCP server evaluation. A server that's fast but has a C or D security grade on Skillful.sh, indicating dependency vulnerabilities or prompt injection risks, isn't a good production choice regardless of its benchmark numbers.

When comparing servers using Skillful.sh's side-by-side comparison tool, the adoption metrics (GitHub stars, directory presence count, download trends) give you a rough proxy for how battle-tested a server is. A server with a Mature adoption stage has likely had more eyes on edge cases and performance issues than a New-stage server, even if the New server looks faster in a controlled benchmark.

The practical workflow is to filter by security grade first (A or B for production use), then compare adoption stage, then run your own performance benchmarks against the shortlisted candidates under workloads that match your actual agent behavior. Benchmark numbers from someone else's test environment are a starting point, not a conclusion.

Running Your Own Benchmarks

The MCP TypeScript SDK includes enough client tooling to build a basic benchmark harness in an afternoon. The core loop is: connect via stdio or SSE, call initialize, then run your tool call sequences in a loop while recording timestamps. Libraries like autocannon aren't directly applicable, but you can adapt the measurement patterns.

For a more structured approach, the k6 load testing tool can drive SSE-transport MCP servers if you write a custom protocol handler. This gives you k6's built-in percentile reporting and time-series output without building your own statistics layer.

Keep your test scenarios close to what your agents actually do. If your agent reads files and then writes summaries, benchmark that sequence, not just isolated reads or isolated writes. The compound latency of realistic tool call sequences is almost always higher than the sum of individual tool call latencies because of connection state, caching behavior, and serialization patterns that only appear in sequence.


Related Reading

Browse MCP servers on Skillful.sh. Search 137,000+ AI tools on Skillful.sh.