Spaghetti Bench: AI Coding Agents vs Concurrency Bugs

1|

The Blind Spot in Every AI Coding Benchmark

2|

Spaghetti Bench, a new benchmark from CMU’s PASTA Lab, reveals why concurrency bugs expose the weakest link in AI coding agent evaluation. SWE-bench—the standard for measuring coding agents—contains exactly one race condition among hundreds of tasks. That 0.3% ratio masks a deeper problem: the tools we use to evaluate AI agents are structurally blind to the class of bugs that causes the most insidious production failures. Published in February 2026, Spaghetti Bench tested six frontier coding agents on 39 Java concurrency bugs and found that without specialized testing tooling, even the most capable models ship patches that look correct under normal test runs but silently fail under the specific thread interleavings that trigger the actual bug.

3|
4|

Why Standard Verification Fails for Concurrency

5|

SWE-bench’s validation strategy is straightforward: run the failing test on the patched code, confirm it passes, and check that existing tests don’t regress. This works well for deterministic bugs—null pointer exceptions, incorrect logic, missing error handling. Concurrency bugs are a different animal entirely. Whether a race condition manifests depends on thread scheduling, which is nondeterministic by definition. A test for a concurrency bug can pass 100, 200, or 1,000 consecutive times and still harbor the defect. The problematic interleaving simply never occurred during those runs.

6|

Spaghetti Bench’s authors, Vasu Vikram, Ao Li, and Rohan Padhye, used Fray—a controlled concurrency testing platform for the JVM published at OOPSLA 2025—to systematically explore thread schedules and verify whether patches actually resolved the race condition. Fray deterministically simulates different interleavings and can find bugs within seconds that thousands of random test runs would miss.

7|
8|

Spaghetti Bench Design: Two Tiers of Difficulty

9|

Spaghetti Bench splits its 39 tasks into two categories designed to probe different skill levels:

10|

    11|

  • Spaghetti-Easy (28 tasks): Single-file Java programs of 100–200 lines sourced from SCTBench. Each isolates a specific concurrency pattern—lock-free queues, producer-consumer races, memory ordering issues. The bug is self-contained and the fix is localized.
  • 12|

  • Spaghetti-Hard (11 tasks): Real concurrency bugs from Apache Kafka‘s codebase. These are embedded in a large, complex system with state management across multiple components, making diagnosis and fix significantly harder.
  • 13|

14|

Six frontier models were evaluated via the OpenHands agentic framework: Claude Opus 4.5, Claude Sonnet 4.5, GPT-5.2, Qwen 3 Coder 480B, Gemini 3 Pro Preview, and Gemini 3 Flash Preview. Each model ran in two configurations—without Fray (standard bash, file editor, and test runner) and with Fray as an additional tool call. In both cases, Fray served as the final arbiter of correctness.

15|
16|

The Results: Easy Bugs Are Fixable, Real-World Bugs Are Not

17|

On Spaghetti-Easy, the gap between tooling configurations is stark. Without Fray, top models like GPT-5.2 and Claude Opus 4.5 already achieve strong results (95.7% and 92.9% pass@1, respectively). With Fray access, those numbers climb further—GPT-5.2 hits 100%, Claude Opus 4.5 reaches 99.3%, and Gemini 3 Pro jumps from 67.9% to 90.7%, a 22.8-point improvement. Fray catches the false positives where agents produced superficially correct patches that a naive test runner would greenlight.

18|
19|

Spaghetti-Easy Results (28 tasks)

20|

21|

22|

23|

24|

25|

26|

27|

28|

29|

30|

31|

ModelWithout FrayWith FrayChange
GPT-5.295.7%100.0%+4.3%
Claude Sonnet 4.593.6%95.7%+2.1%
Claude Opus 4.592.9%99.3%+6.4%
Qwen 3 Coder 480B70.0%75.7%+5.7%
Gemini 3 Pro Preview67.9%90.7%+22.8%

32|
33|

The Spaghetti-Hard results tell a different story. Performance collapses across the board, with or without Fray. The best result is Claude Sonnet 4.5 with Fray at 36.4%—meaning nearly two-thirds of real-world Kafka concurrency bugs remain unfixed even with optimal tooling. GPT-5.2 shows the largest absolute gain (+21.8% with Fray), but still only reaches 43.6%. Qwen 3 Coder actually regresses (-10.9%) when given Fray, suggesting the model struggles to interpret and act on concurrency testing feedback effectively.

34|
35|

Spaghetti-Hard Results (11 tasks)

36|

37|

38|

39|

40|

41|

42|

43|

44|

45|

46|

47|

ModelWithout FrayWith FrayChange
GPT-5.221.8%43.6%+21.8%
Claude Sonnet 4.532.7%36.4%+3.7%
Claude Opus 4.530.9%34.5%+3.6%
Gemini 3 Pro Preview12.7%14.5%+1.8%
Qwen 3 Coder 480B18.2%7.3%-10.9%

48|
49|

Case Study: When Plausible Patches Hide Real Bugs

50|

The benchmark’s case studies illustrate why the gap between “test passes” and “bug fixed” matters. In the WorkStealQueue task (Spaghetti-Easy), Claude Sonnet 4.5 without Fray changed a single comparison operator (<= to <) and declared the fix complete. The boundary condition was wrong, but the deeper problem—an atomic check-then-act violation and a non-volatile array read—remained. Standard test runs never caught it. Fray found the bug at iteration 2,616. With Fray access, the same model rewrote the method entirely: making arrays volatile, adding locking with double-checking, and handling mid-operation resizing. The fix passed 100,000 iterations.

51|

In the KAFKA-17402 task (Spaghetti-Hard), GPT-5.2 correctly diagnosed that tasks appeared in multiple internal queues simultaneously and produced a deduplication wrapper. The fix passed Gradle unit tests. Fray found the failure at iteration 15. The actual Kafka developer fix was fundamentally different—a single line moved inside a lock boundary to ensure atomic state transitions. The agent treated the symptom; the developer treated the cause.

52|
53|

What This Means for Benchmark Trust and AI-Driven Development

54|

The broader benchmark landscape in 2026 already reflects unease about the gap between lab scores and production reality. A Kili Technology analysis found a 37% gap between lab benchmark scores and real-world deployment performance, with agent consistency dropping from 60% on a single run to 25% across eight consecutive runs. Spaghetti Bench adds a concrete dimension to that concern: when verification itself is unreliable, you cannot trust any score.

55|

The practical takeaway for teams deploying AI coding agents is clear. Standard test-pass rates are an inadequate signal for concurrency work. If your agent produces a patch and the tests go green, you still need deterministic verification to be confident the race condition is actually resolved. The Spaghetti Bench researchers suggest three directions for improvement:

56|

    57|

  1. Interactive interleaving exploration: Tools that let agents selectively pause, resume, and rearrange thread execution at breakpoints to understand how race conditions manifest, rather than just receiving a pass/fail verdict.
  2. 58|

  3. Targeted diagnostic feedback: Instead of “test failed,” agents need information about which specific interleaving triggered the failure and which invariant was violated, enabling more precise reasoning about the root cause.
  4. 59|

  5. Domain-specific training data: Concurrency bugs are relatively rare in open-source issue trackers, which may explain why agents struggle with the initial diagnosis. Targeted datasets could improve root-cause reasoning for this class of defects.
  6. 60|

61|
62|

Implementation Checklist for Teams Using AI Coding Agents on Concurrent Code

63|

    64|

  1. Integrate a deterministic concurrency testing tool (Fray for JVM, Shuttle for Rust, or equivalent) into your agent’s tool belt before delegating concurrency bug fixes.
  2. 65|

  3. Configure the agent to use controlled interleaving verification as the final validation step, not just repeated test execution.
  4. 66|

  5. Set correctness thresholds based on deterministic verification, not naive test-pass counts. A patch that passes 1,000 random runs but fails under systematic interleaving exploration is not shipped code.
  6. 67|

  7. For real-world codebases, supplement agent output with human review of the fix strategy—agents currently struggle to distinguish symptom treatments from root-cause fixes in complex concurrent systems.
  8. 68|

  9. Benchmark your agent against Spaghetti Bench before trusting it on production concurrency work. The open-source repository supports any LiteLLM-compatible model.
  10. 69|

70|
71|

Frequently Asked Questions

72|
73|

What makes Spaghetti Bench different from SWE-bench?

74|

SWE-bench evaluates agents on deterministic bugs where test validation is reliable. Spaghetti Bench focuses exclusively on concurrency bugs—race conditions that depend on nondeterministic thread scheduling. Standard test runs can pass even when the bug remains, so Spaghetti Bench uses Fray’s controlled interleaving exploration as the verification oracle, revealing false positives that SWE-bench’s methodology would miss entirely.

75|
76|

Can I run Spaghetti Bench against my own models?

77|

Yes. The benchmark is open source on GitHub under the MIT license. It supports any LiteLLM-compatible model ID and can run via Docker or manual setup with Java 21 and Python 3.10+. You can run individual tasks or the full 39-task suite, with or without Fray tooling enabled.

78|
79|

Does Fray work for languages other than Java?

80|

Fray is a JVM-specific tool. However, the Spaghetti Bench authors note plans to extend the benchmark to other languages with mature concurrency testing tools, specifically mentioning Rust with Shuttle. The underlying principle—controlled schedule exploration as a verification oracle—is language-agnostic and applicable wherever nondeterminism makes standard testing unreliable.

81|
82|

Why does Qwen 3 Coder perform worse with Fray on Spaghetti-Hard?

83|

The benchmark results show Qwen 3 Coder 480B dropping from 18.2% to 7.3% pass@1 on Spaghetti-Hard when given Fray access. The authors attribute this to difficulty interpreting concurrency testing feedback—the model receives failure signals from Fray but fails to incorporate them into improved solutions, potentially making its patches worse through misguided iterations. This highlights that better tooling only helps if the model can reason effectively about the tool’s output.

84|
85|

Sources

86|

93|