The $600 AI Revolution: How Apple’s Secret Chip Is Changing Local Compute Forever
Thirteen months ago, running a frontier-level language model at usable speeds meant spending $6,000 on hardware. Today, a $600 Mac Mini can run a superior model at the same speed—and that’s just the beginning of what Apple never wanted you to know about their Neural Engine.
A developer named maderix just pulled off something remarkable. Using Claude (an AI assistant) to reverse-engineer Apple’s undocumented neural processing unit, he cracked open a black box that Apple has kept sealed since 2017. What he found reveals both the current state of local AI and where it’s heading next.
The DeepSeek Moment, Reversed
Let’s start with the numbers that matter. In January 2025, a Hugging Face engineer demonstrated how to run DeepSeek R1 at roughly 5 tokens per second on a $6,000 custom rig. It was impressive—a frontier model running locally at acceptable speeds.
Fast forward to March 2026. The same 5 tps? A $600 AOOSTAR mini PC running Qwen3-27B at Q4 quantization. Not only is the hardware 10× cheaper—the model is better. If you want genuinely usable speeds, Qwen3.5-35B-A3B at Q4/Q5 delivers 17-20 tps on similar hardware.
This isn’t just incremental progress. It’s a complete inversion of the economics of local AI.
The Reddit community tracking these developments has been asking the right question: at this pace, could a 4B model match frontier performance within a year? The answer depends on understanding a different kind of progress happening in parallel—the hardware side.
Inside the Black Box: Apple’s Neural Engine
Apple doesn’t talk about their Neural Engine. They publish no ISA documentation, no architecture guides, no programming manuals. Everything goes through CoreML, which wraps the hardware in layers of abstraction that make it nearly impossible to understand what’s actually happening.
So maderix did what Apple didn’t want: he reverse-engineered it.
Working with Claude Opus as a collaborative partner, he mapped the entire software stack from CoreML down to the IOKit kernel driver. He discovered 40+ private classes in AppleNeuralEngine.framework, including `_ANEClient`, `_ANEModel`, and `_ANEInMemoryModel`—the keys to bypassing CoreML entirely.
What he found changes how we should think about on-device AI.
The Architecture
The M4 Neural Engine is not a GPU. It’s not a CPU. It’s a graph execution engine—a fixed-function accelerator that takes a compiled neural network graph and executes it as one atomic operation. You don’t issue individual multiply-accumulate instructions. You submit a program describing an entire computation graph, and the hardware runs it end-to-end.
The M4’s ANE (codename H16G) has 16 cores with a queue depth of 127 evaluation requests. It has independent DVFS (dynamic voltage/frequency scaling) and hard power gating that drops it to exactly 0 milliwatts when idle.
The Performance Reality
Apple claims “38 TOPS” for the M4 Neural Engine. Here’s what that actually means.
Through direct benchmarking—bypassing CoreML’s overhead—the true peak is 19 TFLOPS in FP16. The “38 TOPS” number comes from the industry convention of counting INT8 operations as 2× FP16 rate. But the ANE doesn’t actually execute INT8 faster. It dequantizes INT8 weights to FP16 before compute.
This matters because it reveals something about the hardware: it’s fundamentally an FP16 processor with some INT8 memory optimizations, not a true INT8 compute engine.
The SRAM Cliff
The benchmarks revealed a critical performance characteristic. At 2048×2048 matrix multiplication, the ANE delivers 5.7 TFLOPS. At 4096×4096, it drops to 4.0 TFLOPS—a 30% reduction.
The culprit: SRAM capacity. The working set for matrix multiply is 3 matrices (A, B, C). At 2048×2048 in FP16, that’s 24 MB—fits in on-chip SRAM. At 4096×4096, it’s 96 MB, roughly 3× larger than the ~32 MB SRAM, forcing spills to DRAM.
For practitioners, this means keeping tensor footprints under 32 MB per operation is critical for peak performance.
The Convolution Secret
Here’s something Apple has never documented: the ANE is fundamentally a convolution engine. Expressing the same computation as a 1×1 convolution instead of a matrix multiply yields dramatically better throughput.
A matrix multiply C[M,N] = A[M,K] @ B[K,N] can be reshaped as:
- Input: (1, K, 1, M)
- Weight: (N, K, 1, 1)
- Output: (1, N, 1, M)
Same FLOPs, same result, but the ANE’s convolution datapath handles it much more efficiently—roughly 3× faster.
The CoreML Tax
How much performance does CoreML leave on the table? For small operations, the overhead is severe—2-4× slower than direct API access. The gap narrows for throughput-heavy workloads where compute time dominates, but for latency-sensitive applications like LLM token decoding, the CoreML tax is significant.
If you’re building production on-device AI, the lesson is clear: CoreML is convenient, but it’s leaving performance on the table.
The Efficiency Story That Matters
If throughput were the only metric, GPUs would always win. But the ANE’s real advantage is efficiency.
At peak load, the M4 ANE delivers 6.6 TFLOPS per watt. For context:
| Hardware | Efficiency (TFLOPS/W) |
|———-|———————-|
| M4 ANE | 6.6 |
| M4 GPU | ~1.0 |
| H100 GPU | ~0.13 |
| A100 GPU | ~0.08 |
The ANE is roughly 80× more efficient per FLOP than an H100. Yes, the H100 has 50× more total throughput—but for on-device inference running off a battery, the ANE is extraordinary.
This efficiency gap has implications beyond laptops. It suggests a future where AI compute isn’t concentrated in massive data centers but distributed across billions of devices, each running models locally at a fraction of the energy cost.
Training on Inference Hardware
The most surprising part of maderix’s project? He trained a model on the Neural Engine.
The ANE was designed exclusively for inference. Apple’s tooling assumes frozen weights, static graphs. But through careful work—cracking the weight blob format, working around compilation limits, building a custom training pipeline—maderix trained a 110M parameter MicroGPT model on a chip that was never meant to train anything.
The challenges were significant. CoreML’s file-based compilation path requires writing MIL text to disk for every weight update—unacceptable for training loops that need thousands of iterations. The solution came through `_ANEInMemoryModelDescriptor`, a private class that accepts MIL text directly in memory. But getting it working required solving several non-obvious issues: the milText parameter wants NSData (not NSString), weights must be passed as a dictionary (not a single buffer), and even the “in-memory” path internally writes to a temp directory.
Apple also imposes a 119-compile limit before requiring a restart—a safeguard that makes sense for inference but becomes a bottleneck for training. The workaround involves managing the compilation cache and structuring training to minimize recompilation steps.
Practical implications today? Limited. You can’t train large models on a single ANE. But LoRA fine-tuning for 3B-7B models should be feasible—the parameter counts for LoRA adapters are small enough that the compilation overhead becomes manageable. And in theory, a cluster of ANE devices could train larger models by distributing the workload.
The real significance is proving it’s possible. Hardware designed for inference can be repurposed for training with the right software stack. This opens questions about what other “inference-only” hardware might be capable of with similar reverse engineering effort.
Practical Guidance for Practitioners
If you’re building on-device AI today, here’s what the ANE research means for you:
Maximizing ANE Throughput
1. Deep graphs, not wide. Chain 16-64 operations in one MIL program. Single operations waste 70% of capacity.
2. Conv over matmul. Express matrix operations as 1×1 convolutions for 3× speedup.
3. Stay under 32 MB. Keep per-tensor footprint in SRAM. DRAM spills kill throughput.
4. Avoid dispatch-limited ops. Anything under ~1ms is dominated by the 0.095ms dispatch overhead.
When to Use What
– ANE for: Large batch inference, deep graphs with 16+ layers, energy-constrained scenarios, sustained throughput
– SME (CPU matrix extension) for: Single-token decode, custom operations, small matrices, FP32+ precision needs
The ideal LLM inference strategy on M4 is hybrid: prefill (large batch, high throughput) on ANE, decode (single token, latency-sensitive) on SME.
The Bigger Picture
The combination of rapidly improving small models and increasingly efficient hardware points to a future that looks very different from today’s cloud-centric AI.
Thirteen months ago, local AI was a niche hobby for enthusiasts with deep pockets. Today, frontier-quality inference is accessible to anyone with a mid-range computer. Tomorrow, the question won’t be “can I run this locally?” but “why would I send my data to a server?”
The economics are shifting. A $600 device running a 27B model at 5 tps, with 80× better energy efficiency than cloud GPUs—that’s not just progress. That’s a new category of computing.
Apple didn’t intend for us to know this. The Neural Engine was supposed to remain a black box, accessed only through their controlled APIs. But the research community, increasingly aided by AI tools like Claude, is cracking open these boxes faster than companies can seal them. The irony is worth noting: AI systems are now being used to reverse-engineer the hardware that runs them, accelerating the pace of discovery in a recursive loop.
The next DeepSeek moment won’t come from a better model. It will come from better hardware understanding—and the realization that the infrastructure for local AI is already sitting on millions of desks, waiting to be unlocked.
For developers and organizations building AI applications, this shift has strategic implications. Betting entirely on cloud APIs means depending on infrastructure you don’t control, with costs that scale linearly with usage. The alternative—local inference on efficient hardware—offers fixed costs, privacy by default, and no network latency. The tradeoff has always been capability. But as the numbers show, that tradeoff is evaporating faster than most people realize.
FAQ
Can I actually train models on Apple Silicon?
Yes, though with limitations. The research demonstrates training a 110M parameter model on the ANE. LoRA fine-tuning for 3B-7B models should be practical. Full training of large models still requires traditional GPU infrastructure.
Is CoreML fast enough for production?
It depends on your latency requirements. For throughput-heavy batch processing, the overhead is acceptable. For real-time applications like token-by-token LLM generation, direct ANE access via private APIs can be 2-4× faster.
What’s the practical model size limit for local inference?
With Q4 quantization, a 35B model fits comfortably on a Mac Mini with unified memory. The constraint is less about raw capacity and more about SRAM—the 32 MB on-chip limit determines whether you hit peak throughput or fall into DRAM-bound performance.
Should I wait for the next hardware generation?
The efficiency gains from the ANE approach (6.6 TFLOPS/W vs 0.13 for H100) suggest that the next frontier isn’t raw power but intelligent specialization. Apple’s approach—fixed-function hardware for specific workloads—points toward a future where AI compute is distributed and efficient rather than centralized and power-hungry.
How do I access the private ANE APIs?
The research provides code at github.com/maderix/ANE. Note that private APIs are undocumented and may change between macOS versions. For production use, this introduces maintenance risk that needs to be weighed against performance gains.
References
- maderix. “Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering.” Substack, February 2026.
- maderix. “Inside the M4 Apple Neural Engine, Part 2: ANE Benchmarks.” Substack, February 2026.
- Reddit r/LocalLLaMA. “13 months since the DeepSeek moment, how far have we gone running models locally?” March 2026.
- Reddit r/LocalLLaMA. “Reverse engineered Apple Neural Engine(ANE) to train Microgpt.” February 2026.
- Hollance/neural-engine GitHub repository – Community ANE documentation
- Apple ml-ane-transformers – Reference transformer implementations for ANE



