Our First Public Release: Qwen3.5 0.8B, 4B, 9B at 2.5-bit
We are releasing our first public quantized models — Qwen3.5 0.8B, 4B, and 9B at 2.5-bit — alongside a new CLI tool built on ExecuTorch to run them locally on Mac.
Why 2.5-bit?
Standard 4-bit quantization is well-understood, but leaves significant room on the table. At 2.5-bit, we hit a density point where the model fits comfortably in constrained VRAM budgets without the quality collapse you typically see below 3-bit.
What we measured
Our evaluation pipeline checks perplexity, downstream task accuracy, and latency under realistic batch sizes. The 2.5-bit variant holds up where it matters:
- Perplexity: Within 2% of the FP16 baseline on standard benchmarks
- VRAM: 1.9 GB vs 7.5 GB at FP16 — a 4x reduction
- Throughput: Comparable tokens/sec on consumer GPUs
The CLI: run it locally on Mac
Alongside the models, we are shipping a CLI tool built on ExecuTorch that lets you run these models directly on your Mac. Install it, point it at a model, and start generating — no Python environment, no server process, no Docker.
tear run qwen3.5-4b-2.5bit --prompt "Explain quantization in one paragraph"
ExecuTorch compiles models into a format optimized for on-device execution, with native support for Apple Silicon acceleration via CoreML and Metal delegates. The CLI wraps this into a single binary with streaming output, conversation history, and configurable generation parameters.
Why build a new CLI?
Existing inference tools are excellent and we use them as reference points. But they were not the right foundation for what we are building, for a few reasons:
-
Quantization format control. Our quantization pipeline produces custom mixed-precision layouts that don’t map cleanly onto GGUF’s supported formats. Rather than force our models into someone else’s quantization scheme, we wanted end-to-end control from training through inference.
-
ExecuTorch’s delegate system. ExecuTorch lets us target specific hardware backends (CoreML, Metal, XNNPACK) through delegates without maintaining separate codepaths. As we expand to other devices, this matters more than raw throughput on a single platform.
-
Tighter integration with our eval loop. We run the same inference backend in CI that ships to users. Building on ExecuTorch means our evaluation numbers reflect exactly what users get, with no translation layer between benchmark and production.
Most existing tools optimize for broad compatibility across many model families and quantization formats. We are optimizing for a narrower set of models where we control the full stack from quantization to runtime. Different goals, different tools.
What’s next
This is the first in a series of public drops. We are staging a second release now and will share details as evaluation completes. We will also be dropping more custom models built on top of Qwen3.5 0.8B.