LLM inference, optimized for your Mac
Continuous batching and infinite SSD caching, managed directly from your menu bar.
Features
Inference
- Continuous Batching — Handle multiple concurrent requests with mlx-lm BatchGenerator
- Multi-Model Serving — Load LLM, Embedding, Reranker models simultaneously, with LRU eviction
- Reasoning Model Support — Automatic
<think>tag handling for DeepSeek, MiniMax models - Harmony Protocol — Native support for gpt-oss models via openai-harmony parser
Caching
- Paged KV Cache — Block-based with prefix sharing and copy-on-write (vLLM-inspired)
- SSD Tiered Caching — Automatic GPU to SSD offloading for virtually unlimited context caching
- Hybrid Cache — Mixed KVCache + RotatingKVCache for complex architectures (Gemma3, etc.)
- Persistent Cache — KV cache blocks survive server restarts via safetensors storage
API
- OpenAI Compatible —
/v1/chat/completions,/v1/completions,/v1/models,/v1/embeddings - Anthropic Compatible —
/v1/messageswith streaming support - Tool Calling — JSON, Qwen, Gemma, MiniMax, GLM formats + MCP integration
- Structured Output — JSON mode and JSON Schema validation
macOS App
- Native Menubar App — PyObjC-based, not Electron. Start/Stop/Restart from menu bar
- Admin Dashboard — Real-time monitoring, built-in chat, per-model settings at
/admin - Model Downloader — Search and download MLX models from HuggingFace in the dashboard
- Auto-Update Check — GitHub Releases-based update notification
- Signed & Notarized — Developer ID signed, Apple notarized DMG distribution
Requirements
- Apple Silicon (M1/M2/M3/M4)
- macOS 14.0+ (Sonoma)
Install
Download the DMG, drag oMLX to Applications, and launch — that's it.
