Releases: mudler/LocalAI
v4.1.3
What's Changed
Bug fixes 🐛
- fix(token): login via legacy api keys by @mudler in #9249
- fix(anthropic): do not emit empty tokens and fix SSE tool calls by @mudler in #9258
- fix(gpu): better detection for MacOS and Thor by @mudler in #9263
👒 Dependencies
- chore(deps): bump google.golang.org/grpc from 1.79.3 to 1.80.0 by @dependabot[bot] in #9253
- chore(deps): bump github.com/jaypipes/ghw from 0.23.0 to 0.24.0 by @dependabot[bot] in #9250
- chore(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.12 to 1.32.14 by @dependabot[bot] in #9256
- chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.64.0 to 0.65.0 by @dependabot[bot] in #9254
Other Changes
- chore: ⬆️ Update ggml-org/llama.cpp to
d0a6dfeb28a09831d904fc4d910ddb740da82834by @localai-bot in #9259 - docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9260
- chore: ⬆️ Update ace-step/acestep.cpp to
e0c8d75a672fca5684c88c68dbf6d12f58754258by @localai-bot in #9261 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
8afbeb6ba9702c15d41a38296f2ab1fe5c829fa0by @localai-bot in #9262
Full Changelog: v4.1.2...v4.1.3
v4.1.2
What's Changed
Bug fixes 🐛
- fix(autoparser): correctly pass by logprobs by @mudler in #9239
- fix(chat): do not retry if we had chatdeltas or tooldeltas from backend by @mudler in #9244
Exciting New Features 🎉
Other Changes
- Update index.yaml and add Qwen3.5 model files by @ER-EPR in #9237
- chore: ⬆️ Update ggml-org/llama.cpp to
761797ffdf2ce3f118e82c663b1ad7d935fbd656by @localai-bot in #9243 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
7397ddaa86f4e8837d5261724678cde0f36d4d89by @localai-bot in #9242 - docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9241
Full Changelog: v4.1.1...v4.1.2
v4.1.1
This is a patch release to address few regressions from the last release and the upcoming Gemma4, most importantly:
- Fixes Gemma 4 tokenization with llama.cpp
- Show login in api key only mode
- Small fixes to improve Anthropic API compatibility
What's Changed
Other Changes
- docs: Update Home Assistant integrations list by @loryanstrant in #9206
- chore: ⬆️ Update ggml-org/llama.cpp to
a1cfb645307edc61a89e41557f290f441043d3c2by @localai-bot in #9203 - chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9210
- chore: bump inference defaults from unsloth by @github-actions[bot] in #9219
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9214
- chore: ⬆️ Update ggml-org/llama.cpp to
d006858316d4650bb4da0c6923294ccd741caefdby @localai-bot in #9215 - fix(ui): pass by staticApiKeyRequired to show login when only api key is configured by @mudler in #9220
- feat(gemma4): add thinking support by @mudler in #9221
- fix(nats): improve error handling by @mudler in #9222
- feat(autoparser): prefer chat deltas from backends when emitted by @mudler in #9224
- fix(anthropic): show null index when not present, default to 0 by @mudler in #9225
- feat(api): Allow coding agents to interactively discover how to control and configure LocalAI by @richiejp in #9084
- chore(refactor): use interface by @mudler in #9226
- fix(reasoning): accumulate and strip reasoning tags from autoparser results by @mudler in #9227
- chore(model-gallery): ⬆️ update checksum by @localai-bot in #9233
- chore: ⬆️ Update ggml-org/llama.cpp to
b8635075ffe27b135c49afb9a8b5c434bd42c502by @localai-bot in #9231
New Contributors
- @github-actions[bot] made their first contribution in #9219
Full Changelog: v4.1.0...v4.1.1
v4.1.0
🎉 LocalAI 4.1.0 Release! 🚀
LocalAI 4.1.0 is out! 🔥
Just weeks after the landmark 4.0, we're back with another massive drop. This release turns LocalAI into a production-grade AI platform: spin up a distributed cluster with smart routing and autoscaling, lock it down with built-in auth and per-user quotas, fine-tune models without leaving the UI, and much more. If 4.0 was the foundation, 4.1 is the control tower.
| Feature | Summary |
|---|---|
| 🌐 Distributed Mode | Run LocalAI as a cluster — smart routing, node groups, drain/resume, min/max autoscaling. |
| 🔐 Users & Auth | Built-in user management with OIDC, invite mode, API keys, and admin impersonation. |
| 📊 Quota System | Per-user usage quotas with predictive analytics and breakdown dashboards. |
| 🧪 Fine-Tuning | (experimental) Fine-tune models with TRL, auto-export to GGUF, and import back — all from the UI. |
| ⚗️ Quantization | (experimental) New backend for on-the-fly model quantization. |
| 🔧 Pipeline Editor | Visual model pipeline editor in the React UI. |
| 🤖 Standalone Agents | Run agents from the CLI with local-ai agent run. |
| 🧠 Smart Inferencing | Auto inference defaults from Unsloth, tool parsing fallback, and min_p support. |
| 🎬 Media History | Browse past generated images and media in Studio pages. |
New (long version) Full setup walktrough: https://www.youtube.com/watch?v=cMVNnlqwfw4
🚀 Key Features
🌐 Distributed Mode: scaling LocalAI horizontally
Run LocalAI as a distributed cluster and let it figure out where to send your requests. No more single-node bottlenecks.
- Smart Routing: Requests are routed to nodes ordered by available VRAM — the beefiest, free GPU gets the job.
- Node Groups: Pin models to specific node groups for workload isolation (e.g., "gpu-heavy" vs "cpu-light").
- Autoscaling: Built-in min/max autoscaler with a node reconciler that manages the lifecycle automatically.
- Drain & Resume: Gracefully drain nodes for maintenance and bring them back with a single API call.
- Cluster Dashboard: See your entire cluster status at a glance from the home page.
- Smart Model transfer: Use S3 or transfer via peer to peer
distributed-mode.mp4
🔐 Users, Authentication & Quotas
LocalAI now ships with a complete multi-user platform — perfect for teams, classrooms, or any shared deployment.
- User Management: Create, edit, and manage users from the React UI.
- OIDC/OAuth: Plug in your identity provider for SSO — Google, Keycloak, Authentik, you name it.
- Invite Mode: Restrict registration to invite-only with admin approval.
- API Keys: Per-user API key management.
- Admin Powers: Admins can impersonate users for debugging.
- Quota System: Set per-user usage quotas and enforce limits.
- Usage Analytics: Predictive usage dashboard with per-user breakdown statistics.
Users and quota:
usersquota-1775167475876.mp4
Usage metrics per user:
usage.mp4
🧪 Fine-Tuning & Quantization
No more juggling external tools. Fine-tune and quantize directly inside LocalAI.
- Fine-Tuning with TRL (Experimental): Train LoRA adapters with Hugging Face TRL, auto-export to GGUF, and import the result straight back into LocalAI. Includes a built-in evals framework to validate your work.
- Quantization Backend: Spin up the new quantization backend to create optimized model variants on-the-fly.
quantize-fine-tune.mp4
🎨 UI
The React UI keeps getting better. This release adds serious power-user features:
- Model Pipeline Editor: Visually wire up model pipelines — no YAML editing required.
- Per-Model Backend Logs: Drill into logs scoped to individual models for laser-focused debugging.
- Media History: Studio pages now remember your past generations — images, audio, and more.
- Searchable Model/Backend Selector: Quickly find models and backends with inline search and filtering.
- Structured Error Toasts: Errors now link directly to traces — one click from "something broke" to "here's why."
- Tracing Settings: Inline tracing config restored with a cleaner UI.
talk.mp4
🤖 Agents & Inference
- Standalone Agent Mode: Run agents straight from the terminal with
local-ai agent run. Supports single-turn--promptmode and pool-based configurations frompool.json. - Streaming Tool Calls: Agent mode tool calls now stream in real-time, with interleaved thinking fixed.
- Inferencing Defaults: Automatic inference parameters sourced from Unsloth and applied to all endpoints and gallery models, your models just work better out of the box.
- Tool Parsing Fallback: When native tool call parsing fails, an iterative fallback parser kicks in automatically.
🛠️ Under the Hood
- Repeated Log Merging: Noisy terminals? Repeated log lines are now collapsed automatically.
- Jetson/Tegra GPU Detection: First-class NVIDIA Jetson/Tegra platform detection.
- Intel SYCL Fix: Auto-disables
mmapfor SYCL backends to prevent crashes. - llama.cpp Portability: Bundled
libdl,librt,libpthreadfor improved cross-platform support. - HF_ENDPOINT Mirror: Downloader now rewrites HuggingFace URIs with
HF_ENDPOINTfor corporate/mirror setups. - Transformers >5.0: Bumped to HuggingFace Transformers >5.0 with generic model loading.
- API Improvements: Proper 404s for missing models, unescaped model names, unified inferencing paths with automatic retry on transient errors.
🐞 Fixes & Improvements
- Embeddings: Implemented
encoding_format=base64for the embeddings endpoint. - Kokoro TTS: Fixed phonemization model not downloading during installation.
- Realtime API: Fixed Opus codec backend selection alias in development mode.
- Gallery Filtering: Fixed exact tag matching for model gallery filters.
- Open Responses: Fixed required
ORItemParam.Argumentsfield being omitted;ORItemParam.Summarynow always populated. - Tracing: Fixed settings not loading from
runtime_settings.json. - UI: Fixed watchdog field mapping, model list refresh on deletion, backend display in model config, MCP button ordering.
- Downloads: Fixed directory removal during fallback attempts; improved retry logic.
- Model Paths: Fixed
baseDirassignment to useModelPathcorrectly.
❤️ Thank You
LocalAI is a community-powered FOSS movement. Every star, every PR, every bug report matters.
If you believe in privacy-first, self-hosted AI:
- ⭐ Star the repo — it helps more than you think
- 🛠️ Contribute code, docs, or feedback
- 📣 Share with your team, your community, your world
Let's keep building the future of open AI — together. 💪
✅ Full Changelog
📋 Click to expand full changelog
What's Changed
Bug fixes 🐛
- fix: Change baseDir assignment to use ModelPath by @mudler in #9010
- fix(ui): correctly map watchdog fields by @mudler in #9022
- fix(api): unescape model names by @mudler in #9024
- fix(ui): Add tracing inline settings back and create UI tests by @richiejp in #9027
- Always populate ORItemParam.Summary by @tv42 in #9049
- fix(ui): correctly display backend if specified in the model config, re-order MCP buttons by @mudler in #9053
- fix(ui): Refresh model list on deletion by @richiejp in #9059
- fix(openresponses): do not omit required field ORItemParam.Arguments by @tv42 in #9074
- fix: Add tracing settings loading from runtime_settings.json by @localai-bot in #9081
- fix: use exact tag matching for model gallery tag filtering by @majiayu000 in #9041
- fix(realtime): Set the alias for opus so the development backend can be selected by @richiejp in #9083
- fix(llama.cpp): bundle libdl, librt, libpthread in llama-cpp backend by @mudler in #9099
- fix(download): do not remove dst dir until we try all fallbacks by @mudler in #9100
- fix(auth): do not allow to register in invite mode by @mudler in #9101
- fix(downloader): Rewrite full https HF URI with HF_ENDPOINT by @richiejp in #9107
- fix: implement encoding_format=base64 for embeddings endpoint by @walcz-de in #9135
- fix(coqui,nemo,voxcpm): Add dependencies to allow CI to progress by @richiejp in #9142
- fix(voxcpm): Force using a recent voxcpm version to kick the dependency solver by @richiejp in #9150
- fix: huggingface repo change the file name so Update index.yaml is needed by @ER-EPR in #9163
- fix(kokoro): Download phonemization model during installation by @richiejp in #9165...
v4.0.0
🎉 LocalAI 4.0.0 Release! 🚀
LocalAI 4.0.0 is out!
This major release transforms LocalAI into a complete AI orchestration platform. We’ve embedded agentic and hybrid search capabilities directly into the core, completely overhauled the user interface with React for a modern experience, and are thrilled to introduce Agenthub ( link ) a brand new community hub to easily share and import agents. Alongside these massive updates, we've introduced powerful new features like Canvas mode for code artifacts, MCP apps and full MCP client-side support.
| Feature | Summary |
|---|---|
| Agentic Orchestration & Agenthub | Native agent management with memory, skills, and the new Agenthub for community sharing. |
| Revamped React UI | Complete frontend rewrite for lightning-fast performance and modern UX. |
| Canvas Mode | Preview code blocks and artifacts side-by-side in the chat interface. |
| MCP Client-Side | Full Model Context Protocol support, MCP Apps, and tool streaming in chat. |
| WebRTC Realtime | WebRTC support for low-latency realtime audio conversations. |
| New Backends | Added experimental MLX Distributed, fish-speech, ace-step.cpp, and faster-qwen3-tts. |
| Infrastructure | Podman documentation, shell completion, and persistent data path separation. |
🚀 Key Features
🤖 Native Agentic Orchestration & Agenthub
LocalAI now includes agentic capabilities embedded directly in the core. You can manage, import, start, and stop agents via the new UI.
- 🌐 Agenthub: We are launching Agenthub! This is a centralized community space to share common agents and import them effortlessly into your LocalAI instance.
- Agent Management: Full lifecycle management via the React UI. Create Agents, connect them to Slack, configure MCP servers and skills.
- Skills Management: Centralized skill database for AI agents.
- Memory: Agents can utilize memory with Hybrid search (PostgreSQL) or embedded in-memory storage (Chromem).
- Observability: New "Events" column in the Agents list to track observables and status.
- 📚 Documentation: Dive into the new capabilities in our official Agents documentation.
agents.mp4
🎨 Revamped UI & Canvas Mode
The Web interface has been completely migrated to React, bringing a smoother experience and powerful new capabilities:
- Canvas Mode: Enable "canvas mode" in the chat to see code blocks and artifacts generated by the LLM in a dedicated preview bar on the right.
- System View: Tabbed navigation separating Models and Backends for better organization.
- Model Size Warnings: Visual warnings when model storage exceeds system RAM to prevent lockups.
- Traces: Improved trace display using accordions for better readability.
model-fit-canvas-mode.mp4
🔌 MCP Apps & Client-Side Support
We’ve expanded support for the Model Context Protocol (MCP):
- MCP Apps: Select which servers to enable for the chat directly from the UI.
- Tool Streaming: Tools from MCP servers are automatically injected into the standard chat interface.
- Client-Side Support: Full client-side integration for MCP tools and streaming.
- Disable Option: Add
LOCALAI_DISABLE_MCPto completely disable MCP support for security.
🎵 New Backends, Audio & Video Enhancements
- MLX Distributed (Experimental): We've added an experimental backend for running distributed workloads using Apple's MLX framework! Check out the docs here.
- New Audio Backends: Introduced fish-speech, ace-step.cpp, and faster-qwen3-tts (CUDA-only).
- WeRTC Realtime: WebRTC support added to the Realtime API and Talk page for better low-latency audio handling.
- TTS Improvements: Added
sample_ratesupport via post-processing and multi-voice support for Qwen TTS. - Video Generation: Fixed model selection dropdown sync and added
vllm-omnibackend detection.
🛠️ Infrastructure & Developer Experience
- Data Separation: New
--data-pathCLI flag andLOCALAI_DATA_PATHenv var to separate persistent data (agents, skills) from configuration. - Shell Completion: Dynamic completion scripts for bash, zsh, and fish.
- Podman Support: Dedicated documentation for Podman installation and rootless configuration.
- Gallery & Models: Model storage size display with RAM warnings, and fallback URI resolution for backend installation failures.
- Deprecations: HuggingFace backend support removed, and AIO images dropped to focus on main images.
🐞 Fixes & Improvements
- Logging: Fixed watchdog spamming logs when no interval was configured; downgraded health check logs to debug.
- CUDA Detection: Improved GPU vendor checks to prevent false CUDA detection on CPU-only hosts with runtime libs.
- Compatibility: Renamed
json_verbosetoverbose_jsonfor OpenAI spec compliance (fixes Nextcloud integration). - Embedding: Fixed embedding dimension truncation to return full native dimensions.
- Permissions: Changed model install file permissions to 0644 to ensure server readability.
- Windows Docker: Added named volumes to Docker Compose files for Windows compatibility.
- Model Reload: Models now reload automatically after editing YAML config (e.g.,
context_size). - Chat: Fixed issue where thinking/reasoning blocks were sent to the LLM.
- Audio: Fixed img2img pipeline in diffusers backend and Qwen TTS duplicate argument error.
Known issues
- The
diffusersbackend fails to build currently (due to CI limit exhaustion) and it's not currently part of this release (the previous version is still available). We are looking into it but, if you want to help and know someone at Github that could help supporting us with better ARM runners, please reach out!
❤️ Thank You
LocalAI is a true FOSS movement — built by contributors, powered by community.
If you believe in privacy-first AI:
- ✅ Star the repo
- 💬 Contribute code, docs, or feedback
- 📣 Share with others
Your support keeps this stack alive.
✅ Full Changelog
📋 Click to expand full changelog
What's Changed
Breaking Changes 🛠
- Remove HuggingFace backend support by @localai-bot in #8971
- chore: drop AIO images by @mudler in #9004
Bug fixes 🐛
- fix(cli): Fix watchdog running constantly and spamming logs by @nanoandrew4 in #8624
- fix(api): Downgrade health/readiness check to debug by @nanoandrew4 in #8625
- fix: rename json_verbose to verbose_json by @lukasdotcom in #8627
- fix(chatterbox): add support for cuda13/aarch64 by @mudler in #8653
- fix: reload model after editing YAML config (issue #8647) by @localai-bot in #8652
- fix(chat): do not send thinking/reasoning messages to the LLM by @mudler in #8656
- fix: change file permissions from 0600 to 0644 in InstallModel by @localai-bot in #8657
- fix: Add named volumes for Windows Docker compatibility by @localai-bot in #8661
- fix(gallery): add fallback URI resolution for backend installation by @localai-bot in #8663
- fix: whisper breaking on cuda-13 (use absolute path for CUDA directory detection) by @localai-bot in #8678
- fix(gallery): clean up partially downloaded backend on installation failure by @localai-bot in #8679
- fix: properly sync model selection dropdown in video generation UI by @localai-bot in #8680
- fix: allow reranking models configured with known_usecases by @localai-bot in #8681
- fix: return full embedding dimensions instead of truncating trailing zeros (#8721) by @localai-bot in #8755
- fix: Add vllm-omni backend to video generation model detection (#8659) by @localai-bot in #8781
- fix(qwen-tts): duplicate instruct argument in voice design mode by @Weathercold in #8842
- Fix image upload processing and img2img pipeline in diffusers backend by @attilagyorffy in #8879
- fix: gate CUDA directory checks on GPU vendor to prevent false CUDA detection by @sozercan in #8942
- fix(llama-cpp): Set enable_thinking in the correct place by @richiejp in #8973
Exciting New Features 🎉
- feat(traces): Use accordian instead of pop-ups by @richiejp in #8626
- chore: remove install.sh script and documentation references by @localai-bot in #8643
- docs: add Podman installation documentation by @localai-bot in htt...
v3.12.1
This is a patch release to tag the new llama.cpp version which fixes incompatibilities with Qwen 3 coder.
What's Changed
Other Changes
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8611
- feat(traces): Add backend traces by @richiejp in #8609
- chore: ⬆️ Update ggml-org/llama.cpp to
b908baf1825b1a89afef87b09e22c32af2ca6548by @localai-bot in #8612 - chore: drop bark.cpp leftovers from pipelines by @mudler in #8614
- fix: merge openresponses messages by @mudler in #8615
- chore: ⬆️ Update ggml-org/llama.cpp to
ba3b9c8844aca35ecb40d31886686326f22d2214by @localai-bot in #8613
Full Changelog: v3.12.0...v3.12.1
v3.12.0
🎉 LocalAI 3.12.0 Release! 🚀
LocalAI 3.12.0 is out!
| Feature | Summary |
|---|---|
| Multi-modal Realtime | Send text, images, and audio in real-time conversations for richer interactions. |
| Voxtral Backend | New high-quality text-to-speech backend added. |
| Multi-GPU Support | Improved Diffusers performance with multiple GPUs. |
| Legacy CPU Optimization | Enhanced compatibility for older processors. |
| UI Theme & Layout | Improved UI theme (dark/light variants) and navigation |
| Realtime Stability | Multiple fixes for audio, image, and model handling. |
| Logging Improvements | Reduced excessive logs and optimized processing. |
Local Stack Family
Liking LocalAI? LocalAI is part of an integrated suite of AI infrastructure tools, you might also like:
- LocalAGI - AI agent orchestration platform with OpenAI Responses API compatibility and advanced agentic capabilities
- LocalRecall - MCP/REST API knowledge base system providing persistent memory and storage for AI agents
- 🆕 Cogito - Go library for building intelligent, co-operative agentic software and LLM-powered workflows, focusing on improving results for small, open source language models that scales to any LLM. Powers LocalAGI and LocalAI MCP/Agentic capabilities
- 🆕 Wiz - Terminal-based AI agent accessible via Ctrl+Space keybinding. Portable, local-LLM friendly shell assistant with TUI/CLI modes, tool execution with approval, MCP protocol support, and multi-shell compatibility (zsh, bash, fish)
- 🆕 SkillServer - Simple, centralized skills database for AI agents via MCP. Manages skills as Markdown files with MCP server integration, web UI for editing, Git synchronization, and full-text search capabilities
❤️ Thank You
LocalAI is a true FOSS movement — built by contributors, powered by community.
If you believe in privacy-first AI:
- ✅ Star the repo
- 💬 Contribute code, docs, or feedback
- 📣 Share with others
Your support keeps this stack alive.
✅ Full Changelog
📋 Click to expand full changelog
What's Changed
Bug fixes 🐛
- security: validate URLs to prevent SSRF in content fetching endpoints by @kolega-ai-dev in #8476
- fix(realtime): Use user provided voice and allow pipeline models to have no backend by @richiejp in #8415
- fix(realtime): Sampling and websocket locking by @richiejp in #8521
- fix(realtime): Send proper image data to backend by @richiejp in #8547
- fix: prevent excessive logging in capability detection by @localai-bot in #8552
- fix(voxcpm): pin setuptools by @mudler in #8556
- fix(llama-cpp): populate tensor_buft_override buffer so llama-cpp properly performs fit calculations by @cvpcs in #8560
- fix: pin neutts-air to known working commit by @localai-bot in #8566
- fix: improve watchdown logics by @mudler in #8591
- fix(llama-cpp): Pass parameters when using embedded template by @richiejp in #8590
- fix(realtime): Better support for thinking models and setting model parameters by @richiejp in #8595
- fix(realtime): Limit buffer sizes to prevent DoS by @richiejp in #8596
- fix(ui): improve view on mobile by @mudler in #8598
- fix(diffusers): sd_embed is not always available by @mudler in #8602
- fix: do not keep track model if not existing by @mudler in #8603
Exciting New Features 🎉
- feat(stablediffusion-ggml): Improve legacy CPU support for stablediffusion-ggml backend by @cvpcs in #8461
- feat(voxtral): add voxtral backend by @mudler in #8451
- feat(diffusers): add experimental support for sd_embed-style prompt embedding by @cvpcs in #8504
- chore: improve log levels verbosity by @localai-bot in #8528
- feat(realtime): Allow sending text, image and audio conversation items" by @richiejp in #8524
- chore: compute capabilities once by @mudler in #8555
- feat(ui): left navbar, dark/light theme by @mudler in #8594
- fix: multi-GPU support for Diffusers (Issue #8575) by @localai-bot in #8605
🧠 Models
- chore(model gallery): Add Ministral 3 family of models (aside from base versions) by @rampa3 in #8467
- chore(model gallery): add voxtral (which is only available in development) by @mudler in #8532
- chore(model gallery): Add npc-llm-3-8b by @rampa3 in #8498
- chore(model gallery): add nemo-asr by @mudler in #8533
- chore(model gallery): add voxcpm, whisperx, moonshine-tiny by @mudler in #8534
- chore(model gallery): add neutts by @mudler in #8535
- chore(model gallery): add vllm-omni models by @mudler in #8536
- chore(model-gallery): ⬆️ update checksum by @localai-bot in #8540
- feat(gallery): Add nanbeige4.1-3b by @richiejp in #8551
- chore(model-gallery): ⬆️ update checksum by @localai-bot in #8593
- chore(model-gallery): ⬆️ update checksum by @localai-bot in #8600
👒 Dependencies
- chore(deps): bump github.com/anthropics/anthropic-sdk-go from 1.20.0 to 1.22.0 by @dependabot[bot] in #8482
- chore(deps): bump github.com/jaypipes/ghw from 0.21.2 to 0.22.0 by @dependabot[bot] in #8484
- chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.0 to 2.28.1 by @dependabot[bot] in #8483
- chore(deps): bump github.com/alecthomas/kong from 1.13.0 to 1.14.0 by @dependabot[bot] in #8481
- chore(deps): bump github.com/openai/openai-go/v3 from 3.17.0 to 3.19.0 by @dependabot[bot] in #8485
- chore: bump cogito by @mudler in #8568
- fix(gallery): Use YAML v3 to avoid merging maps with incompatible keys by @richiejp in #8580
- chore(deps): bump google.golang.org/grpc from 1.78.0 to 1.79.1 by @dependabot[bot] in #8583
- chore(deps): bump github.com/jaypipes/ghw from 0.22.0 to 0.23.0 by @dependabot[bot] in #8587
- chore(deps): bump github.com/modelcontextprotocol/go-sdk from 1.2.0 to 1.3.0 by @dependabot[bot] in #8585
- chore(deps): bump cogito and add new options to the agent config by @mudler in #8601
Other Changes
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8462
- docs: update model gallery documentation to reference main repository by @veeceey in #8452
- chore: ⬆️ Update ggml-org/whisper.cpp to
4b23ff249e7f93137cb870b28fb27818e074c255by @localai-bot in #8463 - chore: ⬆️ Update ggml-org/llama.cpp to
e06088da0fa86aa444409f38dff274904931c507by @localai-bot in #8464 - chore: ⬆️ Update antirez/voxtral.c to
c9e8773a2042d67c637fc492c8a655c485354080by @localai-bot in #8477 - chore: ⬆️ Update ggml-org/llama.cpp to
262364e31d1da43596fe84244fba44e94a0de64eby @localai-bot in #8479 - chore: ⬆️ Update ggml-org/whisper.cpp to
764482c3175d9c3bc6089c1ec84df7d1b9537d83by @localai-bot in #8478 - chore: ⬆️ Update ggml-org/llama.cpp to
57487a64c88c152ac72f3aea09bd1cc491b2f61eby @localai-bot in #8499 - chore: ⬆️ Update ggml-org/llama.cpp to
4d3daf80f8834e0eb5148efc7610513f1e263653by @localai-bot in #8513 - chore: ⬆️ Update ggml-org/llama.cpp to
338085c69e486b7155e5b03d7b5087e02c0e2528by @localai-bot in #8538 - fix: update moonshine API, add setuptools to voxcpm requirements by @mudler in #8541
- chore: ⬆️ Update ggml-org/llama.cpp to
05a6f0e8946914918758db767f6eb04bc1e38507by @localai-bot in #8553 - chore: ⬆️ Update ggml-org/llama.cpp to
01d8eaa28d57bfc6d06e30072085ed0ef12e06c5by @localai-bot in #8567 - chore: ⬆️ Update...
v3.11.0
🎉 LocalAI 3.11.0 Release! 🚀
LocalAI 3.11.0 is a massive update for Audio and Multimodal capabilities.
We are introducing Realtime Audio Conversations, a dedicated Music Generation UI, and a massive expansion of ASR (Speech-to-Text) and TTS backends. Whether you want to talk to your AI, clone voices, transcribe with speaker identification, or generate songs, this release has you covered.
Check out the highlights below!
📌 TL;DR
| Feature | Summary |
|---|---|
| Realtime Audio | Native support for audio conversations, enabling fluid voice interactions similar to OpenAI's Realtime API. Documentation |
| Music Generation UI | New UI interface for MusicGen (Ace-Step), allowing you to generate music from text prompts directly in the browser. |
| New ASR Backends | Added WhisperX (with Speaker Diarization), VibeVoice, Qwen-ASR, and Nvidia NeMo. |
| TTS Streaming | Text-to-Speech now supports streaming mode for lower latency responses. (VoxCPM only for now) |
| vLLM Omni | Added support for vLLM Omni, expanding our high-performance inference capabilities. |
| Speaker Diarization | Native support for identifying different speakers in transcriptions via WhisperX. |
| Hardware Expansion | Expanded build support for CUDA 12/13, L4T (Jetson), SBSA, and better Metal (Apple Silicon) integration with MLX backends |
| Breaking Changes | ExLlama (deprecated) and Bark (unmaintained) backends have been removed. |
🚀 New Features & Major Enhancements
🎙️ Realtime Audio Conversations
LocalAI 3.11.0 introduces native support for Realtime Audio Conversations.
- Enables fluid, low-latency voice interaction with agents.
- Logic handled directly within the LocalAI pipeline for seamless audio-in/audio-out workflows.
- Support for STT/TTS and voice-to-voice models (experimental)
- Support for tool calls
🗣️ Talk to your LocalAI: This brings us one step closer to a fully local, voice-native assistant experience compatible with standard client implementations.
Check here for detailed documentation.
🎵 Music Generation UI & Ace-Step
We have added a dedicated interface for music generation!
- New Backend: Support for Ace-Step (MusicGen) via the
ace-stepbackend. - Web UI Integration: Generate musical clips directly from the LocalAI Web UI.
- Simple text-to-music workflow (e.g., "Lo-fi hip hop beat for studying").
🎧 Massive ASR (Speech-to-Text) Expansion
This release significantly broadens our transcription capabilities with four new backends:
- WhisperX: Provides fast transcription with Speaker Diarization (identifying who is speaking).
- VibeVoice: Now supports also ASR alongside TTS.
- Qwen-ASR: Support for Qwen's powerful speech recognition models.
- Nvidia NeMo: Initial support for NeMo ASR.
🗣️ TTS Streaming & New Voices
Text-to-Speech gets a speed boost and new options:
- Streaming Support: TTS endpoints now support streaming, reducing the "time-to-first-audio" significantly.
- VoxCPM: Added support for the VoxCPM backend.
- Qwen-TTS: Added support for Qwen-TTS models
- Piper Voices: Added most remaining Piper voices from Hugging Face to the gallery.
🛠️ Hardware & Backend Updates
- vLLM Omni: A new backend integration for vLLM Omni models.
- Extended Platform Support: Major work on MLX to improve compatibility across CUDA 12, CUDA 13, L4T (Nvidia Jetson), SBSA, and macOS Metal.
- GGUF Cleanup: Dropped redundant VRAM estimation logic for GGUF loading, relying on more accurate internal measurements.
⚠️ Breaking Changes
To keep the project lean and maintainable, we have removed some older backends:
- ExLlama: Removed (deprecated in favor of newer loaders like ExLlamaV2 or llama.cpp).
- Bark: Removed (the upstream project is unmaintained; we recommend using the new TTS alternatives).
🚀 The Complete Local Stack for Privacy-First AI
LocalAI |
The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required. |
LocalAGI |
Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI. |
LocalRecall |
RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI. |
❤️ Thank You
LocalAI is a true FOSS movement — built by contributors, powered by community.
If you believe in privacy-first AI:
- ✅ Star the repo
- 💬 Contribute code, docs, or feedback
- 📣 Share with others
Your support keeps this stack alive.
✅ Full Changelog
📋 Click to expand full changelog
What's Changed
Breaking Changes 🛠
Bug fixes 🐛
- fix(ui): correctly display selected image model by @dedyf5 in #8208
- fix(ui): take account of reasoning in token count calculation by @mudler in #8324
- fix: drop gguf VRAM estimation (now redundant) by @mudler in #8325
- fix(api): Add missing field in initial OpenAI streaming response by @acon96 in #8341
- fix(realtime): Include noAction function in prompt template and handle tool_choice by @richiejp in #8372
- fix: filter GGUF and GGML files from model list by @Yaroslav98214 in #8397
- fix(qwen-asr): Remove contagious slop (DEFAULT_GOAL) from Makefile by @richiejp in #8431
Exciting New Features 🎉
- feat(vllm-omni): add new backend by @mudler in #8188
- feat(vibevoice): add ASR support by @mudler in #8222
- feat: add VoxCPM tts backend by @mudler in #8109
- feat(realtime): Add audio conversations by @richiejp in #6245
- feat(qwen-asr): add support to qwen-asr by @mudler in #8281
- feat(tts): add support for streaming mode by @mudler in #8291
- feat(api): Add transcribe response format request parameter & adjust STT backends by @nanoandrew4 in #8318
- feat(whisperx): add whisperx backend for transcription with speaker diarization by @eureka928 in #8299
- feat(mlx): Add support for CUDA12, CUDA13, L4T, SBSA and CPU by @mudler in #8380
- feat(musicgen): add ace-step and UI interface by @mudler in #8396
- fix(api)!: Stop model prior to deletion by @nanoandrew4 in #8422
- feat(nemo): add Nemo (only asr for now) backend by @mudler in #8436
🧠 Models
- chore(model gallery): add qwen3-tts to model gallery by @mudler in #8187
- chore(model gallery): Add most of not yet present Piper voices from Hugging Face by @rampa3 in #8202
- chore: drop bark which is unmaintained by @mudler in #8207
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8220
- chore(model gallery): Add entry for Mistral Small 3.1 with mmproj by @rampa3 in https://git...
v3.10.1
This is a small patch release intended to provide bugfixes and minor polishment, along, we also added support to Qwen-TTS that was just released yesterday.
- Fix reasoning detection on reasoning and instruct models
- Support reasoning blocks with openresponses
- API fixes to correctly run LTX-2
- Support Qwen3-TTS!
What's Changed
Bug fixes 🐛
- fix(reasoning): support models with reasoning without starting thinking tag by @mudler in #8132
- fix(tracing): Create trace buffer on first request to enable tracing at runtime by @richiejp in #8148
- fix(videogen): drop incomplete endpoint, add GGUF support for LTX-2 by @mudler in #8160
Exciting New Features 🎉
- feat(openresponses): Support reasoning blocks by @mudler in #8133
- feat: detect thinking support from backend automatically if not explicitly set by @mudler in #8167
- feat(qwen-tts): add Qwen-tts backend by @mudler in #8163
🧠 Models
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8128
- chore(model gallery): add flux 2 and flux 2 klein by @mudler in #8141
- chore(model-gallery): ⬆️ update checksum by @localai-bot in #8153
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8157
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8170
👒 Dependencies
- chore(deps): bump github.com/mudler/cogito from 0.7.2 to 0.8.1 by @dependabot[bot] in #8124
Other Changes
- feat(swagger): update swagger by @localai-bot in #8098
- chore: ⬆️ Update ggml-org/llama.cpp to
287a33017b32600bfc0e81feeb0ad6e81e0dd484by @localai-bot in #8100 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
2efd19978dd4164e387bf226025c9666b6ef35e2by @localai-bot in #8099 - docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8120
- chore: ⬆️ Update leejet/stable-diffusion.cpp to
a48b4a3ade9972faf0adcad47e51c6fc03f0e46dby @localai-bot in #8121 - chore: ⬆️ Update ggml-org/llama.cpp to
959ecf7f234dc0bc0cd6829b25cb0ee1481aa78aby @localai-bot in #8122 - chore(deps): Bump llama.cpp to '1c7cf94b22a9dc6b1d32422f72a627787a4783a3' by @mudler in #8136
- chore: drop noisy logs by @mudler in #8142
- chore: ⬆️ Update ggml-org/llama.cpp to
ad8d85bd94cc86e89d23407bdebf98f2e6510c61by @localai-bot in #8145 - chore: ⬆️ Update ggml-org/whisper.cpp to
7aa8818647303b567c3a21fe4220b2681988e220by @localai-bot in #8146 - feat(swagger): update swagger by @localai-bot in #8150
- chore(diffusers): add 'av' to requirements.txt by @mudler in #8155
- chore: ⬆️ Update leejet/stable-diffusion.cpp to
329571131d62d64a4f49e1acbef49ae02544fdcdby @localai-bot in #8152 - chore: ⬆️ Update ggml-org/llama.cpp to
c301172f660a1fe0b42023da990bf7385d69adb4by @localai-bot in #8151 - chore: ⬆️ Update ggml-org/llama.cpp to
a5eaa1d6a3732bc0f460b02b61c95680bba5a012by @localai-bot in #8165 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
5e4579c11d0678f9765463582d024e58270faa9cby @localai-bot in #8166
Full Changelog: v3.10.0...v3.10.1
v3.10.0
🎉 LocalAI 3.10.0 Release! 🚀
LocalAI 3.10.0 is big on agent capabilities, multi-modal support, and cross-platform reliability.
We've added native Anthropic API support, launched a new Video Generation UI, introduced Open Responses API compatibility, and enhanced performance with a unified GPU backend system.
For a full tour, see below!
📌 TL;DR
| Feature | Summary |
|---|---|
| Anthropic API Support | Fully compatible /v1/messages endpoint for seamless drop-in replacement of Claude. |
| Open Responses API | Native support for stateful agents with tool calling, streaming, background mode, and multi-turn conversations, passing all official acceptance tests. |
| Video & Image Generation Suite | New video gen UI + LTX-2 support for text-to-video and image-to-video. |
| Unified GPU Backends | GPU libraries (CUDA, ROCm, Vulkan) packaged inside backend containers — works out of the box on Nvidia, AMD, and ARM64 (Experimental). |
| Tool Streaming & XML Parsing | Full support for streaming tool calls and XML-formatted tool outputs. |
| System-Aware Backend Gallery | Only see backends your system can run (e.g., hide MLX on Linux). |
| Crash Fixes | Prevents crashes on AVX-only CPUs (Intel Sandy/Ivy Bridge) and fixes VRAM reporting on AMD GPUs. |
| Request Tracing | Debug agents & fine-tuning with memory-based request/response logging. |
| Moonshine Backend | Ultra-fast transcription engine for low-end devices. |
| Pocket-TTS | Lightweight, high-fidelity text-to-speech with voice cloning. |
| Vulkan arm64 builds | We now build backends and images for vulkan on arm64 as well |
🚀 New Features & Major Enhancements
🤖 Open Responses API: Build Smarter, Autonomous Agents
LocalAI now supports the OpenAI Responses API, enabling powerful agentic workflows locally.
- Stateful conversations via
response_id— resume and manage long-running agent sessions. - Background mode: Run agents asynchronously and fetch results later.
- Streaming support for tools, images, and audio.
- Built-in tools: Web search, file search, and computer use (via MCP integrations).
- Multi-turn interaction with dynamic context and tool use.
✅ Ideal for developers building agents that can browse, analyze files, or interact with systems — all on your local machine.
🔧 How to Use:
- Set
response_idin your request to maintain session state across calls.- Use
background: trueto run agents asynchronously.- Retrieve results via
GET /api/v1/responses/{response_id}.- Enable streaming with
stream: trueto receive partial responses and tool calls in real time.
📌 Tip: Use
response_idto build agent orchestration systems that persist context and avoid redundant computation.
Our support passes all the official acceptance tests:
🧠 Anthropic Messages API: Clone Claude Locally
LocalAI now fully supports the Anthropic messages API.
- Use
https://api.localai.host/v1/messagesas a drop-in replacement for Claude. - Full tool/function calling support, just like OpenAI.
- Streaming and non-streaming responses.
- Compatible with
anthropic-sdk-go, LangChain, and other tooling.
🔥 Perfect for teams migrating from Anthropic to local inference with full feature parity.
🎥 Video Generation: From Text to Video in the Web UI
- New dedicated video generation page with intuitive controls.
- LTX-2 is supported
- Supports text-to-video and image-to-video workflows.
- Built on top of
diffuserswith full compatibility.
📌 How to Use:
- Go to
/videoin the web UI.- Enter a prompt (e.g., "A cat walking on a moonlit rooftop").
- Optionally upload an image for image-to-video generation.
- Adjust parameters like
fps,num_frames, andguidance_scale.
⚙️ Unified GPU Backends: Acceleration Works Out of the Box
A major architectural upgrade: GPU libraries (CUDA, ROCm, Vulkan) are now packaged inside backend containers.
- Single image: Now you don't need anymore to pull a specific image for your GPU. Any image works regardless if you have a GPU or not.
- No more manual GPU driver setup — just run the image and get acceleration.
- Works on Nvidia (CUDA), AMD (ROCm), and ARM64 (Vulkan).
- Vulkan arm64 builds enabled
- Reduced image complexity, faster builds, and consistent performance.
🚀 This means latest/master images now support GPU acceleration on all platforms — no extra config!
Note: this is experimental, please help us by filing an issue if something doesn't work!
🧩 Tool Streaming & Advanced Parsing
Enhance your agent workflows with richer tool interaction.
- Streaming tool calls: Receive partial tool arguments in real time (e.g.,
input_json_delta). - XML-style tool call parsing: Models that return tools in XML format (
<function>...</function>) are now properly parsed alongside text. - Works across all backends (llama.cpp, vLLM, diffusers, etc.).
💡 Enables more natural, real-time interaction with agents that use structured tool outputs.
🌐 System-Aware Backend Gallery: Only Compatible Backends Show
The backend gallery now shows only backends your system can run.
- Auto-detects system capabilities (CPU, GPU, MLX, etc.).
- Hides unsupported backends (e.g., MLX on Linux, CUDA on AMD).
- Shows detected capabilities in the hero section.
🎤 New TTS Backends: Pocket-TTS
Add expressive voice generation to your apps with Pocket-TTS.
- Real-time text-to-speech with voice cloning support (requires HF login).
- Lightweight, fast, and open-source.
- Available in the model gallery.
🗣️ Perfect for voice agents, narrators, or interactive assistants.
❗ Note: Voice cloning requires HF authentication and a registered voice model.
🔍 Request Tracing: Debug Your Agents
Trace requests and responses in memory — great for fine-tuning and agent debugging.
- Enable via runtime setting or API.
- Log stored in memory, dropped after max size.
- Fetch logs via
GET /api/v1/trace. - Export to JSON for analysis.
🪄 New 'Reasoning' Field: Extract Thinking Steps
LocalAI now automatically detects and extracts thinking tags from model output.
- Supports both SSE and non-SSE modes.
- Displays reasoning steps in the chat UI (under "Thinking" tab).
- Fixes issue where thinking content appeared as part of final answer.
🚀 Moonshine Backend: Faster Transcription for Low-End Devices
Add Moonshine, an ONNX-based transcription engine, for fast, lightweight speech-to-text.
- Optimized for low-end devices (Raspberry Pi, older laptops).
- One of the fastest transcription engines available.
- Supports live transcription.
🛠️ Fixes & Stability Improvements
🔧 Prevent BMI2 Crashes on AVX-Only CPUs
Fixed crashes on older Intel CPUs (Ivy Bridge, Sandy Bridge) that lack BMI2 instructions.
- Now safely falls back to
llama-cpp-fallback(SSE2 only). - No more
EOFerrors during model warmup.
✅ Ensures LocalAI runs smoothly on older hardware.
📊 Fix Swapped VRAM Usage on AMD GPUs
Correctly parses rocm-smi output: used and total VRAM are now displayed correctly.
- Fixes misreported memory usage on dual-Radeon setups.
- Handles
HIP_VISIBLE_DEVICESproperly (e.g., when using only discrete GPU).
🚀 The Complete Local Stack for Privacy-First AI
LocalAI |
The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required. |
LocalAGI |
Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI. |
LocalRecall |
RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI. Link: |
