Skip to content

Releases: mudler/LocalAI

v4.1.3

06 Apr 23:05
fdc9f7b

Choose a tag to compare

What's Changed

Bug fixes 🐛

  • fix(token): login via legacy api keys by @mudler in #9249
  • fix(anthropic): do not emit empty tokens and fix SSE tool calls by @mudler in #9258
  • fix(gpu): better detection for MacOS and Thor by @mudler in #9263

👒 Dependencies

  • chore(deps): bump google.golang.org/grpc from 1.79.3 to 1.80.0 by @dependabot[bot] in #9253
  • chore(deps): bump github.com/jaypipes/ghw from 0.23.0 to 0.24.0 by @dependabot[bot] in #9250
  • chore(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.12 to 1.32.14 by @dependabot[bot] in #9256
  • chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.64.0 to 0.65.0 by @dependabot[bot] in #9254

Other Changes

  • chore: ⬆️ Update ggml-org/llama.cpp to d0a6dfeb28a09831d904fc4d910ddb740da82834 by @localai-bot in #9259
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9260
  • chore: ⬆️ Update ace-step/acestep.cpp to e0c8d75a672fca5684c88c68dbf6d12f58754258 by @localai-bot in #9261
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 8afbeb6ba9702c15d41a38296f2ab1fe5c829fa0 by @localai-bot in #9262

Full Changelog: v4.1.2...v4.1.3

v4.1.2

06 Apr 08:54
ad232fd

Choose a tag to compare

What's Changed

Bug fixes 🐛

  • fix(autoparser): correctly pass by logprobs by @mudler in #9239
  • fix(chat): do not retry if we had chatdeltas or tooldeltas from backend by @mudler in #9244

Exciting New Features 🎉

  • feat(llama.cpp): wire speculative decoding settings by @mudler in #9238

Other Changes

  • Update index.yaml and add Qwen3.5 model files by @ER-EPR in #9237
  • chore: ⬆️ Update ggml-org/llama.cpp to 761797ffdf2ce3f118e82c663b1ad7d935fbd656 by @localai-bot in #9243
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 7397ddaa86f4e8837d5261724678cde0f36d4d89 by @localai-bot in #9242
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9241

Full Changelog: v4.1.1...v4.1.2

v4.1.1

05 Apr 00:06

Choose a tag to compare

This is a patch release to address few regressions from the last release and the upcoming Gemma4, most importantly:

  • Fixes Gemma 4 tokenization with llama.cpp
  • Show login in api key only mode
  • Small fixes to improve Anthropic API compatibility

What's Changed

Other Changes

  • docs: Update Home Assistant integrations list by @loryanstrant in #9206
  • chore: ⬆️ Update ggml-org/llama.cpp to a1cfb645307edc61a89e41557f290f441043d3c2 by @localai-bot in #9203
  • chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9210
  • chore: bump inference defaults from unsloth by @github-actions[bot] in #9219
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9214
  • chore: ⬆️ Update ggml-org/llama.cpp to d006858316d4650bb4da0c6923294ccd741caefd by @localai-bot in #9215
  • fix(ui): pass by staticApiKeyRequired to show login when only api key is configured by @mudler in #9220
  • feat(gemma4): add thinking support by @mudler in #9221
  • fix(nats): improve error handling by @mudler in #9222
  • feat(autoparser): prefer chat deltas from backends when emitted by @mudler in #9224
  • fix(anthropic): show null index when not present, default to 0 by @mudler in #9225
  • feat(api): Allow coding agents to interactively discover how to control and configure LocalAI by @richiejp in #9084
  • chore(refactor): use interface by @mudler in #9226
  • fix(reasoning): accumulate and strip reasoning tags from autoparser results by @mudler in #9227
  • chore(model-gallery): ⬆️ update checksum by @localai-bot in #9233
  • chore: ⬆️ Update ggml-org/llama.cpp to b8635075ffe27b135c49afb9a8b5c434bd42c502 by @localai-bot in #9231

New Contributors

  • @github-actions[bot] made their first contribution in #9219

Full Changelog: v4.1.0...v4.1.1

v4.1.0

02 Apr 22:14
e9f10f2

Choose a tag to compare

🎉 LocalAI 4.1.0 Release! 🚀




LocalAI 4.1.0 is out! 🔥

Just weeks after the landmark 4.0, we're back with another massive drop. This release turns LocalAI into a production-grade AI platform: spin up a distributed cluster with smart routing and autoscaling, lock it down with built-in auth and per-user quotas, fine-tune models without leaving the UI, and much more. If 4.0 was the foundation, 4.1 is the control tower.

Feature Summary
🌐 Distributed Mode Run LocalAI as a cluster — smart routing, node groups, drain/resume, min/max autoscaling.
🔐 Users & Auth Built-in user management with OIDC, invite mode, API keys, and admin impersonation.
📊 Quota System Per-user usage quotas with predictive analytics and breakdown dashboards.
🧪 Fine-Tuning (experimental) Fine-tune models with TRL, auto-export to GGUF, and import back — all from the UI.
⚗️ Quantization (experimental) New backend for on-the-fly model quantization.
🔧 Pipeline Editor Visual model pipeline editor in the React UI.
🤖 Standalone Agents Run agents from the CLI with local-ai agent run.
🧠 Smart Inferencing Auto inference defaults from Unsloth, tool parsing fallback, and min_p support.
🎬 Media History Browse past generated images and media in Studio pages.

New (long version) Full setup walktrough: https://www.youtube.com/watch?v=cMVNnlqwfw4

🚀 Key Features

🌐 Distributed Mode: scaling LocalAI horizontally

Run LocalAI as a distributed cluster and let it figure out where to send your requests. No more single-node bottlenecks.

  • Smart Routing: Requests are routed to nodes ordered by available VRAM — the beefiest, free GPU gets the job.
  • Node Groups: Pin models to specific node groups for workload isolation (e.g., "gpu-heavy" vs "cpu-light").
  • Autoscaling: Built-in min/max autoscaler with a node reconciler that manages the lifecycle automatically.
  • Drain & Resume: Gracefully drain nodes for maintenance and bring them back with a single API call.
  • Cluster Dashboard: See your entire cluster status at a glance from the home page.
  • Smart Model transfer: Use S3 or transfer via peer to peer
distributed-mode.mp4

🔐 Users, Authentication & Quotas

LocalAI now ships with a complete multi-user platform — perfect for teams, classrooms, or any shared deployment.

  • User Management: Create, edit, and manage users from the React UI.
  • OIDC/OAuth: Plug in your identity provider for SSO — Google, Keycloak, Authentik, you name it.
  • Invite Mode: Restrict registration to invite-only with admin approval.
  • API Keys: Per-user API key management.
  • Admin Powers: Admins can impersonate users for debugging.
  • Quota System: Set per-user usage quotas and enforce limits.
  • Usage Analytics: Predictive usage dashboard with per-user breakdown statistics.

Users and quota:

usersquota-1775167475876.mp4

Usage metrics per user:

usage.mp4

🧪 Fine-Tuning & Quantization

No more juggling external tools. Fine-tune and quantize directly inside LocalAI.

  • Fine-Tuning with TRL (Experimental): Train LoRA adapters with Hugging Face TRL, auto-export to GGUF, and import the result straight back into LocalAI. Includes a built-in evals framework to validate your work.
  • Quantization Backend: Spin up the new quantization backend to create optimized model variants on-the-fly.
quantize-fine-tune.mp4

🎨 UI

The React UI keeps getting better. This release adds serious power-user features:

  • Model Pipeline Editor: Visually wire up model pipelines — no YAML editing required.
  • Per-Model Backend Logs: Drill into logs scoped to individual models for laser-focused debugging.
  • Media History: Studio pages now remember your past generations — images, audio, and more.
  • Searchable Model/Backend Selector: Quickly find models and backends with inline search and filtering.
  • Structured Error Toasts: Errors now link directly to traces — one click from "something broke" to "here's why."
  • Tracing Settings: Inline tracing config restored with a cleaner UI.
talk.mp4

🤖 Agents & Inference

  • Standalone Agent Mode: Run agents straight from the terminal with local-ai agent run. Supports single-turn --prompt mode and pool-based configurations from pool.json.
  • Streaming Tool Calls: Agent mode tool calls now stream in real-time, with interleaved thinking fixed.
  • Inferencing Defaults: Automatic inference parameters sourced from Unsloth and applied to all endpoints and gallery models, your models just work better out of the box.
  • Tool Parsing Fallback: When native tool call parsing fails, an iterative fallback parser kicks in automatically.

🛠️ Under the Hood

  • Repeated Log Merging: Noisy terminals? Repeated log lines are now collapsed automatically.
  • Jetson/Tegra GPU Detection: First-class NVIDIA Jetson/Tegra platform detection.
  • Intel SYCL Fix: Auto-disables mmap for SYCL backends to prevent crashes.
  • llama.cpp Portability: Bundled libdl, librt, libpthread for improved cross-platform support.
  • HF_ENDPOINT Mirror: Downloader now rewrites HuggingFace URIs with HF_ENDPOINT for corporate/mirror setups.
  • Transformers >5.0: Bumped to HuggingFace Transformers >5.0 with generic model loading.
  • API Improvements: Proper 404s for missing models, unescaped model names, unified inferencing paths with automatic retry on transient errors.

🐞 Fixes & Improvements

  • Embeddings: Implemented encoding_format=base64 for the embeddings endpoint.
  • Kokoro TTS: Fixed phonemization model not downloading during installation.
  • Realtime API: Fixed Opus codec backend selection alias in development mode.
  • Gallery Filtering: Fixed exact tag matching for model gallery filters.
  • Open Responses: Fixed required ORItemParam.Arguments field being omitted; ORItemParam.Summary now always populated.
  • Tracing: Fixed settings not loading from runtime_settings.json.
  • UI: Fixed watchdog field mapping, model list refresh on deletion, backend display in model config, MCP button ordering.
  • Downloads: Fixed directory removal during fallback attempts; improved retry logic.
  • Model Paths: Fixed baseDir assignment to use ModelPath correctly.

❤️ Thank You

LocalAI is a community-powered FOSS movement. Every star, every PR, every bug report matters.

If you believe in privacy-first, self-hosted AI:

  • Star the repo — it helps more than you think
  • 🛠️ Contribute code, docs, or feedback
  • 📣 Share with your team, your community, your world

Let's keep building the future of open AI — together. 💪


✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Bug fixes 🐛

  • fix: Change baseDir assignment to use ModelPath by @mudler in #9010
  • fix(ui): correctly map watchdog fields by @mudler in #9022
  • fix(api): unescape model names by @mudler in #9024
  • fix(ui): Add tracing inline settings back and create UI tests by @richiejp in #9027
  • Always populate ORItemParam.Summary by @tv42 in #9049
  • fix(ui): correctly display backend if specified in the model config, re-order MCP buttons by @mudler in #9053
  • fix(ui): Refresh model list on deletion by @richiejp in #9059
  • fix(openresponses): do not omit required field ORItemParam.Arguments by @tv42 in #9074
  • fix: Add tracing settings loading from runtime_settings.json by @localai-bot in #9081
  • fix: use exact tag matching for model gallery tag filtering by @majiayu000 in #9041
  • fix(realtime): Set the alias for opus so the development backend can be selected by @richiejp in #9083
  • fix(llama.cpp): bundle libdl, librt, libpthread in llama-cpp backend by @mudler in #9099
  • fix(download): do not remove dst dir until we try all fallbacks by @mudler in #9100
  • fix(auth): do not allow to register in invite mode by @mudler in #9101
  • fix(downloader): Rewrite full https HF URI with HF_ENDPOINT by @richiejp in #9107
  • fix: implement encoding_format=base64 for embeddings endpoint by @walcz-de in #9135
  • fix(coqui,nemo,voxcpm): Add dependencies to allow CI to progress by @richiejp in #9142
  • fix(voxcpm): Force using a recent voxcpm version to kick the dependency solver by @richiejp in #9150
  • fix: huggingface repo change the file name so Update index.yaml is needed by @ER-EPR in #9163
  • fix(kokoro): Download phonemization model during installation by @richiejp in #9165...
Read more

v4.0.0

14 Mar 18:18

Choose a tag to compare


🎉 LocalAI 4.0.0 Release! 🚀




LocalAI 4.0.0 is out!

This major release transforms LocalAI into a complete AI orchestration platform. We’ve embedded agentic and hybrid search capabilities directly into the core, completely overhauled the user interface with React for a modern experience, and are thrilled to introduce Agenthub ( link ) a brand new community hub to easily share and import agents. Alongside these massive updates, we've introduced powerful new features like Canvas mode for code artifacts, MCP apps and full MCP client-side support.

Feature Summary
Agentic Orchestration & Agenthub Native agent management with memory, skills, and the new Agenthub for community sharing.
Revamped React UI Complete frontend rewrite for lightning-fast performance and modern UX.
Canvas Mode Preview code blocks and artifacts side-by-side in the chat interface.
MCP Client-Side Full Model Context Protocol support, MCP Apps, and tool streaming in chat.
WebRTC Realtime WebRTC support for low-latency realtime audio conversations.
New Backends Added experimental MLX Distributed, fish-speech, ace-step.cpp, and faster-qwen3-tts.
Infrastructure Podman documentation, shell completion, and persistent data path separation.

🚀 Key Features

🤖 Native Agentic Orchestration & Agenthub

LocalAI now includes agentic capabilities embedded directly in the core. You can manage, import, start, and stop agents via the new UI.

  • 🌐 Agenthub: We are launching Agenthub! This is a centralized community space to share common agents and import them effortlessly into your LocalAI instance.
  • Agent Management: Full lifecycle management via the React UI. Create Agents, connect them to Slack, configure MCP servers and skills.
  • Skills Management: Centralized skill database for AI agents.
  • Memory: Agents can utilize memory with Hybrid search (PostgreSQL) or embedded in-memory storage (Chromem).
  • Observability: New "Events" column in the Agents list to track observables and status.
  • 📚 Documentation: Dive into the new capabilities in our official Agents documentation.
agents.mp4

🎨 Revamped UI & Canvas Mode

The Web interface has been completely migrated to React, bringing a smoother experience and powerful new capabilities:

  • Canvas Mode: Enable "canvas mode" in the chat to see code blocks and artifacts generated by the LLM in a dedicated preview bar on the right.
  • System View: Tabbed navigation separating Models and Backends for better organization.
  • Model Size Warnings: Visual warnings when model storage exceeds system RAM to prevent lockups.
  • Traces: Improved trace display using accordions for better readability.
model-fit-canvas-mode.mp4

🔌 MCP Apps & Client-Side Support

We’ve expanded support for the Model Context Protocol (MCP):

  • MCP Apps: Select which servers to enable for the chat directly from the UI.
  • Tool Streaming: Tools from MCP servers are automatically injected into the standard chat interface.
  • Client-Side Support: Full client-side integration for MCP tools and streaming.
  • Disable Option: Add LOCALAI_DISABLE_MCP to completely disable MCP support for security.
mcp apps

🎵 New Backends, Audio & Video Enhancements

  • MLX Distributed (Experimental): We've added an experimental backend for running distributed workloads using Apple's MLX framework! Check out the docs here.
  • New Audio Backends: Introduced fish-speech, ace-step.cpp, and faster-qwen3-tts (CUDA-only).
  • WeRTC Realtime: WebRTC support added to the Realtime API and Talk page for better low-latency audio handling.
  • TTS Improvements: Added sample_rate support via post-processing and multi-voice support for Qwen TTS.
  • Video Generation: Fixed model selection dropdown sync and added vllm-omni backend detection.

🛠️ Infrastructure & Developer Experience

  • Data Separation: New --data-path CLI flag and LOCALAI_DATA_PATH env var to separate persistent data (agents, skills) from configuration.
  • Shell Completion: Dynamic completion scripts for bash, zsh, and fish.
  • Podman Support: Dedicated documentation for Podman installation and rootless configuration.
  • Gallery & Models: Model storage size display with RAM warnings, and fallback URI resolution for backend installation failures.
  • Deprecations: HuggingFace backend support removed, and AIO images dropped to focus on main images.

🐞 Fixes & Improvements

  • Logging: Fixed watchdog spamming logs when no interval was configured; downgraded health check logs to debug.
  • CUDA Detection: Improved GPU vendor checks to prevent false CUDA detection on CPU-only hosts with runtime libs.
  • Compatibility: Renamed json_verbose to verbose_json for OpenAI spec compliance (fixes Nextcloud integration).
  • Embedding: Fixed embedding dimension truncation to return full native dimensions.
  • Permissions: Changed model install file permissions to 0644 to ensure server readability.
  • Windows Docker: Added named volumes to Docker Compose files for Windows compatibility.
  • Model Reload: Models now reload automatically after editing YAML config (e.g., context_size).
  • Chat: Fixed issue where thinking/reasoning blocks were sent to the LLM.
  • Audio: Fixed img2img pipeline in diffusers backend and Qwen TTS duplicate argument error.

Known issues

  • The diffusers backend fails to build currently (due to CI limit exhaustion) and it's not currently part of this release (the previous version is still available). We are looking into it but, if you want to help and know someone at Github that could help supporting us with better ARM runners, please reach out!

❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

  • Star the repo
  • 💬 Contribute code, docs, or feedback
  • 📣 Share with others

Your support keeps this stack alive.


✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Breaking Changes 🛠

Bug fixes 🐛

  • fix(cli): Fix watchdog running constantly and spamming logs by @nanoandrew4 in #8624
  • fix(api): Downgrade health/readiness check to debug by @nanoandrew4 in #8625
  • fix: rename json_verbose to verbose_json by @lukasdotcom in #8627
  • fix(chatterbox): add support for cuda13/aarch64 by @mudler in #8653
  • fix: reload model after editing YAML config (issue #8647) by @localai-bot in #8652
  • fix(chat): do not send thinking/reasoning messages to the LLM by @mudler in #8656
  • fix: change file permissions from 0600 to 0644 in InstallModel by @localai-bot in #8657
  • fix: Add named volumes for Windows Docker compatibility by @localai-bot in #8661
  • fix(gallery): add fallback URI resolution for backend installation by @localai-bot in #8663
  • fix: whisper breaking on cuda-13 (use absolute path for CUDA directory detection) by @localai-bot in #8678
  • fix(gallery): clean up partially downloaded backend on installation failure by @localai-bot in #8679
  • fix: properly sync model selection dropdown in video generation UI by @localai-bot in #8680
  • fix: allow reranking models configured with known_usecases by @localai-bot in #8681
  • fix: return full embedding dimensions instead of truncating trailing zeros (#8721) by @localai-bot in #8755
  • fix: Add vllm-omni backend to video generation model detection (#8659) by @localai-bot in #8781
  • fix(qwen-tts): duplicate instruct argument in voice design mode by @Weathercold in #8842
  • Fix image upload processing and img2img pipeline in diffusers backend by @attilagyorffy in #8879
  • fix: gate CUDA directory checks on GPU vendor to prevent false CUDA detection by @sozercan in #8942
  • fix(llama-cpp): Set enable_thinking in the correct place by @richiejp in #8973

Exciting New Features 🎉

  • feat(traces): Use accordian instead of pop-ups by @richiejp in #8626
  • chore: remove install.sh script and documentation references by @localai-bot in #8643
  • docs: add Podman installation documentation by @localai-bot in htt...
Read more

v3.12.1

21 Feb 13:49
fcecc12

Choose a tag to compare

This is a patch release to tag the new llama.cpp version which fixes incompatibilities with Qwen 3 coder.

What's Changed

Other Changes

  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8611
  • feat(traces): Add backend traces by @richiejp in #8609
  • chore: ⬆️ Update ggml-org/llama.cpp to b908baf1825b1a89afef87b09e22c32af2ca6548 by @localai-bot in #8612
  • chore: drop bark.cpp leftovers from pipelines by @mudler in #8614
  • fix: merge openresponses messages by @mudler in #8615
  • chore: ⬆️ Update ggml-org/llama.cpp to ba3b9c8844aca35ecb40d31886686326f22d2214 by @localai-bot in #8613

Full Changelog: v3.12.0...v3.12.1

v3.12.0

20 Feb 18:16

Choose a tag to compare

🎉 LocalAI 3.12.0 Release! 🚀




LocalAI 3.12.0 is out!

Feature Summary
Multi-modal Realtime Send text, images, and audio in real-time conversations for richer interactions.
Voxtral Backend New high-quality text-to-speech backend added.
Multi-GPU Support Improved Diffusers performance with multiple GPUs.
Legacy CPU Optimization Enhanced compatibility for older processors.
UI Theme & Layout Improved UI theme (dark/light variants) and navigation
Realtime Stability Multiple fixes for audio, image, and model handling.
Logging Improvements Reduced excessive logs and optimized processing.

Local Stack Family

Liking LocalAI? LocalAI is part of an integrated suite of AI infrastructure tools, you might also like:

  • LocalAGI - AI agent orchestration platform with OpenAI Responses API compatibility and advanced agentic capabilities
  • LocalRecall - MCP/REST API knowledge base system providing persistent memory and storage for AI agents
  • 🆕 Cogito - Go library for building intelligent, co-operative agentic software and LLM-powered workflows, focusing on improving results for small, open source language models that scales to any LLM. Powers LocalAGI and LocalAI MCP/Agentic capabilities
  • 🆕 Wiz - Terminal-based AI agent accessible via Ctrl+Space keybinding. Portable, local-LLM friendly shell assistant with TUI/CLI modes, tool execution with approval, MCP protocol support, and multi-shell compatibility (zsh, bash, fish)
  • 🆕 SkillServer - Simple, centralized skills database for AI agents via MCP. Manages skills as Markdown files with MCP server integration, web UI for editing, Git synchronization, and full-text search capabilities

❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

  • Star the repo
  • 💬 Contribute code, docs, or feedback
  • 📣 Share with others

Your support keeps this stack alive.


✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Bug fixes 🐛

  • security: validate URLs to prevent SSRF in content fetching endpoints by @kolega-ai-dev in #8476
  • fix(realtime): Use user provided voice and allow pipeline models to have no backend by @richiejp in #8415
  • fix(realtime): Sampling and websocket locking by @richiejp in #8521
  • fix(realtime): Send proper image data to backend by @richiejp in #8547
  • fix: prevent excessive logging in capability detection by @localai-bot in #8552
  • fix(voxcpm): pin setuptools by @mudler in #8556
  • fix(llama-cpp): populate tensor_buft_override buffer so llama-cpp properly performs fit calculations by @cvpcs in #8560
  • fix: pin neutts-air to known working commit by @localai-bot in #8566
  • fix: improve watchdown logics by @mudler in #8591
  • fix(llama-cpp): Pass parameters when using embedded template by @richiejp in #8590
  • fix(realtime): Better support for thinking models and setting model parameters by @richiejp in #8595
  • fix(realtime): Limit buffer sizes to prevent DoS by @richiejp in #8596
  • fix(ui): improve view on mobile by @mudler in #8598
  • fix(diffusers): sd_embed is not always available by @mudler in #8602
  • fix: do not keep track model if not existing by @mudler in #8603

Exciting New Features 🎉

  • feat(stablediffusion-ggml): Improve legacy CPU support for stablediffusion-ggml backend by @cvpcs in #8461
  • feat(voxtral): add voxtral backend by @mudler in #8451
  • feat(diffusers): add experimental support for sd_embed-style prompt embedding by @cvpcs in #8504
  • chore: improve log levels verbosity by @localai-bot in #8528
  • feat(realtime): Allow sending text, image and audio conversation items" by @richiejp in #8524
  • chore: compute capabilities once by @mudler in #8555
  • feat(ui): left navbar, dark/light theme by @mudler in #8594
  • fix: multi-GPU support for Diffusers (Issue #8575) by @localai-bot in #8605

🧠 Models

  • chore(model gallery): Add Ministral 3 family of models (aside from base versions) by @rampa3 in #8467
  • chore(model gallery): add voxtral (which is only available in development) by @mudler in #8532
  • chore(model gallery): Add npc-llm-3-8b by @rampa3 in #8498
  • chore(model gallery): add nemo-asr by @mudler in #8533
  • chore(model gallery): add voxcpm, whisperx, moonshine-tiny by @mudler in #8534
  • chore(model gallery): add neutts by @mudler in #8535
  • chore(model gallery): add vllm-omni models by @mudler in #8536
  • chore(model-gallery): ⬆️ update checksum by @localai-bot in #8540
  • feat(gallery): Add nanbeige4.1-3b by @richiejp in #8551
  • chore(model-gallery): ⬆️ update checksum by @localai-bot in #8593
  • chore(model-gallery): ⬆️ update checksum by @localai-bot in #8600

👒 Dependencies

  • chore(deps): bump github.com/anthropics/anthropic-sdk-go from 1.20.0 to 1.22.0 by @dependabot[bot] in #8482
  • chore(deps): bump github.com/jaypipes/ghw from 0.21.2 to 0.22.0 by @dependabot[bot] in #8484
  • chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.0 to 2.28.1 by @dependabot[bot] in #8483
  • chore(deps): bump github.com/alecthomas/kong from 1.13.0 to 1.14.0 by @dependabot[bot] in #8481
  • chore(deps): bump github.com/openai/openai-go/v3 from 3.17.0 to 3.19.0 by @dependabot[bot] in #8485
  • chore: bump cogito by @mudler in #8568
  • fix(gallery): Use YAML v3 to avoid merging maps with incompatible keys by @richiejp in #8580
  • chore(deps): bump google.golang.org/grpc from 1.78.0 to 1.79.1 by @dependabot[bot] in #8583
  • chore(deps): bump github.com/jaypipes/ghw from 0.22.0 to 0.23.0 by @dependabot[bot] in #8587
  • chore(deps): bump github.com/modelcontextprotocol/go-sdk from 1.2.0 to 1.3.0 by @dependabot[bot] in #8585
  • chore(deps): bump cogito and add new options to the agent config by @mudler in #8601

Other Changes

  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8462
  • docs: update model gallery documentation to reference main repository by @veeceey in #8452
  • chore: ⬆️ Update ggml-org/whisper.cpp to 4b23ff249e7f93137cb870b28fb27818e074c255 by @localai-bot in #8463
  • chore: ⬆️ Update ggml-org/llama.cpp to e06088da0fa86aa444409f38dff274904931c507 by @localai-bot in #8464
  • chore: ⬆️ Update antirez/voxtral.c to c9e8773a2042d67c637fc492c8a655c485354080 by @localai-bot in #8477
  • chore: ⬆️ Update ggml-org/llama.cpp to 262364e31d1da43596fe84244fba44e94a0de64e by @localai-bot in #8479
  • chore: ⬆️ Update ggml-org/whisper.cpp to 764482c3175d9c3bc6089c1ec84df7d1b9537d83 by @localai-bot in #8478
  • chore: ⬆️ Update ggml-org/llama.cpp to 57487a64c88c152ac72f3aea09bd1cc491b2f61e by @localai-bot in #8499
  • chore: ⬆️ Update ggml-org/llama.cpp to 4d3daf80f8834e0eb5148efc7610513f1e263653 by @localai-bot in #8513
  • chore: ⬆️ Update ggml-org/llama.cpp to 338085c69e486b7155e5b03d7b5087e02c0e2528 by @localai-bot in #8538
  • fix: update moonshine API, add setuptools to voxcpm requirements by @mudler in #8541
  • chore: ⬆️ Update ggml-org/llama.cpp to 05a6f0e8946914918758db767f6eb04bc1e38507 by @localai-bot in #8553
  • chore: ⬆️ Update ggml-org/llama.cpp to 01d8eaa28d57bfc6d06e30072085ed0ef12e06c5 by @localai-bot in #8567
  • chore: ⬆️ Update...
Read more

v3.11.0

07 Feb 21:31
944874d

Choose a tag to compare

🎉 LocalAI 3.11.0 Release! 🚀




LocalAI 3.11.0 is a massive update for Audio and Multimodal capabilities.

We are introducing Realtime Audio Conversations, a dedicated Music Generation UI, and a massive expansion of ASR (Speech-to-Text) and TTS backends. Whether you want to talk to your AI, clone voices, transcribe with speaker identification, or generate songs, this release has you covered.

Check out the highlights below!


📌 TL;DR

Feature Summary
Realtime Audio Native support for audio conversations, enabling fluid voice interactions similar to OpenAI's Realtime API. Documentation
Music Generation UI New UI interface for MusicGen (Ace-Step), allowing you to generate music from text prompts directly in the browser.
New ASR Backends Added WhisperX (with Speaker Diarization), VibeVoice, Qwen-ASR, and Nvidia NeMo.
TTS Streaming Text-to-Speech now supports streaming mode for lower latency responses. (VoxCPM only for now)
vLLM Omni Added support for vLLM Omni, expanding our high-performance inference capabilities.
Speaker Diarization Native support for identifying different speakers in transcriptions via WhisperX.
Hardware Expansion Expanded build support for CUDA 12/13, L4T (Jetson), SBSA, and better Metal (Apple Silicon) integration with MLX backends
Breaking Changes ExLlama (deprecated) and Bark (unmaintained) backends have been removed.

🚀 New Features & Major Enhancements

🎙️ Realtime Audio Conversations

LocalAI 3.11.0 introduces native support for Realtime Audio Conversations.

  • Enables fluid, low-latency voice interaction with agents.
  • Logic handled directly within the LocalAI pipeline for seamless audio-in/audio-out workflows.
  • Support for STT/TTS and voice-to-voice models (experimental)
  • Support for tool calls

🗣️ Talk to your LocalAI: This brings us one step closer to a fully local, voice-native assistant experience compatible with standard client implementations.

Check here for detailed documentation.


🎵 Music Generation UI & Ace-Step

We have added a dedicated interface for music generation!

  • New Backend: Support for Ace-Step (MusicGen) via the ace-step backend.
  • Web UI Integration: Generate musical clips directly from the LocalAI Web UI.
  • Simple text-to-music workflow (e.g., "Lo-fi hip hop beat for studying").
Screenshot 2026-02-07 at 23-32-00 LocalAI - Generate sound with ace-step-turbo

🎧 Massive ASR (Speech-to-Text) Expansion

This release significantly broadens our transcription capabilities with four new backends:

  1. WhisperX: Provides fast transcription with Speaker Diarization (identifying who is speaking).
  2. VibeVoice: Now supports also ASR alongside TTS.
  3. Qwen-ASR: Support for Qwen's powerful speech recognition models.
  4. Nvidia NeMo: Initial support for NeMo ASR.

🗣️ TTS Streaming & New Voices

Text-to-Speech gets a speed boost and new options:

  • Streaming Support: TTS endpoints now support streaming, reducing the "time-to-first-audio" significantly.
  • VoxCPM: Added support for the VoxCPM backend.
  • Qwen-TTS: Added support for Qwen-TTS models
  • Piper Voices: Added most remaining Piper voices from Hugging Face to the gallery.

🛠️ Hardware & Backend Updates

  • vLLM Omni: A new backend integration for vLLM Omni models.
  • Extended Platform Support: Major work on MLX to improve compatibility across CUDA 12, CUDA 13, L4T (Nvidia Jetson), SBSA, and macOS Metal.
  • GGUF Cleanup: Dropped redundant VRAM estimation logic for GGUF loading, relying on more accurate internal measurements.

⚠️ Breaking Changes

To keep the project lean and maintainable, we have removed some older backends:

  • ExLlama: Removed (deprecated in favor of newer loaders like ExLlamaV2 or llama.cpp).
  • Bark: Removed (the upstream project is unmaintained; we recommend using the new TTS alternatives).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI Logo

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI Logo

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall Logo

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link: https://github.com/mudler/LocalRecall


❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

  • Star the repo
  • 💬 Contribute code, docs, or feedback
  • 📣 Share with others

Your support keeps this stack alive.


✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Breaking Changes 🛠

  • chore(exllama): drop backend now almost deprecated by @mudler in #8186

Bug fixes 🐛

  • fix(ui): correctly display selected image model by @dedyf5 in #8208
  • fix(ui): take account of reasoning in token count calculation by @mudler in #8324
  • fix: drop gguf VRAM estimation (now redundant) by @mudler in #8325
  • fix(api): Add missing field in initial OpenAI streaming response by @acon96 in #8341
  • fix(realtime): Include noAction function in prompt template and handle tool_choice by @richiejp in #8372
  • fix: filter GGUF and GGML files from model list by @Yaroslav98214 in #8397
  • fix(qwen-asr): Remove contagious slop (DEFAULT_GOAL) from Makefile by @richiejp in #8431

Exciting New Features 🎉

  • feat(vllm-omni): add new backend by @mudler in #8188
  • feat(vibevoice): add ASR support by @mudler in #8222
  • feat: add VoxCPM tts backend by @mudler in #8109
  • feat(realtime): Add audio conversations by @richiejp in #6245
  • feat(qwen-asr): add support to qwen-asr by @mudler in #8281
  • feat(tts): add support for streaming mode by @mudler in #8291
  • feat(api): Add transcribe response format request parameter & adjust STT backends by @nanoandrew4 in #8318
  • feat(whisperx): add whisperx backend for transcription with speaker diarization by @eureka928 in #8299
  • feat(mlx): Add support for CUDA12, CUDA13, L4T, SBSA and CPU by @mudler in #8380
  • feat(musicgen): add ace-step and UI interface by @mudler in #8396
  • fix(api)!: Stop model prior to deletion by @nanoandrew4 in #8422
  • feat(nemo): add Nemo (only asr for now) backend by @mudler in #8436

🧠 Models

  • chore(model gallery): add qwen3-tts to model gallery by @mudler in #8187
  • chore(model gallery): Add most of not yet present Piper voices from Hugging Face by @rampa3 in #8202
  • chore: drop bark which is unmaintained by @mudler in #8207
  • chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8220
  • chore(model gallery): Add entry for Mistral Small 3.1 with mmproj by @rampa3 in https://git...
Read more

v3.10.1

23 Jan 14:21
923ebbb

Choose a tag to compare

This is a small patch release intended to provide bugfixes and minor polishment, along, we also added support to Qwen-TTS that was just released yesterday.

  • Fix reasoning detection on reasoning and instruct models
  • Support reasoning blocks with openresponses
  • API fixes to correctly run LTX-2
  • Support Qwen3-TTS!

What's Changed

Bug fixes 🐛

  • fix(reasoning): support models with reasoning without starting thinking tag by @mudler in #8132
  • fix(tracing): Create trace buffer on first request to enable tracing at runtime by @richiejp in #8148
  • fix(videogen): drop incomplete endpoint, add GGUF support for LTX-2 by @mudler in #8160

Exciting New Features 🎉

  • feat(openresponses): Support reasoning blocks by @mudler in #8133
  • feat: detect thinking support from backend automatically if not explicitly set by @mudler in #8167
  • feat(qwen-tts): add Qwen-tts backend by @mudler in #8163

🧠 Models

  • chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8128
  • chore(model gallery): add flux 2 and flux 2 klein by @mudler in #8141
  • chore(model-gallery): ⬆️ update checksum by @localai-bot in #8153
  • chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8157
  • chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8170

👒 Dependencies

  • chore(deps): bump github.com/mudler/cogito from 0.7.2 to 0.8.1 by @dependabot[bot] in #8124

Other Changes

  • feat(swagger): update swagger by @localai-bot in #8098
  • chore: ⬆️ Update ggml-org/llama.cpp to 287a33017b32600bfc0e81feeb0ad6e81e0dd484 by @localai-bot in #8100
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 2efd19978dd4164e387bf226025c9666b6ef35e2 by @localai-bot in #8099
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8120
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to a48b4a3ade9972faf0adcad47e51c6fc03f0e46d by @localai-bot in #8121
  • chore: ⬆️ Update ggml-org/llama.cpp to 959ecf7f234dc0bc0cd6829b25cb0ee1481aa78a by @localai-bot in #8122
  • chore(deps): Bump llama.cpp to '1c7cf94b22a9dc6b1d32422f72a627787a4783a3' by @mudler in #8136
  • chore: drop noisy logs by @mudler in #8142
  • chore: ⬆️ Update ggml-org/llama.cpp to ad8d85bd94cc86e89d23407bdebf98f2e6510c61 by @localai-bot in #8145
  • chore: ⬆️ Update ggml-org/whisper.cpp to 7aa8818647303b567c3a21fe4220b2681988e220 by @localai-bot in #8146
  • feat(swagger): update swagger by @localai-bot in #8150
  • chore(diffusers): add 'av' to requirements.txt by @mudler in #8155
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 329571131d62d64a4f49e1acbef49ae02544fdcd by @localai-bot in #8152
  • chore: ⬆️ Update ggml-org/llama.cpp to c301172f660a1fe0b42023da990bf7385d69adb4 by @localai-bot in #8151
  • chore: ⬆️ Update ggml-org/llama.cpp to a5eaa1d6a3732bc0f460b02b61c95680bba5a012 by @localai-bot in #8165
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 5e4579c11d0678f9765463582d024e58270faa9c by @localai-bot in #8166

Full Changelog: v3.10.0...v3.10.1

v3.10.0

18 Jan 21:00
5f403b1

Choose a tag to compare

🎉 LocalAI 3.10.0 Release! 🚀




LocalAI 3.10.0 is big on agent capabilities, multi-modal support, and cross-platform reliability.

We've added native Anthropic API support, launched a new Video Generation UI, introduced Open Responses API compatibility, and enhanced performance with a unified GPU backend system.

For a full tour, see below!


📌 TL;DR

Feature Summary
Anthropic API Support Fully compatible /v1/messages endpoint for seamless drop-in replacement of Claude.
Open Responses API Native support for stateful agents with tool calling, streaming, background mode, and multi-turn conversations, passing all official acceptance tests.
Video & Image Generation Suite New video gen UI + LTX-2 support for text-to-video and image-to-video.
Unified GPU Backends GPU libraries (CUDA, ROCm, Vulkan) packaged inside backend containers — works out of the box on Nvidia, AMD, and ARM64 (Experimental).
Tool Streaming & XML Parsing Full support for streaming tool calls and XML-formatted tool outputs.
System-Aware Backend Gallery Only see backends your system can run (e.g., hide MLX on Linux).
Crash Fixes Prevents crashes on AVX-only CPUs (Intel Sandy/Ivy Bridge) and fixes VRAM reporting on AMD GPUs.
Request Tracing Debug agents & fine-tuning with memory-based request/response logging.
Moonshine Backend Ultra-fast transcription engine for low-end devices.
Pocket-TTS Lightweight, high-fidelity text-to-speech with voice cloning.
Vulkan arm64 builds We now build backends and images for vulkan on arm64 as well

🚀 New Features & Major Enhancements

🤖 Open Responses API: Build Smarter, Autonomous Agents

LocalAI now supports the OpenAI Responses API, enabling powerful agentic workflows locally.

  • Stateful conversations via response_id — resume and manage long-running agent sessions.
  • Background mode: Run agents asynchronously and fetch results later.
  • Streaming support for tools, images, and audio.
  • Built-in tools: Web search, file search, and computer use (via MCP integrations).
  • Multi-turn interaction with dynamic context and tool use.

✅ Ideal for developers building agents that can browse, analyze files, or interact with systems — all on your local machine.

🔧 How to Use:

  • Set response_id in your request to maintain session state across calls.
  • Use background: true to run agents asynchronously.
  • Retrieve results via GET /api/v1/responses/{response_id}.
  • Enable streaming with stream: true to receive partial responses and tool calls in real time.

📌 Tip: Use response_id to build agent orchestration systems that persist context and avoid redundant computation.

Our support passes all the official acceptance tests:

Open Responses API support

🧠 Anthropic Messages API: Clone Claude Locally

LocalAI now fully supports the Anthropic messages API.

  • Use https://api.localai.host/v1/messages as a drop-in replacement for Claude.
  • Full tool/function calling support, just like OpenAI.
  • Streaming and non-streaming responses.
  • Compatible with anthropic-sdk-go, LangChain, and other tooling.

🔥 Perfect for teams migrating from Anthropic to local inference with full feature parity.


🎥 Video Generation: From Text to Video in the Web UI

  • New dedicated video generation page with intuitive controls.
  • LTX-2 is supported
  • Supports text-to-video and image-to-video workflows.
  • Built on top of diffusers with full compatibility.

📌 How to Use:

  • Go to /video in the web UI.
  • Enter a prompt (e.g., "A cat walking on a moonlit rooftop").
  • Optionally upload an image for image-to-video generation.
  • Adjust parameters like fps, num_frames, and guidance_scale.

⚙️ Unified GPU Backends: Acceleration Works Out of the Box

A major architectural upgrade: GPU libraries (CUDA, ROCm, Vulkan) are now packaged inside backend containers.

  • Single image: Now you don't need anymore to pull a specific image for your GPU. Any image works regardless if you have a GPU or not.
  • No more manual GPU driver setup — just run the image and get acceleration.
  • Works on Nvidia (CUDA), AMD (ROCm), and ARM64 (Vulkan).
  • Vulkan arm64 builds enabled
  • Reduced image complexity, faster builds, and consistent performance.

🚀 This means latest/master images now support GPU acceleration on all platforms — no extra config!

Note: this is experimental, please help us by filing an issue if something doesn't work!


🧩 Tool Streaming & Advanced Parsing

Enhance your agent workflows with richer tool interaction.

  • Streaming tool calls: Receive partial tool arguments in real time (e.g., input_json_delta).
  • XML-style tool call parsing: Models that return tools in XML format (<function>...</function>) are now properly parsed alongside text.
  • Works across all backends (llama.cpp, vLLM, diffusers, etc.).

💡 Enables more natural, real-time interaction with agents that use structured tool outputs.


🌐 System-Aware Backend Gallery: Only Compatible Backends Show

The backend gallery now shows only backends your system can run.

  • Auto-detects system capabilities (CPU, GPU, MLX, etc.).
  • Hides unsupported backends (e.g., MLX on Linux, CUDA on AMD).
  • Shows detected capabilities in the hero section.

🎤 New TTS Backends: Pocket-TTS

Add expressive voice generation to your apps with Pocket-TTS.

  • Real-time text-to-speech with voice cloning support (requires HF login).
  • Lightweight, fast, and open-source.
  • Available in the model gallery.

🗣️ Perfect for voice agents, narrators, or interactive assistants.
Note: Voice cloning requires HF authentication and a registered voice model.


🔍 Request Tracing: Debug Your Agents

Trace requests and responses in memory — great for fine-tuning and agent debugging.

  • Enable via runtime setting or API.
  • Log stored in memory, dropped after max size.
  • Fetch logs via GET /api/v1/trace.
  • Export to JSON for analysis.

🪄 New 'Reasoning' Field: Extract Thinking Steps

LocalAI now automatically detects and extracts thinking tags from model output.

  • Supports both SSE and non-SSE modes.
  • Displays reasoning steps in the chat UI (under "Thinking" tab).
  • Fixes issue where thinking content appeared as part of final answer.

🚀 Moonshine Backend: Faster Transcription for Low-End Devices

Add Moonshine, an ONNX-based transcription engine, for fast, lightweight speech-to-text.

  • Optimized for low-end devices (Raspberry Pi, older laptops).
  • One of the fastest transcription engines available.
  • Supports live transcription.

🛠️ Fixes & Stability Improvements

🔧 Prevent BMI2 Crashes on AVX-Only CPUs

Fixed crashes on older Intel CPUs (Ivy Bridge, Sandy Bridge) that lack BMI2 instructions.

  • Now safely falls back to llama-cpp-fallback (SSE2 only).
  • No more EOF errors during model warmup.

✅ Ensures LocalAI runs smoothly on older hardware.


📊 Fix Swapped VRAM Usage on AMD GPUs

Correctly parses rocm-smi output: used and total VRAM are now displayed correctly.

  • Fixes misreported memory usage on dual-Radeon setups.
  • Handles HIP_VISIBLE_DEVICES properly (e.g., when using only discrete GPU).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI Logo

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI Logo

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall Logo

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link:

Read more