06 Apr 23:05

mudler

fdc9f7b

v4.1.3 Latest

Latest

What's Changed

Bug fixes 🐛

fix(token): login via legacy api keys by @mudler in #9249
fix(anthropic): do not emit empty tokens and fix SSE tool calls by @mudler in #9258
fix(gpu): better detection for MacOS and Thor by @mudler in #9263

👒 Dependencies

chore(deps): bump google.golang.org/grpc from 1.79.3 to 1.80.0 by @dependabot[bot] in #9253
chore(deps): bump github.com/jaypipes/ghw from 0.23.0 to 0.24.0 by @dependabot[bot] in #9250
chore(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.12 to 1.32.14 by @dependabot[bot] in #9256
chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.64.0 to 0.65.0 by @dependabot[bot] in #9254

Other Changes

chore: ⬆️ Update ggml-org/llama.cpp to d0a6dfeb28a09831d904fc4d910ddb740da82834 by @localai-bot in #9259
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9260
chore: ⬆️ Update ace-step/acestep.cpp to e0c8d75a672fca5684c88c68dbf6d12f58754258 by @localai-bot in #9261
chore: ⬆️ Update leejet/stable-diffusion.cpp to 8afbeb6ba9702c15d41a38296f2ab1fe5c829fa0 by @localai-bot in #9262

Full Changelog: v4.1.2...v4.1.3

Contributors

mudler, dependabot, and localai-bot

Assets 9

06 Apr 08:54

mudler

v4.1.2

ad232fd

v4.1.2

What's Changed

Bug fixes 🐛

fix(autoparser): correctly pass by logprobs by @mudler in #9239
fix(chat): do not retry if we had chatdeltas or tooldeltas from backend by @mudler in #9244

Exciting New Features 🎉

feat(llama.cpp): wire speculative decoding settings by @mudler in #9238

Other Changes

Update index.yaml and add Qwen3.5 model files by @ER-EPR in #9237
chore: ⬆️ Update ggml-org/llama.cpp to 761797ffdf2ce3f118e82c663b1ad7d935fbd656 by @localai-bot in #9243
chore: ⬆️ Update leejet/stable-diffusion.cpp to 7397ddaa86f4e8837d5261724678cde0f36d4d89 by @localai-bot in #9242
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9241

Full Changelog: v4.1.1...v4.1.2

Contributors

mudler, ER-EPR, and localai-bot

Assets 9

05 Apr 00:06

mudler

v4.1.1

9b7d551

v4.1.1

This is a patch release to address few regressions from the last release and the upcoming Gemma4, most importantly:

Fixes Gemma 4 tokenization with llama.cpp
Show login in api key only mode
Small fixes to improve Anthropic API compatibility

What's Changed

Other Changes

docs: Update Home Assistant integrations list by @loryanstrant in #9206
chore: ⬆️ Update ggml-org/llama.cpp to a1cfb645307edc61a89e41557f290f441043d3c2 by @localai-bot in #9203
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9210
chore: bump inference defaults from unsloth by @github-actions[bot] in #9219
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9214
chore: ⬆️ Update ggml-org/llama.cpp to d006858316d4650bb4da0c6923294ccd741caefd by @localai-bot in #9215
fix(ui): pass by staticApiKeyRequired to show login when only api key is configured by @mudler in #9220
feat(gemma4): add thinking support by @mudler in #9221
fix(nats): improve error handling by @mudler in #9222
feat(autoparser): prefer chat deltas from backends when emitted by @mudler in #9224
fix(anthropic): show null index when not present, default to 0 by @mudler in #9225
feat(api): Allow coding agents to interactively discover how to control and configure LocalAI by @richiejp in #9084
chore(refactor): use interface by @mudler in #9226
fix(reasoning): accumulate and strip reasoning tags from autoparser results by @mudler in #9227
chore(model-gallery): ⬆️ update checksum by @localai-bot in #9233
chore: ⬆️ Update ggml-org/llama.cpp to b8635075ffe27b135c49afb9a8b5c434bd42c502 by @localai-bot in #9231

New Contributors

@github-actions[bot] made their first contribution in #9219

Full Changelog: v4.1.0...v4.1.1

Contributors

richiejp, mudler, and 2 other contributors

Assets 9

02 Apr 22:14

mudler

v4.1.0

e9f10f2

v4.1.0

🎉 LocalAI 4.1.0 Release! 🚀

LocalAI 4.1.0 is out! 🔥

Just weeks after the landmark 4.0, we're back with another massive drop. This release turns LocalAI into a production-grade AI platform: spin up a distributed cluster with smart routing and autoscaling, lock it down with built-in auth and per-user quotas, fine-tune models without leaving the UI, and much more. If 4.0 was the foundation, 4.1 is the control tower.

Feature	Summary
🌐 Distributed Mode	Run LocalAI as a cluster — smart routing, node groups, drain/resume, min/max autoscaling.
🔐 Users & Auth	Built-in user management with OIDC, invite mode, API keys, and admin impersonation.
📊 Quota System	Per-user usage quotas with predictive analytics and breakdown dashboards.
🧪 Fine-Tuning	(experimental) Fine-tune models with TRL, auto-export to GGUF, and import back — all from the UI.
⚗️ Quantization	(experimental) New backend for on-the-fly model quantization.
🔧 Pipeline Editor	Visual model pipeline editor in the React UI.
🤖 Standalone Agents	Run agents from the CLI with `local-ai agent run`.
🧠 Smart Inferencing	Auto inference defaults from Unsloth, tool parsing fallback, and `min_p` support.
🎬 Media History	Browse past generated images and media in Studio pages.

New (long version) Full setup walktrough: https://www.youtube.com/watch?v=cMVNnlqwfw4

🚀 Key Features

🌐 Distributed Mode: scaling LocalAI horizontally

Run LocalAI as a distributed cluster and let it figure out where to send your requests. No more single-node bottlenecks.

Smart Routing: Requests are routed to nodes ordered by available VRAM — the beefiest, free GPU gets the job.
Node Groups: Pin models to specific node groups for workload isolation (e.g., "gpu-heavy" vs "cpu-light").
Autoscaling: Built-in min/max autoscaler with a node reconciler that manages the lifecycle automatically.
Drain & Resume: Gracefully drain nodes for maintenance and bring them back with a single API call.
Cluster Dashboard: See your entire cluster status at a glance from the home page.
Smart Model transfer: Use S3 or transfer via peer to peer

distributed-mode.mp4

🔐 Users, Authentication & Quotas

LocalAI now ships with a complete multi-user platform — perfect for teams, classrooms, or any shared deployment.

User Management: Create, edit, and manage users from the React UI.
OIDC/OAuth: Plug in your identity provider for SSO — Google, Keycloak, Authentik, you name it.
Invite Mode: Restrict registration to invite-only with admin approval.
API Keys: Per-user API key management.
Admin Powers: Admins can impersonate users for debugging.
Quota System: Set per-user usage quotas and enforce limits.
Usage Analytics: Predictive usage dashboard with per-user breakdown statistics.

Users and quota:

usersquota-1775167475876.mp4

Usage metrics per user:

usage.mp4

🧪 Fine-Tuning & Quantization

No more juggling external tools. Fine-tune and quantize directly inside LocalAI.

Fine-Tuning with TRL (Experimental): Train LoRA adapters with Hugging Face TRL, auto-export to GGUF, and import the result straight back into LocalAI. Includes a built-in evals framework to validate your work.
Quantization Backend: Spin up the new quantization backend to create optimized model variants on-the-fly.

quantize-fine-tune.mp4

🎨 UI

The React UI keeps getting better. This release adds serious power-user features:

Model Pipeline Editor: Visually wire up model pipelines — no YAML editing required.
Per-Model Backend Logs: Drill into logs scoped to individual models for laser-focused debugging.
Media History: Studio pages now remember your past generations — images, audio, and more.
Searchable Model/Backend Selector: Quickly find models and backends with inline search and filtering.
Structured Error Toasts: Errors now link directly to traces — one click from "something broke" to "here's why."
Tracing Settings: Inline tracing config restored with a cleaner UI.

talk.mp4

🤖 Agents & Inference

Standalone Agent Mode: Run agents straight from the terminal with local-ai agent run. Supports single-turn --prompt mode and pool-based configurations from pool.json.
Streaming Tool Calls: Agent mode tool calls now stream in real-time, with interleaved thinking fixed.
Inferencing Defaults: Automatic inference parameters sourced from Unsloth and applied to all endpoints and gallery models, your models just work better out of the box.
Tool Parsing Fallback: When native tool call parsing fails, an iterative fallback parser kicks in automatically.

🛠️ Under the Hood

Repeated Log Merging: Noisy terminals? Repeated log lines are now collapsed automatically.
Jetson/Tegra GPU Detection: First-class NVIDIA Jetson/Tegra platform detection.
Intel SYCL Fix: Auto-disables mmap for SYCL backends to prevent crashes.
llama.cpp Portability: Bundled libdl, librt, libpthread for improved cross-platform support.
HF_ENDPOINT Mirror: Downloader now rewrites HuggingFace URIs with HF_ENDPOINT for corporate/mirror setups.
Transformers >5.0: Bumped to HuggingFace Transformers >5.0 with generic model loading.
API Improvements: Proper 404s for missing models, unescaped model names, unified inferencing paths with automatic retry on transient errors.

🐞 Fixes & Improvements

Embeddings: Implemented encoding_format=base64 for the embeddings endpoint.
Kokoro TTS: Fixed phonemization model not downloading during installation.
Realtime API: Fixed Opus codec backend selection alias in development mode.
Gallery Filtering: Fixed exact tag matching for model gallery filters.
Open Responses: Fixed required ORItemParam.Arguments field being omitted; ORItemParam.Summary now always populated.
Tracing: Fixed settings not loading from runtime_settings.json.
UI: Fixed watchdog field mapping, model list refresh on deletion, backend display in model config, MCP button ordering.
Downloads: Fixed directory removal during fallback attempts; improved retry logic.
Model Paths: Fixed baseDir assignment to use ModelPath correctly.

❤️ Thank You

LocalAI is a community-powered FOSS movement. Every star, every PR, every bug report matters.

If you believe in privacy-first, self-hosted AI:

⭐ Star the repo — it helps more than you think
🛠️ Contribute code, docs, or feedback
📣 Share with your team, your community, your world

Let's keep building the future of open AI — together. 💪

✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Bug fixes 🐛

fix: Change baseDir assignment to use ModelPath by @mudler in #9010
fix(ui): correctly map watchdog fields by @mudler in #9022
fix(api): unescape model names by @mudler in #9024
fix(ui): Add tracing inline settings back and create UI tests by @richiejp in #9027
Always populate ORItemParam.Summary by @tv42 in #9049
fix(ui): correctly display backend if specified in the model config, re-order MCP buttons by @mudler in #9053
fix(ui): Refresh model list on deletion by @richiejp in #9059
fix(openresponses): do not omit required field ORItemParam.Arguments by @tv42 in #9074
fix: Add tracing settings loading from runtime_settings.json by @localai-bot in #9081
fix: use exact tag matching for model gallery tag filtering by @majiayu000 in #9041
fix(realtime): Set the alias for opus so the development backend can be selected by @richiejp in #9083
fix(llama.cpp): bundle libdl, librt, libpthread in llama-cpp backend by @mudler in #9099
fix(download): do not remove dst dir until we try all fallbacks by @mudler in #9100
fix(auth): do not allow to register in invite mode by @mudler in #9101
fix(downloader): Rewrite full https HF URI with HF_ENDPOINT by @richiejp in #9107
fix: implement encoding_format=base64 for embeddings endpoint by @walcz-de in #9135
fix(coqui,nemo,voxcpm): Add dependencies to allow CI to progress by @richiejp in #9142
fix(voxcpm): Force using a recent voxcpm version to kick the dependency solver by @richiejp in #9150
fix: huggingface repo change the file name so Update index.yaml is needed by @ER-EPR in #9163
fix(kokoro): Download phonemization model during installation by @richiejp in #9165...

Contributors

tv42, richiejp, and 6 other contributors

Assets 9

2 Join discussion

14 Mar 18:18

mudler

v4.0.0

8e8b7df

v4.0.0

🎉 LocalAI 4.0.0 Release! 🚀

LocalAI 4.0.0 is out!

This major release transforms LocalAI into a complete AI orchestration platform. We’ve embedded agentic and hybrid search capabilities directly into the core, completely overhauled the user interface with React for a modern experience, and are thrilled to introduce Agenthub ( link ) a brand new community hub to easily share and import agents. Alongside these massive updates, we've introduced powerful new features like Canvas mode for code artifacts, MCP apps and full MCP client-side support.

Feature	Summary
Agentic Orchestration & Agenthub	Native agent management with memory, skills, and the new Agenthub for community sharing.
Revamped React UI	Complete frontend rewrite for lightning-fast performance and modern UX.
Canvas Mode	Preview code blocks and artifacts side-by-side in the chat interface.
MCP Client-Side	Full Model Context Protocol support, MCP Apps, and tool streaming in chat.
WebRTC Realtime	WebRTC support for low-latency realtime audio conversations.
New Backends	Added experimental MLX Distributed, fish-speech, ace-step.cpp, and faster-qwen3-tts.
Infrastructure	Podman documentation, shell completion, and persistent data path separation.

🚀 Key Features

🤖 Native Agentic Orchestration & Agenthub

LocalAI now includes agentic capabilities embedded directly in the core. You can manage, import, start, and stop agents via the new UI.

🌐 Agenthub: We are launching Agenthub! This is a centralized community space to share common agents and import them effortlessly into your LocalAI instance.
Agent Management: Full lifecycle management via the React UI. Create Agents, connect them to Slack, configure MCP servers and skills.
Skills Management: Centralized skill database for AI agents.
Memory: Agents can utilize memory with Hybrid search (PostgreSQL) or embedded in-memory storage (Chromem).
Observability: New "Events" column in the Agents list to track observables and status.
📚 Documentation: Dive into the new capabilities in our official Agents documentation.

agents.mp4

🎨 Revamped UI & Canvas Mode

The Web interface has been completely migrated to React, bringing a smoother experience and powerful new capabilities:

Canvas Mode: Enable "canvas mode" in the chat to see code blocks and artifacts generated by the LLM in a dedicated preview bar on the right.
System View: Tabbed navigation separating Models and Backends for better organization.
Model Size Warnings: Visual warnings when model storage exceeds system RAM to prevent lockups.
Traces: Improved trace display using accordions for better readability.

model-fit-canvas-mode.mp4

🔌 MCP Apps & Client-Side Support

We’ve expanded support for the Model Context Protocol (MCP):

MCP Apps: Select which servers to enable for the chat directly from the UI.
Tool Streaming: Tools from MCP servers are automatically injected into the standard chat interface.
Client-Side Support: Full client-side integration for MCP tools and streaming.
Disable Option: Add LOCALAI_DISABLE_MCP to completely disable MCP support for security.

🎵 New Backends, Audio & Video Enhancements

MLX Distributed (Experimental): We've added an experimental backend for running distributed workloads using Apple's MLX framework! Check out the docs here.
New Audio Backends: Introduced fish-speech, ace-step.cpp, and faster-qwen3-tts (CUDA-only).
WeRTC Realtime: WebRTC support added to the Realtime API and Talk page for better low-latency audio handling.
TTS Improvements: Added sample_rate support via post-processing and multi-voice support for Qwen TTS.
Video Generation: Fixed model selection dropdown sync and added vllm-omni backend detection.

🛠️ Infrastructure & Developer Experience

Data Separation: New --data-path CLI flag and LOCALAI_DATA_PATH env var to separate persistent data (agents, skills) from configuration.
Shell Completion: Dynamic completion scripts for bash, zsh, and fish.
Podman Support: Dedicated documentation for Podman installation and rootless configuration.
Gallery & Models: Model storage size display with RAM warnings, and fallback URI resolution for backend installation failures.
Deprecations: HuggingFace backend support removed, and AIO images dropped to focus on main images.

🐞 Fixes & Improvements

Logging: Fixed watchdog spamming logs when no interval was configured; downgraded health check logs to debug.
CUDA Detection: Improved GPU vendor checks to prevent false CUDA detection on CPU-only hosts with runtime libs.
Compatibility: Renamed json_verbose to verbose_json for OpenAI spec compliance (fixes Nextcloud integration).
Embedding: Fixed embedding dimension truncation to return full native dimensions.
Permissions: Changed model install file permissions to 0644 to ensure server readability.
Windows Docker: Added named volumes to Docker Compose files for Windows compatibility.
Model Reload: Models now reload automatically after editing YAML config (e.g., context_size).
Chat: Fixed issue where thinking/reasoning blocks were sent to the LLM.
Audio: Fixed img2img pipeline in diffusers backend and Qwen TTS duplicate argument error.

Known issues

The diffusers backend fails to build currently (due to CI limit exhaustion) and it's not currently part of this release (the previous version is still available). We are looking into it but, if you want to help and know someone at Github that could help supporting us with better ARM runners, please reach out!

❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

✅ Star the repo
💬 Contribute code, docs, or feedback
📣 Share with others

Your support keeps this stack alive.

✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Breaking Changes 🛠

Remove HuggingFace backend support by @localai-bot in #8971
chore: drop AIO images by @mudler in #9004

Bug fixes 🐛

fix(cli): Fix watchdog running constantly and spamming logs by @nanoandrew4 in #8624
fix(api): Downgrade health/readiness check to debug by @nanoandrew4 in #8625
fix: rename json_verbose to verbose_json by @lukasdotcom in #8627
fix(chatterbox): add support for cuda13/aarch64 by @mudler in #8653
fix: reload model after editing YAML config (issue #8647) by @localai-bot in #8652
fix(chat): do not send thinking/reasoning messages to the LLM by @mudler in #8656
fix: change file permissions from 0600 to 0644 in InstallModel by @localai-bot in #8657
fix: Add named volumes for Windows Docker compatibility by @localai-bot in #8661
fix(gallery): add fallback URI resolution for backend installation by @localai-bot in #8663
fix: whisper breaking on cuda-13 (use absolute path for CUDA directory detection) by @localai-bot in #8678
fix(gallery): clean up partially downloaded backend on installation failure by @localai-bot in #8679
fix: properly sync model selection dropdown in video generation UI by @localai-bot in #8680
fix: allow reranking models configured with known_usecases by @localai-bot in #8681
fix: return full embedding dimensions instead of truncating trailing zeros (#8721) by @localai-bot in #8755
fix: Add vllm-omni backend to video generation model detection (#8659) by @localai-bot in #8781
fix(qwen-tts): duplicate instruct argument in voice design mode by @Weathercold in #8842
Fix image upload processing and img2img pipeline in diffusers backend by @attilagyorffy in #8879
fix: gate CUDA directory checks on GPU vendor to prevent false CUDA detection by @sozercan in #8942
fix(llama-cpp): Set enable_thinking in the correct place by @richiejp in #8973

Exciting New Features 🎉

feat(traces): Use accordian instead of pop-ups by @richiejp in #8626
chore: remove install.sh script and documentation references by @localai-bot in #8643
docs: add Podman installation documentation by @localai-bot in htt...

Contributors

attilagyorffy, sozercan, and 9 other contributors

Assets 9

0 Join discussion

21 Feb 13:49

mudler

v3.12.1

fcecc12

v3.12.1

This is a patch release to tag the new llama.cpp version which fixes incompatibilities with Qwen 3 coder.

What's Changed

Other Changes

docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8611
feat(traces): Add backend traces by @richiejp in #8609
chore: ⬆️ Update ggml-org/llama.cpp to b908baf1825b1a89afef87b09e22c32af2ca6548 by @localai-bot in #8612
chore: drop bark.cpp leftovers from pipelines by @mudler in #8614
fix: merge openresponses messages by @mudler in #8615
chore: ⬆️ Update ggml-org/llama.cpp to ba3b9c8844aca35ecb40d31886686326f22d2214 by @localai-bot in #8613

Full Changelog: v3.12.0...v3.12.1

Contributors

richiejp, mudler, and localai-bot

Assets 9

20 Feb 18:16

mudler

v3.12.0

352b8aa

v3.12.0

🎉 LocalAI 3.12.0 Release! 🚀

LocalAI 3.12.0 is out!

Feature	Summary
Multi-modal Realtime	Send text, images, and audio in real-time conversations for richer interactions.
Voxtral Backend	New high-quality text-to-speech backend added.
Multi-GPU Support	Improved Diffusers performance with multiple GPUs.
Legacy CPU Optimization	Enhanced compatibility for older processors.
UI Theme & Layout	Improved UI theme (dark/light variants) and navigation
Realtime Stability	Multiple fixes for audio, image, and model handling.
Logging Improvements	Reduced excessive logs and optimized processing.

Local Stack Family

Liking LocalAI? LocalAI is part of an integrated suite of AI infrastructure tools, you might also like:

LocalAGI - AI agent orchestration platform with OpenAI Responses API compatibility and advanced agentic capabilities
LocalRecall - MCP/REST API knowledge base system providing persistent memory and storage for AI agents
🆕 Cogito - Go library for building intelligent, co-operative agentic software and LLM-powered workflows, focusing on improving results for small, open source language models that scales to any LLM. Powers LocalAGI and LocalAI MCP/Agentic capabilities
🆕 Wiz - Terminal-based AI agent accessible via Ctrl+Space keybinding. Portable, local-LLM friendly shell assistant with TUI/CLI modes, tool execution with approval, MCP protocol support, and multi-shell compatibility (zsh, bash, fish)
🆕 SkillServer - Simple, centralized skills database for AI agents via MCP. Manages skills as Markdown files with MCP server integration, web UI for editing, Git synchronization, and full-text search capabilities

❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

✅ Star the repo
💬 Contribute code, docs, or feedback
📣 Share with others

Your support keeps this stack alive.

✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Bug fixes 🐛

security: validate URLs to prevent SSRF in content fetching endpoints by @kolega-ai-dev in #8476
fix(realtime): Use user provided voice and allow pipeline models to have no backend by @richiejp in #8415
fix(realtime): Sampling and websocket locking by @richiejp in #8521
fix(realtime): Send proper image data to backend by @richiejp in #8547
fix: prevent excessive logging in capability detection by @localai-bot in #8552
fix(voxcpm): pin setuptools by @mudler in #8556
fix(llama-cpp): populate tensor_buft_override buffer so llama-cpp properly performs fit calculations by @cvpcs in #8560
fix: pin neutts-air to known working commit by @localai-bot in #8566
fix: improve watchdown logics by @mudler in #8591
fix(llama-cpp): Pass parameters when using embedded template by @richiejp in #8590
fix(realtime): Better support for thinking models and setting model parameters by @richiejp in #8595
fix(realtime): Limit buffer sizes to prevent DoS by @richiejp in #8596
fix(ui): improve view on mobile by @mudler in #8598
fix(diffusers): sd_embed is not always available by @mudler in #8602
fix: do not keep track model if not existing by @mudler in #8603

Exciting New Features 🎉

feat(stablediffusion-ggml): Improve legacy CPU support for stablediffusion-ggml backend by @cvpcs in #8461
feat(voxtral): add voxtral backend by @mudler in #8451
feat(diffusers): add experimental support for sd_embed-style prompt embedding by @cvpcs in #8504
chore: improve log levels verbosity by @localai-bot in #8528
feat(realtime): Allow sending text, image and audio conversation items" by @richiejp in #8524
chore: compute capabilities once by @mudler in #8555
feat(ui): left navbar, dark/light theme by @mudler in #8594
fix: multi-GPU support for Diffusers (Issue #8575) by @localai-bot in #8605

🧠 Models

chore(model gallery): Add Ministral 3 family of models (aside from base versions) by @rampa3 in #8467
chore(model gallery): add voxtral (which is only available in development) by @mudler in #8532
chore(model gallery): Add npc-llm-3-8b by @rampa3 in #8498
chore(model gallery): add nemo-asr by @mudler in #8533
chore(model gallery): add voxcpm, whisperx, moonshine-tiny by @mudler in #8534
chore(model gallery): add neutts by @mudler in #8535
chore(model gallery): add vllm-omni models by @mudler in #8536
chore(model-gallery): ⬆️ update checksum by @localai-bot in #8540
feat(gallery): Add nanbeige4.1-3b by @richiejp in #8551
chore(model-gallery): ⬆️ update checksum by @localai-bot in #8593
chore(model-gallery): ⬆️ update checksum by @localai-bot in #8600

👒 Dependencies

chore(deps): bump github.com/anthropics/anthropic-sdk-go from 1.20.0 to 1.22.0 by @dependabot[bot] in #8482
chore(deps): bump github.com/jaypipes/ghw from 0.21.2 to 0.22.0 by @dependabot[bot] in #8484
chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.0 to 2.28.1 by @dependabot[bot] in #8483
chore(deps): bump github.com/alecthomas/kong from 1.13.0 to 1.14.0 by @dependabot[bot] in #8481
chore(deps): bump github.com/openai/openai-go/v3 from 3.17.0 to 3.19.0 by @dependabot[bot] in #8485
chore: bump cogito by @mudler in #8568
fix(gallery): Use YAML v3 to avoid merging maps with incompatible keys by @richiejp in #8580
chore(deps): bump google.golang.org/grpc from 1.78.0 to 1.79.1 by @dependabot[bot] in #8583
chore(deps): bump github.com/jaypipes/ghw from 0.22.0 to 0.23.0 by @dependabot[bot] in #8587
chore(deps): bump github.com/modelcontextprotocol/go-sdk from 1.2.0 to 1.3.0 by @dependabot[bot] in #8585
chore(deps): bump cogito and add new options to the agent config by @mudler in #8601

Other Changes

docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8462
docs: update model gallery documentation to reference main repository by @veeceey in #8452
chore: ⬆️ Update ggml-org/whisper.cpp to 4b23ff249e7f93137cb870b28fb27818e074c255 by @localai-bot in #8463
chore: ⬆️ Update ggml-org/llama.cpp to e06088da0fa86aa444409f38dff274904931c507 by @localai-bot in #8464
chore: ⬆️ Update antirez/voxtral.c to c9e8773a2042d67c637fc492c8a655c485354080 by @localai-bot in #8477
chore: ⬆️ Update ggml-org/llama.cpp to 262364e31d1da43596fe84244fba44e94a0de64e by @localai-bot in #8479
chore: ⬆️ Update ggml-org/whisper.cpp to 764482c3175d9c3bc6089c1ec84df7d1b9537d83 by @localai-bot in #8478
chore: ⬆️ Update ggml-org/llama.cpp to 57487a64c88c152ac72f3aea09bd1cc491b2f61e by @localai-bot in #8499
chore: ⬆️ Update ggml-org/llama.cpp to 4d3daf80f8834e0eb5148efc7610513f1e263653 by @localai-bot in #8513
chore: ⬆️ Update ggml-org/llama.cpp to 338085c69e486b7155e5b03d7b5087e02c0e2528 by @localai-bot in #8538
fix: update moonshine API, add setuptools to voxcpm requirements by @mudler in #8541
chore: ⬆️ Update ggml-org/llama.cpp to 05a6f0e8946914918758db767f6eb04bc1e38507 by @localai-bot in #8553
chore: ⬆️ Update ggml-org/llama.cpp to 01d8eaa28d57bfc6d06e30072085ed0ef12e06c5 by @localai-bot in #8567
chore: ⬆️ Update...

Contributors

cvpcs, richiejp, and 6 other contributors

Assets 9

07 Feb 21:31

mudler

v3.11.0

944874d

v3.11.0

🎉 LocalAI 3.11.0 Release! 🚀

LocalAI 3.11.0 is a massive update for Audio and Multimodal capabilities.

We are introducing Realtime Audio Conversations, a dedicated Music Generation UI, and a massive expansion of ASR (Speech-to-Text) and TTS backends. Whether you want to talk to your AI, clone voices, transcribe with speaker identification, or generate songs, this release has you covered.

Check out the highlights below!

📌 TL;DR

Feature	Summary
Realtime Audio	Native support for audio conversations, enabling fluid voice interactions similar to OpenAI's Realtime API. Documentation
Music Generation UI	New UI interface for MusicGen (Ace-Step), allowing you to generate music from text prompts directly in the browser.
New ASR Backends	Added WhisperX (with Speaker Diarization), VibeVoice, Qwen-ASR, and Nvidia NeMo.
TTS Streaming	Text-to-Speech now supports streaming mode for lower latency responses. (VoxCPM only for now)
vLLM Omni	Added support for vLLM Omni, expanding our high-performance inference capabilities.
Speaker Diarization	Native support for identifying different speakers in transcriptions via WhisperX.
Hardware Expansion	Expanded build support for CUDA 12/13, L4T (Jetson), SBSA, and better Metal (Apple Silicon) integration with MLX backends
Breaking Changes	ExLlama (deprecated) and Bark (unmaintained) backends have been removed.

🚀 New Features & Major Enhancements

🎙️ Realtime Audio Conversations

LocalAI 3.11.0 introduces native support for Realtime Audio Conversations.

Enables fluid, low-latency voice interaction with agents.
Logic handled directly within the LocalAI pipeline for seamless audio-in/audio-out workflows.
Support for STT/TTS and voice-to-voice models (experimental)
Support for tool calls

🗣️ Talk to your LocalAI: This brings us one step closer to a fully local, voice-native assistant experience compatible with standard client implementations.

Check here for detailed documentation.

🎵 Music Generation UI & Ace-Step

We have added a dedicated interface for music generation!

New Backend: Support for Ace-Step (MusicGen) via the ace-step backend.
Web UI Integration: Generate musical clips directly from the LocalAI Web UI.
Simple text-to-music workflow (e.g., "Lo-fi hip hop beat for studying").

Screenshot 2026-02-07 at 23-32-00 LocalAI - Generate sound with ace-step-turbo

🎧 Massive ASR (Speech-to-Text) Expansion

This release significantly broadens our transcription capabilities with four new backends:

WhisperX: Provides fast transcription with Speaker Diarization (identifying who is speaking).
VibeVoice: Now supports also ASR alongside TTS.
Qwen-ASR: Support for Qwen's powerful speech recognition models.
Nvidia NeMo: Initial support for NeMo ASR.

🗣️ TTS Streaming & New Voices

Text-to-Speech gets a speed boost and new options:

Streaming Support: TTS endpoints now support streaming, reducing the "time-to-first-audio" significantly.
VoxCPM: Added support for the VoxCPM backend.
Qwen-TTS: Added support for Qwen-TTS models
Piper Voices: Added most remaining Piper voices from Hugging Face to the gallery.

🛠️ Hardware & Backend Updates

vLLM Omni: A new backend integration for vLLM Omni models.
Extended Platform Support: Major work on MLX to improve compatibility across CUDA 12, CUDA 13, L4T (Nvidia Jetson), SBSA, and macOS Metal.
GGUF Cleanup: Dropped redundant VRAM estimation logic for GGUF loading, relying on more accurate internal measurements.

⚠️ Breaking Changes

To keep the project lean and maintainable, we have removed some older backends:

ExLlama: Removed (deprecated in favor of newer loaders like ExLlamaV2 or llama.cpp).
Bark: Removed (the upstream project is unmaintained; we recommend using the new TTS alternatives).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link: https://github.com/mudler/LocalRecall

❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

✅ Star the repo
💬 Contribute code, docs, or feedback
📣 Share with others

Your support keeps this stack alive.

✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Breaking Changes 🛠

chore(exllama): drop backend now almost deprecated by @mudler in #8186

Bug fixes 🐛

fix(ui): correctly display selected image model by @dedyf5 in #8208
fix(ui): take account of reasoning in token count calculation by @mudler in #8324
fix: drop gguf VRAM estimation (now redundant) by @mudler in #8325
fix(api): Add missing field in initial OpenAI streaming response by @acon96 in #8341
fix(realtime): Include noAction function in prompt template and handle tool_choice by @richiejp in #8372
fix: filter GGUF and GGML files from model list by @Yaroslav98214 in #8397
fix(qwen-asr): Remove contagious slop (DEFAULT_GOAL) from Makefile by @richiejp in #8431

Exciting New Features 🎉

feat(vllm-omni): add new backend by @mudler in #8188
feat(vibevoice): add ASR support by @mudler in #8222
feat: add VoxCPM tts backend by @mudler in #8109
feat(realtime): Add audio conversations by @richiejp in #6245
feat(qwen-asr): add support to qwen-asr by @mudler in #8281
feat(tts): add support for streaming mode by @mudler in #8291
feat(api): Add transcribe response format request parameter & adjust STT backends by @nanoandrew4 in #8318
feat(whisperx): add whisperx backend for transcription with speaker diarization by @eureka928 in #8299
feat(mlx): Add support for CUDA12, CUDA13, L4T, SBSA and CPU by @mudler in #8380
feat(musicgen): add ace-step and UI interface by @mudler in #8396
fix(api)!: Stop model prior to deletion by @nanoandrew4 in #8422
feat(nemo): add Nemo (only asr for now) backend by @mudler in #8436

🧠 Models

chore(model gallery): add qwen3-tts to model gallery by @mudler in #8187
chore(model gallery): Add most of not yet present Piper voices from Hugging Face by @rampa3 in #8202
chore: drop bark which is unmaintained by @mudler in #8207
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8220
chore(model gallery): Add entry for Mistral Small 3.1 with mmproj by @rampa3 in https://git...

Contributors

richiejp, mudler, and 9 other contributors

Assets 9

23 Jan 14:21

mudler

v3.10.1

923ebbb

v3.10.1

This is a small patch release intended to provide bugfixes and minor polishment, along, we also added support to Qwen-TTS that was just released yesterday.

Fix reasoning detection on reasoning and instruct models
Support reasoning blocks with openresponses
API fixes to correctly run LTX-2
Support Qwen3-TTS!

What's Changed

Bug fixes 🐛

fix(reasoning): support models with reasoning without starting thinking tag by @mudler in #8132
fix(tracing): Create trace buffer on first request to enable tracing at runtime by @richiejp in #8148
fix(videogen): drop incomplete endpoint, add GGUF support for LTX-2 by @mudler in #8160

Exciting New Features 🎉

feat(openresponses): Support reasoning blocks by @mudler in #8133
feat: detect thinking support from backend automatically if not explicitly set by @mudler in #8167
feat(qwen-tts): add Qwen-tts backend by @mudler in #8163

🧠 Models

chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8128
chore(model gallery): add flux 2 and flux 2 klein by @mudler in #8141
chore(model-gallery): ⬆️ update checksum by @localai-bot in #8153
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8157
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8170

👒 Dependencies

chore(deps): bump github.com/mudler/cogito from 0.7.2 to 0.8.1 by @dependabot[bot] in #8124

Other Changes

feat(swagger): update swagger by @localai-bot in #8098
chore: ⬆️ Update ggml-org/llama.cpp to 287a33017b32600bfc0e81feeb0ad6e81e0dd484 by @localai-bot in #8100
chore: ⬆️ Update leejet/stable-diffusion.cpp to 2efd19978dd4164e387bf226025c9666b6ef35e2 by @localai-bot in #8099
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8120
chore: ⬆️ Update leejet/stable-diffusion.cpp to a48b4a3ade9972faf0adcad47e51c6fc03f0e46d by @localai-bot in #8121
chore: ⬆️ Update ggml-org/llama.cpp to 959ecf7f234dc0bc0cd6829b25cb0ee1481aa78a by @localai-bot in #8122
chore(deps): Bump llama.cpp to '1c7cf94b22a9dc6b1d32422f72a627787a4783a3' by @mudler in #8136
chore: drop noisy logs by @mudler in #8142
chore: ⬆️ Update ggml-org/llama.cpp to ad8d85bd94cc86e89d23407bdebf98f2e6510c61 by @localai-bot in #8145
chore: ⬆️ Update ggml-org/whisper.cpp to 7aa8818647303b567c3a21fe4220b2681988e220 by @localai-bot in #8146
feat(swagger): update swagger by @localai-bot in #8150
chore(diffusers): add 'av' to requirements.txt by @mudler in #8155
chore: ⬆️ Update leejet/stable-diffusion.cpp to 329571131d62d64a4f49e1acbef49ae02544fdcd by @localai-bot in #8152
chore: ⬆️ Update ggml-org/llama.cpp to c301172f660a1fe0b42023da990bf7385d69adb4 by @localai-bot in #8151
chore: ⬆️ Update ggml-org/llama.cpp to a5eaa1d6a3732bc0f460b02b61c95680bba5a012 by @localai-bot in #8165
chore: ⬆️ Update leejet/stable-diffusion.cpp to 5e4579c11d0678f9765463582d024e58270faa9c by @localai-bot in #8166

Full Changelog: v3.10.0...v3.10.1

Contributors

richiejp, mudler, and 2 other contributors

Assets 9

18 Jan 21:00

mudler

v3.10.0

5f403b1

v3.10.0

🎉 LocalAI 3.10.0 Release! 🚀

LocalAI 3.10.0 is big on agent capabilities, multi-modal support, and cross-platform reliability.

We've added native Anthropic API support, launched a new Video Generation UI, introduced Open Responses API compatibility, and enhanced performance with a unified GPU backend system.

For a full tour, see below!

📌 TL;DR

Feature	Summary
Anthropic API Support	Fully compatible `/v1/messages` endpoint for seamless drop-in replacement of Claude.
Open Responses API	Native support for stateful agents with tool calling, streaming, background mode, and multi-turn conversations, passing all official acceptance tests.
Video & Image Generation Suite	New video gen UI + LTX-2 support for text-to-video and image-to-video.
Unified GPU Backends	GPU libraries (CUDA, ROCm, Vulkan) packaged inside backend containers — works out of the box on Nvidia, AMD, and ARM64 (Experimental).
Tool Streaming & XML Parsing	Full support for streaming tool calls and XML-formatted tool outputs.
System-Aware Backend Gallery	Only see backends your system can run (e.g., hide MLX on Linux).
Crash Fixes	Prevents crashes on AVX-only CPUs (Intel Sandy/Ivy Bridge) and fixes VRAM reporting on AMD GPUs.
Request Tracing	Debug agents & fine-tuning with memory-based request/response logging.
Moonshine Backend	Ultra-fast transcription engine for low-end devices.
Pocket-TTS	Lightweight, high-fidelity text-to-speech with voice cloning.
Vulkan arm64 builds	We now build backends and images for vulkan on arm64 as well

🚀 New Features & Major Enhancements

🤖 Open Responses API: Build Smarter, Autonomous Agents

LocalAI now supports the OpenAI Responses API, enabling powerful agentic workflows locally.

Stateful conversations via response_id — resume and manage long-running agent sessions.
Background mode: Run agents asynchronously and fetch results later.
Streaming support for tools, images, and audio.
Built-in tools: Web search, file search, and computer use (via MCP integrations).
Multi-turn interaction with dynamic context and tool use.

✅ Ideal for developers building agents that can browse, analyze files, or interact with systems — all on your local machine.

🔧 How to Use:

Set response_id in your request to maintain session state across calls.

Use background: true to run agents asynchronously.

Retrieve results via GET /api/v1/responses/{response_id}.

Enable streaming with stream: true to receive partial responses and tool calls in real time.

📌 Tip: Use response_id to build agent orchestration systems that persist context and avoid redundant computation.

Our support passes all the official acceptance tests:

🧠 Anthropic Messages API: Clone Claude Locally

LocalAI now fully supports the Anthropic messages API.

Use https://api.localai.host/v1/messages as a drop-in replacement for Claude.
Full tool/function calling support, just like OpenAI.
Streaming and non-streaming responses.
Compatible with anthropic-sdk-go, LangChain, and other tooling.

🔥 Perfect for teams migrating from Anthropic to local inference with full feature parity.

🎥 Video Generation: From Text to Video in the Web UI

New dedicated video generation page with intuitive controls.
LTX-2 is supported
Supports text-to-video and image-to-video workflows.
Built on top of diffusers with full compatibility.

📌 How to Use:

Go to /video in the web UI.

Enter a prompt (e.g., "A cat walking on a moonlit rooftop").

Optionally upload an image for image-to-video generation.

Adjust parameters like fps, num_frames, and guidance_scale.

⚙️ Unified GPU Backends: Acceleration Works Out of the Box

A major architectural upgrade: GPU libraries (CUDA, ROCm, Vulkan) are now packaged inside backend containers.

Single image: Now you don't need anymore to pull a specific image for your GPU. Any image works regardless if you have a GPU or not.
No more manual GPU driver setup — just run the image and get acceleration.
Works on Nvidia (CUDA), AMD (ROCm), and ARM64 (Vulkan).
Vulkan arm64 builds enabled
Reduced image complexity, faster builds, and consistent performance.

🚀 This means latest/master images now support GPU acceleration on all platforms — no extra config!

Note: this is experimental, please help us by filing an issue if something doesn't work!

🧩 Tool Streaming & Advanced Parsing

Enhance your agent workflows with richer tool interaction.

Streaming tool calls: Receive partial tool arguments in real time (e.g., input_json_delta).
XML-style tool call parsing: Models that return tools in XML format (<function>...</function>) are now properly parsed alongside text.
Works across all backends (llama.cpp, vLLM, diffusers, etc.).

💡 Enables more natural, real-time interaction with agents that use structured tool outputs.

🌐 System-Aware Backend Gallery: Only Compatible Backends Show

The backend gallery now shows only backends your system can run.

Auto-detects system capabilities (CPU, GPU, MLX, etc.).
Hides unsupported backends (e.g., MLX on Linux, CUDA on AMD).
Shows detected capabilities in the hero section.

🎤 New TTS Backends: Pocket-TTS

Add expressive voice generation to your apps with Pocket-TTS.

Real-time text-to-speech with voice cloning support (requires HF login).
Lightweight, fast, and open-source.
Available in the model gallery.

🗣️ Perfect for voice agents, narrators, or interactive assistants.
❗ Note: Voice cloning requires HF authentication and a registered voice model.

🔍 Request Tracing: Debug Your Agents

Trace requests and responses in memory — great for fine-tuning and agent debugging.

Enable via runtime setting or API.
Log stored in memory, dropped after max size.
Fetch logs via GET /api/v1/trace.
Export to JSON for analysis.

🪄 New 'Reasoning' Field: Extract Thinking Steps

LocalAI now automatically detects and extracts thinking tags from model output.

Supports both SSE and non-SSE modes.
Displays reasoning steps in the chat UI (under "Thinking" tab).
Fixes issue where thinking content appeared as part of final answer.

🚀 Moonshine Backend: Faster Transcription for Low-End Devices

Add Moonshine, an ONNX-based transcription engine, for fast, lightweight speech-to-text.

Optimized for low-end devices (Raspberry Pi, older laptops).
One of the fastest transcription engines available.
Supports live transcription.

🛠️ Fixes & Stability Improvements

🔧 Prevent BMI2 Crashes on AVX-Only CPUs

Fixed crashes on older Intel CPUs (Ivy Bridge, Sandy Bridge) that lack BMI2 instructions.

Now safely falls back to llama-cpp-fallback (SSE2 only).
No more EOF errors during model warmup.

✅ Ensures LocalAI runs smoothly on older hardware.

📊 Fix Swapped VRAM Usage on AMD GPUs

Correctly parses rocm-smi output: used and total VRAM are now displayed correctly.

Fixes misreported memory usage on dual-Radeon setups.
Handles HIP_VISIBLE_DEVICES properly (e.g., when using only discrete GPU).

🚀 The Complete Local Stack for Privacy-First AI

Contributors

richiejp, Nold360, and 9 other contributors

Assets 7

Uh oh!

Releases: mudler/LocalAI

v4.1.3

What's Changed

Bug fixes 🐛

👒 Dependencies

Other Changes

Contributors

Uh oh!

v4.1.2

What's Changed

Bug fixes 🐛

Exciting New Features 🎉

Other Changes

Contributors

Uh oh!

v4.1.1

What's Changed

Other Changes

New Contributors

Contributors

Uh oh!

v4.1.0

🎉 LocalAI 4.1.0 Release! 🚀

New (long version) Full setup walktrough: https://www.youtube.com/watch?v=cMVNnlqwfw4

🚀 Key Features

🌐 Distributed Mode: scaling LocalAI horizontally

🔐 Users, Authentication & Quotas

Users and quota:

Usage metrics per user:

🧪 Fine-Tuning & Quantization

🎨 UI

🤖 Agents & Inference

🛠️ Under the Hood

🐞 Fixes & Improvements

❤️ Thank You

✅ Full Changelog

What's Changed

Bug fixes 🐛

Contributors

Uh oh!

v4.0.0

🎉 LocalAI 4.0.0 Release! 🚀

🚀 Key Features

🤖 Native Agentic Orchestration & Agenthub

🎨 Revamped UI & Canvas Mode

🔌 MCP Apps & Client-Side Support

🎵 New Backends, Audio & Video Enhancements

🛠️ Infrastructure & Developer Experience

🐞 Fixes & Improvements

Known issues

❤️ Thank You

✅ Full Changelog

What's Changed

Breaking Changes 🛠

Bug fixes 🐛

Exciting New Features 🎉

Contributors

Uh oh!

v3.12.1

What's Changed

Other Changes

Contributors

Uh oh!

v3.12.0

🎉 LocalAI 3.12.0 Release! 🚀

Local Stack Family

❤️ Thank You

✅ Full Changelog

What's Changed

Bug fixes 🐛

Exciting New Features 🎉

🧠 Models

👒 Dependencies

Other Changes

Contributors

Uh oh!

v3.11.0

🎉 LocalAI 3.11.0 Release! 🚀

📌 TL;DR