diff --git a/docs/en/solutions/ecosystem/nacos/How_to_Configure_Nacos_Hot_Standby_with_Shared_MySQL.md b/docs/en/solutions/ecosystem/nacos/How_to_Configure_Nacos_Hot_Standby_with_Shared_MySQL.md new file mode 100644 index 00000000..260f0b01 --- /dev/null +++ b/docs/en/solutions/ecosystem/nacos/How_to_Configure_Nacos_Hot_Standby_with_Shared_MySQL.md @@ -0,0 +1,156 @@ +--- +kind: + - Solution +products: + - Alauda Application Services +ProductsVersion: + - 3.18,4.0,4.1 +--- + +# How to Configure Nacos Hot Standby for Configuration Disaster Recovery (Shared MySQL) + +## Introduction + +This guide describes a hot-standby disaster-recovery (DR) topology for **Nacos 2.5 configuration data** built on top of a DR-capable MySQL. A primary Nacos cluster ("Nacos A") and a standby Nacos cluster ("Nacos B") share the same logical MySQL (typically an MGR-based DR setup). A third-tier load balancer fronts both clusters so that the active endpoint can be switched quickly during a site failure. + +This plan applies to Nacos 2.5.x on ACP 3.18, 4.0, and 4.1. + +## Scope and Limitations + +This plan covers **configuration-resource hot standby only**. It does not provide multi-write — both Nacos clusters writing to the same database concurrently will corrupt data. + +Other limitations to communicate to customers up-front: + +1. **Data consistency**: because both clusters write to the same logical database, high concurrent writes or replication lag can produce transient inconsistencies between active and standby that briefly affect service discovery accuracy. The plan tolerates this for config workloads but not for naming workloads (see "Ephemeral services do not replicate" below). +2. **Resource contention**: dual-cluster access concentrates load on one database. Heavy write traffic can cause lock contention. +3. **Switch-over complexity**: when failing over from A to B, residual DB load or replication lag can extend the switch window. + +> **Risk**: This plan is for **read-side hot standby** of configuration data. Do **not** use it in active-active write mode — concurrent writes risk data divergence. + +## How Nacos Refresh Works + +Nacos-server has two reconciliation paths against MySQL: + +| Mechanism | Default interval | Setting | +| --- | --- | --- | +| Full dump | 6 hours | `DUMP_ALL_INTERVAL_IN_MINUTE` (constant) | +| Incremental dump | 30 seconds | `dumpChangeWorkerInterval` (hidden) gated by `dumpChangeOn` (hidden, default `true`) — Nacos 2.5 | + +Practically, a direct write to the underlying database (with the matching timestamp update) propagates to the Nacos-server cache in roughly 30 seconds, and from there to clients via push. + +For the upstream implementation, see the Nacos source under `console/src/main/java/com/alibaba/nacos/console/controller` and the related dump-task classes in `config/src/main/java/com/alibaba/nacos/config/server/service/dump`. + +## Architecture + +```text + ┌──────────────────────────┐ + │ External LB / F5 │ + (active) ─────────►│ 8848/http 9848/grpc │◄───────── (standby) + └──────────────────────────┘ + │ │ + ▼ ▼ + ┌───────────────┐ ┌───────────────┐ + │ Nacos A │ │ Nacos B │ + │ (primary) │ │ (standby) │ + └───────┬───────┘ └───────┬───────┘ + │ │ + └────────────┬───────────┘ + ▼ + DR-capable MySQL (MGR / equivalent) +``` + +- Nacos A is the active cluster that clients hit. +- Nacos B reads the same database and stands by, ready to take over. +- The LB has two listeners — `8848/http` and `9848/grpc` — and the gRPC port **must** be exactly `http + 1000`. + +## Prerequisites + +1. A DR-capable MySQL — Nacos supports MySQL or any MySQL-protocol-compatible store. The team's recommended DR pattern is the internal MGR hot-standby scheme (consult your DR runbook for that database). +2. The MySQL endpoint must be reachable from both Nacos clusters. +3. The user/database (`nacos_config` and the Nacos user) must be created exactly once on the shared logical MySQL. +4. An external load balancer or F5 device in front of both Nacos clusters. +5. Both Nacos clusters can be installed using [How to Deploy Nacos 2.5](./How_to_Deploy_Nacos_2.5.md). This document only spells out the deltas between a standalone install and the DR install. + +## Procedure + +### 1. Provision the DR MySQL + +Deploy the DR MySQL according to the team's MySQL DR runbook (MGR hot standby, or equivalent). Initialize the Nacos schema **only once** — Nacos B will reuse Nacos A's tables. + +### 2. Deploy Nacos A (Primary) + +Follow [How to Deploy Nacos 2.5](./How_to_Deploy_Nacos_2.5.md). When configuring `db.host`, point it at the DR MySQL endpoint accessible from cluster A. + +### 3. Deploy Nacos B (Standby) + +Follow the same plan, with two deltas: + +1. `db.host` must point at the DR MySQL endpoint accessible from cluster B (typically a different reader/writer route, but ultimately the same logical database). +2. The **JWT signing key must match Nacos A**. A user that logs in against A receives a token signed with A's key; once the LB cuts to B, B must accept that token, so both clusters need the same key. + +> **Note**: You do **not** need to copy the Nacos admin user across — it lives in the `users` table of the shared MySQL, so B inherits it automatically. `Server Identity Key` / `Server Identity Value` are **intra-cluster peer-auth** headers used between Nacos pods of the same cluster; they may differ between A and B without breaking the DR plan. + +### 4. Provision the External Load Balancer + +Configure two listeners on the LB: + +| Port | Protocol | Purpose | +| --- | --- | --- | +| `8848` | HTTP | Initial connection / handshake. | +| `9848` | gRPC | Real-time push between Nacos and clients. Must be exactly `8848 + 1000`. | + +Both listeners initially point at Nacos A. + +## Failover + +### Replication-Lag Budget + +The total propagation delay to a client request answered by the standby has two components: + +1. Database replication lag — determined by the underlying MySQL DR mechanism. +2. Nacos cache refresh — bounded by the 30-second incremental dump cycle (with some database-scan variance). + +Nacos DR RTO is therefore **at least `database DR RTO + 30 seconds`**, and can exceed that under sustained database load or while a longer full-dump pass is running. + +### Failover Steps + +1. **Verify standby data integrity** — log in to Nacos B's dashboard and confirm the expected configs are present and current. If MySQL replication has gaps, surface them now so the operator knows what may be missing. +2. **Cut the LB over to Nacos B** — switch both the `8848/http` and `9848/grpc` listeners simultaneously. Mismatched endpoints (HTTP pointing to A, gRPC pointing to B) will break push semantics. + +## Verification + +The verification scenarios below assume the LB still points at Nacos A. + +### Configuration Sync — Create + +1. Create config `test.yaml` on Nacos A with values `a: 1`, `b: 2`. +2. Open Nacos B's dashboard and verify `test.yaml` is visible with the same data within ~30 seconds. + +### Configuration Sync — Update + +1. On Nacos A, change `a` to `111`. +2. Refresh Nacos B's dashboard and confirm `a` is now `111`. + +### Configuration Sync — Delete + +1. Delete `test.yaml` on Nacos A. +2. Confirm it is gone from Nacos B. + +### Naming Data Behavior + +Naming data splits along the ephemeral/persistent boundary: + +- **Ephemeral instances** (heartbeat-driven, the default for most Spring Cloud / Dubbo apps) live only in each Nacos cluster's memory. They do **not** replicate. +- **Persistent instances** (`ephemeral=false`) are stored in MySQL `instances`-style tables and therefore *do* appear on the standby — but the live health state, push subscriptions, and dispatcher state are still in-memory, so persistent registrations cannot be served seamlessly through this DR scheme either. + +Run these to confirm that ephemeral naming traffic is **not** part of this plan: + +1. Register an ephemeral instance against Nacos A — it should not appear in Nacos B. +2. Register an ephemeral instance against Nacos B — it should not appear in Nacos A. + +For workloads that need DR coverage of naming data, treat that as a separate design and do **not** rely on this configuration-only plan. + +## References + +1. +2. diff --git a/docs/en/solutions/ecosystem/nacos/How_to_Deploy_Nacos_2.2.md b/docs/en/solutions/ecosystem/nacos/How_to_Deploy_Nacos_2.2.md new file mode 100644 index 00000000..a3688baa --- /dev/null +++ b/docs/en/solutions/ecosystem/nacos/How_to_Deploy_Nacos_2.2.md @@ -0,0 +1,258 @@ +--- +kind: + - Solution +products: + - Alauda Application Services +ProductsVersion: + - 3.18,4.0,4.1 +--- + +# How to Deploy Nacos 2.2 + +## Introduction + +This guide explains how to deliver a production-ready **Nacos 2.2.3** cluster on Alauda Container Platform (ACP) using the Nacos Chart from the Alauda application catalog. Use this document when a customer's SDK is still pinned to a 2.2-compatible client; otherwise prefer the newer [How to Deploy Nacos 2.5](./How_to_Deploy_Nacos_2.5.md) plan. + +> **Note**: "Primary" replaces the previously used term "Master" for the leading Nacos node in a cluster. + +## Pre-Delivery Notice + +1. **IPv6 is not supported.** +2. Nacos versions that the community has explicitly marked end-of-life cannot be supported by Alauda R&D. +3. The community provides no major-version upgrade path, so Alauda also has no in-place upgrade path. To move to a new major version, redeploy from scratch. +4. Alauda only supports Nacos clusters delivered using this plan. Customer-built Nacos clusters are out of scope. +5. Alauda's support covers troubleshooting, vulnerability patches, and bug fixes layered on top of the community release. +6. The Nacos version delivered by this plan is **2.2.3**. Nacos 2.1 has a known HA bug; customers running below 2.2.3 should be upgraded (by redeploy) to 2.2.3. +7. **Confirm SDK compatibility before delivery.** The most common customer issue is a client SDK that pre-dates 2.2.3 — apps then break in unpredictable ways. + +To check whether your application SDK version is compatible with this Nacos version, see the [Spring Cloud Alibaba component version table](https://github.com/alibaba/spring-cloud-alibaba/wiki/%E7%89%88%E6%9C%AC%E8%AF%B4%E6%98%8E#%E7%BB%84%E4%BB%B6%E7%89%88%E6%9C%AC%E5%85%B3%E7%B3%BB). + +## Architecture Overview + +- Nacos is delivered through a Helm Chart and is installed from the platform App Store. +- The cluster defaults to **three nodes** for high availability and can be scaled to any **odd number ≥ 3** (5, 7, …) to suit larger deployments. The Chart sets Kubernetes readiness/liveness probes by default. +- External access can be exposed through `NodePort` or `LoadBalancer`. On ACP, **ALB is the LoadBalancer implementation**; the Web-console verification section below uses an ALB listener. An Istio Ingress Gateway is also supported when one is already deployed in the cluster. +- Monitoring is enabled by default; customers can scrape Nacos metrics with Grafana. +- The plan does **not** cover cross-site DR replication or data migration. +- Major-version upgrades are achieved by destroying the old cluster and redeploying the new version. + +## Prerequisites + +### 1. Violet CLI + +Download the `violet` tool matching your cluster version from **App Store > App Onboarding**. + +### 2. Storage Class + +A working `StorageClass` is required. + +> **Known issue**: With TopoLVM, a physical-node restart has been observed to cause Nacos data loss. If you must use TopoLVM, plan node maintenance carefully. Other CSI drivers backed by network storage are safer. + +### 3. MySQL + +The Nacos community lists MySQL 5.6.5 as the absolute minimum, but **this plan requires MySQL 5.7.6 or higher** because the bootstrap SQL below uses `CREATE USER IF NOT EXISTS`, which MySQL 5.6 does not support (the clause was added in 5.7.6). MySQL 5.6 is also community-EOL (since 2021). You can use customer-provided MySQL or the Alauda Application Services MySQL Operator. + +> **Known issue — MySQL Router < 8.0.35**: Nacos connecting through MySQL Router prior to 8.0.35 fails with `Couldn't read RSA public key from server`. MySQL Router 8.0.35 fixes this. Alauda Application Services ships MySQL 8.0.36 starting in ACP 3.17 (and back-ported to small versions of 3.14 / 3.16). + +## Procedure + +### 1. Upload the Nacos Material Package + +Sign in to Alauda Cloud with a tenant account and download the `nacos` artifact from the App Marketplace. Then push the Nacos package into the target business cluster: + +```bash +violet push \ + --platform-address \ + --clusters \ + --platform-username \ + --platform-password \ + nacos-v2.2.3.tgz +``` + +Sign in to the platform as an administrator, switch to the **Nacos** project and namespace in the App Store, and confirm that the Nacos package is visible. + +### 2. Create the Nacos User and Database in MySQL + +The block to use depends on whether MySQL Router 8.0.35+ (which fixed the RSA-key handshake bug) is in front of MySQL: if you sit behind an older MySQL Router use `mysql_native_password`, otherwise prefer `caching_sha2_password`. Replace `` and `` with the values you intend to configure in the Nacos Chart. + +#### MySQL Router ≥ 8.0.35 (or direct connection to MySQL) + +```sql +CREATE DATABASE IF NOT EXISTS nacos_config; +CREATE USER IF NOT EXISTS ''@'%' + IDENTIFIED WITH caching_sha2_password BY ''; +GRANT ALL PRIVILEGES ON nacos_config.* TO ''@'%'; +FLUSH PRIVILEGES; +``` + +#### MySQL Router < 8.0.35 + +Compatible with MySQL server `8.0.x` (pre-`8.0.35` Router) and `5.7.6+` — the user must use the legacy `mysql_native_password` auth plugin to avoid the Router RSA-key handshake bug noted in Prerequisites. + +```sql +CREATE DATABASE IF NOT EXISTS nacos_config; +CREATE USER IF NOT EXISTS ''@'%' + IDENTIFIED WITH mysql_native_password BY ''; +GRANT ALL PRIVILEGES ON nacos_config.* TO ''@'%'; +FLUSH PRIVILEGES; +``` + +### 3. Deploy the Nacos Chart + +In the App Store, switch to the **Nacos** project and namespace, locate the Nacos chart, and click **Deploy**. + +Most parameters have sane defaults. The following fields deserve attention: + +| Field | Notes | +| --- | --- | +| `name` | The instance name; `nacos` is a sensible default. | +| `displayName` | Display name, typically `Nacos`. | +| `templateVersion` | For fresh environments only one version is usually shown; on upgrades, pick the newest. | +| Image registry | Must match the registry where the material was pushed; otherwise pulls will fail. | +| `-XX:InitialRAMPercentage` | Default `75.0`. JDK requires at least one decimal place. | +| `-XX:MaxRAMPercentage` | Default `75.0`. Same JDK requirement. | +| Resources | Lab-validated defaults: request 2 cores / 2.5 Gi, limit 2 cores / 4 Gi. Scale to actual load. | +| Deployment mode | `cluster` (default) for three-node HA; `standalone` for single node. Production must use `cluster`. | +| Startup mode | `naming` (default) — Nacos acts as registry only. `config` — config center only. `all` — both. | +| Context path | Default `/nacos`. If changed, replace `/nacos` in all verification URLs below. | +| Admin password | Default `nacos`. Use a strong custom password. Nacos 2.2.3 honours password changes made in the Web console even after restart. | +| `Server Identity Key` | Header key for inter-node auth. For private networks, `identitykey` is fine. Replaces the pre-1.4.1 User-Agent scheme. | +| `Server Identity Value` | Matching header value, e.g. `identityvalue`. | +| Data StorageClass | Name of the StorageClass that backs Nacos data, e.g. `sc-topolvm`. | +| Log StorageClass | Name of the StorageClass for logs. Keeping logs on a separate class protects the data PV from log-driven exhaustion. | +| `db.host` | MySQL host. When using the platform internal MySQL service, include the namespace: `.`. | +| `db.port` | MySQL port. Default `3306`. | +| `db.name` | MySQL database name. Default `nacos_config`. | +| `db.user` | MySQL user used by Nacos (and by the init container that creates the schema). Default `nacos`. | +| `db.password` | Password matching the user above. | + +> **Warning**: Redeploying the Chart wipes the underlying database. Back up first if you intend to re-create the instance. +> +> **JWT signing key**: Unlike the 2.5 chart, the Alauda 2.2 chart does **not** surface a `JWT signing key` parameter — Nacos 2.2.3 falls back to its built-in default token secret. The default is suitable for inter-namespace traffic on a trusted network but is publicly known, so do not rely on it as a security boundary. If you need a custom key, override `nacos.core.auth.default.token.secret.key` in `application.properties` through the chart's advanced options (and remember the same base64 / decoded-≥-32-bytes rule called out in the 2.5 doc). + +## Verification + +### 1. API Verification + +`exec` into any non-Nacos pod in the cluster: + +```bash +kubectl -n exec -it -- sh +``` + +In the commands below, replace `` with `..svc.cluster.local`, `` with the Nacos service port (default `8848`), and `` with the access token returned by the login call. + +#### Acquire a Token + +```bash +curl -X POST 'http://:/nacos/v1/auth/login' \ + -d 'username=nacos&password=nacos' +``` + +Sample response: + +```json +{"accessToken":"eyJhbGciOiJI...","tokenTtl":18000,"globalAdmin":true} +``` + +#### Register an Instance + +```bash +curl -X POST 'http://:/nacos/v1/ns/instance?serviceName=nacos.naming.serviceName&ip=20.18.7.10&port=8080&accessToken=' +``` + +#### Discover Instances + +```bash +curl -X GET 'http://:/nacos/v1/ns/instance/list?serviceName=nacos.naming.serviceName&accessToken=' +``` + +> **Note**: The registered instance will report `"healthy":false` because this verification only POSTs a registration and never sends heartbeats. For an ephemeral registration, "unhealthy without heartbeats" is the expected steady state. + +#### Publish Configuration + +```bash +curl -X POST "http://:/nacos/v1/cs/configs?dataId=nacos.cfg.dataId&group=test&content=helloWorld&accessToken=" +``` + +#### Retrieve Configuration + +```bash +curl -X GET "http://:/nacos/v1/cs/configs?dataId=nacos.cfg.dataId&group=test&accessToken=" +``` + +> **Note**: The examples above use the v1 OpenAPI for simplicity. Nacos 2.x also exposes a [v2 OpenAPI](https://nacos.io/docs/next/manual/user/open-api/) (`/nacos/v2/...`) with JSON bodies and a different auth path (`/nacos/v2/auth/user/login`) — useful for production tooling, but the v1 calls shown here are the quickest manual smoke test. + +### 2. Web Console Verification + +The Nacos console is exposed through ALB. Confirm ALB is deployed first, then add a listener: + +| Field | Value | +| --- | --- | +| Port | Any free port. | +| Protocol | `TCP`. | +| Algorithm | Round-Robin (default). | +| Internal route group | `nacos`, port `8848` (Nacos default). | +| Session affinity | `Source IP hash`. | +| Backend protocol | `TCP`. | + +Open `http://:/nacos`. The default credentials are `nacos / nacos` — change them immediately on first login. + +## FAQ + +### Q1. Memory usage exceeds 80% when a 1.x client connects (Nacos resources 4c8g) + +Temporarily scale up the Nacos resources to absorb the load, then migrate the client to a 2.x SDK. The root cause is high-frequency heartbeats from 1.x clients that the server cannot reclaim. + +Upstream issue: . + +### Q2. After a graceful shutdown of a Nacos client application, the data Nacos reports is inconsistent + +Monitor Nacos disk and memory. Disk exhaustion or memory pressure degrades Nacos performance and produces inconsistent reads. + +### Q3. HA Nacos on TopoLVM drops out of sync after a host restart + +Affected Nacos versions: **2.2.3 and below.** + +- On **2.2.3**, the cluster ends up in a divergent state but is recoverable: restart the offline Nacos pod and it rejoins. +- On versions **below 2.2.3**, the divergence is unrecoverable — redeploy to 2.2.3 (or later). + +Upstream issue: . + +### Q4. Nacos pod is in `CrashLoopBackOff` with `User limit of inotify instances reached or too many open files` + +The host inotify quota is exhausted (often by other workloads on the same node). + +Raise the limits on the host: + +```text +fs.inotify.max_queued_events = 32768 +fs.inotify.max_user_instances = 65536 +fs.inotify.max_user_watches = 1048576 +``` + +Also raise `nofile` in `/etc/security/limits.conf` if applications keep many descriptors open. Review applications that create and destroy inotify instances frequently and pool their usage. + +### Q5. Nacos pod logs `UnknownHostException jmenv.tbsite.net` + +The Nacos peer-finder plugin failed to write `cluster.conf` (often because the API server is overloaded or temporarily unreachable), so Nacos falls back to a hard-coded Taobao-internal endpoint (`jmenv.tbsite.net`). Verify API server health and restart the Nacos pods once it is stable. Upstream code reference: [`alibaba/nacos` "tbsite" search](https://github.com/search?q=repo%3Aalibaba%2Fnacos%20tbsite&type=code). + +### Q6. Nacos client logs `Ignore the empty nacos configuration and get it based on dataId` + +Nacos resolves configs by composing names; the log line is expected during startup. With older clients, the **file format used by the client** matters — `bootstrap.yaml` succeeds where `bootstrap.properties` may not retrieve configs cleanly. An Alauda-internal Spring Cloud demo lives at `https://gitlab-ce.alauda.cn/middleware/nacos-spring-cloud-example` (ask your Alauda contact for an exported copy if you do not have access to that GitLab). + +### Q7. What MySQL size should Nacos use? + +| Scale | CPU (vCores) | Memory (RAM) | Storage (SSD) | InnoDB Buffer Pool | +| --- | --- | --- | --- | --- | +| Small / Test | 2 | 4 GB | 50 GB+ | 2–3 GB | +| Medium production | 4 | 8–16 GB | 100–250 GB+ | 4–12 GB | +| Large production | 8+ | 16–32 GB+ | 250–500 GB+ | 12–24 GB+ | + +- **Small** — lab, dev, or low-microservice-density early production. +- **Medium** — stable microservice production with clear performance/availability expectations. +- **Large** — high-throughput, mission-critical production where availability and data safety are paramount. + +### Q8. After a Nacos 2.2.3 ephemeral instance is taken offline, the pod remains registered under the service + +Known community issue affecting Nacos 2.2.3. Fixed in 2.3.x. Track upstream: . To eliminate the symptom permanently, redeploy Nacos using the [Nacos 2.5 plan](./How_to_Deploy_Nacos_2.5.md). diff --git a/docs/en/solutions/ecosystem/nacos/How_to_Deploy_Nacos_2.5.md b/docs/en/solutions/ecosystem/nacos/How_to_Deploy_Nacos_2.5.md new file mode 100644 index 00000000..7df62f9b --- /dev/null +++ b/docs/en/solutions/ecosystem/nacos/How_to_Deploy_Nacos_2.5.md @@ -0,0 +1,245 @@ +--- +kind: + - Solution +products: + - Alauda Application Services +ProductsVersion: + - 3.18,4.0,4.1 +--- + +# How to Deploy Nacos 2.5 + +## Introduction + +This guide explains how to deliver a production-ready Nacos 2.5.x cluster on Alauda Container Platform (ACP) using the Nacos Chart from the Alauda application catalog. The plan covers prerequisites, MySQL initialization, Chart parameters, post-deployment verification through the OpenAPI and Web console, and a Grafana dashboard for ongoing monitoring. + +> **Note**: "Primary" replaces the previously used term "Master" for the leading Nacos node in a cluster. + +## Pre-Delivery Notice + +1. Nacos versions that the community has explicitly marked end-of-life cannot be supported by Alauda R&D. +2. The community provides no major-version upgrade path, so Alauda also has no in-place upgrade path for customers. To move to a new major version, redeploy from scratch. +3. Alauda only supports Nacos clusters delivered using this plan. Customer-built Nacos clusters are out of scope. +4. Alauda's support covers troubleshooting, vulnerability patches, and bug fixes layered on top of the community release. +5. The Nacos version delivered by this plan is **2.5.1**. + +To check whether your application SDK version is compatible with this Nacos version, see the [Spring Cloud Alibaba component version table](https://github.com/alibaba/spring-cloud-alibaba/wiki/%E7%89%88%E6%9C%AC%E8%AF%B4%E6%98%8E#%E7%BB%84%E4%BB%B6%E7%89%88%E6%9C%AC%E5%85%B3%E7%B3%BB). + +## Architecture Overview + +- Nacos is delivered through a Helm Chart and is installed from the platform App Store. +- The cluster defaults to **three nodes** for high availability and can be scaled to any **odd number ≥ 3** (5, 7, …) to suit larger deployments. The Chart sets Kubernetes readiness/liveness probes by default. +- External access can be exposed through `NodePort` or `LoadBalancer`. On ACP, **ALB is the LoadBalancer implementation**, and the Web-console verification section below uses an ALB listener; if you exposed Nacos via `NodePort` instead, substitute a NodePort Service for the ALB listener. +- Monitoring is enabled by default; a dedicated Grafana dashboard is provided in this guide. +- The plan does **not** cover cross-site DR replication or data migration. For cross-site DR, see the companion document on Nacos hot standby. +- Major-version upgrades are achieved by destroying the old cluster and redeploying the new version. + +## Prerequisites + +### 1. Violet CLI + +Download the `violet` tool matching your cluster version from **App Store > App Onboarding**. + +### 2. Storage Class + +A working `StorageClass` is required. + +> **Known issue**: With TopoLVM, a physical-node restart has been observed to cause Nacos data loss. If you must use TopoLVM, plan node maintenance carefully. Other CSI drivers backed by network storage are safer. + +### 3. MySQL + +The Nacos community lists MySQL 5.6.5 as the absolute minimum, but **this plan requires MySQL 5.7.6 or higher** because the bootstrap SQL below uses `CREATE USER IF NOT EXISTS`, which MySQL 5.6 does not support (the clause was added in 5.7.6). MySQL 5.6 is also community-EOL (since 2021). You can use customer-provided MySQL or the Alauda Application Services MySQL Operator. + +> **Known issue — MySQL Router < 8.0.35**: Nacos connecting through MySQL Router prior to 8.0.35 fails with `Couldn't read RSA public key from server`. MySQL Router 8.0.35 fixes this. Alauda Application Services ships MySQL 8.0.36 starting in ACP 3.17 (and back-ported to small versions of 3.14 / 3.16). + +## Procedure + +### 1. Upload the Nacos Material Package + +Sign in to Alauda Cloud with a tenant account and download the `nacos` artifact from the App Marketplace. Then push it into the target business cluster: + +```bash +violet push \ + --platform-address \ + --clusters \ + --platform-username \ + --platform-password \ + nacos-v2.5.x-yyyy.tgz +``` + +Sign in to the platform as an administrator, switch to the **Nacos** project and namespace in the App Store, and confirm that the Nacos package is visible. + +### 2. Create the Nacos User and Database in MySQL + +The block to use depends on whether MySQL Router 8.0.35+ (which fixed the RSA-key handshake bug) is in front of MySQL: pick `mysql_native_password` for Router `< 8.0.35`; otherwise — including direct MySQL connections — prefer `caching_sha2_password`. Replace `` and `` with the values you intend to configure in the Nacos Chart. + +#### MySQL Router ≥ 8.0.35 (or direct connection to MySQL) + +```sql +CREATE DATABASE IF NOT EXISTS nacos_config; +CREATE USER IF NOT EXISTS ''@'%' + IDENTIFIED WITH caching_sha2_password BY ''; +GRANT ALL PRIVILEGES ON nacos_config.* TO ''@'%'; +FLUSH PRIVILEGES; +``` + +#### MySQL Router < 8.0.35 + +Compatible with MySQL server `8.0.x` (pre-`8.0.35` Router) and `5.7.6+` — the user must use the legacy `mysql_native_password` auth plugin to avoid the Router RSA-key handshake bug noted in Prerequisites. + +```sql +CREATE DATABASE IF NOT EXISTS nacos_config; +CREATE USER IF NOT EXISTS ''@'%' + IDENTIFIED WITH mysql_native_password BY ''; +GRANT ALL PRIVILEGES ON nacos_config.* TO ''@'%'; +FLUSH PRIVILEGES; +``` + +### 3. Deploy the Nacos Chart + +In the App Store, switch to the **Nacos** project and namespace, locate the Nacos chart, and click **Deploy**. + +Most parameters have sane defaults. The following fields deserve attention: + +| Field | Notes | +| --- | --- | +| `name` | The instance name; `nacos` is a sensible default. | +| `displayName` | Display name, typically `Nacos`. | +| `templateVersion` | For fresh environments only one version is usually shown; on upgrades, pick the newest. | +| Image registry | Must match the registry where the material was pushed; otherwise pulls will fail. | +| `-XX:InitialRAMPercentage` | Default `75.0`. JDK requires at least one decimal place. | +| `-XX:MaxRAMPercentage` | Default `75.0`. Same JDK requirement. | +| Resources | Lab-validated defaults: request 2 cores / 2.5 Gi, limit 2 cores / 4 Gi. Scale to actual load. | +| Deployment mode | `cluster` (default) for three-node HA; `standalone` for single node. Production must use `cluster`. The cluster size can be any odd number ≥ 3. | +| Context path | Default `/nacos`. If changed, replace `/nacos` in all verification URLs below. | +| Admin password | Default `nacos`. Use a strong custom password. In Nacos 2.5 the password change in the Web console is honoured after restart. | +| `Server Identity Key` | Header key for inter-node auth. For private networks, `identitykey` is fine. Replaces the pre-1.4.1 User-Agent scheme. | +| `Server Identity Value` | Matching header value, e.g. `identityvalue`. | +| `JWT signing key` | Used to sign user-login JWTs (HS256 / RFC 7518). Must be a **base64-encoded** string whose **decoded** value is at least **32 bytes** long — i.e. the base64 string itself is at least **44 characters**. A shorter key causes Nacos to refuse to start. | +| Data StorageClass | Name of the StorageClass that backs Nacos data, e.g. `sc-topolvm`. | +| Log StorageClass | Name of the StorageClass for logs. Keeping logs on a separate class protects the data PV from log-driven exhaustion. | +| `db.host` | MySQL host. When using the platform internal MySQL service, include the namespace: `.`. | +| `db.port` | MySQL port. Default `3306`. | +| `db.name` | MySQL database name. Default `nacos_config`. | +| `db.user` | MySQL user used by Nacos (and by the init container that creates the schema). Default `nacos`. | +| `db.password` | Password matching the user above. | + +> **Warning**: Redeploying the Chart wipes the underlying database. Back up first if you intend to re-create the instance. + +## Verification + +### 1. API Verification + +`exec` into any non-Nacos pod in the cluster: + +```bash +kubectl -n exec -it -- sh +``` + +In the commands below, replace `` with `..svc.cluster.local`, `` with the Nacos service port (default `8848`), and `` with the access token returned by the login call. + +#### Acquire a Token + +```bash +curl -X POST 'http://:/nacos/v1/auth/login' \ + -d 'username=nacos&password=nacos' +``` + +Sample response: + +```json +{"accessToken":"eyJhbGciOiJI...","tokenTtl":18000,"globalAdmin":true} +``` + +#### Register an Instance + +```bash +curl -X POST 'http://:/nacos/v1/ns/instance?serviceName=nacos.naming.serviceName&ip=20.18.7.10&port=8080&accessToken=' +``` + +#### Discover Instances + +```bash +curl -X GET 'http://:/nacos/v1/ns/instance/list?serviceName=nacos.naming.serviceName&accessToken=' +``` + +> **Note**: The registered instance will report `"healthy":false` because this verification only POSTs a registration and never sends heartbeats. For an ephemeral registration, "unhealthy without heartbeats" is the expected steady state. + +#### Publish Configuration + +```bash +curl -X POST "http://:/nacos/v1/cs/configs?dataId=nacos.cfg.dataId&group=test&content=helloWorld&accessToken=" +``` + +#### Retrieve Configuration + +```bash +curl -X GET "http://:/nacos/v1/cs/configs?dataId=nacos.cfg.dataId&group=test&accessToken=" +``` + +> **Note**: The examples above use the v1 OpenAPI for simplicity. Nacos 2.x also exposes a [v2 OpenAPI](https://nacos.io/docs/next/manual/user/open-api/) (`/nacos/v2/...`) with JSON request bodies and a different auth path (`/nacos/v2/auth/user/login`) — useful for production tooling, but the v1 calls shown here are the quickest manual smoke test. + +### 2. Web Console Verification + +The Nacos console is exposed through ALB. Confirm ALB is deployed first, then add a listener: + +| Field | Value | +| --- | --- | +| Port | Any free port. | +| Protocol | `TCP`. | +| Algorithm | Round-Robin (default). | +| Internal route group | `nacos`, port `8848` (Nacos default). | +| Session affinity | `Source IP hash`. | +| Backend protocol | `TCP`. | + +Open `http://:/nacos`. The default credentials are `nacos / nacos` — change them immediately on first login. + +## Monitoring Dashboard + +Alauda ships an ACP-native `nacos-dashboard.yaml` (a custom Dashboard resource for ACP's monitoring stack) — ask your Alauda contact for the file and apply it with: + +```bash +kubectl create -f nacos-dashboard.yaml +``` + +Once applied, find the Nacos dashboard under **Platform Management > Operations Center > Monitoring > Dashboards**. + +If you prefer a community alternative, the Nacos team publishes a standard Grafana dashboard (Nacos exposes Prometheus metrics at `/nacos/actuator/prometheus`) — import it from the official Nacos repository (`console/src/main/resources/static/img/nacos_dashboard.json`) into your own Grafana. + +## FAQ + +### Q1. After a graceful shutdown of a Nacos client application, the data Nacos reports is inconsistent + +Monitor Nacos disk, memory, and CPU. Disk exhaustion or memory pressure degrades Nacos performance and produces inconsistent reads — adjust resources to stay clear of these limits. + +### Q2. Nacos pod is in `CrashLoopBackOff` with `User limit of inotify instances reached or too many open files` + +The host inotify quota is exhausted (often by other workloads on the same node). Raise the limits on the host: + +```text +fs.inotify.max_queued_events = 32768 +fs.inotify.max_user_instances = 65536 +fs.inotify.max_user_watches = 1048576 +``` + +Also raise `nofile` in `/etc/security/limits.conf` if applications on the node keep many descriptors open, and audit applications that churn inotify instances so they pool them. + +### Q3. Nacos pod logs `UnknownHostException jmenv.tbsite.net` + +The Nacos peer-finder plugin failed to write `cluster.conf` (often because the API server is overloaded or temporarily unreachable), so Nacos falls back to a hard-coded Taobao-internal endpoint (`jmenv.tbsite.net`). Verify API server health and restart the Nacos pods once it is stable. Upstream code reference: [`alibaba/nacos` "tbsite" search](https://github.com/search?q=repo%3Aalibaba%2Fnacos%20tbsite&type=code). + +### Q4. Nacos client logs `Ignore the empty nacos configuration and get it based on dataId` + +Nacos resolves configs by composing names; the log line is expected during startup. With older clients, the **file format used by the client** matters — `bootstrap.yaml` succeeds where `bootstrap.properties` may not retrieve configs cleanly. An Alauda-internal Spring Cloud demo lives at `https://gitlab-ce.alauda.cn/middleware/nacos-spring-cloud-example` (ask your Alauda contact for an exported copy if you do not have access to that GitLab). + +### Q5. What MySQL size should Nacos use? + +| Scale | CPU (vCores) | Memory (RAM) | Storage (SSD) | InnoDB Buffer Pool | +| --- | --- | --- | --- | --- | +| Small / Test | 2 | 4 GB | 50 GB+ | 2–3 GB | +| Medium production | 4 | 8–16 GB | 100–250 GB+ | 4–12 GB | +| Large production | 8+ | 16–32 GB+ | 250–500 GB+ | 12–24 GB+ | + +- **Small** — lab, dev, or low-microservice-density early production. +- **Medium** — stable microservice production with clear performance/availability expectations. +- **Large** — high-throughput, mission-critical production where availability and data safety are paramount.