我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

最近在折腾本地大模型，发现一个核心问题：Ollama 和 LM Studio 能让模型跑起来，但参数全靠猜——上下文长度、KV cache 类型、MoE expert 放哪、ubatch 多大……用默认参数基本是在浪费显卡。

于是做了个工具自动找最优配置，过程中踩了不少坑，记录一下。

核心发现

1. MoE 模型的 offload 策略决定了一切

Qwen3-30B-A3B 是 MoE 架构，在 8GB 显卡上：

LM Studio 默认把所有层塞进显存 → 7549MB （ 93%），3 tok/s
只把 attention 层放 GPU ，MoE expert 层走 CPU → 2603MB （ 32%），21 tok/s

快了 7 倍，显存反而省了 65%。关键是 llama.cpp 支持这个，但你得自己识别哪些 tensor 是 MoE expert （.ffn_.*_exps. 这类命名），然后手动配。

2. KV cache 类型影响比大多数人想的大

同一张 8GB 显卡跑 Llama 3.1 8B ，不同 KV cache 配置速度差异：

配置	ctx	速度
iso3+iso3 ，4 slot	8K	19.4 tok/s
q8_0+q4_0 ，1 slot	8K	38.2 tok/s
f16+f16 ，1 slot	8K	51.7 tok/s
f16+f16 ，1 slot （自动）	64K	26.2 tok/s

f16 比 iso3 快将近 3 倍。但 f16 显存占用更大，所以正确策略是：先算 f16 KV cache 占多少显存，装得下就用 f16 ，装不下再降级。

公式：KV_MB = 2 × layers × kv_heads × head_dim × ctx × bytes / 1024²

3. oobabooga 公式用来预测 ctx 上限

社区里流传的 oobabooga 显存估算公式，原本用来预测装载模型后剩余显存能支持多大 ctx 。但这个公式是基于 q8_0/f16 拟合的，用 iso3 的时候会严重高估显存需求，导致 ctx 只算出 4K 。

最后放弃公式预测，改成二分探测：从 min(nativeCtx, 65536) 开始，OOM 就减半，最多探 5 次，让 llama-server 自己告诉我能跑多少。Llama 3.1 8B 的 ctx 从 4K 直接到 64K 。

4. parallel slot 数量对单用户场景影响巨大

llama.cpp 默认开 4 个并行 slot （为了多用户并发），但单用户场景下这会把 VRAM 分成 4 份。

关掉多余 slot （--parallel 1）之后：18.5 → 38.2 tok/s ，直接翻倍。

5. ubatch 实测比理论更可靠

ubatch 128 vs 512 的性能差异跟模型和显卡都有关系，没有通用最优值。实测结论：

8K ctx：ubatch 512 比 128 快 7.6%
64K ctx：ubatch 512 比 128 快 21.6%

直接 benchmark 两个值取快的，比查文档猜靠谱。

6. 对话压缩不要用模型生成摘要

最初方案是上下文满了之后调本地模型生成摘要——结果单 slot 阻塞，直接超时。

改成纯算法提取：保留头部（ system prompt + 首轮对话）和尾部（最近 8K tokens ），中间部分提取代码路径、函数名、文件名、TODO 等关键信息。压缩率 73%，耗时 <1ms 。

用了哪些技术，实现了什么功能

llama.cpp — 推理引擎核心

直接调用 llama.cpp 的 llama-server ，所有参数（ ctx 、KV cache 类型、线程数、ubatch 、mlock 、tensor split ）都通过启动参数注入。Kaiwu 本质上是一个参数决策层，不改推理引擎本身。

IsoQuant / TurboQuant — 3-bit KV cache 压缩

集成了 johndpope 的 turboquant fork （feature/planarquant-kv-cache），支持 -ctk iso3 -ctv iso3 参数。iso3 的压缩系数实测 0.73 ，理论值 0.75 ，在 VRAM 紧张的设备（ 8GB ）上可以把 KV cache 占用压缩到 q8_0 的一半。但有约 600MB 固定解码 buffer 开销，VRAM 充裕时反而比 f16 慢 8%，所以策略是 VRAM > 16GB 才默认开 iso3 。

oobabooga 显存估算公式 — ctx 上限预测（已放弃）

社区流传的公式用来预测剩余显存能支持多大 ctx ，基于 q8_0/f16 拟合。iso3 场景下高估显存需求，导致 ctx 只算出 4K 。最终改成二分探测代替公式，让 llama-server 自己决定能跑多少。

GQA 架构识别 — KV cache 精准估算

Qwen3 等新模型用 GQA （ Grouped Query Attention ），kv_heads 远小于 attention_heads 。KV cache 大小公式里用的是 kv_heads 而不是 heads ，不识别这一点会高估 3-4 倍。通过读 GGUF metadata 拿到准确的 kv_heads 值再做计算。

MoE tensor 识别 — 自动 expert offload

读取模型的 tensor 名称列表，匹配 .ffn_.*_exps. 模式识别出 MoE expert 层，自动决定把这部分路由到 CPU 。不需要用户手动指定，也不需要提前知道模型架构。

Extractive Summary — 零延迟对话压缩

上下文到 75% 时触发，纯算法提取：保留 system prompt 、首轮对话、最近 8K tokens ，中间部分按关键词权重保留（代码路径、函数名、文件名、TODO 、命令行等）。不调用任何模型，压缩耗时 <1ms ，73% 压缩率。最初试过调本地模型生成摘要，单 slot 阻塞直接超时，这条路走不通。

GitHub Actions CI — 跨平台自动编译

turboquant fork 需要自己编译带 iso3 支持的 llama-server 。用 GitHub Actions 同时编译 Windows （ MSVC ）和 Linux （ GCC ）版本，CUDA 12.4 ，覆盖 sm_75/80/86/89 架构，RTX 50 系列通过 PTX JIT 运行时支持。踩了三个 MSVC 编译坑（ extern "C" 声明改定义、M_PI 未定义、全局符号缺失），记录在 PROGRESS.md 里。

工具

把上面这些逻辑都自动化了，叫开物（ Kaiwu ）。一行命令启动，参数全部自动找，结果缓存起来，第二次 2 秒启动。

GitHub： https://github.com/val1813/kaiwu

OpenAI 兼容 API ，Continue / Cursor / Claude Code 直接接。

有遇到类似问题的欢迎交流，尤其是 MoE offload 和 KV cache 这块踩坑挺深的。

第 1 条附言 · 14 小时 28 分钟前

上线一天感谢大家试用我们加紧优化了 8 个版本
从 v0.1.1 到 v0.1.8 ，Kaiwu 解决的核心问题可以分三类：

让模型真正跑起来（稳定性）
v0.1.1 发布时，iso3 KV cache 检测时机有 bug——warmup 用 iso3 参数启动 llama-server ，但还没确认 binary 支不支持，导致所有 ctx 探测失败、误报 OOM （ v0.1.2 修）。Blackwell 架构（ RTX 50 系）的 iso3 检测超时只有 10 秒，但 SM120 首次启动需要 PTX JIT 编译约 30 秒，直接超时失败（ v0.1.4 修）。MoE offload 的 -ot 正则路由 expert 层到 CPU 实际没生效，所有层都上 GPU ，8GB 装不下 13GB 模型（ v0.1.6 修，改用 --cpu-moe ）。kaiwu run /path/to/model.gguf 传绝对路径时实际走下载流程而不是用本地文件（ v0.1.6 修）。

让 warmup 找到真正最优的配置（调参质量）
iso3 每次启动都重新检测，RTX 50 系每次要等 30 秒（ v0.1.5 加磁盘缓存）。小模型（<2GB ） ubatch 用 512 导致--kv-unified 预分配 OOM （ v0.1.5 修，降到 128 ）。MoE offload 的 KV cache 选择用全量模型大小估算 GPU 占用，算出来是负数，导致 warmup 全 OOM （ v0.1.6 修，信任 --fit on 处理层分配）。MoE warmup 超时只有 60 秒，13GB 模型从内存加载来不及（ v0.1.6 延长到 180 秒）。MoE warmup 用 18 tok/s 阈值，但 PCIe 带宽限制下 MoE 最多 13-15tok/s ，阈值永远达不到，warmup 总是 fallback 到最小 ctx （ v0.1.7 改 8 ，v0.1.8 直接去掉阈值——MoE 速度跟 ctx 无关，找最大能跑的 ctx 就行）。

让更多客户端能用（兼容性）
新版 Cursor 和 Claude Code 调用 /responses （不带 /v1/ 前缀），proxy 只注册了 /v1/responses ，返回 404 （ v0.1.7 修）。发布包没有打包 OpenSSL DLL ，用户需要手动装 OpenSSL 才能跑 llama-server （ v0.1.6 CI 改动，自动打包）。

再次感谢各位热心指导如有问题随时反馈我尽快优化

大模型

显卡

优化

84 条回复 • 2026-04-26 08:42:10 +08:00

KaiWuBOSS

1 天前

llmbbs.ai 欢迎交流。

zrlhk

1 天前

看起来显卡还是不够...:
本地大模型部署器 vv0.1.1 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 3080 (SM86, 10240 MB VRAM, 760 GB/s)
RAM: 47 GB UNKNOWN
OS: windows amd64

[2/6] Selecting configuration...
Model: Gemma 4 26B A4B It (moe, 19B total / 1B active)
Quant: Q3_K_S (11.4 GB)
Mode: moe_offload (experts on CPU)
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: gemma-4-26B-A4B-it.Q3_K_S.gguf [cached]

[4/6] Preflight check...
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=256K ... OOM
Probe 2: ctx=128K ... OOM
Probe 3: ctx=64K ... OOM
Probe 4: ctx=32K ... OOM
Probe 5: ctx=16K ... OOM
Probe 6: ctx=8K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行
建议：选择更小的量化或使用 MoE offload 模型
Usage:
kaiwu run <model> [flags]

Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小（ 0=自动）
--fast Skip warmup, use cached profile
-h, --help help for run
--reset 清除缓存，重新 warmup 探测最优参数

KaiWuBOSS

1 天前

哥您多大显存？

tangping

1 天前

试试去了 🙌

KaiWuBOSS

1 天前

我马上优化一版空了再试试 gemma4 支持 ios3 的呀判定有问题

zrlhk

1 天前

@KaiWuBOSS 是 win10, 显存 10G ，内存 48G ，我在下载 qwen3-30b 试试看

damontian

1 天前 via Android

大佬，16g 显存，64g 内存，跑哪个模型最合适？

KaiWuBOSS

1 天前

@zrlhk 我正在对你这个进行修复 1 你是正常 0.1.1 吗我看代码怎么显示你没编译涡轮量化 2 我回退策略太大了我调整一版我无论如何让你跑起来顺畅跑起来

KaiWuBOSS

1 天前

@damontian 直接上 30b 模型你选你喜欢的 50 系列看 nvfp 的

zrlhk

1 天前

@KaiWuBOSS 嗯，是 0.1.1 版本。是的，就是编译涡轮量化不知道怎么弄

KaiWuBOSS

1 天前

@damontian 换 Qwen3-30B-A3B
这个模型专为低显存优化
3080 10GB 跑起来没问题

ntdll

1 天前

看起来是 nVidia only

如果有用 AMD + Windows 的组合，可以尝试把 llama.cpp 的后端改成 vulkan ，会比 ROCm 的推理速度快上一档。在 Linux 上，我试下来是 ROCm 更快，但 Windows 相反。

KaiWuBOSS

1 天前

@zrlhk 我的错我的上传脚本有问题晚点推 0.1.2 你要方便可以试试 qwen3 应该没问题

sentinelK

1 天前

这个机制和 llama.cpp 的 -- fit on 区别是？

KaiWuBOSS

1 天前

@ntdll 是的得等新的 cude 现在只支持 n 卡 llama-server-cuda.exe：
用 CUDA 编译的，只能跑在 N 卡
Release 包里只有这一个版本

KaiWuBOSS

1 天前

@sentinelK 我也参考了他的 fiton 但他没有涡轮量化另外我还做了上下文优化相比而言我这个不用调参而且是硬件最大上下文最优显存
-fit on 是随机削层，Kaiwu 是精准分层。

--fit on：显存不够就把后面几层丢给 CPU ，
不管是什么层，速度损失大。

Kaiwu：专门识别 MoE 的专家层，
只把专家层放 CPU ，注意力层全在 GPU ，
速度损失极小——这就是为什么
同样 8GB 显存，Kaiwu 能跑出 21 tok/s ，
LM Studio 只有 3 tok/s 。

KaiWuBOSS

1 天前

0.1.1 版 ios3 脚本没上传上正在编译 0.1.2 估计三个小时后发布

KaiWuBOSS

1 天前

第一次发仓库项目没经验 😰

hongdengdao

1 天前

4060ti+5060ti 双卡没有识别出来,只出来了 4060ti

KaiWuBOSS

1 天前

@hongdengdao 奇怪我特意在我双 4090 电脑测试能识别的我去看看代码

KaiWuBOSS

1 天前

@hongdengdao 哥跑一下 nvidia-smi 看输出是一个显卡还是 2 个我这个读驱动的有代码支持的

hongdengdao

1 天前

hongdengdao

1 天前

.\kaiwu.exe run Qwen3.6-27B-Q4_K_M.gguf

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.1 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 4060 Ti (SM89, 16380 MB VRAM, 0 GB/s)
RAM: 61 GB DDR5
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3.6-27B (dense, 28B)
Quant: Q4_K_M (15.7 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3.6-27B-Q4_K_M.gguf [cached]

[4/6] Preflight check...
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=8K ... OOM
Probe 2: ctx=4K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行
建议：选择更小的量化或使用 MoE offload 模型
Usage:
kaiwu run <model> [flags]

pengzhizhuo

1 天前

这个咋样？

(base) PS E:\kaiwu-windows-amd64> .\kaiwu.exe run Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.1 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 4060 Laptop GPU (SM89, 8188 MB VRAM, 0 GB/s)
RAM: 63 GB DDR5
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive (moe, 36B total / 1B active)
Quant: Q4_K_M (19.7 GB)
Mode: moe_offload (experts on CPU)
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf [cached]

[4/6] Preflight check...
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=256K ... 22.1 tok/s
Tune ubatch: ub=128 → 22.3 tok/s; ub=512 → 20.7 tok/s;
✓ 22.3 tok/s @ 256K ctx
Saved profile: C:\Users\pzz\.kaiwu\profiles\qwen3.6-35b-a3b-uncensored-hauhaucs-aggressive-q4_k_m_sm89_8188mb_ddr5.json
✓ 22.3 tok/s

[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
llama-server started (PID 49380, port 11434)
Kaiwu proxy started (port 11435)
2026/04/24 22:03:23 Kaiwu proxy listening on :11435 → llama-server :11434

┌─────────────────────────────────────────────────┐
│ Ready — Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive @ 22.3 tok/s │
│ API: http://127.0.0.1:11435/v1/chat/completions │
│ 模型文件夹: E:\model │
└─────────────────────────────────────────────────┘

运行 kaiwu inject 接入 IDE · Ctrl+C 停止
─ 实时监控 · 空载 ─────────────────── 每 2s 刷新 ─
reuse:1024 · KV:q8_0 · 256K ctx · ub128 · mlock
速度显存内存 GPU 温度
— tok/s 5.5/8 GB 47.0/64 GB 2% 58°CC
[..........] [======....] [=======...] [..........] [=====.....]
─────────────────────────────────────────────────────────
上下文 [....................] 0.0K / 256K 余 256.0K

正在停止服务...
✓ llama-server 已停止
✓ Kaiwu proxy 已停止

KaiWuBOSS

1 天前

@hongdengdao 不好意思我对双显卡判定之前太简单现在优化了已经发布 0.1.2 了你看看还有没有问题下载后先 kaiwu run xx.gguf --reset 然后再跑应该没问题了

KaiWuBOSS

1 天前

产品最大担忧我让 tok/s 最优值设定在 20 左右就是让 kaiwu 在这个速度下寻找上下文和显存最优解如果用户显存有效又期盼高速度这个就不合适了。之所以 20 是因为网上说 20 就是甜区了，但我觉得有点慢了。

mingtdlb

1 天前

看抖音说，同样的配置，llama.cpp 跑比 ollama 跑吞吐更高？

osilinka

1 天前

现在不是有 turboquant 版本？
这个会不会效果更好

KaiWuBOSS

1 天前

@osilinka 还没吧，官方还没编译把已经出来了吗

KaiWuBOSS

1 天前

@mingtdlb ollama 用的是 llama 架构你可以试试 lm 也是这个架构体验更好但要自己调参

hongdengdao

1 天前

这个应该有点问题,双卡 32g 显存, 8k-64k 肯定是可以运行的

.\kaiwu.exe run Qwen3.6-27B-Q4_K_M.gguf

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.2 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 4060 Ti × 2 (SM89, 16380 MB VRAM each, 0 GB/s)
RAM: 61 GB DDR5
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3.6-27B (dense, 28B)
Quant: Q4_K_M (15.7 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3.6-27B-Q4_K_M.gguf [cached]

[4/6] Preflight check...
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=256K ... OOM
Probe 2: ctx=128K ... OOM
Probe 3: ctx=64K ... OOM
Probe 4: ctx=32K ... OOM
Probe 5: ctx=16K ... OOM
Probe 6: ctx=8K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 64K 重试...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 32K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 3 次启动均失败，建议选择更小的模型
Usage:
kaiwu run <model> [flags],

coefu

1 天前

@KaiWuBOSS #25 单机多卡 tensor parallel + offload ，比单卡 + offload 复杂多了。

xJogger

1 天前 via Android

6G 显存加 40G 内存是不是告别比赛了

KaiWuBOSS

1 天前

@hongdengdao 已经定位问题了正在抢修感谢反馈

KaiWuBOSS

1 天前

@coefu 刚找 opus 拿到解决方案搜一下。结论很清楚，有三种方案，按稳定性排序。

---

## 最稳的方案：nvidia-smi XML 输出

NVIDIA 官方的 Go collector 就是用 `nvidia-smi -q -x` 的 XML 输出解析多卡信息，这种方式在所有驱动版本和 GPU 型号上输出格式一致。

```go
// 执行：nvidia-smi -q -x
// 输出是结构化 XML ，直接解析
type NvidiaSMI struct {
GPUs []GPU `xml:"gpu"`
}
type GPU struct {
Name string `xml:"product_name"`
Memory struct {
Total string `xml:"total"`
Free string `xml:"free"`
} `xml:"fb_memory_usage"`
ComputeCap string `xml:"compute_cap"`
}
```

优点：
- 不受 `\r\n` 影响
- 不受分隔符格式影响
- Tesla/Quadro/GeForce 全部统一格式
- 不需要 CGO ，纯 Go

---

## 最精确的方案：go-nvml

NVIDIA 官方维护的 Go NVML 绑定，直接调用 `libnvidia-ml.so`，不需要解析任何文本输出，API 向后兼容所有驱动版本。

```go
import "github.com/NVIDIA/go-nvml/pkg/nvml"

nvml.Init()
count, _ := nvml.DeviceGetCount()
for i := 0; i < count; i++ {
device, _ := nvml.DeviceGetHandleByIndex(i)
name, _ := device.GetName()
mem, _ := device.GetMemoryInfo()
// mem.Total, mem.Free, mem.Used 精确到字节
}
```

缺点：
- 目前只支持 Linux ，Windows 不支持。
- 需要 CGO ，编译复杂度增加
- 跨平台打包麻烦

---

## 结论

```
对 Kaiwu 的最优方案：

主路径：nvidia-smi -q -x （ XML 解析）
- Linux + Windows 都支持
- 不需要 CGO
- 一次改好，多卡识别永久稳定
- Kaiwu 的目标用户主要是 Windows

备用路径：go-nvml （仅 Linux ）
- 将来如果要精确读带宽、温度等
- 作为 Linux 上的增强路径

兜底：环境变量手动指定
KAIWU_GPUS="12288,12288,12288"
```

让 Opus 把 `probe_windows.go` 和 `probe_linux.go` 里的 csv 解析全部改成 XML 解析，这是一劳永逸的方案，之后所有多卡识别问题都解决了。

kevan

1 天前

大佬,问一下我 5070TI 16GB + 32GB 内存用哪个模型比较合适? 想用来跑小龙虾

gswgudujian

1 天前

llama-server.exe 能跑起来，Qwen3-Coder-30B-A3B-Instruct-Q3_K_S.gguf 使用自带下载的 Qwen3-30B-A3B-UD-Q3_K_XL.gguf 跑不动，按你介绍应该跑的更快才是，怎么都跑不起来，我使 4090 16G+64G 内存+13 代 I9 处理器;

kevan

1 天前

kaiwu.exe run Qwen3-30B-A3B

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.2 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 5070 Ti (SM120, 16303 MB VRAM, 0 GB/s)
RAM: 31 GB UNKNOWN
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3-30B-A3B (moe, 30B total / 3B active)
Quant: ud-q3-k-xl (14.0 GB)
Mode: full_gpu
Accel: Flash Attention + MTP (native)

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3-30B-A3B-UD-Q3_K_XL.gguf [cached]

[4/6] Preflight check...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=8K ... OOM
Probe 2: ctx=4K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行

NVIDIA GeForce RTX 5070 Ti: 16303 MB VRAM
模型 Qwen3-30B-A3B: ~14336 MB
KV cache (4K, q4_0): ~96 MB
预估总需: ~15456 MB

建议:
1. 选择更小的量化 (Q2_K)
2. 选择更小的模型
3. 使用 MoE offload 模型（ experts 放 CPU RAM ）
Usage:
kaiwu run <model> [flags]

Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小（ 0=自动）
--fast Skip warmup, use cached profile
-h, --help help for run
--llama-server string 使用自定义 llama-server 二进制（完整路径）
--reset 清除缓存，重新 warmup 探测最优参数

醉了?怎么和介绍的不一样呢?
介绍说 8GB 显存都能跑,我 16G 显存怎么不行啊?

rune15

1 天前

D:\AI\models>kaiwu run Qwen3-30B-A3B

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.1 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 4070 Ti (SM89, 12282 MB VRAM, 0 GB/s)
RAM: 31 GB DDR4
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3-30B-A3B (moe, 30B total / 3B active)
Quant: ud-q3-k-xl (14.0 GB)
Mode: moe_offload (experts on CPU)
Accel: Flash Attention + MTP (native)

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Downloading model: Qwen3-30B-A3B-UD-Q3_K_XL.gguf
From: https://hf-mirror.com/unsloth/Qwen3-30B-A3B-GGUF/resolve/main/Qwen3-30B-A3B-UD-Q3_K_XL.gguf
Downloading 100% |████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (14/14 GB, 25 MB/s) [9m10s:0s]
Model: Qwen3-30B-A3B-UD-Q3_K_XL.gguf [cached]

[4/6] Preflight check...
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=128K ... OOM
Probe 2: ctx=64K ... OOM
Probe 3: ctx=32K ... OOM
Probe 4: ctx=16K ... OOM
Probe 5: ctx=8K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行
建议：选择更小的量化或使用 MoE offload 模型
Usage:
kaiwu run <model> [flags]

Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小（ 0=自动）
--fast Skip warmup, use cached profile
-h, --help help for run
--reset 清除缓存，重新 warmup 探测最优参数

我的 4070-Ti 也同样加载不了

KaiWuBOSS

1 天前

@kevan 我自己是 5060 显卡测试时候 jti 缓存过了你是新环境所以超时了。不好意思已经在新版本延长了时间。另外 50 系列显卡你去找 nvpf 的模型这类对 50 特别加速的感谢体验🙏🏻

KaiWuBOSS

1 天前

@rune15 哥不好意思 0.1.1 当时发布时候脚本上传把 iso3turbo 的 fork 没编译进去后来的版本都有了你能不能再试试最新的还不行你打我

KaiWuBOSS

1 天前

目前对多卡的尤其是不同多卡的支持还有些不足我看看怎么优化

KaiWuBOSS

1 天前

@rune15：升级到 v0.1.4 就好，irm https://raw.githubusercontent.com/val1813/kaiwu/main/install.ps1 | iex 然后 kaiwu run Qwen3-30B-A3B --reset 然后 kaiwu run 你的模型就行了

xnplus

1 天前

等我试试 10 年前的外星人，6950x+1080 dual ，再插个 4060 ，哈哈

kevan

1 天前 via Android

@KaiWuBOSS 本来没什么欲望的，看到你的介绍感觉焕发新生一样，傻瓜式安装，能不能推荐一下具体安装方法，你说的那些太资深

beginor

1 天前

新版本的 llama.cpp 支持 --cpu-moe 参数，即将所有的 MoE 权重放在 CPU ，是这个意思么？

https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

KaiWuBOSS

1 天前

@kevan 打开 powershell 然后输入这段命令 irm https://raw.githubusercontent.com/val1813/kaiwu/main/install.ps1 | iex
然后等提示好了（约 700m 左右）然后你就 kaiwu run xx/xx.gguf(你的模型地址，必须是 gguf 格式)
最后会给你一段 127.0.0.1:***的链接就成功了你可以填到你自己的软件去或者 kaiwu injet codex 或者 claude code 直接用这些软件

KaiWuBOSS

1 天前

@beginor 支持的但他和 moeoffload 机制不完全一致欢迎尝试对比我测试环境太少

gcod

1 天前

PS C:\Windows\system32> kaiwu run Qwen3-1.7B --ctx-size 2048

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.4 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce GTX 1660 Ti (SM75, 6144 MB VRAM, 0 GB/s)
RAM: 31 GB DDR4
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3-1.7B (dense, 2B)
Quant: q5-k-m (1.2 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3-1.7B-Q5_K_M.gguf [cached]

[4/6] Preflight check...
llama-server 不支持 iso3 (或首次 JIT 编译超时)，回退到 q8_0/q4_0
✓ VRAM sufficient

[5/6] Warmup benchmark...
用户指定 ctx=2048 ，跳过缓存
User override: ctx=2K ... ⚠️ Warmup failed: user-specified ctx=2K failed to start (OOM?)
Using default parameters

[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行

NVIDIA GeForce GTX 1660 Ti: 6144 MB VRAM
模型 Qwen3-1.7B: ~1228 MB
KV cache (4K, q4_0): ~112 MB
预估总需: ~2364 MB

建议:
1. 选择更小的量化 (Q2_K)
2. 选择更小的模型
3. 使用 MoE offload 模型（ experts 放 CPU RAM ）
Usage:
kaiwu run <model> [flags]

Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小（ 0=自动）
--fast Skip warmup, use cached profile
-h, --help help for run
--llama-server string 使用自定义 llama-server 二进制（完整路径）
--reset 清除缓存，重新 warmup 探测最优参数

7 年前的老机子了 1660 Ti😮‍💨

KaiWuBOSS

1 天前

@gcod 感谢反馈之前在一个个修 n 卡的 sm 版本问题马上上 0.1.5 版本把所有 sm 版本一次解决就能跑了还能再试试 4b

KaiWuBOSS

1 天前

0.1.5 已经更新修复 n 卡 sm 版本适应性问题另外读取带宽对小旧老显卡进行了优化

poorcai

1 天前

大佬，24GB 内存的 mac mini m4 适合跑什么模型？主要是平时还用来做开发。

rune15

1 天前

@KaiWuBOSS 收到，我再试试

xnplus

23 小时 35 分钟前

自动下载的模型，怎么知道列表？

stefwoo

23 小时 21 分钟前

很棒的工作，回家试试我的 3090

ideniece

23 小时 17 分钟前

我在运行本地模型，为什么去提示去下载？

kaiwu.exe run .\Qwen3.5-35B-A3B-Q4_K_M.gguf
[2/6] Selecting configuration...
Model: Qwen3.5-35B-A3B (moe, 37B total / 1B active)
Quant: Q4_K_M (20.5 GB)
Mode: moe_offload (experts on CPU)
Accel: Flash Attention + SWA-Full (hybrid arch)

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Downloading model: Qwen3.5-35B-A3B-Q4_K_M.gguf
From: https://hf-mirror.com//resolve/main/Qwen3.5-35B-A3B-Q4_K_M.gguf
Error: failed to ensure model file: failed to download model: failed to download: Get "https://hf-mirror.com//resolve/main/Qwen3.5-35B-A3B-Q4_K_M.gguf": EOF

ideniece

23 小时 11 分钟前

没事了。
@ideniece #56

kevan

22 小时 37 分钟前

@KaiWuBOSS 大佬,为什么我下载 0.1.6 版本还是不行啊??????

>kaiwu run Qwen3-30B-A3B-UD-Q3_K_XL.gguf --reset

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.6 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 5070 Ti (SM120, 16303 MB VRAM, 896 GB/s)
RAM: 31 GB UNKNOWN
OS: windows amd64
⚠️ CUDA 13.2 detected — known bug with low-bit quantization
If you see garbled output, downgrade driver to CUDA 13.1
Warning: RTX 50 series with CUDA 13.2 detected
Kaiwu will use CUDA 12.4 binary for stability.

[2/6] Selecting configuration...
Model: Qwen3-30B-A3B (moe, 29B total / 2B active)
Quant: Q3_K_M (12.9 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3-30B-A3B-UD-Q3_K_XL.gguf [cached]

[4/6] Preflight check...
⚠ RTX 50 系首次启动需要 JIT 编译 (~30s)，请稍候...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
✓ VRAM sufficient

[5/6] Warmup benchmark...
已清除缓存，重新探测
Probe 1: ctx=8K ... OOM
Probe 2: ctx=4K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行

NVIDIA GeForce RTX 5070 Ti: 16303 MB VRAM
模型 Qwen3-30B-A3B: ~13189 MB
KV cache (4K, q4_0): ~96 MB
预估总需: ~14309 MB

建议:
1. 选择更小的量化 (Q4_K_M 或 Q2_K)
2. 选择更小的模型

Usage:
kaiwu run <model> [flags]

Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小（ 0=自动）
--fast Skip warmup, use cached profile
-h, --help help for run
--llama-server string 使用自定义 llama-server 二进制（完整路径）
--reset 清除缓存，重新 warmup 探测最优参数

C:\Kevan\AI\kaiwu-windows-amd64>

ImINH

22 小时 17 分钟前

llmbbs 论坛 bug 有点多，楼主方便拉个群组沟通吗？

gcod

21 小时 41 分钟前

@KaiWuBOSS 4B 果然也不行...

PS C:\Windows\system32> irm https://raw.githubusercontent.com/val1813/kaiwu/main/install.ps1 | iex
Kaiwu Installer
===============

Detected: windows/amd64
Fetching latest release...
Latest version: v0.1.6
Downloading https://github.com/val1813/kaiwu/releases/download/v0.1.6/kaiwu-windows-amd64.zip...

Kaiwu installed successfully!

Kaiwu v0.1.6

Get started:
kaiwu run Qwen3-30B-A3B

Note: restart your terminal for PATH changes to take effect.
PS C:\Windows\system32> kaiwu run Qwen3-4B

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.6 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce GTX 1660 Ti (SM75, 6144 MB VRAM, 288 GB/s)
RAM: 31 GB DDR4
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3-4B (dense, 4B)
Quant: q5-k-m (2.8 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3-4B-Q5_K_M.gguf [cached]

[4/6] Preflight check...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=8K ... OOM
Probe 2: ctx=4K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行

NVIDIA GeForce GTX 1660 Ti: 6144 MB VRAM
模型 Qwen3-4B: ~2867 MB
KV cache (4K, q4_0): ~112 MB
预估总需: ~4003 MB

建议:
1. 运行 kaiwu run qwen3-4b --reset 重新探测参数
2. 模型较小但仍 OOM ，可能是参数配置问题，请升级到最新版本

Usage:
kaiwu run <model> [flags]

Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小（ 0=自动）
--fast Skip warmup, use cached profile
-h, --help help for run
--llama-server string 使用自定义 llama-server 二进制（完整路径）
--reset 清除缓存，重新 warmup 探测最优参数

PS C:\Windows\system32> kaiwu run qwen3-4b --reset

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.6 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce GTX 1660 Ti (SM75, 6144 MB VRAM, 288 GB/s)
RAM: 31 GB DDR4
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3-4B (dense, 4B)
Quant: q5-k-m (2.8 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3-4B-Q5_K_M.gguf [cached]

[4/6] Preflight check...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
✓ VRAM sufficient

[5/6] Warmup benchmark...
已清除缓存，重新探测
Probe 1: ctx=8K ... OOM
Probe 2: ctx=4K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行

NVIDIA GeForce GTX 1660 Ti: 6144 MB VRAM
模型 Qwen3-4B: ~2867 MB
KV cache (4K, q4_0): ~112 MB
预估总需: ~4003 MB

建议:
1. 运行 kaiwu run qwen3-4b --reset 重新探测参数
2. 模型较小但仍 OOM ，可能是参数配置问题，请升级到最新版本

Usage:
kaiwu run <model> [flags]

Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小（ 0=自动）
--fast Skip warmup, use cached profile
-h, --help help for run
--llama-server string 使用自定义 llama-server 二进制（完整路径）
--reset 清除缓存，重新 warmup 探测最优参数

CFM880

21 小时 31 分钟前

PS C:\Users\cfm880\Downloads> kaiwu run .\Qwen3-Coder-30B-APEX-I-Quality.gguf

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.6 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 3080 Laptop GPU (SM86, 16384 MB VRAM, 760 GB/s)
RAM: 63 GB DDR4
OS: windows amd64
⚠️ CUDA 13.2 detected — known bug with low-bit quantization
If you see garbled output, downgrade driver to CUDA 13.1

[2/6] Selecting configuration...
Model: Qwen3-Coder-30B-A3B-Instruct (moe, 22B total / 1B active)
Quant: Q6_K (18.1 GB)
Mode: moe_offload (experts on CPU)
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3-Coder-30B-APEX-I-Quality.gguf [cached]

[4/6] Preflight check...
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=128K ... 13.6 tok/s (< 18, too slow)
Probe 2: ctx=64K ... 14.6 tok/s (< 18, too slow)
Probe 3: ctx=32K ... 15.7 tok/s (< 18, too slow)
Probe 4: ctx=16K ... 14.8 tok/s (< 18, too slow)
Probe 5: ctx=8K ... 13.8 tok/s (< 18, too slow)
Tune ubatch: ub=128 → 14.5 tok/s; ub=512 → 14.6 tok/s;
✓ 14.6 tok/s @ 32K ctx
Saved profile: C:\Users\cfm880\.kaiwu\profiles\qwen3-coder-30b-apex-i-quality_sm86_16384mb_ddr4.json
✓ 14.6 tok/s

[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
llama-server started (PID 17428, port 11434)
Kaiwu proxy started (port 11435)

┌─────────────────────────────────────────────────┐
2026/04/25 14:35:09 Kaiwu proxy listening on :11435 → llama-server :11434
│ Ready — Qwen3-Coder-30B-A3B-Instruct @ 14.6 tok/s │
│ API: http://127.0.0.1:11435/v1/chat/completions │
│ 模型文件夹: C:\Users\cfm880\.kaiwu\models │
└─────────────────────────────────────────────────┘

运行 kaiwu inject 接入 IDE · Ctrl+C 停止
─ 实时监控 · 空载 ─────────────────── 每 2s 刷新 ─
reuse:1024 · KV:f16 · 32K ctx · ub512 · mlock
速度显存内存 GPU 温度
— tok/s 6.4/16 GB 30.4/64 GB 0% 50°CC
[..........] [====......] [====......] [..........] [=====.....]
─────────────────────────────────────────────────────────
上下文 [....................] 0.0K / 32K 余 32.0K

CFM880

20 小时 57 分钟前

■ unexpected status 404 Not Found: 404 page not found, url: http://127.0.0.1:11435/responses

lizhenda

19 小时 43 分钟前

干货满满，好贴要收藏起来！

diudiuu

19 小时 9 分钟前

这个 dgx spark 带宽就是 274g,比如部署 gemma4 31b 16bf 的,理论值也就 4token/s,我用 llama.cpp 部署也就到达 2.5token/s,这个是靠什么优化的.

还是我理解的不对

KaiWuBOSS

17 小时 57 分钟前

@diudiuu 请问是用过 kaiwu 对比过的吗

KaiWuBOSS

17 小时 56 分钟前

@ImINH 嗯 ai 自己写的没用别人的架构一直在修有兴趣参与管理吗

ImINH

17 小时 10 分钟前

@KaiWuBOSS 管理可能没有精力，但是可以在论坛分享本地部署的东西，因为我们的产品也是本地 AI 的方案。所以可以论坛上放个社群的链接，比如 tg 啥的，沟通方便一些，遇到网站的 bug 也能及时反馈。

shen09darkareas

17 小时 10 分钟前

windows 有些电脑要安装 Win64OpenSSL-3_6_2.exe 这个依赖，不然跑不起来

kubecoder

16 小时 8 分钟前

PS E:\kaiwu> kaiwu run qwen3-4b --reset

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.5 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 3060 (SM86, 12288 MB VRAM, 360 GB/s)
RAM: 31 GB DDR4
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3-4B (dense, 4B)
Quant: q8-0 (4.4 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3-4B-Q8_0.gguf [cached]

[4/6] Preflight check...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
✓ VRAM sufficient

[5/6] Warmup benchmark...
已清除缓存，重新探测
Probe 1: ctx=32K ... OOM
Probe 2: ctx=16K ... OOM
Probe 3: ctx=8K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 16K 重试...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 8K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 3 次启动失败，即使最小上下文(4K)也无法运行

NVIDIA GeForce RTX 3060: 12288 MB VRAM
模型 Qwen3-4B: ~4505 MB
KV cache (4K, q4_0): ~144 MB
预估总需: ~5673 MB

建议:
1. 运行 kaiwu run qwen3-4b --reset 重新探测参数
2. 模型较小但仍 OOM ，可能是参数配置问题，请升级到最新版本

Usage:
kaiwu run <model> [flags]

Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小（ 0=自动）
--fast Skip warmup, use cached profile
-h, --help help for run
--llama-server string 使用自定义 llama-server 二进制（完整路径）
--reset 清除缓存，重新 warmup 探测最优参数

PS E:\kaiwu>

shen09darkareas

15 小时 54 分钟前

@kubecoder 试着安装 Win64OpenSSL-3_6_2.exe 依赖，https://slproweb.com/products/Win32OpenSSL.html 这个项目 windows 缺这个

olivergrace006

15 小时 30 分钟前 via Android

好久没有看到这么硬核有用的技术贴了，楼主牛逼

KaiWuBOSS

15 小时 15 分钟前

@kubecoder 哥你重新更新下 0.1.6 看看问题还在不记得再次使用记得要 reset

KaiWuBOSS

15 小时 13 分钟前

@shen09darkareas 谢谢提醒我马上把 dll 加进去就不折腾用户了

KaiWuBOSS

15 小时 4 分钟前

@CFM880 url: http://127.0.0.1:11435/responses

这是 OpenAI 新版 API 的端点
Responses API （ 2025 年新增）
用于流式响应的新格式

Kaiwu 的 proxy 只实现了：
/v1/chat/completions ✅
/v1/models ✅

没有实现：
/responses ❌

用户用的客户端（可能是新版 Cursor 或 Claude Code ）
在调用新的 /responses 端点
Kaiwu proxy 不认识这个路径，返回 404 我马上来优化麻烦看到 0.1.7 发布后再试试谢谢了

KaiWuBOSS

14 小时 57 分钟前

@ImINH 嗯很好的建议之前就希望有专家能给建议我周一来整我还不知道什么 tg 我还只会 qq.vx

ravecn2014

14 小时 33 分钟前

本地大模型部署器 vv0.1.6 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 3090 × 2 (SM86, 24576 MB VRAM each, 936 GB/s)
RAM: 251 GB DDR4
OS: linux amd64

[2/6] Selecting configuration...
Model: Qwen3.6-35B-A3B (moe, 35B total / 3B active)
Quant: ud-q5-k-xl (25.0 GB)
Mode: full_gpu
Accel: Flash Attention + MTP (native) + NVLink + SWA-Full (hybrid arch)

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda
Binary: llama-server-cuda [cached]
Model: Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf [cached]

[4/6] Preflight check...
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=256K ... OOM
Probe 2: ctx=128K ... OOM
Probe 3: ctx=64K ... OOM
Probe 4: ctx=32K ... OOM
Probe 5: ctx=16K ... OOM
Probe 6: ctx=8K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 128K 重试...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 64K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 3 次启动均失败，建议选择更小的模型
Usage:
kaiwu run <model> [flags]

Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小（ 0=自动）
--fast Skip warmup, use cached profile
-h, --help help for run
--llama-server string 使用自定义 llama-server 二进制（完整路径）
--reset 清除缓存，重新 warmup 探测最优参数

KaiWuBOSS

14 小时 25 分钟前

@ravecn2014 仍然是多显卡问题这方面还得再优化我想想有没有更好方法

diudiuu

14 小时 8 分钟前

@KaiWuBOSS 没有用 kuaiwu,我给的参数是实际自己部署的

我就是好奇这个已经是带宽的最大能力了,还能有优化空间吗?

有什么新的思路

KaiWuBOSS

14 小时 0 分钟前

@ravecn2014 claude 发现个方法我马上试试:
v0.1.9 — 多卡 tensor split 优化多卡 tensor split 从纯按显存比例改为按显存×带宽加权。异构多卡（如 3090+4090+5060 ）分配更合理——弱卡少分层，不拖慢整体多卡显示改为逐卡列出（型号、显存、带宽、分配比例）
--fit on 现在对 full_gpu 和 moe_offload 两种模式都无条件启用（之前 fallback 路径的 moe_offload 漏了）加速特性显示新增 tensor split 比例（多卡无 NVLink 时）

老师麻烦等我 0.1.9 编译好发布再测试一遍应该能好如果不行告诉我我跟进

osilinka

13 小时 44 分钟前

很牛，感觉可以建个公司找风投了！

KaiWuBOSS

13 小时 34 分钟前

@diudiuu 你的分析基本正确：
带宽是 decode 阶段的瓶颈
31B dense bf16 理论值就是 4-5 tok/s
llama.cpp 跑到 2.5 tok/s 是正常的（未充分优化）

但有两个方向可以突破：

1. 换 Gemma 4 26B MoE 版本
同等文件大小，速度快 6 倍（实测 70 tok/s ）
因为每次 token 只激活 4B 参数

2. 降量化
BF16 → Q4_K_M：约 11 tok/s
BF16 → NVFP4 （ DGX Spark 支持）：约 52 tok/s

Kaiwu 的原理就是自动做这些判断：
识别 dense vs MoE
根据带宽选最优量化
找到速度/质量/上下文的最优平衡

对 DGX Spark 这种统一内存架构
Kaiwu 会把它当高带宽设备处理
自动选更高精度的量化（不需要 q4_0 ）

coefu

12 小时 8 分钟前

@KaiWuBOSS #35 单机多卡 llama.cpp row 模式，对多卡是要做按比例 row split 的，如果 N 张卡都放不下整个，offload 到 cpu/mem 的部分，这之间怎么配比，你这个方案里没有。

比如，1 2080 11G ，1 3090 24G ，跑 unsloth/Qwen3.6-35B-A3B-GGUF Q8 ，38.5G ，多卡 parallel ，类似这种切分的解。

我阐述的问题是，N 不同的情况下，可能没有通用解。每个 N ，都是单独的。

coefu

12 小时 6 分钟前

哥们儿的态度，我很欣赏。

KaiWuBOSS

3 小时 32 分钟前

@coefu 这个问题之前确实没想过你提示很到位，我刚搜了下，说 lammacpp 也回复说自己也没搞定。我想了下，cpu 不应该只做存放，应该也要做运算，–– cpu-moe 是支持的。我们计划后面版本验证下，如果 cpu 计算后丢给 gpu 能不能提速，如果最小验证成功我们就上线，具体：
attention 层 → GPU （计算密集）
MoE expert → CPU （并行激活，利用多核）
KV cache 管理 → CPU 异步处理
三者同时跑，不互相等待。现在只是思路，后面看最小验证成功就能上线。