V2EX › hongdengdao 的所有回复 › 第 1 页 / 共 7 页

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

For Existing Member Sign In

1 2 3 4 5 6 7

❮

❯

4 月 24 日

回复了 KaiWuBOSS 创建的主题 › Local LLM › 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

这个应该有点问题,双卡 32g 显存, 8k-64k 肯定是可以运行的

.\kaiwu.exe run Qwen3.6-27B-Q4_K_M.gguf

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.2 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 4060 Ti × 2 (SM89, 16380 MB VRAM each, 0 GB/s)
RAM: 61 GB DDR5
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3.6-27B (dense, 28B)
Quant: Q4_K_M (15.7 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3.6-27B-Q4_K_M.gguf [cached]

[4/6] Preflight check...
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=256K ... OOM
Probe 2: ctx=128K ... OOM
Probe 3: ctx=64K ... OOM
Probe 4: ctx=32K ... OOM
Probe 5: ctx=16K ... OOM
Probe 6: ctx=8K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 64K 重试...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 32K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 3 次启动均失败，建议选择更小的模型
Usage:
kaiwu run <model> [flags],

4 月 24 日

回复了 KaiWuBOSS 创建的主题 › Local LLM › 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

.\kaiwu.exe run Qwen3.6-27B-Q4_K_M.gguf

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.1 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 4060 Ti (SM89, 16380 MB VRAM, 0 GB/s)
RAM: 61 GB DDR5
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3.6-27B (dense, 28B)
Quant: Q4_K_M (15.7 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3.6-27B-Q4_K_M.gguf [cached]

[4/6] Preflight check...
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=8K ... OOM
Probe 2: ctx=4K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行
建议：选择更小的量化或使用 MoE offload 模型
Usage:
kaiwu run <model> [flags]

4 月 24 日

回复了 KaiWuBOSS 创建的主题 › Local LLM › 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现