llama-bench -m <模型路径>
| test\backend | ipex-llm | sycl | vulkan | 
|---|---|---|---|
| pp512 | 458.82 | 192.35 | 64.35 | 
| tg128 | 7.09 | 6.55 | 11.60 | 
4090 (572.60) + X670E, llama-b4820-bin-win-cuda-cu12.4-x64
pp512: 2291.15, tg128: 40.55
https://github.com/ProjectPhysX/OpenCL-Benchmark/releases/download/v1.8/OpenCL-Benchmark-Windows.exe
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | Intel(R) Arc(TM) A770 Graphics                             |
| Device ID    1 | Intel(R) Arc(TM) A750 Graphics                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Intel(R) Arc(TM) A770 Graphics                             |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 32.0.101.6559 (Windows)                                    |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 512 at 2400 MHz (4096 cores, 19.661 TFLOPs/s)              |
| Memory, Cache  | 16255 MB VRAM, 16384 KB global / 64 KB local               |
| Buffer Limits  | 4095 MB global, 4194296 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                          not supported        |
| FP32  compute                                        12.196 TFLOPs/s (2/3 ) |
| FP16  compute                                        18.425 TFLOPs/s ( 1x ) |
| INT64 compute                                         1.191  TIOPs/s (1/16) |
| INT32 compute                                         5.687  TIOPs/s (1/4 ) |
| INT16 compute                                        30.045  TIOPs/s ( 2x ) |
| INT8  compute                                        29.282  TIOPs/s ( 1x ) |
| Memory Bandwidth ( coalesced read      )                        223.97 GB/s |
| Memory Bandwidth ( coalesced      write)                        432.86 GB/s |
| Memory Bandwidth (misaligned read      )                        400.16 GB/s |
| Memory Bandwidth (misaligned      write)                        438.62 GB/s |
| PCIe   Bandwidth (send                 )                          9.30 GB/s |
| PCIe   Bandwidth (   receive           )                          9.00 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    9.90 GB/s |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | Intel(R) Arc(TM) A750 Graphics                             |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 32.0.101.6559 (Windows)                                    |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 448 at 2400 MHz (3584 cores, 17.203 TFLOPs/s)              |
| Memory, Cache  | 8095 MB VRAM, 16384 KB global / 64 KB local                |
| Buffer Limits  | 3967 MB global, 4062248 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                          not supported        |
| FP32  compute                                        10.693 TFLOPs/s (2/3 ) |
| FP16  compute                                        16.177 TFLOPs/s ( 1x ) |
| INT64 compute                                         1.090  TIOPs/s (1/16) |
| INT32 compute                                         5.043  TIOPs/s (1/3 ) |
| INT16 compute                                        26.553  TIOPs/s ( 2x ) |
| INT8  compute                                        26.611  TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read      )                        210.06 GB/s |
| Memory Bandwidth ( coalesced      write)                        434.85 GB/s |
| Memory Bandwidth (misaligned read      )                        399.86 GB/s |
| Memory Bandwidth (misaligned      write)                        441.22 GB/s |
| PCIe   Bandwidth (send                 )                          9.35 GB/s |
| PCIe   Bandwidth (   receive           )                          9.04 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    9.94 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit.                                                  |
'-----------------------------------------------------------------------------'
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce RTX 4090                                    |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 572.60 (Windows)                                           |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 128 at 2535 MHz (16384 cores, 83.067 TFLOPs/s)             |
| Memory, Cache  | 24563 MB VRAM, 3584 KB global / 48 KB local                |
| Buffer Limits  | 6140 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         1.401 TFLOPs/s (1/64) |
| FP32  compute                                        85.239 TFLOPs/s ( 1x ) |
| FP16  compute                                        88.567 TFLOPs/s ( 1x ) |
| INT64 compute                                         4.204  TIOPs/s (1/24) |
| INT32 compute                                        44.164  TIOPs/s (1/2 ) |
| INT16 compute                                        38.203  TIOPs/s (1/2 ) |
| INT8  compute                                       133.384  TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read      )                        925.72 GB/s |
| Memory Bandwidth ( coalesced      write)                        898.38 GB/s |
| Memory Bandwidth (misaligned read      )                        923.73 GB/s |
| Memory Bandwidth (misaligned      write)                        212.93 GB/s |
| PCIe   Bandwidth (send                 )                         15.66 GB/s |
| PCIe   Bandwidth (   receive           )                         14.80 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   15.24 GB/s |
|-----------------------------------------------------------------------------|