forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 653
[FEATURE REQUEST] - Turbo Quant #2075
Copy link
Copy link
Open
Description
It is looking like a new method for handling KV Cache has arrived, one that improves over Q8 KV, both in speed and reducing the memory footprint. IMO, it seems likely to become a feature in LlamaCPP, as I have seen several people use Turbo Quant.
The thread below details the subject.
TurboQuant - Extreme KV Cache Quantization
I figured that I should draw attention to this, as it seems very promising.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels