Large language models (LLMs) are termed ‘large’ due to their substantial size in bytes, not their intelligence. With billions of parameters, each four bytes, they challenge storage and RAM limits, particularly VRAM in GPUs. To manage this, quantization plays a critical role, significantly reducing model size. Codeically illustrates this by compressing a GB-sized model to just 63 MB through bit reduction. Additionally, models can be offloaded, but this can severely affect performance. Compression, often lowering 16-bit floating points to 8-bit, can reduce memory usage by approximately 75%. Using GPT-2 and internet quotes, a model with a mere 4-bit integer size was trained. Initial attempts with zeroed weights resulted in unusable output, but refining this approach yielded a functional chatbot generating text at 78 tokens per second on a CPU, proving that compact models can deliver surprisingly engaging—if nonsensical—results without extensive hardware investments.
Source link 
