Code_ML
GGUF > GGML
llama.cpp is a C++ implementation of the LLaMA (Large Language Model Meta AI) models, which were developed by Meta (formerly Facebook) as open-weight large language models. The llama.cpp repository enables running LLaMA models in a more efficient and user-friendly manner on local machines, typically designed to work with limited resources compared to running large-scale models in cloud environments.
ggml (Giant GPU Memory Library) is a lightweight, high-performance library designed for optimizing the memory management and inference of large machine learning models, specifically focusing on large language models (LLMs). It's particularly useful for running models like LLaMA on systems with limited resources, as it improves memory efficiency and speeds up processing.
📌 GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML.
GGUF has the same type of layout as GGML, with metadata and tensor data in a single file, but in addition is also designed to be backwards-compatible. The key difference is that previously instead of a list of values for the hyperparameters, the new file format uses a key-value lookup tables which accommodate shifting values.
Basically, GGUF (i.e. "GPT-Generated Unified Format"), previously GGML, is a quantization method that allows users to use the CPU to run an LLM but also offload some of its layers to the GPU for a speed up.
📌 GGML is a C++ Tensor library designed for machine learning, facilitating the running of LLMs either on a CPU alone or in tandem with a GPU.
Llama.cpp has dropped support for the GGML format and now only supports GGUF
GGUF contains all the metadata it needs in the model file (no need for other files like tokenizer_config.json) except the prompt template
llama.cpp has a script to convert *.safetensors model files into *.gguf
Transformers & Llama.cpp support both CPU, GPU and MPU inference
Being compiled in C++, with GGUF the inference is multithreaded.
GGML format recently changed to GGUF which is designed to be extensible, so that new features shouldn’t break compatibility with existing models. It also centralizes all the metadata in one file, such as special tokens, RoPE scaling parameters, etc. In short, it answers a few historical pain points and should be future-proof.
https://www.techpowerup.com/gpu-specs/radeon-rx-580.c2938
https://huggingface.co/models?pipeline_tag=image-text-to-text
https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct
https://huggingface.co/models?pipeline_tag=image-text-to-text