LLM Inference
16AI tools in the LLM Inference category
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
NVIDIA Framework for LLM Inference(Transitioned to TensorRT-LLM)
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution -- all with a simple interface.
MII makes low-latency and high-throughput inference, similar to vLLM powered by DeepSpeed.
Inference for text-embeddings in Rust, HFOIL Licence.
A high-throughput and low-latency inference and serving framework for LLMs and VLs
A distributed implementation of llama.cpp that lets you run 70B-level LLMs on your everyday devices.