LLM Inference
15AI tools in the LLM Inference category
SGLang
SGLang is a fast serving framework for large language models and vision language models.
TGI
a toolkit for deploying and serving Large Language Models (LLMs).
TensorRT-LLM
Nvidia Framework for LLM Inference
FasterTransformer
NVIDIA Framework for LLM Inference(Transitioned to TensorRT-LLM)
MInference
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
mistral.rs
Blazingly fast LLM inference.
SkyPilot
Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution -- all with a simple interface.
DeepSpeed-Mii
MII makes low-latency and high-throughput inference, similar to vLLM powered by DeepSpeed.
Text-Embeddings-Inference
Inference for text-embeddings in Rust, HFOIL Licence.
Infinity
Inference for text-embeddings in Python
LMDeploy
A high-throughput and low-latency inference and serving framework for LLMs and VLs
Liger-Kernel
Efficient Triton Kernels for LLM Training.
prima.cpp
A distributed implementation of llama.cpp that lets you run 70B-level LLMs on your everyday devices.
deploy-llms-with-ansible
Easily deploy any LLM on a VM with minimal configuration, using Ansible.