Search
exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
mistral.rs
Blazingly fast LLM inference.
SkyPilot
Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution -- all with a simple interface.
...moreDeepSpeed-Mii
MII makes low-latency and high-throughput inference, similar to vLLM powered by DeepSpeed.
Text-Embeddings-Inference
Inference for text-embeddings in Rust, HFOIL Licence.
Infinity
Inference for text-embeddings in Python
Liger-Kernel
Efficient Triton Kernels for LLM Training.
prima.cpp
A distributed implementation of llama.cpp that lets you run 70B-level LLMs on your everyday devices.
deploy-llms-with-ansible
Easily deploy any LLM on a VM with minimal configuration, using Ansible.
Swiss Army Llama
Comprehensive set of tools for working with local LLMs for various tasks.
wechat-chatgpt
Use ChatGPT On Wechat via wechaty
Serge
a chat interface crafted with llama.cpp for running Alpaca models. No API keys, entirely self-hosted!
IntelliServer
simplifies the evaluation of LLMs by providing a unified microservice to access and test multiple AI models.
Search with Lepton
Build your own conversational search engine using less than 500 lines of code by [LeptonAI](https://github.com/leptonai).
...moreRobocorp
Create, deploy and operate Actions using Python anywhere to enhance your AI agents and assistants. Batteries included with an extensive set of libraries, helpers and logging.
...moreTune Studio
Playground for devs to finetune & deploy LLMs
talkd.ai dialog
Simple API for deploying any RAG or LLM that you want adding plugins.
Wllama
WebAssembly binding for llama.cpp - Enabling in-browser LLM inference
GPUStack
An open-source GPU cluster manager for running LLMs
MNN-LLM
- A Device-Inference framework, including LLM Inference on device(Mobile Phone/PC/IOT)