LLM Explorer: A Curated Large Language Model Directory and Analytics  // 

LLM Hosting

GPU Hostings, Serverless LLM and LLM Inference Services
Explore the ultimate directory of the GPU hosting and serverless LLM inference servers working via vllm and ollama for large language models (LLMs). Our comprehensive list includes serverless LLM endpoints with API integration, powerful GPU servers, and options for fine-tuning your models. Find the perfect llm hosting service tailored to your application development needs. Whether you're looking for high-performance GPU for LLM or convenient serverless environments for llm inference, our directory covers it all. Compare features, performance, and pricing to select the best LLM hosting service. Ideal for developers and businesses seeking efficient Large Language Model deployment and inference capabilities.

LLM Inference Frameworks

The LLM (Large Language Model) Inference Frameworks are tools and methods used to deploy and serve LLMs for tasks such as text generation, translation, content summarization, and more. These frameworks are needed to address the challenges posed by the huge size and computational power requirements of LLMs. They enable seamless integration with LLM models, provide high throughput serving, support distributed inference, and offer features such as continuous batching, paged attention, and support for various LLM models.
vLLMgithub.com/vllm-project/vllm
llama.cppgithub.com/ggerganov/llama.cpp
SkyPilotgithub.com/skypilot-org/skypilot
TGIgithub.com/huggingface/text-generation-i
TensorRTdeveloper.nvidia.com/tensorrt-getting-st
MLXgithub.com/ml-explore/mlx
LoRAXgithub.com/predibase/lorax
Titantitanml.co
exllamav2github.com/turboderp/exllamav2
NeuralMagicneuralmagic.com
ollama.aiollama.ai

GPU Hosting with API for LLM Inference

GPU hosting with API for LLM inference refers to the provision of GPU resources and an application programming interface (API) for running large language models (LLMs) on GPUs. This allows users to access the computational power of GPUs for LLM inference via a programming interface. There are various cloud-based services and platforms that offer GPU hosting for LLM inference, allowing users to access GPU resources via an API for running LLMs. These services include offerings from different providers such as AWS, Azure, Google Cloud, and several other companies.
HuggingFace Endpointhuggingface.co/inference-endpoints
Modelbitmodelbit.com
Havenhaven.run
Replicatereplicate.com
BaseTenbaseten.co
Modalmodal.com
Mysticmystic.ai
Saladsalad.com
SaturnCloudsaturncloud.io
DataRobot Algorithmiadatarobot.com/platform/deploy-and-run
DataBricksdocs.databricks.com/en/machine-learning/
Kagglekaggle.com
Google Colabcolab.google
QBlocksqblocks.cloud
DataCrunchdatacrunch.io/inference
DStackdstack.ai
CloudFlareai.cloudflare.com
Predibasepredibase.com
Encloudencloud.tech
MosaicMLmosaicml.com
SeaPlaneseaplane.io

Serverless LLM Hosting, Endpoints for LLM Inference

Serverless inference allows deploying and scaling ML models without handling hardware, adjusting resources based on demand and charging only for usage. This is cost-effective for variable traffic, supporting endpoints with 1-6 GB memory. It suits models within this memory range but may not be ideal for larger models needing more than 6 GB due to possible high latency. For real-time, low-latency needs, other hosting options should be considered. Serverless deployment also simplifies setting up LLMs as APIs, reducing costs with features like scale to zero and secure, offline endpoints. This offers an easy, cost-efficient way to deploy LLMs for various applications.
together.aitogether.aitoken cost
Mistral AI Platformmistral.aitoken cost
AWS BedRockaws.amazon.com/bedrocktoken cost
Anyscaleanyscale.com/endpointstoken cost
Lamini.ailamini.aitoken cost
OpenPipeopenpipe.aitoken cost
Fireworks AIapp.fireworks.aitoken cost
OpenRouteropenrouter.aitoken cost
DeepInfradeepinfra.comtoken cost

GPU Hosting

GPU hosting involves leveraging powerful Graphics Processing Units (GPUs) within data centers or cloud platforms to provide on-demand high-performance computing capabilities. Users can access this computing power on a subscription basis or pay by the hour, making it a flexible solution for a range of high-demand applications. This service is essential for workloads requiring significant computational resources, including scientific simulations, video rendering, machine learning, and gaming. It offers access to advanced GPUs capable of handling intensive tasks like 3D rendering, machine learning model training, video transcoding, and other high-performance computing needs. GPU hosting is particularly advantageous for activities needing parallel processing capabilities, such as video and photo editing, machine learning applications, gaming, scientific research, and the development of artificial intelligence technologies.
Paperspace Gradientpaperspace.com/deployments
AWS SageMakeraws.amazon.com/sagemaker
Azure AI Machine Learning Studiostudio.azureml.net
Google Vertex AIcloud.google.com/vertex-ai
NVIDIA Triton Inference Serverdeveloper.nvidia.com/triton-inference-se
TensorDocktensordock.com/product-marketplace
TrueFoundrytruefoundry.com/llmops
Latitudelatitude.sh/accelerate/pricing
Bananabanana.dev
Beam Cloudbeam.cloud
Lightninglightning.ai
Genesis Cloudgenesiscloud.com
Vultrvultr.com/pricing
ScaleWayscaleway.com/en
CudoComputecudocompute.com
Unweaveunweave.io
Vagonvagon.io
LeaderGPUleadergpu.com
CirraScalecirrascale.com
Vast.AIvast.ai
Immers Clouden.immers.cloud/gpu
Fai.aifal.ai
Original data from HuggingFace, OpenCompass and various public git repos.
Release v2024022003