GPU Hostings, Serverless LLM and LLM Inference Services
Explore the ultimate directory of the GPU hosting and serverless LLM inference servers working via vllm and ollama for large language models (LLMs). Our comprehensive list includes serverless LLM endpoints with API integration, powerful GPU servers, and options for fine-tuning your models. Find the perfect llm hosting service tailored to your application development needs. Whether you're looking for high-performance GPU for LLM or convenient serverless environments for llm inference, our directory covers it all. Compare features, performance, and pricing to select the best LLM hosting service. Ideal for developers and businesses seeking efficient Large Language Model deployment and inference capabilities.
LLM Inference Frameworks
The LLM (Large Language Model) Inference Frameworks are tools and methods used to deploy and serve LLMs for tasks such as text generation, translation, content summarization, and more. These frameworks are needed to address the challenges posed by the huge size and computational power requirements of LLMs. They enable seamless integration with LLM models, provide high throughput serving, support distributed inference, and offer features such as continuous batching, paged attention, and support for various LLM models.
vLLM | github.com/vllm-project/vllm | ||
llama.cpp | github.com/ggerganov/llama.cpp | ||
SkyPilot | github.com/skypilot-org/skypilot | ||
TGI | github.com/huggingface/text-generation-i | ||
TensorRT | developer.nvidia.com/tensorrt-getting-st | ||
MLX | github.com/ml-explore/mlx | ||
LoRAX | github.com/predibase/lorax | ||
Titan | titanml.co | ||
exllamav2 | github.com/turboderp/exllamav2 | ||
NeuralMagic | neuralmagic.com | ||
ollama.ai | ollama.ai |
GPU Hosting with API for LLM Inference
GPU hosting with API for LLM inference refers to the provision of GPU resources and an application programming interface (API) for running large language models (LLMs) on GPUs. This allows users to access the computational power of GPUs for LLM inference via a programming interface. There are various cloud-based services and platforms that offer GPU hosting for LLM inference, allowing users to access GPU resources via an API for running LLMs. These services include offerings from different providers such as AWS, Azure, Google Cloud, and several other companies.
RunPod | runpod.io featured | ||
Affordable price: A100 80GB $1.89/hr, H100 80GB $3.89/hr, A40 48GB $0.79/hr, RTX 4090 24GB $0.74/hr, RTX A6000 48GB $0.79/hr. | |||
HuggingFace Endpoint | huggingface.co/inference-endpoints | ||
Modelbit | modelbit.com | ||
Haven | haven.run | ||
Replicate | replicate.com | ||
BaseTen | baseten.co | ||
Modal | modal.com | ||
Mystic | mystic.ai | ||
Salad | salad.com | ||
SaturnCloud | saturncloud.io | ||
DataRobot Algorithmia | datarobot.com/platform/deploy-and-run | ||
DataBricks | docs.databricks.com/en/machine-learning/ | ||
Kaggle | kaggle.com | ||
Google Colab | colab.google | ||
QBlocks | qblocks.cloud | ||
DataCrunch | datacrunch.io/inference | ||
DStack | dstack.ai | ||
CloudFlare | ai.cloudflare.com | ||
Predibase | predibase.com | ||
Encloud | encloud.tech | ||
MosaicML | mosaicml.com | ||
SeaPlane | seaplane.io |
Serverless LLM Hosting, Endpoints for LLM Inference
Serverless inference allows deploying and scaling ML models without handling hardware, adjusting resources based on demand and charging only for usage. This is cost-effective for variable traffic, supporting endpoints with 1-6 GB memory. It suits models within this memory range but may not be ideal for larger models needing more than 6 GB due to possible high latency. For real-time, low-latency needs, other hosting options should be considered.
Serverless deployment also simplifies setting up LLMs as APIs, reducing costs with features like scale to zero and secure, offline endpoints. This offers an easy, cost-efficient way to deploy LLMs for various applications.
together.ai | together.ai | token cost | ||
Mistral AI Platform | mistral.ai | token cost | ||
AWS BedRock | aws.amazon.com/bedrock | token cost | ||
Anyscale | anyscale.com/endpoints | token cost | ||
Lamini.ai | lamini.ai | token cost | ||
OpenPipe | openpipe.ai | token cost | ||
Fireworks AI | app.fireworks.ai | token cost | ||
OpenRouter | openrouter.ai | token cost | ||
DeepInfra | deepinfra.com | token cost |
GPU Hosting
GPU hosting involves leveraging powerful Graphics Processing Units (GPUs) within data centers or cloud platforms to provide on-demand high-performance computing capabilities. Users can access this computing power on a subscription basis or pay by the hour, making it a flexible solution for a range of high-demand applications. This service is essential for workloads requiring significant computational resources, including scientific simulations, video rendering, machine learning, and gaming. It offers access to advanced GPUs capable of handling intensive tasks like 3D rendering, machine learning model training, video transcoding, and other high-performance computing needs. GPU hosting is particularly advantageous for activities needing parallel processing capabilities, such as video and photo editing, machine learning applications, gaming, scientific research, and the development of artificial intelligence technologies.
TensorWave | go.tensorwave.com featured | ||
✨ AMD MI300X GPUs Available Now at TensorWave. ✨ Try it today! | |||
Paperspace Gradient | paperspace.com/deployments | ||
AWS SageMaker | aws.amazon.com/sagemaker | ||
Azure AI Machine Learning Studio | studio.azureml.net | ||
Google Vertex AI | cloud.google.com/vertex-ai | ||
NVIDIA Triton Inference Server | developer.nvidia.com/triton-inference-se | ||
TensorDock | tensordock.com/product-marketplace | ||
TrueFoundry | truefoundry.com/llmops | ||
Latitude | latitude.sh/accelerate/pricing | ||
Banana | banana.dev | ||
Beam Cloud | beam.cloud | ||
Lightning | lightning.ai | ||
Genesis Cloud | genesiscloud.com | ||
Vultr | vultr.com/pricing | ||
ScaleWay | scaleway.com/en | ||
CudoCompute | cudocompute.com | ||
Unweave | unweave.io | ||
Vagon | vagon.io | ||
LeaderGPU | leadergpu.com | ||
CirraScale | cirrascale.com | ||
Vast.AI | vast.ai | ||
Immers Cloud | en.immers.cloud/gpu | ||
Fai.ai | fal.ai |
Original data from HuggingFace, OpenCompass and various public git repos.
Release v2024072803