All Services
Load Balancing

Vision Proxy

Intelligent load balancing across your GPU fleet. Route vision and chat requests to available backends automatically. No single point of failure.

Get a Demo

The Routing Layer Your Cluster Needs

One endpoint for your application. Multiple GPU backends behind the scenes.

As your AI Metal Cluster grows from one GPU node to many, you need a way to distribute requests across them. Vision Proxy is that layer.

Your application talks to a single endpoint. The proxy figures out which backend has capacity, routes the request there, and returns the response. If a backend goes down, traffic shifts automatically. Your application never knows the difference.

  • Transparent to your application — Same API, same models, just more capacity
  • Scales horizontally — Add GPU nodes and the proxy includes them automatically
  • No code changes — Point your application at the proxy endpoint and you are done

Request Flow

Your Application

Sends request to a single proxy endpoint

Vision Proxy

Checks capacity, translates model names, picks backend

GPU Backend A

Available

GPU Backend B

Busy

GPU Backend C

Available

Key Features

Everything you need to manage a multi-node GPU cluster

Capacity Checking

The proxy knows which backends have room and which are saturated. Requests are only sent to backends that can handle them, preventing queue buildup and timeouts.

Round-Robin Distribution

Requests are distributed evenly across available backends. This prevents any single node from being overloaded while others sit idle.

Model Name Translation

Different backends may run models under different internal names. The proxy translates your request to the correct model name for whichever backend handles it. One API, regardless of backend.

Automatic Failover

If a backend goes offline, the proxy routes around it. No manual intervention needed. When it comes back, it is automatically included in the rotation again.

Why This Matters

Without a proxy layer, every application needs to know about every GPU backend. It needs to handle failover, capacity checking, and model name differences. That is complexity your application should not carry.

Vision Proxy absorbs that complexity. Your application sends a request to one URL. Everything else is handled.

Without Proxy Each app manages its own backends
With Vision Proxy One endpoint, automatic routing

Scale Without Rewiring

Adding a new GPU node to your cluster? The proxy picks it up. No application redeployment, no configuration changes on the client side.

High Availability

No single point of failure in your GPU fleet. If one node goes down for maintenance or hardware issues, the rest keep serving requests.

Transparent to Applications

Your application layer does not need to know how many backends you have or which one handled its request. It just gets a response.

Part of the Stack

Scale Your GPU Fleet With Confidence

See how Vision Proxy manages multi-node deployments in a live demo.

Request a Demo