How It Works

The big picture

syndicAI provisions dedicated GPU instances on the spot market, deploys optimized inference engines (vLLM or llama.cpp), and exposes an OpenAI-compatible API for your squad. The entire lifecycle — from provisioning to billing — is managed through syndicAI's control plane so you never touch infrastructure directly.

Your tools (Cursor, aider, SDK)
    ↓  HTTPS + API Key
Squad Server (GPU Node)
    ├── Satellite Proxy (NestJS) — auth, rate limiting, telemetry
    └── Inference Engine (vLLM)  — model serving, token generation
    ↓  Management data only
syndicAI Control Plane
    ├── API (Cloudflare Workers)  — CRUD, billing, lifecycle
    ├── Dashboard (Angular SPA)   — squad management UI
    └── Database (Cloudflare D1)  — accounts, squads, usage

The key architectural principle: token data (your prompts, code, and model responses) never leaves the GPU node. Only management data (usage counts, health checks, lifecycle events) flows to syndicAI's central systems.

Satellite architecture

Each Squad Server runs a "satellite" — a lightweight NestJS proxy that sits between your tools and the inference engine. The satellite handles:

  • Authentication: Validates API keys against syndicAI's control plane
  • Request routing: Forwards validated requests to the local vLLM instance
  • Usage telemetry: Counts tokens processed and reports aggregate usage (not content) to central
  • Health monitoring: Reports server health status for the dashboard

The satellite runs on the same GPU node as the inference engine. There's no intermediate hop, no third-party routing, and no central proxy. Your requests go directly to your dedicated hardware.

Inference engine

syndicAI uses vLLM as the primary inference engine for most models. vLLM provides:

  • Continuous batching: Multiple squad members' requests are batched efficiently, so 5–10 concurrent users experience minimal performance degradation
  • PagedAttention: Efficient GPU memory management for long context windows
  • Speculative decoding: Faster token generation for supported models
  • OpenAI-compatible API: Native /v1/chat/completions and /v1/models endpoints

For smaller models (like Qwen2.5-Coder-32B), llama.cpp may be used for its lower overhead.

Lifecycle

A Squad Server goes through these states:

  1. Creating — GPU instance is being provisioned on the spot market
  2. Provisioning — Docker container is deploying, model weights are loading
  3. Active — Server is running and accepting requests
  4. Idle — No requests received for the idle timeout period; server is running but quiet
  5. Stopped — Server has been auto-stopped after reaching the daily GPU-hour limit
  6. Restarting — Server is spinning back up (on the next day or by manual trigger)

Auto-start and auto-stop ensure you only consume GPU-hours when your squad is actively coding.

Security model

  • TLS everywhere: All connections use HTTPS — from your tools to the satellite, and from the satellite to syndicAI's APIs
  • API key authentication: Each squad member has their own API key. Keys are validated against syndicAI's control plane on each request
  • Satellite-first data isolation: Token data (prompts, completions, code context) stays on the GPU node. syndicAI's central systems never see, store, or process your code
  • No logging of token content: The satellite reports aggregate usage metrics (token counts, request counts) but never logs the content of requests or responses
  • Managed infrastructure: GPU instances run in isolated containers with no shared tenancy. Your squad's server is yours alone

GPU spot market

syndicAI provisions GPU instances from the spot market — the same datacenter-class hardware (NVIDIA A100, H100) used by major AI labs, available at a fraction of on-demand pricing.

The spot market makes high-end GPUs accessible:

  • A100 80GB: Typically $1.00–1.60/hour on the spot market
  • H100 80GB: Typically $2.00–3.20/hour on the spot market
  • Multi-GPU configurations: 2× or 4× GPU setups for larger models

syndicAI handles all the complexity of spot market provisioning — instance selection, availability monitoring, automatic migration if a spot instance is reclaimed, and graceful shutdown/restart.