Briefing · April 2026

Ubicloud AI

Open-source inference endpoints, EuroGPT Enterprise,
and B200 virtualization

TL;DR — AI Strategy

  • Pivoted from raw GPU rentals to managed inference PaaS — GPU GitHub Actions runners deprecated Dec 31, 2025; GPU VMs repositioned as private/enterprise-only
  • Open-weight only — no Claude/GPT/Gemini re-hosting; every model on the platform is open-weight
  • Three product surfaces: inference endpoints (dev API), EuroGPT Enterprise (SaaS), private B200 VMs (enterprise/BYOC)
  • Production runtime: vLLM V1 with FlashAttention-3, FlashInfer, speculative decoding, prefix caching
  • Signature technical work: open-source virtualization of NVIDIA HGX B200 using QEMU 10.1+ + Fabric Manager Shared NVSwitch Multitenancy
  • AI footprint: Germany (Falkenstein, Helsinki, EuroGPT processing) + Türkiye Istanbul Private Location for B200

Product Surface

Two API surfaces:

SurfaceBase URLPurposeAuth
Management https://api.ubicloud.com Manage API keys, endpoints, projects Bearer JWT
Inference data plane https://{model}.ai.ubicloud.com/v1 OpenAI-compatible inference Bearer API key

Per-model subdomain pattern — each model gets its own hostname (e.g. llama-3-3-70b-turbo.ai.ubicloud.com/v1). There is no unified inference host.

SDK support: any OpenAI-compatible SDK (Python openai, JS); first-party Ruby SDK + ubi CLI (beta).

Free tier: 500,000 tokens / month.

OpenAI Compatibility

Documented and working against the per-model base URL:

  • POST /v1/chat/completions — non-streaming
  • POST /v1/chat/completions with stream=True — SSE streaming
  • POST /v1/chat/completions with response_format={"type":"json_object"} — JSON mode
  • POST /v1/chat/completions with tools=[...], tool_choice="auto" — function/tool calling
  • /v1/embeddings — implied by Qwen3-Embedding-8B launch (endpoint path not explicitly documented)
Not documented or not offered: /v1/completions (legacy), /v1/models on data plane, audio, image, batch API, fine-tuning API.

Model Catalog (Confirmed Public)

Model IDFamilyRoleFirst seen
llama-3-3-70b-turboLlama 3.3 70BChatFeb 2025
mistral-small-3Mistral Small 3 (24B)ChatFeb 2025
ds-r1-qwen-32bDeepSeek-R1-Distill-Qwen-32BReasoningFeb–Mar 2025
DeepSeek V3DeepSeek V3ChatJun 2025
DeepSeek R1DeepSeek R1ReasoningJun 2025
Qwen2.5-VL-72BQwen 2.5 VLVision-languageJul 2025
Qwen3 VLQwen 3 VLVision-languageOct 2025
Qwen3-Embedding-8BQwen 3 EmbeddingText embeddingsMar 2026
Llama Guard 3MetaModeration (EuroGPT)Nov 2024
Llama 3.1 405BMetaChat (EuroGPT)Nov 2024

Open-weight only. No Llama 4 in public materials. Context windows and quantization not published per-model.

Public Pricing

Per-token pricing is dashboard-only for most chat models. Only two models are publicly priced on web:

ModelPriceNotes
Qwen2.5-VL-72B$0.80 / M tokens (input + output)Jul 2025
Qwen3-Embedding-8B$0.05 / M input tokensMar 2026
Free tier500k tokens / monthFeb 2025

March 2026 addition: new GET /project/{id}/inference-endpoint API returns full price table programmatically with separate per_million_prompt_tokens and per_million_completion_tokens.

Positioning claims (Ubicloud-authored): "3–10x lower than comparable offerings" for cloud overall; "3x lower than US alternatives" for EuroGPT. "10x cheaper than OpenAI" is NOT a Ubicloud claim — that phrasing came from third-party research.

Hardware Stack

GPUStatusFirst public mention
NVIDIA A100Preview (Germany)May 2025
NVIDIA H100Production (prior GPU VMs)
NVIDIA HGX B200Production (Türkiye Istanbul, on request)Oct 2025
NVIDIA RTX PRO 6000On requestDec 2025

Not offered in public materials: H200, L40S, MI300X.

B200 partitioning via Shared NVSwitch Multitenancy

Partition sizeWhen added
1-GPU, 2-GPUOct 2025 launch
4-GPU, 8-GPUNov 2025

Inside a partition: full NVLink/NVSwitch bandwidth. Across partitions: isolated. Fabric Manager enforces routing.

B200 Virtualization — Signature Tech Work

Ubicloud wrote the "missing manual" on open-source virtualization of NVIDIA HGX B200. Stack:

  • QEMU 10.1+ (not Cloud Hypervisor) — B200 needs multi-level PCIe topology that Cloud Hypervisor's flat topology can't produce; 10.1 added BAR-mapping optimizations critical for B200's 256 GB Region 2 BAR per GPU
  • VFIO-PCI passthroughvfio-pci.ids=10de:2901, intel_iommu=on iommu=pt; blacklist nouveau/nvidia/nvidia_drm
  • nvidia-open driver on guest (proprietary stack can't drive B200)
  • NVIDIA Fabric Manager in FABRIC_MODE=1 (Shared NVSwitch Multitenancy) on host; fmpm CLI for partition management
  • Host/guest driver versions must match exactly (e.g., 580.95.05)
Competitive point: entire stack is open source; operators can replicate it. Reached HN front page Dec 15, 2025.

vLLM V1 Internals

Production runtime is vLLM V1. Three main components:

  • AsyncLLM — async wrapper for tokenization/detokenization; talks to engine via IPC (bypasses Python GIL)
  • EngineCore — busy loop: pull from input queue, run scheduler + one forward pass per step
  • Scheduler — continuous batching via max_num_batched_tokens; all requests finish prefill before decode

Optimization layer

  • FlashAttention-3 for forward passes
  • FlashInfer (integrated Feb 2025) as high-performance kernel generator
  • PagedAttention-lineage block-based KV cache, dynamically allocated
  • Speculative decoding on DeepSeek R1 32B (Mar 2025)
  • Prefix caching referenced in Dewey.py deep-research demo

Not covered publicly: multi-worker load balancing, health checks, auto-restart, model hot-swap.

EuroGPT Enterprise

The consumer/SaaS face of Ubicloud AI. Available at eurogpt.ubicloud.com.

  • €19 per user per month — framed as 3x cheaper than ChatGPT Enterprise / Copilot
  • LLM: Meta Llama 3.1 405B (open weights)
  • Moderation: Llama Guard 3 (optional, input + output)
  • Embeddings: Mistral E5 7B for RAG with private knowledge base
  • Web search: DuckDuckGo (privacy-preserving)
  • Data residency: "Data remains in Germany, including all GPU processing"
  • Training: "No customer data or metadata used for training purposes"
  • Security: encryption in transit + envelope encryption at rest, key rotation, file upload
  • SSO: OIDC at platform level (Jul 2025); EuroGPT-specific SSO not explicitly documented

Not disclosed: quantization of the 405B deployment. Not offered: private API for EuroGPT — raw API consumers use Inference Endpoints directly.

Strategic Pivot: GPU Rentals → Inference PaaS

Before (2024)

Offered raw GPU rentals (RTX 4000 Ada / H100) as GitHub Actions runners and GPU VMs.

Inflection (2025)

Recognized the CapEx-heavy raw-GPU race against CoreWeave, Lambda, AWS P5, Azure NDv5 as structurally unviable for a seed-stage company. Moved up-stack to managed inference PaaS + dedicated enterprise GPU (private locations).

After (Dec 31, 2025)

  • GPU GitHub Actions runners deprecated
  • GPU VMs repositioned as private/enterprise deployments (B200, RTX PRO 6000 on request)
  • Open-weight inference endpoints become the primary AI front door
  • EuroGPT Enterprise becomes the productized SaaS face
Implication: Ubicloud is no longer competing on GPU-hours — they're competing on tokens and on the quality of the managed inference stack.

Positioning

Competitor classExamplesUbicloud's angle
Closed-model LLM vendorsOpenAI, AnthropicOpen-weight only; lower price; EU residency; no training use
Fast-inference specialistsGroq, Together, Fireworks, DeepInfraSame model class; adds full IaaS underneath + EuroGPT SaaS on top
GPU cloudsCoreWeave, Lambda, AWS P5Open-source B200 virtualization; control plane on GitHub; BYOC option
GPU-on-demandRunPod, Vast.aiManaged-first; GDPR-native; EuroGPT SaaS
European sovereign AIMistral-La Plateforme, Aleph AlphaBroader IaaS (compute + K8s + Postgres) beyond just models

Differentiators actually claimable

  • End-to-end AGPL-3.0 stack (hypervisor → vLLM → UI)
  • Proven B200 virtualization (with public technical writeup)
  • Germany-resident EuroGPT turnkey product
  • Strong Postgres heritage → good RAG / vector story when paired with managed Postgres

Gaps & What's Missing

  • No public SLA, rate limits, latency, or throughput numbers for inference endpoints
  • No public per-token pricing for chat/reasoning models (only Qwen2.5-VL and Qwen3-Embedding priced on web) — dashboard-only
  • Not offered: batch inference API, fine-tuning / LoRA, image generation, audio (Whisper/TTS), multimodal beyond vision-language input
  • No public EU AI Act role classification (provider vs deployer) despite operating EuroGPT and open inference
  • No named AI customers in public materials; no case studies beyond Ubicloud's own Dewey.py deep-research demo
  • No benchmarks vs CoreWeave / Lambda / AWS P5 on B200 workloads; vs OpenAI / Groq / Together on inference throughput or latency
  • Istanbul B200 hosting provider not publicly named — framed as "Private Location" / on-request

Key AI Sources