Briefing · April 2026

Ubicloud AI

Open-source inference endpoints, EuroGPT Enterprise,
and B200 virtualization

TL;DR — AI Strategy

Pivoted from raw GPU rentals to managed inference PaaS — GPU GitHub Actions runners deprecated Dec 31, 2025; GPU VMs repositioned as private/enterprise-only
Open-weight only — no Claude/GPT/Gemini re-hosting; every model on the platform is open-weight
Three product surfaces: inference endpoints (dev API), EuroGPT Enterprise (SaaS), private B200 VMs (enterprise/BYOC)
Production runtime: vLLM V1 with FlashAttention-3, FlashInfer, speculative decoding, prefix caching
Signature technical work: open-source virtualization of NVIDIA HGX B200 using QEMU 10.1+ + Fabric Manager Shared NVSwitch Multitenancy
AI footprint: Germany (Falkenstein, Helsinki, EuroGPT processing) + Türkiye Istanbul Private Location for B200

Product Surface

Two API surfaces:

Surface	Base URL	Purpose	Auth
Management	`https://api.ubicloud.com`	Manage API keys, endpoints, projects	Bearer JWT
Inference data plane	`https://{model}.ai.ubicloud.com/v1`	OpenAI-compatible inference	Bearer API key

Per-model subdomain pattern — each model gets its own hostname (e.g. llama-3-3-70b-turbo.ai.ubicloud.com/v1). There is no unified inference host.

SDK support: any OpenAI-compatible SDK (Python openai, JS); first-party Ruby SDK + ubi CLI (beta).

Free tier: 500,000 tokens / month.

OpenAI Compatibility

Documented and working against the per-model base URL:

POST /v1/chat/completions — non-streaming
POST /v1/chat/completions with stream=True — SSE streaming
POST /v1/chat/completions with response_format={"type":"json_object"} — JSON mode
POST /v1/chat/completions with tools=[...], tool_choice="auto" — function/tool calling
/v1/embeddings — implied by Qwen3-Embedding-8B launch (endpoint path not explicitly documented)

Not documented or not offered: /v1/completions (legacy), /v1/models on data plane, audio, image, batch API, fine-tuning API.

Model Catalog (Confirmed Public)

Model ID	Family	Role	First seen
`llama-3-3-70b-turbo`	Llama 3.3 70B	Chat	Feb 2025
`mistral-small-3`	Mistral Small 3 (24B)	Chat	Feb 2025
`ds-r1-qwen-32b`	DeepSeek-R1-Distill-Qwen-32B	Reasoning	Feb–Mar 2025
DeepSeek V3	DeepSeek V3	Chat	Jun 2025
DeepSeek R1	DeepSeek R1	Reasoning	Jun 2025
`Qwen2.5-VL-72B`	Qwen 2.5 VL	Vision-language	Jul 2025
Qwen3 VL	Qwen 3 VL	Vision-language	Oct 2025
`Qwen3-Embedding-8B`	Qwen 3 Embedding	Text embeddings	Mar 2026
Llama Guard 3	Meta	Moderation (EuroGPT)	Nov 2024
Llama 3.1 405B	Meta	Chat (EuroGPT)	Nov 2024

Open-weight only. No Llama 4 in public materials. Context windows and quantization not published per-model.

Public Pricing

Per-token pricing is dashboard-only for most chat models. Only two models are publicly priced on web:

Model	Price	Notes
Qwen2.5-VL-72B	$0.80 / M tokens (input + output)	Jul 2025
Qwen3-Embedding-8B	$0.05 / M input tokens	Mar 2026
Free tier	500k tokens / month	Feb 2025

March 2026 addition: new GET /project/{id}/inference-endpoint API returns full price table programmatically with separate per_million_prompt_tokens and per_million_completion_tokens.

Positioning claims (Ubicloud-authored): "3–10x lower than comparable offerings" for cloud overall; "3x lower than US alternatives" for EuroGPT. "10x cheaper than OpenAI" is NOT a Ubicloud claim — that phrasing came from third-party research.

Hardware Stack

GPU	Status	First public mention
NVIDIA A100	Preview (Germany)	May 2025
NVIDIA H100	Production (prior GPU VMs)	—
NVIDIA HGX B200	Production (Türkiye Istanbul, on request)	Oct 2025
NVIDIA RTX PRO 6000	On request	Dec 2025

Not offered in public materials: H200, L40S, MI300X.

B200 partitioning via Shared NVSwitch Multitenancy

Partition size	When added
1-GPU, 2-GPU	Oct 2025 launch
4-GPU, 8-GPU	Nov 2025

Inside a partition: full NVLink/NVSwitch bandwidth. Across partitions: isolated. Fabric Manager enforces routing.

B200 Virtualization — Signature Tech Work

Ubicloud wrote the "missing manual" on open-source virtualization of NVIDIA HGX B200. Stack:

QEMU 10.1+ (not Cloud Hypervisor) — B200 needs multi-level PCIe topology that Cloud Hypervisor's flat topology can't produce; 10.1 added BAR-mapping optimizations critical for B200's 256 GB Region 2 BAR per GPU
VFIO-PCI passthrough — vfio-pci.ids=10de:2901, intel_iommu=on iommu=pt; blacklist nouveau/nvidia/nvidia_drm
nvidia-open driver on guest (proprietary stack can't drive B200)
NVIDIA Fabric Manager in FABRIC_MODE=1 (Shared NVSwitch Multitenancy) on host; fmpm CLI for partition management
Host/guest driver versions must match exactly (e.g., 580.95.05)

Competitive point: entire stack is open source; operators can replicate it. Reached HN front page Dec 15, 2025.

vLLM V1 Internals

Production runtime is vLLM V1. Three main components:

AsyncLLM — async wrapper for tokenization/detokenization; talks to engine via IPC (bypasses Python GIL)
EngineCore — busy loop: pull from input queue, run scheduler + one forward pass per step
Scheduler — continuous batching via max_num_batched_tokens; all requests finish prefill before decode

Optimization layer

FlashAttention-3 for forward passes
FlashInfer (integrated Feb 2025) as high-performance kernel generator
PagedAttention-lineage block-based KV cache, dynamically allocated
Speculative decoding on DeepSeek R1 32B (Mar 2025)
Prefix caching referenced in Dewey.py deep-research demo

Not covered publicly: multi-worker load balancing, health checks, auto-restart, model hot-swap.

EuroGPT Enterprise

The consumer/SaaS face of Ubicloud AI. Available at eurogpt.ubicloud.com.

€19 per user per month — framed as 3x cheaper than ChatGPT Enterprise / Copilot
LLM: Meta Llama 3.1 405B (open weights)
Moderation: Llama Guard 3 (optional, input + output)
Embeddings: Mistral E5 7B for RAG with private knowledge base
Web search: DuckDuckGo (privacy-preserving)
Data residency: "Data remains in Germany, including all GPU processing"
Training: "No customer data or metadata used for training purposes"
Security: encryption in transit + envelope encryption at rest, key rotation, file upload
SSO: OIDC at platform level (Jul 2025); EuroGPT-specific SSO not explicitly documented

Not disclosed: quantization of the 405B deployment. Not offered: private API for EuroGPT — raw API consumers use Inference Endpoints directly.

Strategic Pivot: GPU Rentals → Inference PaaS

Before (2024)

Offered raw GPU rentals (RTX 4000 Ada / H100) as GitHub Actions runners and GPU VMs.

Inflection (2025)

Recognized the CapEx-heavy raw-GPU race against CoreWeave, Lambda, AWS P5, Azure NDv5 as structurally unviable for a seed-stage company. Moved up-stack to managed inference PaaS + dedicated enterprise GPU (private locations).

After (Dec 31, 2025)

GPU GitHub Actions runners deprecated
GPU VMs repositioned as private/enterprise deployments (B200, RTX PRO 6000 on request)
Open-weight inference endpoints become the primary AI front door
EuroGPT Enterprise becomes the productized SaaS face

Implication: Ubicloud is no longer competing on GPU-hours — they're competing on tokens and on the quality of the managed inference stack.

Positioning

Competitor class	Examples	Ubicloud's angle
Closed-model LLM vendors	OpenAI, Anthropic	Open-weight only; lower price; EU residency; no training use
Fast-inference specialists	Groq, Together, Fireworks, DeepInfra	Same model class; adds full IaaS underneath + EuroGPT SaaS on top
GPU clouds	CoreWeave, Lambda, AWS P5	Open-source B200 virtualization; control plane on GitHub; BYOC option
GPU-on-demand	RunPod, Vast.ai	Managed-first; GDPR-native; EuroGPT SaaS
European sovereign AI	Mistral-La Plateforme, Aleph Alpha	Broader IaaS (compute + K8s + Postgres) beyond just models

Differentiators actually claimable

End-to-end AGPL-3.0 stack (hypervisor → vLLM → UI)
Proven B200 virtualization (with public technical writeup)
Germany-resident EuroGPT turnkey product
Strong Postgres heritage → good RAG / vector story when paired with managed Postgres

Gaps & What's Missing

No public SLA, rate limits, latency, or throughput numbers for inference endpoints
No public per-token pricing for chat/reasoning models (only Qwen2.5-VL and Qwen3-Embedding priced on web) — dashboard-only
Not offered: batch inference API, fine-tuning / LoRA, image generation, audio (Whisper/TTS), multimodal beyond vision-language input
No public EU AI Act role classification (provider vs deployer) despite operating EuroGPT and open inference
No named AI customers in public materials; no case studies beyond Ubicloud's own Dewey.py deep-research demo
No benchmarks vs CoreWeave / Lambda / AWS P5 on B200 workloads; vs OpenAI / Groq / Together on inference throughput or latency
Istanbul B200 hosting provider not publicly named — framed as "Private Location" / on-request

Ubicloud AI

TL;DR — AI Strategy

Product Surface

OpenAI Compatibility

Model Catalog (Confirmed Public)

Public Pricing

Hardware Stack

B200 partitioning via Shared NVSwitch Multitenancy

B200 Virtualization — Signature Tech Work

vLLM V1 Internals

Optimization layer

EuroGPT Enterprise

Strategic Pivot: GPU Rentals → Inference PaaS

Before (2024)

Inflection (2025)

After (Dec 31, 2025)

Positioning

Differentiators actually claimable

Gaps & What's Missing

Key AI Sources