Briefing · April 2026
Ubicloud AI
Open-source inference endpoints, EuroGPT Enterprise,
and B200 virtualization
TL;DR — AI Strategy
- Pivoted from raw GPU rentals to managed inference PaaS — GPU GitHub Actions runners deprecated Dec 31, 2025; GPU VMs repositioned as private/enterprise-only
- Open-weight only — no Claude/GPT/Gemini re-hosting; every model on the platform is open-weight
- Three product surfaces: inference endpoints (dev API), EuroGPT Enterprise (SaaS), private B200 VMs (enterprise/BYOC)
- Production runtime: vLLM V1 with FlashAttention-3, FlashInfer, speculative decoding, prefix caching
- Signature technical work: open-source virtualization of NVIDIA HGX B200 using QEMU 10.1+ + Fabric Manager Shared NVSwitch Multitenancy
- AI footprint: Germany (Falkenstein, Helsinki, EuroGPT processing) + Türkiye Istanbul Private Location for B200
Product Surface
Two API surfaces:
| Surface | Base URL | Purpose | Auth |
| Management |
https://api.ubicloud.com |
Manage API keys, endpoints, projects |
Bearer JWT |
| Inference data plane |
https://{model}.ai.ubicloud.com/v1 |
OpenAI-compatible inference |
Bearer API key |
Per-model subdomain pattern — each model gets its own hostname (e.g. llama-3-3-70b-turbo.ai.ubicloud.com/v1). There is no unified inference host.
SDK support: any OpenAI-compatible SDK (Python openai, JS); first-party Ruby SDK + ubi CLI (beta).
Free tier: 500,000 tokens / month.
OpenAI Compatibility
Documented and working against the per-model base URL:
POST /v1/chat/completions — non-streaming
POST /v1/chat/completions with stream=True — SSE streaming
POST /v1/chat/completions with response_format={"type":"json_object"} — JSON mode
POST /v1/chat/completions with tools=[...], tool_choice="auto" — function/tool calling
/v1/embeddings — implied by Qwen3-Embedding-8B launch (endpoint path not explicitly documented)
Not documented or not offered: /v1/completions (legacy), /v1/models on data plane, audio, image, batch API, fine-tuning API.
Model Catalog (Confirmed Public)
| Model ID | Family | Role | First seen |
llama-3-3-70b-turbo | Llama 3.3 70B | Chat | Feb 2025 |
mistral-small-3 | Mistral Small 3 (24B) | Chat | Feb 2025 |
ds-r1-qwen-32b | DeepSeek-R1-Distill-Qwen-32B | Reasoning | Feb–Mar 2025 |
| DeepSeek V3 | DeepSeek V3 | Chat | Jun 2025 |
| DeepSeek R1 | DeepSeek R1 | Reasoning | Jun 2025 |
Qwen2.5-VL-72B | Qwen 2.5 VL | Vision-language | Jul 2025 |
| Qwen3 VL | Qwen 3 VL | Vision-language | Oct 2025 |
Qwen3-Embedding-8B | Qwen 3 Embedding | Text embeddings | Mar 2026 |
| Llama Guard 3 | Meta | Moderation (EuroGPT) | Nov 2024 |
| Llama 3.1 405B | Meta | Chat (EuroGPT) | Nov 2024 |
Open-weight only. No Llama 4 in public materials. Context windows and quantization not published per-model.
Public Pricing
Per-token pricing is dashboard-only for most chat models. Only two models are publicly priced on web:
| Model | Price | Notes |
| Qwen2.5-VL-72B | $0.80 / M tokens (input + output) | Jul 2025 |
| Qwen3-Embedding-8B | $0.05 / M input tokens | Mar 2026 |
| Free tier | 500k tokens / month | Feb 2025 |
March 2026 addition: new GET /project/{id}/inference-endpoint API returns full price table programmatically with separate per_million_prompt_tokens and per_million_completion_tokens.
Positioning claims (Ubicloud-authored): "3–10x lower than comparable offerings" for cloud overall; "3x lower than US alternatives" for EuroGPT. "10x cheaper than OpenAI" is NOT a Ubicloud claim — that phrasing came from third-party research.
Hardware Stack
| GPU | Status | First public mention |
| NVIDIA A100 | Preview (Germany) | May 2025 |
| NVIDIA H100 | Production (prior GPU VMs) | — |
| NVIDIA HGX B200 | Production (Türkiye Istanbul, on request) | Oct 2025 |
| NVIDIA RTX PRO 6000 | On request | Dec 2025 |
Not offered in public materials: H200, L40S, MI300X.
B200 partitioning via Shared NVSwitch Multitenancy
| Partition size | When added |
| 1-GPU, 2-GPU | Oct 2025 launch |
| 4-GPU, 8-GPU | Nov 2025 |
Inside a partition: full NVLink/NVSwitch bandwidth. Across partitions: isolated. Fabric Manager enforces routing.
B200 Virtualization — Signature Tech Work
Ubicloud wrote the "missing manual" on open-source virtualization of NVIDIA HGX B200. Stack:
- QEMU 10.1+ (not Cloud Hypervisor) — B200 needs multi-level PCIe topology that Cloud Hypervisor's flat topology can't produce; 10.1 added BAR-mapping optimizations critical for B200's 256 GB Region 2 BAR per GPU
- VFIO-PCI passthrough —
vfio-pci.ids=10de:2901, intel_iommu=on iommu=pt; blacklist nouveau/nvidia/nvidia_drm
- nvidia-open driver on guest (proprietary stack can't drive B200)
- NVIDIA Fabric Manager in
FABRIC_MODE=1 (Shared NVSwitch Multitenancy) on host; fmpm CLI for partition management
- Host/guest driver versions must match exactly (e.g., 580.95.05)
Competitive point: entire stack is open source; operators can replicate it. Reached HN front page Dec 15, 2025.
vLLM V1 Internals
Production runtime is vLLM V1. Three main components:
- AsyncLLM — async wrapper for tokenization/detokenization; talks to engine via IPC (bypasses Python GIL)
- EngineCore — busy loop: pull from input queue, run scheduler + one forward pass per step
- Scheduler — continuous batching via
max_num_batched_tokens; all requests finish prefill before decode
Optimization layer
- FlashAttention-3 for forward passes
- FlashInfer (integrated Feb 2025) as high-performance kernel generator
- PagedAttention-lineage block-based KV cache, dynamically allocated
- Speculative decoding on DeepSeek R1 32B (Mar 2025)
- Prefix caching referenced in Dewey.py deep-research demo
Not covered publicly: multi-worker load balancing, health checks, auto-restart, model hot-swap.
EuroGPT Enterprise
The consumer/SaaS face of Ubicloud AI. Available at eurogpt.ubicloud.com.
- €19 per user per month — framed as 3x cheaper than ChatGPT Enterprise / Copilot
- LLM: Meta Llama 3.1 405B (open weights)
- Moderation: Llama Guard 3 (optional, input + output)
- Embeddings: Mistral E5 7B for RAG with private knowledge base
- Web search: DuckDuckGo (privacy-preserving)
- Data residency: "Data remains in Germany, including all GPU processing"
- Training: "No customer data or metadata used for training purposes"
- Security: encryption in transit + envelope encryption at rest, key rotation, file upload
- SSO: OIDC at platform level (Jul 2025); EuroGPT-specific SSO not explicitly documented
Not disclosed: quantization of the 405B deployment. Not offered: private API for EuroGPT — raw API consumers use Inference Endpoints directly.
Strategic Pivot: GPU Rentals → Inference PaaS
Before (2024)
Offered raw GPU rentals (RTX 4000 Ada / H100) as GitHub Actions runners and GPU VMs.
Inflection (2025)
Recognized the CapEx-heavy raw-GPU race against CoreWeave, Lambda, AWS P5, Azure NDv5 as structurally unviable for a seed-stage company. Moved up-stack to managed inference PaaS + dedicated enterprise GPU (private locations).
After (Dec 31, 2025)
- GPU GitHub Actions runners deprecated
- GPU VMs repositioned as private/enterprise deployments (B200, RTX PRO 6000 on request)
- Open-weight inference endpoints become the primary AI front door
- EuroGPT Enterprise becomes the productized SaaS face
Implication: Ubicloud is no longer competing on GPU-hours — they're competing on tokens and on the quality of the managed inference stack.
Positioning
| Competitor class | Examples | Ubicloud's angle |
| Closed-model LLM vendors | OpenAI, Anthropic | Open-weight only; lower price; EU residency; no training use |
| Fast-inference specialists | Groq, Together, Fireworks, DeepInfra | Same model class; adds full IaaS underneath + EuroGPT SaaS on top |
| GPU clouds | CoreWeave, Lambda, AWS P5 | Open-source B200 virtualization; control plane on GitHub; BYOC option |
| GPU-on-demand | RunPod, Vast.ai | Managed-first; GDPR-native; EuroGPT SaaS |
| European sovereign AI | Mistral-La Plateforme, Aleph Alpha | Broader IaaS (compute + K8s + Postgres) beyond just models |
Differentiators actually claimable
- End-to-end AGPL-3.0 stack (hypervisor → vLLM → UI)
- Proven B200 virtualization (with public technical writeup)
- Germany-resident EuroGPT turnkey product
- Strong Postgres heritage → good RAG / vector story when paired with managed Postgres
Gaps & What's Missing
- No public SLA, rate limits, latency, or throughput numbers for inference endpoints
- No public per-token pricing for chat/reasoning models (only Qwen2.5-VL and Qwen3-Embedding priced on web) — dashboard-only
- Not offered: batch inference API, fine-tuning / LoRA, image generation, audio (Whisper/TTS), multimodal beyond vision-language input
- No public EU AI Act role classification (provider vs deployer) despite operating EuroGPT and open inference
- No named AI customers in public materials; no case studies beyond Ubicloud's own Dewey.py deep-research demo
- No benchmarks vs CoreWeave / Lambda / AWS P5 on B200 workloads; vs OpenAI / Groq / Together on inference throughput or latency
- Istanbul B200 hosting provider not publicly named — framed as "Private Location" / on-request