Two architectures dominate real-time AI avatar platforms in 2026. From the outside they look similar: the user speaks, the avatar responds. Underneath, they work in opposite ways and produce very different results across bandwidth, cost, latency, and deployment flexibility.
This is a breakdown of both from first principles — using the numbers that actually matter in production.
Architecture 1: Cloud Streaming
In a cloud-streaming architecture, the avatar lives on a server. The full pipeline — speech recognition, language model, text-to-speech, avatar animation, and frame rendering — runs server-side. The rendered output is encoded as a video stream and delivered to the client over WebRTC or a similar protocol.
The client is essentially a video player. It receives a stream, decodes frames, plays audio. The device’s GPU does nothing except basic video decode.
Data flow:
User speaks → Cloud: ASR → LLM → TTS → Avatar render → H.264 encode
↓
1–2 MB/s video stream to client
↓
Client: decode → display
Platforms using this architecture: HeyGen LiveAvatar, Tavus, Anam.ai, most cloud-first interactive avatar systems.
What Cloud Streaming Gets Right
Rendering quality ceiling is high. A cloud GPU renders with no compromise — full polygon budgets, high-resolution textures. The client’s hardware capability is irrelevant because the device just decodes video.
Where Cloud Streaming Breaks Down
Bandwidth is non-negotiable. A video stream at standard quality requires 1–2 MB/s sustained per session. This is a physical constraint of video encoding at usable quality — not a configuration you can tune around. Packet loss causes visual artifacts. Jitter causes lip-sync drift. Insufficient bandwidth causes degradation or stall.
Latency has a structural floor above 3 seconds. Every step in the cloud pipeline adds delay: the round-trip for audio, ASR processing, LLM inference, TTS synthesis, avatar rendering, video encoding, stream delivery, client decode. Traditional cloud-rendered avatar pipelines deliver end-to-end latency greater than 3 seconds.
Per-session cost scales with GPU time. Cloud GPU rendering is the expensive component. The industry average for cloud-streamed interactive avatar sessions runs approximately $0.15/minute. At meaningful session volumes, this becomes the dominant infrastructure cost.
Connectivity is a hard dependency. If the connection degrades below the video stream’s minimum viable bitrate, the experience breaks. There is no graceful degraded mode — either the stream works or it doesn’t.
Architecture 2: On-Device Rendering (Spatius)
Spatius separates the rendering from the driving inference. The cloud GPU runs a lightweight driving model that takes TTS audio as input and outputs FLAME expression parameters — a compact mathematical description of how the avatar’s face should move at each moment. These parameters stream to the client at 10–20 KB/s.
The client runs AvatarKit, Spatius’s rendering SDK. AvatarKit receives the FLAME parameters, applies them to the 3DGS (3D Gaussian Splatting) avatar model stored on-device, and renders the result locally. Audio and visual output are aligned on-device.
Data flow:
[Customer-built: ASR → LLM → TTS audio]
↓
Spatius cloud: lightweight driving model → FLAME expression parameters
↓
10–20 KB/s parameter stream
↓
AvatarKit (client): 3DGS render + audio alignment → display
Important: Spatius does not provide ASR, LLM, or TTS. These are customer-built components. Spatius handles the driving model (cloud, lightweight GPU inference) and AvatarKit (client, rendering only — zero inference cost).
What On-Device Rendering Gets Right
Bandwidth drops by ~99%. 10–20 KB/s versus 1–2 MB/s is not a marginal difference. It’s the difference between requiring dedicated fiber and working on shared 4G. A 20-device deployment needs roughly 200–400 KB/s total — negligible on any business internet connection.
End-to-end latency is under 1.5 seconds. The audio input to avatar interaction adds less than 300ms on the Spatius side. With an optimized end-to-end voice AI stack, total response latency drops below 1 second. Compared to traditional cloud rendering’s 3+ seconds, this is a fundamental difference in conversational feel.
Cost structure changes. The Spatius Scale plan runs $0.007/minute ($0.42/hour). The industry average for cloud-streamed avatar sessions is approximately $0.15/minute — more than 20× higher. With a $5,000 budget, Spatius delivers approximately 11,349 hours of sessions; the industry average delivers approximately 556 hours.
Entry-level hardware is sufficient. Because AvatarKit does only rendering and audio alignment — no inference — entry-level SOCs handle the workload stably at 25fps. Mid-range hardware runs at 30–60fps. Officially supported chipsets include G88, S565, 8189, and RK3576, none of which require a dedicated GPU.
Network degradation affects latency, not rendering quality. A momentary connection drop delays the next FLAME parameter batch. The currently rendered frame continues uninterrupted. The user sees a brief pause in the avatar’s speech — not a frozen or artifacted video frame.
What On-Device Rendering Gives Up
Rendering quality ceiling is hardware-dependent. The highest visual fidelity requires a more capable client GPU. For premium visual experiences on constrained hardware, there’s a trade-off.
SDK integration is required. AvatarKit (npm @spatialwalk/avatarkit on Web, AvatarKit.xcframework on iOS, Gradle ai.spatialwalk:avatarkit on Android) must be integrated into the application.
The avatar model ships with the application. The 3DGS model (~5–10 MB) is downloaded to the device on first use.
Side-by-Side Comparison
| Dimension | Cloud Streaming | Spatius (On-Device) |
|---|---|---|
| Bandwidth per session | 1–2 MB/s | 10–20 KB/s (~99% less) |
| End-to-end latency | >3 seconds | <1.5 seconds |
| Avatar→audio additional latency | High (encode+stream+decode) | <300ms |
| Cost per hour | ~$9/hr (industry avg $0.15/min) | $0.42/hr (Scale plan) |
| Rendering location | Cloud GPU | Client device |
| Cloud GPU involved? | Yes (heavy — full render) | Yes (light — driving model only) |
| Works on entry-level devices | Yes (video decode only) | Yes (25fps on entry-level SOC) |
| Works on 1–2 Mbps connections | Marginal / unreliable | Unaffected (10–20 KB/s) |
| Connectivity fallback | Stream degrades/stops | Audio-only mode (15s timeout) |
| Platform SDK | Varies | Web / iOS / Android |
| You provide ASR+LLM+TTS | No (platform-managed) | Yes (customer-built) |
Choosing Between Architectures
Cloud streaming is the right fit when:
- You need maximum visual fidelity and have a reliable high-bandwidth environment
- You want a fully managed AI pipeline (ASR + LLM + TTS + rendering handled by the vendor)
- Session volume is low and cost-per-minute is not a binding constraint
On-device rendering (Spatius) is the right fit when:
- You’re deploying to bandwidth-constrained environments: retail kiosks, field devices, mobile-first, emerging markets
- You’re projecting high session volumes where the cost difference ($0.42/hr vs ~$9/hr) is material
- You need native iOS and Android SDK coverage alongside Web
- You want full control over your AI stack — choosing your own ASR, LLM, and TTS providers
- You need the avatar to degrade gracefully under connectivity interruptions rather than fail hard
Try the Architecture in Your Browser
Spatius’s playground runs AvatarKit in your browser using WebGL/WebGPU. The avatar rendering is happening on your device, not being streamed from a server. Open DevTools → Network while talking to the avatar: you’ll see small parameter packets in the 10–20 KB/s range, not a sustained video stream at 1–2 MB/s.
Related Reading
Hardware requirements → AI Avatar on Entry-Level Chipsets: How On-Device Rendering Works on Budget Hardware
Performance numbers → Comparing AI Avatar Platforms for Speed: Latency, Bandwidth, and Performance in 2026
Real deployment scenarios → AI Avatars for Edge Deployments: Kiosks, Retail, and Low-Bandwidth Environments
The full landscape → Interactive Avatar: The Complete Guide to Real-Time AI Avatars in 2026