Research

On-Device AI Avatar vs Cloud Streaming: Architecture, Bandwidth, and Cost in 2026

ST
Spatius Team
Jun 6, 2026 9 min read 分钟阅读

Two architectures dominate real-time AI avatar platforms in 2026. From the outside they look similar: the user speaks, the avatar responds. Underneath, they work in opposite ways and produce very different results across bandwidth, cost, latency, and deployment flexibility.

This is a breakdown of both from first principles — using the numbers that actually matter in production.


Architecture 1: Cloud Streaming

In a cloud-streaming architecture, the avatar lives on a server. The full pipeline — speech recognition, language model, text-to-speech, avatar animation, and frame rendering — runs server-side. The rendered output is encoded as a video stream and delivered to the client over WebRTC or a similar protocol.

The client is essentially a video player. It receives a stream, decodes frames, plays audio. The device’s GPU does nothing except basic video decode.

Data flow:

User speaks → Cloud: ASR → LLM → TTS → Avatar render → H.264 encode

                                           1–2 MB/s video stream to client

                                              Client: decode → display

Platforms using this architecture: HeyGen LiveAvatar, Tavus, Anam.ai, most cloud-first interactive avatar systems.

What Cloud Streaming Gets Right

Rendering quality ceiling is high. A cloud GPU renders with no compromise — full polygon budgets, high-resolution textures. The client’s hardware capability is irrelevant because the device just decodes video.

Where Cloud Streaming Breaks Down

Bandwidth is non-negotiable. A video stream at standard quality requires 1–2 MB/s sustained per session. This is a physical constraint of video encoding at usable quality — not a configuration you can tune around. Packet loss causes visual artifacts. Jitter causes lip-sync drift. Insufficient bandwidth causes degradation or stall.

Latency has a structural floor above 3 seconds. Every step in the cloud pipeline adds delay: the round-trip for audio, ASR processing, LLM inference, TTS synthesis, avatar rendering, video encoding, stream delivery, client decode. Traditional cloud-rendered avatar pipelines deliver end-to-end latency greater than 3 seconds.

Per-session cost scales with GPU time. Cloud GPU rendering is the expensive component. The industry average for cloud-streamed interactive avatar sessions runs approximately $0.15/minute. At meaningful session volumes, this becomes the dominant infrastructure cost.

Connectivity is a hard dependency. If the connection degrades below the video stream’s minimum viable bitrate, the experience breaks. There is no graceful degraded mode — either the stream works or it doesn’t.


Architecture 2: On-Device Rendering (Spatius)

Spatius separates the rendering from the driving inference. The cloud GPU runs a lightweight driving model that takes TTS audio as input and outputs FLAME expression parameters — a compact mathematical description of how the avatar’s face should move at each moment. These parameters stream to the client at 10–20 KB/s.

The client runs AvatarKit, Spatius’s rendering SDK. AvatarKit receives the FLAME parameters, applies them to the 3DGS (3D Gaussian Splatting) avatar model stored on-device, and renders the result locally. Audio and visual output are aligned on-device.

Data flow:

[Customer-built: ASR → LLM → TTS audio]

Spatius cloud: lightweight driving model → FLAME expression parameters

         10–20 KB/s parameter stream

AvatarKit (client): 3DGS render + audio alignment → display

Important: Spatius does not provide ASR, LLM, or TTS. These are customer-built components. Spatius handles the driving model (cloud, lightweight GPU inference) and AvatarKit (client, rendering only — zero inference cost).

What On-Device Rendering Gets Right

Bandwidth drops by ~99%. 10–20 KB/s versus 1–2 MB/s is not a marginal difference. It’s the difference between requiring dedicated fiber and working on shared 4G. A 20-device deployment needs roughly 200–400 KB/s total — negligible on any business internet connection.

End-to-end latency is under 1.5 seconds. The audio input to avatar interaction adds less than 300ms on the Spatius side. With an optimized end-to-end voice AI stack, total response latency drops below 1 second. Compared to traditional cloud rendering’s 3+ seconds, this is a fundamental difference in conversational feel.

Cost structure changes. The Spatius Scale plan runs $0.007/minute ($0.42/hour). The industry average for cloud-streamed avatar sessions is approximately $0.15/minute — more than 20× higher. With a $5,000 budget, Spatius delivers approximately 11,349 hours of sessions; the industry average delivers approximately 556 hours.

Entry-level hardware is sufficient. Because AvatarKit does only rendering and audio alignment — no inference — entry-level SOCs handle the workload stably at 25fps. Mid-range hardware runs at 30–60fps. Officially supported chipsets include G88, S565, 8189, and RK3576, none of which require a dedicated GPU.

Network degradation affects latency, not rendering quality. A momentary connection drop delays the next FLAME parameter batch. The currently rendered frame continues uninterrupted. The user sees a brief pause in the avatar’s speech — not a frozen or artifacted video frame.

What On-Device Rendering Gives Up

Rendering quality ceiling is hardware-dependent. The highest visual fidelity requires a more capable client GPU. For premium visual experiences on constrained hardware, there’s a trade-off.

SDK integration is required. AvatarKit (npm @spatialwalk/avatarkit on Web, AvatarKit.xcframework on iOS, Gradle ai.spatialwalk:avatarkit on Android) must be integrated into the application.

The avatar model ships with the application. The 3DGS model (~5–10 MB) is downloaded to the device on first use.


Side-by-Side Comparison

DimensionCloud StreamingSpatius (On-Device)
Bandwidth per session1–2 MB/s10–20 KB/s (~99% less)
End-to-end latency>3 seconds<1.5 seconds
Avatar→audio additional latencyHigh (encode+stream+decode)<300ms
Cost per hour~$9/hr (industry avg $0.15/min)$0.42/hr (Scale plan)
Rendering locationCloud GPUClient device
Cloud GPU involved?Yes (heavy — full render)Yes (light — driving model only)
Works on entry-level devicesYes (video decode only)Yes (25fps on entry-level SOC)
Works on 1–2 Mbps connectionsMarginal / unreliableUnaffected (10–20 KB/s)
Connectivity fallbackStream degrades/stopsAudio-only mode (15s timeout)
Platform SDKVariesWeb / iOS / Android
You provide ASR+LLM+TTSNo (platform-managed)Yes (customer-built)

Choosing Between Architectures

Cloud streaming is the right fit when:

  • You need maximum visual fidelity and have a reliable high-bandwidth environment
  • You want a fully managed AI pipeline (ASR + LLM + TTS + rendering handled by the vendor)
  • Session volume is low and cost-per-minute is not a binding constraint

On-device rendering (Spatius) is the right fit when:

  • You’re deploying to bandwidth-constrained environments: retail kiosks, field devices, mobile-first, emerging markets
  • You’re projecting high session volumes where the cost difference ($0.42/hr vs ~$9/hr) is material
  • You need native iOS and Android SDK coverage alongside Web
  • You want full control over your AI stack — choosing your own ASR, LLM, and TTS providers
  • You need the avatar to degrade gracefully under connectivity interruptions rather than fail hard

Try the Architecture in Your Browser

Spatius’s playground runs AvatarKit in your browser using WebGL/WebGPU. The avatar rendering is happening on your device, not being streamed from a server. Open DevTools → Network while talking to the avatar: you’ll see small parameter packets in the 10–20 KB/s range, not a sustained video stream at 1–2 MB/s.


Hardware requirements → AI Avatar on Entry-Level Chipsets: How On-Device Rendering Works on Budget Hardware

Performance numbers → Comparing AI Avatar Platforms for Speed: Latency, Bandwidth, and Performance in 2026

Real deployment scenarios → AI Avatars for Edge Deployments: Kiosks, Retail, and Low-Bandwidth Environments

The full landscape → Interactive Avatar: The Complete Guide to Real-Time AI Avatars in 2026

on-device avatar cloud streaming AI avatar architecture bandwidth latency