Live Avatar SDK: How to Add a Real-Time Live Avatar to Your App Without Cloud Video Streaming (2026)

Search “live avatar” and you’ll mostly get movie results. Search it as a developer, and you mean something very specific: a digital human face that listens, talks back, and reacts in real time — driven by your own AI stack, rendered live inside your app. That’s a live avatar, and the thing that makes it practical to ship is the live avatar SDK underneath it.

This guide is about that layer. We’ll cover what a live avatar SDK is responsible for (and what it isn’t), why the rendering architecture decides whether your live avatar feels instant or laggy, and how to add a real-time live avatar to a Web, iOS, or Android app without piping heavy video down from the cloud.

What “live avatar” actually means in 2026

A pre-rendered talking-head video is not a live avatar. Neither is a video you generate from a script and play back later — that’s video generation, a different product category entirely (it’s what platforms like Synthesia and HeyGen’s video tools are built for).

A live avatar is interactive. The defining traits:

It responds to live input — usually a user’s voice — and replies in the moment.
Its lip movement and expression are driven in real time, not baked into a clip.
The conversation is two-way: the user can interrupt, change topic, and the avatar adapts.

In other words, a live avatar is the visible face on top of a real-time conversational pipeline. The pipeline turns speech into a reply; the live avatar turns that reply into a face that’s actually speaking it. For the broader category framing, see our complete guide to interactive avatars.

What a live avatar SDK is responsible for — and what it isn’t

This is the part most teams get wrong when they scope a project, so it’s worth being precise. A real-time avatar system has three separate layers:

The AI agent (the “brain”) — automatic speech recognition (ASR), your large language model or RAG logic, and text-to-speech (TTS). This is your stack. A live avatar SDK like Spatius does not provide ASR, LLM, or TTS. You bring your own, or wire up providers you already use.
The avatar (the “face”) — the digital human identity. With Spatius this is either a stock avatar or a custom one built from a single photo.
The avatar SDK (the “driving engine”) — this is the live avatar SDK itself. It takes the AI-generated audio, drives the face so the mouth and expression match, and renders the result on screen, in sync, in real time.

So the SDK’s job is narrow and deep: take audio in, produce a lip-synced, expressive, rendered avatar out — fast enough to feel live. Everything upstream (deciding what to say) stays with you. That separation is what lets you swap LLMs or voices without touching the avatar layer at all.

Why rendering architecture is the whole ballgame

Here’s where live avatar platforms diverge, and where most of the latency and cost come from.

Cloud-streaming approach. The cloud renders the avatar into video frames and streams that video to the user’s device. Video is heavy: a sustained session needs roughly 1–2 MB/s of bandwidth, and end-to-end latency commonly runs over 3 seconds. On a flaky network — a classroom, a kiosk, a phone on cellular — that’s exactly where a “live” avatar stops feeling live.

On-device rendering approach. Instead of streaming video, the cloud does only lightweight driving inference and sends down compact Motion data (driving parameters) — on the order of 10–20 KB/s, what our docs describe as a “~100 kbps stream.” The client SDK then renders the 3D avatar locally and aligns it with the audio. Because the heavy rendering happens on the device, the data crossing the network is roughly two orders of magnitude smaller, and end-to-end latency comes in under 1.5 seconds (depending on your voice AI stack), with the avatar-driving step adding under 300 ms.

Spatius takes the on-device approach. The flow looks like this:

Your audio (from your TTS) → Motion Server (cloud) → Motion data (driving parameters) → client SDK (renders the 3DGS avatar + syncs audio) → a live, audio-visually synced avatar on screen.

A practical note on cost that follows directly from this: moving rendering to the device minimizes GPU cost rather than centralizing it. There is still lightweight driving inference running in the cloud, but the device itself does only rendering, with no inference. That is why a live avatar can run on hardware with no dedicated GPU. We go deep on the trade-offs in on-device AI avatar vs cloud streaming.

What this buys you: device reach and resilience

Because the device only renders, the bar for “what can run a live avatar” drops dramatically. Spatius runs across roughly 99% of mainstream Android, iOS, and Web devices. Sub-$1000 and mid-range hardware holds a stable 30–60 fps; even entry-level SOCs with no dedicated GPU stay around 25 fps. We’ve documented specific chipsets — and what to expect on each — in AI avatars on entry-level chipsets.

Two reliability behaviors matter for anything you’d call “live”:

Fallback. If the WebSocket connection fails to establish within 15 seconds, the SDK switches to an audio-only fallback mode — audio keeps playing uninterrupted, only the animation pauses. A network hiccup degrades gracefully instead of dropping the session.
Interrupt. Users can cut the avatar off mid-sentence. Calling interrupt() clears the current playback and buffered data, so the avatar stops immediately and listens — which is what a real conversation requires.

How to add a real-time live avatar to your app

You have three integration paths, depending on how much control you need:

Mode	Latency	Dev effort	Platforms	Best for
Basic Mode	Medium	Low	Web / iOS / Android	Getting started fast, mobile apps
LiveKit Plugin	Ultra-low	Low	Web only	Projects already running LiveKit Agents
Custom Mode	Ultra-low	Medium–high	Web / iOS / Android	Teams that need full control of the transport layer

A few specifics worth knowing before you start:

Web ships as the npm package @spatialwalk/avatarkit (WebGL/WebGPU). For LiveKit transport on web, use @spatialwalk/avatarkit-rtc.
iOS is AvatarKit.xcframework (Metal). Note: the iOS simulator doesn’t support Metal rendering, so you have to test on a real device.
Android is the Gradle dependency ai.spatialwalk:avatarkit (Vulkan).
Server-side, there’s a Python SDK (pip: spatius) and a Go SDK. A JS server SDK is coming soon but not released yet.
If you go the LiveKit Plugin route, it’s Web-only today (not iOS/Android), and it pins livekit-client to version 2.16.1 — other versions can cause connection or audio issues.

Audio in is mono 16-bit PCM (s16le), default 16000 Hz, with no automatic resampling — so make sure your TTS output format matches. The Python SDK additionally accepts Ogg Opus.

The fastest way to feel the latency for yourself before writing any code is the Spatius Playground — talk to a live avatar in the browser. When you’re ready to wire it into a real pipeline, the working voice agent demo on GitHub (Web, iOS, Android, Flutter clients) is the shortest path, and the Spatius docs cover credentials, session setup, and LiveKit usage.

How a live avatar compares to the “interactive avatar” tools you’ve seen

If you’ve evaluated HeyGen’s interactive avatar product, the difference is architectural, not cosmetic. HeyGen’s live avatar streams rendered video from the cloud; an on-device SDK streams only Motion data and renders locally. That’s the gap that shows up as bandwidth and latency under real conditions. We break down exactly when each makes sense in HeyGen interactive avatar: what it does, where it falls short and in HeyGen interactive avatar vs. alternatives.

Pricing, briefly

A live avatar is something users run for minutes at a time, so per-minute cost matters more than a sticker price. Spatius has a permanent free tier (500 credits/month, ~50 minutes), and on the Scale plan the effective rate is $0.007/min — about $0.42 per hour of conversation. (That $0.42/hour figure is the Scale-plan rate specifically; Free and Starter rates differ.) Full breakdown on the pricing page, and a cost-per-minute comparison across platforms in the cheapest real-time AI avatar API in 2026.

The takeaway

A “live avatar” is only as live as its weakest link, and for most teams that link is the network. If your avatar’s rendering lives in the cloud, your users’ bandwidth becomes your latency. Move rendering to the device, stream only Motion data, and a real-time live avatar becomes something you can ship to a phone on cellular, a kiosk on store Wi-Fi, or a budget tablet in a classroom — and have it still feel live.

Spin one up in the Playground, or grab the SDK from the docs and build.