Product

AI Avatar for Virtual Assistants: Build an On-Device Agent That Works on Any Budget Hardware

ST
Spatius Team
Jun 14, 202610 min read 分钟阅读

Virtual assistant deployments have a reliability problem that rarely shows up in demos: the real users are on budget tablets in a retail store, a hospital waiting room, or a field office with intermittent Wi-Fi. The demo ran on a MacBook Pro with a wired connection. The production deployment runs on a $150 Android kiosk with a shared corporate network.

This guide covers what actually matters when deploying an AI avatar virtual assistant — and why architecture, not visual quality, is the decision that determines whether it works.

What an AI Avatar Virtual Assistant Actually Is

An AI avatar virtual assistant is a real-time interactive digital human that handles a defined set of tasks on behalf of a user — answering questions, guiding workflows, conducting intake processes, or providing information — with a face and voice that responds in real time to what the user says.

It’s different from a chatbot with a picture. The face is animated in real time based on synthesized speech. The conversation is driven by a live LLM or a scripted dialogue flow. The interaction feels like talking to a person because the facial expressions, lip sync, and audio are synchronized and responsive.

Where AI avatar virtual assistants are being deployed in 2026:

  • Kiosk-based reception and intake: retail, healthcare, government offices
  • In-store product assistance: retail and hospitality
  • Education and training: language learning platforms, corporate onboarding
  • HR tech: AI interviewers, candidate screening, role-play coaching
  • AI hardware: dedicated devices with built-in avatar interfaces

What makes the hardware constraint so important in these contexts is that none of these environments reliably offer the bandwidth or compute required by cloud-streaming avatar architectures.

The Problem with Cloud-Streaming Avatars in Production

Cloud-streaming approaches render the avatar’s face on a cloud GPU and deliver it to the client as a video stream. The video quality is consistent. The problem is what it requires from the deployment environment.

Bandwidth: sustained 1–2 MB/s per session is the typical requirement for cloud-rendered avatar video. On a shared corporate Wi-Fi network handling 20 kiosk sessions simultaneously, that’s 20–40 MB/s of sustained throughput — before any other network traffic. On a mobile connection, it’s the difference between the session working and not.

Latency: cloud rendering adds its own processing time to the voice AI pipeline. Total end-to-end latency — from a user finishing a sentence to the avatar beginning its response — frequently exceeds 3 seconds for cloud-streaming platforms under realistic network conditions.

Cost at scale: when you’re paying for cloud GPU rendering per minute of session time, the per-minute rate compounds with your concurrent session count. Virtual assistant deployments that handle thousands of daily sessions make this the dominant infrastructure cost.

Device coverage: cloud streaming shifts the rendering burden to the server but doesn’t eliminate device requirements — it replaces them with a network requirement. A device that can play video can handle cloud-streamed avatars. But any bandwidth constraint, whether from the device’s connection or the shared network, degrades the session.

On-Device Rendering: Why It Works for Virtual Assistants

On-device rendering inverts the architecture. Instead of streaming video from a cloud GPU, Spatius Motion Server transmits only lightweight Motion data to an SDK that runs on the user’s device. The device does the rendering.

The result is approximately 10–20 KB/s of bandwidth per session instead of 1–2 MB/s. That’s a reduction of roughly 99% in data transfer per session, which changes the math on network requirements, device compatibility, and cost at scale entirely.

In a kiosk deployment running 20 simultaneous sessions:

  • Cloud streaming: 20–40 MB/s sustained bandwidth required
  • On-device rendering: 200–400 KB/s sustained bandwidth required

The on-device approach also decouples rendering quality from network quality. A session running at 10–20 KB/s maintains the same visual output whether the connection is fast or slow, because Motion data is small enough to transmit reliably over weak connections.

For virtual assistant hardware specifically, Spatius’s on-device rendering SDK runs stably on entry-level chipsets — including G88, S565, 8189, RK3576, and similar entry-level SoCs — at 25+ fps with no dedicated GPU required. Mid-range and flagship devices achieve 30–60 fps. The SDK renders through the platform’s native graphics layer: WebGL/WebGPU on web, Metal on iOS, Vulkan on Android.

For a detailed breakdown of what entry-level chipset performance looks like in practice: AI Avatar on Entry-Level Chipsets: How On-Device Rendering Works on Budget Hardware

Where to Find AI Avatar Services for Virtual Assistants

The practical answer to where to find these services is: look past the demo, look at the architecture.

Most platforms can produce a compelling 60-second demo video. The evaluation questions that surface real differences:

What’s the minimum bandwidth requirement per concurrent session? This determines whether your deployment environment is viable without infrastructure upgrades.

What devices does the rendering SDK certify? Confirm whether it covers the chipsets in your intended hardware — not just flagship devices.

Does the platform provide the AI, or just the face? For virtual assistant use cases, you almost certainly want to bring your own LLM and voice pipeline. Your use case requires domain-specific knowledge that no off-the-shelf AI can provide out of the box. Platforms that let you integrate BYO ASR, LLM, and TTS give you this flexibility; bundled platforms don’t.

How does the session behave when connectivity is interrupted? In production deployments, connections drop. A platform with graceful fallback — like Spatius’s auto-fallback to audio-only mode after 15 seconds of WebSocket failure — keeps the session usable. One that drops entirely loses the user.

What does concurrent pricing look like at your expected session volume? Per-minute rates that seem reasonable at one session compound quickly at 50 or 500.

Building an AI Avatar Virtual Assistant with Spatius

Spatius provides the rendering layer. You provide the AI.

The typical virtual assistant integration path:

  1. Your ASR transcribes the user’s speech into text
  2. Your LLM processes the input against your knowledge base or dialogue flow and generates a response
  3. Your TTS converts the response to audio (mono 16-bit PCM, 16kHz is the default; 8kHz through 48kHz are all supported)
  4. Spatius SDK sends the audio to Motion Server, receives Motion data, and renders the avatar’s facial animation on the user’s device in sync with the audio

The cloud component in Spatius is Motion Server, which outputs Motion data rather than rendered video frames. The device renders the 3DGS avatar model locally. This is why the bandwidth stays at 10–20 KB/s regardless of the LLM or voice stack complexity.

Concurrency limits by plan:

  • Free: 2 concurrent sessions, 500 credits/month (~50 minutes)
  • Starter ($19/month): 5 concurrent sessions, 22,000 credits (~2,200 minutes)
  • Scale ($299/month): 40 concurrent sessions, 400,000 credits (~40,000 minutes)
  • Enterprise: unlimited concurrent sessions

For kiosk deployments, the Scale plan’s 40 concurrent sessions and $0.007/minute rate covers most mid-scale deployments. At a $5,000 monthly budget, that scales to approximately 11,349 hours of session time.

Real-time team collaboration scenarios: for deployments where human agents supervise avatar sessions or can take over, the BYO architecture gives you full access to the conversation data stream — transcripts, LLM outputs, session state — because it flows through your own pipeline. There’s no vendor black box between the AI and your workforce management tools.

Integration Modes for Virtual Assistant Deployments

Spatius offers three integration modes with different trade-offs:

Basic Mode: simplest setup, suitable for web, iOS, and Android. Connects the SDK to Spatius Motion Server directly. Best for getting to a working demo quickly; typical for mobile virtual assistant apps.

LiveKit Plugin: ultra-low latency, Web only. For teams already using LiveKit Agents for their voice pipeline. Reduces the additional latency contribution from the avatar layer to near-zero.

Custom Mode: full control over the transport layer, web, iOS, and Android. For deployments that need custom signaling, specific routing, or advanced session management.

For most kiosk virtual assistant deployments, Basic Mode on Android covers the hardware and provides straightforward SDK integration. For browser-based kiosk interfaces where you’re using LiveKit for voice, the LiveKit Plugin is the better choice.

Try the live demo before integrating: Spatius Playground

For Real-Time Collaboration Use Cases

For deployments where multiple team members interact with or monitor avatar sessions — training environments, panel interviews, supervisory monitoring — the key question is whether the platform exposes session data in real time.

Because Spatius operates as a rendering layer within your own stack, session data (transcripts, LLM outputs, conversation state) lives in your pipeline by default. You decide what to store, what to expose to your collaboration tools, and how to route session handoffs to human agents. This is qualitatively different from platforms where the AI is bundled and the session data lives in the vendor’s system.

For a broader overview of building conversational AI avatar solutions: Conversational AI Avatar for Customer Service: The Complete Platform Guide (2026)


AI avatarvirtual assistantkioskedge renderingon-device SDK
ShareX (Twitter)LinkedIn