Engineering

Build a Custom AI Avatar from a Single Photo: How 3DGS and Facial Matching Work in 2026

ST
Spatius Team
Jun 14, 202610 min read 分钟阅读

Creating a photorealistic custom AI avatar used to require a professional 3D scanning rig, hours of manual modeling work, and a rendering infrastructure that could handle the compute demands of real-time playback. In 2026, the process starts with a single photograph and ends with a 5–10 MB model file that renders at 30–60 fps on a mid-range mobile device.

Understanding how this works — specifically how 3D Gaussian Splatting (3DGS) and the facial matching pipeline behind it function — matters for anyone building interactive AI avatar experiences with a custom persona.

What “Facial Matching SDK” Actually Means

A facial matching SDK, in the context of real-time AI avatars, typically refers to two distinct capabilities that are often conflated:

Identity-preserving avatar reconstruction: building a 3D model of a specific person’s face from input data (photos, video, or scan), such that the resulting avatar is recognizably that person.

Real-time facial drive and expression matching: animating that model in real time based on audio input, Motion data, or motion capture, such that the avatar’s face moves in a way that matches what the person would naturally look like saying the same thing.

Most platforms that offer “custom avatar” functionality handle one or both of these. The quality gap between approaches is large, and it’s primarily determined by the underlying 3D representation technology.

The Problem with Traditional Custom Avatar Approaches

Before 3DGS became practical for real-time rendering, the standard approaches for photorealistic custom avatar creation were:

3D mesh scanning: requires controlled lighting, multi-camera rigs, and professional environments. Produces high-quality models, but input requirements make it inaccessible for most consumer or enterprise use cases.

Monocular video reconstruction: estimates 3D structure from a video clip of the subject. More accessible but quality degrades significantly under uncontrolled lighting and for profile or extreme angle views. The resulting mesh often has visible artifacts in hair and skin texture.

Texture mapping on a generic base model: starts with a standard 3D human model and applies a photographic texture of the target face. Fast to generate but recognizably artificial — texture-mapped avatars don’t capture the depth, specular highlights, and volumetric properties of a real face.

Video puppeteering: takes a driving video of a performer and applies their expressions to a reference photo of the target face using neural rendering techniques. High quality for pre-rendered video but impractical for real-time use due to compute requirements.

How 3DGS Changes the Picture

3D Gaussian Splatting is a scene representation technique developed in academic computer vision that represents a 3D scene as a collection of oriented, colored, partially transparent 3D Gaussians — mathematical splats — rather than as a polygon mesh. The key properties that make it useful for real-time avatar rendering are:

High visual fidelity from sparse input: a 3DGS model trained from a set of photos from different angles can reconstruct highly realistic appearance including fine details like subsurface scattering in skin, specular highlights on surfaces, and volumetric hair — details that mesh-based approaches struggle to capture cleanly.

Compact output: a complete 3DGS avatar model can be as small as 5–10 MB. For comparison, traditional high-fidelity 3D character assets for games often run 50–200 MB with texture maps included.

Real-time rendering via rasterization: the Gaussian splats can be rendered efficiently using standard GPU rasterization pipelines. WebGL and WebGPU, which run in modern browsers without plugins, are sufficient for 30–60 fps rendering of 3DGS avatars at typical session video dimensions.

No dedicated GPU required: because the rendering is efficient and the model is compact, 3DGS avatars render stably on entry-level hardware — mobile chipsets, embedded displays, budget Android devices — that would fail to maintain real-time performance with traditional high-fidelity avatar rendering.

From Single Photo to Avatar: The Spatius Pipeline

The Spatius avatar creation pipeline takes a single photograph as input and produces a 3DGS avatar model:

  1. Face detection and crop: the pipeline extracts the face region and normalizes for scale, orientation, and lighting.

  2. 3D geometry estimation: a neural model estimates the 3D geometry of the face from the single view. This is the hardest step in the single-photo case — inferring depth, self-occluded regions, and back-of-head geometry from a 2D image.

  3. Multi-view synthesis: the geometry estimate is used to synthesize plausible views of the face from multiple angles, creating a pseudo-multi-view dataset for the 3DGS training step.

  4. 3DGS model training: the 3D Gaussian Splatting optimization runs on the synthesized multi-view data, fitting a set of Gaussians to reproduce the target appearance.

  5. Motion data binding: the resulting 3DGS model is bound to Spatius’s Motion data driving space. This binding is what allows the model to be driven in real time from the lightweight Motion data stream Spatius transmits during sessions.

  6. Model export: the final model file (5–10 MB) is stored and associated with the Avatar ID in your Spatius account.

Total time from photo to deployable model: approximately 3 hours in current production. Once the model is generated, it’s permanently associated with the Avatar ID and can be used in unlimited sessions without re-generation.

Quality notes: single-photo reconstruction cannot fully recover information that isn’t visible in the photo — extreme profile views, back of the head, occlusion by hair. The synthesis step makes informed estimates but will not be perfect for all cases. For use cases where a specific individual’s likeness must be highly accurate (branded spokespeople, personalised education), providing multiple reference photos improves reconstruction quality.

The Facial Matching SDK in Practice

The “facial matching” component of Spatius’s avatar creation isn’t matching in the face-recognition sense — it’s not identifying who a person is from an image. It’s about maintaining the identity-specific properties of a face across the expression space: the shape, texture, and distinctive features of a specific person that remain recognizable when they smile, frown, or speak.

This is technically distinct from face swapping or deepfake techniques. The 3DGS model is built from the subject’s actual appearance, not from another model’s geometry with a texture applied. Motion data driving constrains expression to the space of natural human expressions, which prevents the uncanny distortions common in less constrained approaches.

For applications where accurate lip sync across multiple languages and including technical terms (mathematical notation, proper nouns, non-native phonemes) matters — education and language learning being the primary case — Spatius supports multilingual lip sync, including mathematical symbols. This is documented in the product as one of the primary differentiators for educational use cases.

Custom Avatar vs Stock Avatar: When to Use Which

Spatius includes a commercial-free high-fidelity stock avatar that you can deploy immediately without any custom model generation. For many use cases — general customer service, product demos, HR intake — a stock avatar is the faster and lower-maintenance choice.

Custom avatar generation (using the facial matching pipeline) makes sense when:

  • Brand identity requires a specific persona: a branded spokesperson, a named learning character, a company mascot
  • User personalization: the user sees their own likeness or a likeness they’ve chosen as the avatar (consent and privacy implications apply)
  • Localization: different regional markets may respond better to a locally familiar face

Avatar generation uses a separate quota system and does not consume credits. Avatar generation is currently in Beta, with quota granted by the Spatius team. Failed generation attempts automatically refund the quota.

Rendering on Device: Why Model Size Matters

The 5–10 MB model size is not incidental — it’s a direct consequence of the 3DGS representation and a core requirement for on-device rendering at scale.

Consider the deployment math for an application that offers 10 custom avatar personas to users. Each model is 5–10 MB. On first session load, the model downloads and caches on the device. Subsequent sessions load from cache. The total storage footprint per persona is similar to a single high-resolution image.

For comparison, a cloud-streaming approach doesn’t cache a model locally — it streams rendered video for each session. This means network bandwidth is consumed every session, on every device, for every user.

At 10 concurrent users loading a session for the first time: 10 × 10 MB = 100 MB of model data transferred one time per user, versus 10 × (session duration in minutes × 1–2 MB/s × 60) in streaming video per session. For any non-trivial session length, the streaming cost exceeds the model download cost by orders of magnitude.

This is why on-device rendering — and the compact 3DGS model format that makes it feasible — changes the economics of custom avatar deployment at scale.

For a deeper look at how on-device and cloud-streaming architectures compare on bandwidth and cost: On-Device AI Avatar vs Cloud Streaming: Architecture, Bandwidth, and Cost in 2026

Creating 3D Hologram Effects with a Custom Avatar

The same 3DGS model that powers a standard rectangular video frame can be used to create hologram and transparent-layer visual effects. Because 3DGS natively supports 3D layer separation — the avatar renders as a distinct volumetric object rather than a texture on a flat surface — it can be composited as a transparent layer on top of other content.

In practice, this means a custom avatar can appear to float above a presentation, a retail product display, or a dynamic dashboard without the artificial edge artifacts of traditional green-screen cutout approaches.

For a technical walkthrough of building this effect on Web, iOS, and Android: How to Make a Hologram Out of an AI Avatar
For the broader hologram design and use case guide: AI Hologram: How to Build a Holographic AI Avatar in 2026

Getting Started

The fastest path to testing 3DGS avatar rendering before building a custom model is the Spatius Playground — a live session with a stock avatar that runs on the same rendering engine: spatius.ai/playground

Custom avatar generation (Beta) requires a Spatius account. Quota is granted by the team upon request at hello@spatius.ai.

For SDK documentation and model management APIs: docs.spatius.ai


custom AI avatar3DGSfacial matching SDKdigital human SDKsingle photo avatar
ShareX (Twitter)LinkedIn