Avatar Studio: Browser-Native Perception R&D Platform
#worksona#avatar#perception#face-recognition#voice-emotion#motion-capture#browser-native
David OlssonAvatar Studio is a browser-native R&D platform containing seven self-contained perception applications that form the research substrate for the Worksona Avatar stack. The applications cover face recognition and 3D mesh mapping (face-api.js, 468-point landmark mesh), facial emotion detection, voice emotion analysis (Web Audio API, frequency-domain feature extraction), body motion capture (MediaPipe Pose), hand tracking (MediaPipe Hands), eye gaze estimation, and a Three.js creative visualization layer that synthesizes perception streams into an expressive 3D avatar render.
Each application opens as a standalone HTML file with CDN-loaded dependencies. There is no build step, no local server, and no shared runtime state between applications.
Why is it useful?
Perception capability for avatars requires iterating on models, sensor configurations, and visual feedback loops before committing to a production stack. The zero-infrastructure constraint is deliberate: R&D moves faster when the loop is open-a-file rather than restart-the-server. Each of the seven applications is independently runnable, so face tracking experiments do not require voice analysis to be operational. This isolation prevents cross-concern failures from blocking progress on any individual perception module.
The 468-point facial mesh provides enough landmark density for expressive avatar animation. We can detect not just gross expressions but subtle sub-expressions across mouth, eye, and brow regions โ the difference between a polite smile and a genuine one is representable in the coordinate data. The voice emotion module operates in the frequency domain rather than on raw waveforms, which keeps the feature extraction lightweight enough to run in a browser audio worklet without perceptible latency.
How and where does it apply?
Avatar Studio is upstream R&D for the Worksona Avatar system. The perception data collected and validated here informs the runtime avatar pipeline architecture. MediaPipe body and hand tracking outputs map to animation rig coordinate systems. The voice emotion analysis runs on real-time microphone input and produces a five-emotion probability vector (angry, fearful, happy, sad, neutral) that drives avatar affective state in the downstream pipeline.
The Three.js creative visualization layer serves a dual function: it is a technical testbed for perception fusion logic and a stakeholder demonstration environment where multiple perception streams can be observed rendering into a single coherent avatar representation in real time.
graph TD
A[Camera Input] --> B[face-api.js<br/>468-point mesh]
A --> C[MediaPipe Pose<br/>body keypoints]
A --> D[MediaPipe Hands<br/>hand tracking]
E[Microphone] --> F[Web Audio API<br/>voice emotion]
B & C & D & F --> G[Perception Fusion]
G --> H[Three.js Avatar<br/>3D expressive render]
G --> I[Emotion Vector<br/>5-class probability]
The voice emotion feature extraction below illustrates the frequency-band decomposition we use. Three bands capture prosody, voice quality, and breathiness respectively โ the three acoustic dimensions most correlated with emotional state in the literature we validated against.
const analyzeVoiceEmotion = (analyserNode) => {
const bufferLength = analyserNode.frequencyBinCount;
const dataArray = new Float32Array(bufferLength);
analyserNode.getFloatFrequencyData(dataArray);
// Extract energy in emotion-relevant frequency bands
const lowEnergy = bandEnergy(dataArray, 0, 500); // prosody
const midEnergy = bandEnergy(dataArray, 500, 2000); // voice quality
const highEnergy = bandEnergy(dataArray, 2000, 8000); // breathiness
return classifyEmotion({ lowEnergy, midEnergy, highEnergy });
};
The three-band decomposition is intentionally coarse at this stage. As we accumulate labeled samples from the R&D environment, we will refine the band boundaries and the classifier weights before promoting the module to the production avatar pipeline.