24-Hour AI Companion System Overview Plan

24-Hour AI Companion System Overview Plan

Her

Her-v2 Evolution Plan: Building a 24-Hour AI Companion System

The Journey from ASR-LLM-TTS to ASR-Agents-TTS

Project Vision: Create a true 24-hour AI companion that not only converses, but truly understands your life, remembers your stories, perceives your emotions, and integrates into your social network.


🎯 Core Design Philosophy

Three Levels of Anthropomorphization

Level 1: Perceptual Anthropomorphization
Not just hearing what you say, but understanding "how you say it" — through multi-dimensional VAD and emotion recognition, perceiving your tone, emotions, and intentions, like a true listening friend.

Level 2: Memory Anthropomorphization
Not mechanical database retrieval, but human-like memory — some things remembered clearly, some gradually forgotten, some suddenly recalled in specific contexts. Building an interconnected knowledge network through the A-MEM system.

Level 3: Interactive Anthropomorphization
Not passive responses, but proactive care — morning greetings, weather change reminders, comfort when noticing you're feeling down, even timely humorous banter, making AI part of life rather than just a tool.

Ultimate Goal of Human-Computer Interaction

Low Latency: End-to-end <1.5s for natural, fluid conversation
High Intelligence: Complex task handling through Agents architecture
Strong Memory: Building user social networks, understanding relationship contexts
Emotional Resonance: Recognizing and responding to emotional changes
Environmental Awareness: Passive listening with intelligent intervention


📐 System Architecture Design

Overall Architecture: ASR-Agents-TTS

┌─────────────────────────────────────────────────────────────┐
                 Android/Embedded Client                      
  [Background Service]  [Audio Stream]  [Smart VAD]  [Trigger Decision]  
└────────────────────────┬────────────────────────────────────┘
                          WebSocket/gRPC
┌────────────────────────┴────────────────────────────────────┐
                  P2P Penetration Server (Personal Server)    
                                                              
  ┌──────────────────────────────────────────────────────┐  
             Intelligent VAD Decision Engine               
    Silence Detection + Decibel Analysis + Voice Detection +   
    Semantic Completeness + Voiceprint                    
  └──────────────────┬───────────────────────────────────┘  
                                                             
  ┌──────────────────▼───────────────────────────────────┐  
                  ASR System                               
   SenseVoice (Multilingual+Emotion+VAD)  Real-time      
   Transcription  Dialogue or Self-talk                  
  └──────────────────┬───────────────────────────────────┘  
                                                             
  ┌──────────────────▼───────────────────────────────────┐  
             Master Agent (Coordinator)                    
           Qwen2.5-7B / gpt-oss:20b                        
                                                            
    ┌─────────────────────────────────────────────────┐   
              Agents Ecosystem                           
                                                         
      ┌──────────┐  ┌──────────┐  ┌──────────┐        
        Memory     Emotion     Context          
        Agent       Agent       Agent           
      └────┬─────┘  └────┬─────┘  └────┬─────┘        
                                                    
      ┌────▼─────┐  ┌────▼─────┐  ┌────▼─────┐        
        Dialog       Tool       Social          
        Agent       Agent       Agent           
      └──────────┘  └──────────┘  └──────────┘        
                                                         
    └─────────────────────────────────────────────────┘   
                                                          
              ┌────────┴────────┐                         
                                                        
         [Local Decision]  [Cloud API]                    
       Quick response for    Complex reasoning/           
       simple tasks          tool invocation              
  └─────────────────┬───────────────────────────────────┘  
                                                            
  ┌─────────────────▼───────────────────────────────────┐  
           Memory & Personality System (Core)             
    Vector Semantic Matching - Metadata Matching -        
    Multimodal Memory Fusion (Experimental)               
    ChromaDB(Vector) + Neo4j(Graph) + PostgreSQL(Structured)   
  └────────────────────────────────────────────────────┘  
                                                              
  ┌──────────────────────────────────────────────────────┐  
                 TTS System                                
    GPT-SoVITS / Index-TTS  Emotionalized Voice Synthesis   
  └──────────────────────────────────────────────────────┘  
└──────────────────────────────────────────────────────────────┘

Thanks for Reading

If this article was helpful to you, feel free to connect with me!