Voice & Multimodal

AI agents that process and generate speech, audio, music, images, and video. Covers real-time voice interfaces, speech-to-text/TTS pipelines, multimodal foundation models, and agents that perceive the world through multiple sensory channels.