Biomimetic Voice Decomposition and Recomposition System

P.I.: Young H. Cho, PhD
University of Southern California/Information Sciences Institute

spectrogram   hybrid

Figure: Conventional Spectrogram vs. Time-Freq Hybrid Map

Digital signal processing of complex audio signal has been studied for many decades.  Most common methods of processing audio are frequency spectrum analysis and time domain pattern analysis.  While these traditional audio processing methods are computationally attractive, these face limitations in terms of time-frequency resolution.  In short, higher resolution in time domain analysis forces lower resolution in frequency domain while higher resolution in frequency domain necessitates lower time resolution.  Therefore, analysis that may require higher resolution in both time and frequency domains have to deal with limitations and filter unintended processing residuals.

This research leverage the various aspects of audio processing mechanisms in human auditory systems to convert any given audio signal to a clean and optimized time-frequency hybrid maps.  Given these maps, we intend to identify, isolate, and localize different sources of sound in a single audio stream as biological systems are capable of doing already.

Our current processing system allows us to convert of human voice signal into clean time-frequency hybrid map that allows isolation of different sounds based on audio patterns.  Manually isolated audio segments from this map can be re-encoded to produce sounds that appears to originate from different parts of human vocal anatomy.

The algorithms developed under this research are being integrated into a voice manipulation software that feature manipulation of intonation (including emotions in the voice) and conversion of one voice to another voice through filters.  This tool can be used to enhance voice synthesis and modification for use in entertainment industry to produce desirable human voice (i.e. computer animated movies and advanced audio tuning in music).  This technology can also be applied to enhance the quality of text-to-speech generation as well as reduce the size of the database for voice recognition systems.

A Sample Output of Our Voice Manipulation Program

The Original *.WAV files

The converted audio files using Audacity's Pitch Change Function

The converted audio files using our technology