reported
by: Dr. Roland K C Tan
Chairman (Term
1997/98)
On Friday, 8 May 1998,
about 12 members and 36 invited guests attended a seminar on
"Time-Scale Modification Algorithm for Audio & Speech Signal Applications"
organised by the Audio Engineering Society (AES) Singapore Section. The
topic for the seminar was based on our paper that was also presented at the
recent AES 104th Convention (preprint no: 4644)* on Sunday, 17
May 1998 in Amsterdam at the RAI Congress Centre.
The speaker for the evening was my
co-author, 15-year old student Amerson H J Lin. The young
prodigy from Raffles Institution, Singapore's premier school, is sitting for
his GCE 'O' level exam at the end of this year and has been involved in DSP
research projects under my mentorship since January 1996. The event was held
at the Ngee Ann Polytechnic's Electrical Engineering Department's Staff
Conference Room with audiences that comprised of audio professionals and
tertiary students from the local TV and radio broadcasting stations, audio
industries, the local universities and polytechnics. Amerson's teachers and
classmates from Raffles Institution were also present during his talk.
A high quality time-scale
modification algorithm applicable for both digital audio and speech is a
useful feature for dedicated audio system. A novel approach proposed as
SASOLA (which stands for sub-band analysis synchronous overlap-and-add)
enables varying the tempo of music to as high as twice the rate of expansion
or reduction without affecting the pitch of the musical instruments or the
singers' voice characteristic.
|

Amerson Lin, a
15-year old prodigy from Singapore's premier school Raffles Institution,
presenting a talk on his research findings at Ngee Ann Polytechnic -
photograph by Mr. Michael
Teh, Committee Member. |
Time-scale modification algorithms
originally developed for speech such as the pitch synchronous
overlap-and-add (PSOLA) technique can produce excellent results. However, it
may not perform as well with audio signals due to the fact that an accurate
pitch prediction computation is difficult to achieve in audio waveform. The
proposed SASOLA algorithm as offers an alternative to the existing
time-scale modification algorithms due to its computational efficiency and
higher audio sound quality output. SASOLA considers sub-band analysis and is
based on the time-domain synchronous overlap-and-add (SOLA) technique
originally developed for speech.
The principle of overlap-and-add
concatenates two frames of speech/audio samples (that is, the analysis frame
and the synthesis frame) by finding the best alignment point in the region
of overlap with the highest similarity. In SOLA, this is found by maximising
the cross-correlation function between the analysis frame and the synthesis
frame in the overlapping region.
|

Amerson Lin
(centre, in school blazer and tie) with AES members and guests after the
talk - photograph by Christopher K C Yap,
Treasurer. |
Unlike speech signals, the
presence of high-transients and non-stationary characteristics inherent in
most broadband audio signals within the full audible frequency bandwidth
between 20Hz to 20kHz means that the best alignment point in the region of
overlap may not always be ideal. These contribute audible distortions which
effect the pitch of the instruments and singer's voice. SASOLA algorithm
overcomes the problem by first decomposing the broadband signal into smaller
sub-bands before performing overlap-and-add on the individual bands. By
partitioning the audio signal into sub-bands with narrower bandwidth, the
signal becomes more predictable. A better alignment can thus be realised
which results in overall improvement in the output sound quality.
Although the computational
complexity for the SOLA algorithm is relatively lower when compared to the
frequency domain processing techniques such as the short-time Fourier
transform (STFT) algorithm, a single chip hardware real-time implementation
for both audio and speech applications at 44.1kHz/48kHz sampling frequency
is not viable. This is due to the compute intensive time-domain
cross-correlation computation found in the SOLA algorithm. In fact, the
overall computational efficiency can be increased by simply switching from
time-domain to the frequency-domain in the cross-correlation computation
based on the simple convolution-multiplication relationship which can be
mathematically proven.
|

Amerson H J Lin
(right) receiving the plaque from Chairman AES Singapore Section, Dr.
Roland K C Tan - photograph by C S Lim, SBA. |
The difference
in sound quality using a commercial technique on both speech and music
signals were subjectively compared during a sound demo session, which
followed after the presentation. With speech signal, the sound quality
was clearly superior as opposed to the results obtained with music
signals when performing both time-scale expansion/reduction
modifications. This can be explained by the fact that in the time-domain
waveform of music signal, it is generally more complex |
and non-stationary (high
variations with time) as compared to speech. To have a good
cross-correlation between frames for music signal would be difficult to
achieve.
Therefore, decomposing the full
audio bandwidth spectrum into smaller sub-bands reduces the complexity of
the music signal in each band thus making it more "stationary". A better
cross-correlation between frames can then be achieved. The processed
sub-band signals after time-scale modifications can thus be synthesised
(combined) again to obtain the resulting output music signal at full
bandwidth.
|

Amerson Lin
presenting during the AES 104th Convention at the RAI Congress Centre,
Amsterdam - photograph by Dr. Roland K C Tan,
Chairman. |
A subjective comparison was
made again this time using SASOLA and the commercial technique. At twice
the expansion (-50%) in particular, it was found that time-scale
modification using the commercial technique generated audible "echo" and
"stuttering" distortion effects in the background. On the other hand at
twice the reduction (+100%), there were clearly missing information.
However, these audible distortions were eliminated using SASOLA. |
These were more obvious with
contemporary pop music signal consisting instruments that produce
high-transient waveform such as kick-drum, castanets, or high-hats. Overall,
the audience felt that the results obtained using SASOLA could retain the
pitch and tone characteristics of the original music and speech signals
better.
The technology developed are
suitable for applications in the pro-audio, communication, broadcast and
entertainment industries. As an example in lips synchronisation during
voice-over work or special sound effects in cinematography, there is no need
to re-record the actor's voice nor involve the orchestra again. This could
save both time and money. With CD, DAB and digital mixer, DJ can vary the
tempo of music with a "smooth mix" without affecting the original music
signal characteristics - a technique currently not possible with an analogue
mixer. In communication system applications, listening to long recorded or
'live' voice messages of a fast talker can be slowed down to improve
intelligibility. Similarly, listening time can be shortened by speeding up
music & speech recordings during playback.
|

Dr. Roland K C
Tan (left) with Amerson H J Lin (right) standing right in front of the
RAI Congress Centre, Amsterdam, The Netherlands dated Sunday, 17 May
1998 - photograph by C S Lim, SBA. |
* Amerson H J Lin & Roland K C Tan, "Time-Scale
Modification Algorithm for Audio and Speech Signal Applications" presented
at the 104th Convention of the Audio Engineering Society, J. Audio Eng.
Soc. (Abstracts), vol.46, p.574 (1998 June), preprint 4644.
* Roland K C Tan & Amerson H J Lin, "A Time-Scale
Modification Algorithm Based on the Subband Time-Domain Technique for
Broad-Band Signal Applications" J. Audio Eng. Soc., vol.48, No.5, pp
437-449 (2000 May).
|