Ken Lomax is teaching his synthesiser to sing. He reckons that in a few years, with a little fine-tuning, it will be possible to create a duet between Kiri Te Kanawa and Edith Piaf. Claire Neesham reports
New Scientist vol 151 issue 2046 - 07 September 96, page 36
EMOTIONS are running high. It's half-past ten at night and the heroine is preparing for one of the most intense and demanding arias of Richard Wagner's opera Tristan and Isolde. Tristan has just died, and Isolde will sing of his soul floating above the water. After three and a half hours of drama, Isolde must once again grip the audience, conveying her hopelessness and grief above the swelling orchestra before she, too, dies.
To carry such a performance convincingly, a singer must train for years and prepare carefully every day before going on stage. She must stretch and relax the many muscles she will need to project her voice, have enough fluid to keep her vocal chords lubricated, and be in the right emotional state. These are all such intensely human activities that it is difficult to imagine how a computer could ever replace an opera singer's voice.
Infinitely flexible
Ken Lomax thinks he has at least an idea of how it may be possible within a few years. At the departments of phonetics and computing at the University of Oxford, Lomax has spent two years analysing and synthesising the voices of classical singers. He hopes that "singing" will one day be an option on the average digital synthesiser, providing the opportunity to hear Elvis sing a duet with Luciano Pavarotti or Kiri Te Kinawa in harmony with Edith Piaf.
One of the reasons the human voice has not appeared on a synthesiser before now is that listeners demand such high quality. "We are all experts in the human voice," says Lomax. A string player may cringe at the sound of a synthesised violin, but everyone winces at a synthesised voice.
There are other difficulties, too. "The voice is not a rigid instrument like a cello or a piano, but an infinitely flexible apparatus," says Lomax. Everyone's vocal tract is different, and people's voices change for a host of reasons: in the short term, perhaps because they went drinking the night before, and in the long term as a natural consequence of ageing. Trained singers also introduce vibrato into their voices and use precise muscular control to give words or phrases emphasis, and to add emotion.
With so many variables, reducing a voice to the digital language of a computer is obviously going to be far harder than dealing with other musical instruments. "There are two stages to the process," says Lomax. "The first is to analyse and encode the voice of the singer, and the second is to analyse and encode their singing style." Later, when he wants his machine to "sing", the synthesiser takes the encoded voice and manipulates it according to the encoded style.
For the first stage, Lomax asks his subjects, who come from a choral group at the university, to sing nonsense words such as "tadameadolaeo" which contain a number of common syllables and vowels used in singing practice. He then separates out the vowels and syllables from each recording, creating what he calls "sound units", which he digitises on his Silicon Graphics workstation.
What are your intentions?
These sound units need to be in a format that allows their pitch and duration to be changed without losing their quality. To achieve this Lomax has taken a fundamental look at the way sound is built up. He uses a mathematical device called a Fourier transform, which breaks down complex sounds into series of pure tones—known as partials—of different amplitudes (see Diagrams below).
One way to think of these harmonics is as a row of vertical spikes, each one at a different frequency and with its height equal to the amplitude. The computer can store the frequencies of the partials easily enough, because they are all multiples of the "fundamental", the pitch of the original sound. It stores the amplitudes as an "envelope"—a line that joins together the tops of all the spikes. Lomax's computer carries out this analysis a hundred times a second, slicing up the sound until it is all stored away.
For any time slice, the machine can reconstruct the envelope and partials just by being given the fundamental. This means that it can produce exactly the same sound, "la" or "do", at any frequency. By repeating some of the time slices, Lomax can also lengthen a sound unit, or he can link different units together to make new words.
Much of this technique is similar to what goes on in existing digital synthesisers. What really sets Lomax's machine apart is that it can change not only the sounds that are sung but the style in which they are performed. When professionals sing, they usually know how they want to sing a piece, based on their reading of the manuscript. They decide whether to begin softly (piano), or loudly (forte), whether to sing the sounds smoothly (legato) or in a clipped, abrupt manner (staccato). Lomax has analysed what these intentions mean in terms of the frequencies and amplitudes of the sound units, and then written programs to reproduce those effects.
Lomax asks each singer to explain his or her intentions for a section of musical manuscript, and then to sing it. In the phrase: "Sing middle C and then make a crescendo", for example, it is obvious that during the crescendo, the amplitude of the sound "la" will increase over time. But this is not all that changes. When a performer sings softly, the high-frequency partials in the voice have relatively low amplitudes, but as the performer sings louder, the amplitudes of the high-frequency partials grow faster than those of the low-frequency partials, so the proportion changes.
This phenomenon is known as spectral tilt, and in order to realise the intention "crescendo" in physical terms, Lomax must build it into his machine along with the general rise in amplitude.
A human feel
To do this, he builds up a series of templates for every intention and for every singer. The templates are graphs that show how particular properties—for example, pitch, spectral tilt, amplitude, vibrato rate or depth of vibrato—vary in time when that intention is performed. And it is these templates that the synthesiser uses to manipulate the sound units to give them a human feel.
Take the section of manuscript opposite. Once this is fed into the synthesiser, Lomax can press a button and the machine translates the musical notation into a series of intentions. Before it begins to perform, the machine notes the "global" intentions: such things as the number of beats to a bar and the volume.
Then it retrieves the first sound unit from its database, and applies the first intention "start" to it. When humans sing a note, they begin slightly off-pitch and then adjust. This is precisely what is stored in the template under the intention "start". The synthesiser retrieves the second sound unit and reads the intention "slur" so it slides from one to the other as best it can.
What does all this mean for would-be musical impresarios? If you wanted Björk to sing Isolde's final aria, for example, you would first need to sample Björk's voice to create the sound units and build up each word of the aria from them. Then you would need to feed the manuscript of the aria into the synthesiser, so that it could generate the appropriate intentions. Finally, Björk would need to demonstrate the relevant musical intentions in order to build up a series of Björk-style templates. Only then would the synthesiser be ready to perform.
Sounds simple? "Don't hold your breath," says Lomax—who is first to admit that a synthesiser capable of coping with this combination is still far away. Although he says there are "surprisingly few" musical intentions —about twenty—every singer would have to record some 2000 sound units in order for the machine to have command of the entire English language.
So how good a mimic is the machine? Jane Morgan, one of Lomax's group of singers, says that while she is impressed with how well the synthesiser reproduces individual sounds that she has sung, the transition from one sound to the next is still not right. "It is not possible yet to imitate a voice singing a whole song without it sounding rather artificial," says Morgan. This is true, agrees Lomax. But to be fair, "some of the sounds are completely convincing", he says. Lomax sees smoothing the transition between sound units as the biggest challenge in his work. And he points to research going on elsewhere that may one day help to solve these problems.
The meaning of a word, spoken or sung, can change depending on the way it is spoken. The word "alright", for example, can be a statement or a question. The difference between the two depends on how energy is distributed between the pairs of sounds, or diphones, within the word. At the Institute for Research and Coordination of Acoustical Music (IRCAM) in Paris, where Lomax used to study, Xavier Rodet has modelled the way energy is distributed between diphones. This work could help Lomax to smooth the transitions between his sound units.
Going with the flow
And at Princeton University, New Jersey, computer scientist Perry Raymond Cook is also working on this problem, but from a completely different angle. Cook argues that, when it comes to singing, the flow of the voice is more important than accurate articulation of words. The impact of singing often has more to do with the performer's emphasis on words than on the words themselves. People can tell whether an aria is sad or happy, for example, without understanding the words.
In the early 1990s, while at Stanford University's Center for Computer Research in Music and Acoustics, Cook built a computerised vocal tract in an attempt to copy what goes on inside humans when they sing. Using X-ray photographs, endoscopes and masks that measure the flow of air leaving the mouth, several people have built up profiles of how air flow, shape of the vocal tract and vibrations of the vocal chords change when people sing. Cook built a computer simulation of all the processes.
Cook aimed to make a machine that could be used as a voice coach. And by focusing particularly on fine pitch control, he tried to smooth the transitions between sounds, giving a more human feel to the synthesised voices.
Whether or not other researchers crack the problem of making transitions between sounds more natural, Lomax is confident that within five years a natural-sounding, singing synthesiser will arrive. Singing voices, he reckons, will be recreated and mixed along with the harpsichord, electric bass and screeching siren. "A singing synthesiser will not dominate the musical market, but it will provide another avenue for artistic endeavour," he says. By contrast, Neal Tomlinson, who works for the synthesiser manufacturer Roland, is more cautious. He questions whether anyone will want to pay for a singing synthesiser when they already have a voice they can use for free.
Will the synthesised voice be a threat to real singers? "Not in the immediate future," says Lomax, who enjoys singing in barbershop quartets and agrees with his singers that there is something uniquely human about singing which cannot yet be captured digitally. The future is another matter, however. "Just watch this space," he says.
Claire Neesham