World's largest searchable audio database soon

Sound waves A new database enabling visitors to search for spoken English ranging from people talking to their dogs to ministers addressing their congregation is being created at the Oxford University Phonetics Laboratory.

Professor John Coleman and his team are one of four teams to win the 'Digging into Data' competition set up to encourage imaginative, forward-thinking research using large-scale computing in Humanities.

The resulting database will contain a year's worth of spoken English and the project mining a year of speech will create the world's largest searchable database of spoken English sound recordings.

It will be a useful resource for anyone interested in spoken English not just phoneticians and linguists, but also many other kinds of people such as teachers of English language, social historians, and interested members of the public.

Professor Coleman said: "In a world where there's more multimedia than text, audio searching is becoming a vital technology: even Google is moving into it now. We will provide the data so that it is searchable, but we can't even begin to imagine the full range of questions about language that people will want to use it for."

The team will work in partnership with Lou Burnard (Oxford University Computing Services), professor Mark Liberman and colleagues from the University of Pennsylvania and the British Library Sound Archive.

While the American side of the partnership will work on sound recordings in the Linguistic Data Consortium at Penn, the English team will prepare and release the four million word spoken part of British National Corpus, the largest set of recordings of 'language in the wild' ever made.

Although the BNC was transcribed and published electronically many years ago, the speech recordings that accompany it have not previously been released, apart from a small sample.

It is almost unique among speech archives in that it captured huge quantities of unscripted, colloquial speech recorded by hundreds of volunteers across the country as they went about their daily lives. Professor Coleman said: "If the word 'phonetics' makes you think of elocution teachers, then think again.

"For at least a century, the scientific study of speech has sat right on the borderline of the arts and sciences, and our team is no stranger to developing cutting-edge computational technology for the analysis of spoken language."

If this collection of sound recordings were to be played end-to-end, it would take over a year of continuous listening to find what was being searched for.

Professor Coleman and professor Liberman's project will use a variant of automatic speech recognition technology to label every word and every vowel and consonant in the recordings, and a demonstration search engine so that enquirers can rapidly find examples of the bits of spoken English they are looking for.

For example, someone interested in history might want to ask for the recording where George Bush said 'Read my lips', someone learning English might want to hear how 'misled' is pronounced, or an English pronunciation specialist might be interested in how many people pronounce 'schism' with an initial 's', and how many with an initial 'sh'.

It takes a trained phonetician around 10 minutes to label each word in one minute of speech, and about 100 minutes to label every vowel and consonant; so it would take over 100 years work to label such a vast database by hand.

But as all of the material in the database already has a corresponding written transcript, professor Coleman and Professor Libermans teams will use speech recognition technology to index the 'year of speech' in under 15 months.