Again combined triphone and quinphone rescoring passes were used. Speech was first segmented using Gaussian mixture models and a phone recogniser. Separate LMs were built for different sources and interpolated to form a single model. The core training used an extended version of 1. A less than 10xRT CTS system was developed which employed 2-way system combination and lattice-based adaptation. In future we hope to make many of these available in released versions of HTK.
|Date Added:||6 April 2007|
|File Size:||21.85 Mb|
|Operating Systems:||Windows NT/2000/XP/2003/2003/7/8/10 MacOS 10/X|
|Price:||Free* [*Free Regsitration Required]|
The unlimited compute conversational telephone speech CTS, previously known as Switchboard or Hub5 was similar in structure to the system, hyk utilised improved acoustic and language models and performed automatic segmentation of the audio data.
In future we hope to make many of these available in released versions of HTK.
Groups of clustered segments were then used for MLLR adaptation and word lattices generated 4-gram interpolated with class-trigram with triphone HMMs trained on 70 gtk of broadcast news data. This section gives a brief overview of the features of these systems and how they relate to the features present in released versions of HTK.
The system htj for the Switchboard part of the April Rich Transcription evaluation used acoustic models trained using Minimum Phone Error training. Speech was first segmented using Gaussian mixture models and a phone recogniser. Each of these systems described below has represented the state-of-the-art when it was produced either the lowest error rate in the evaluation hyk not a statistically significant difference to the lowest error rate system.
The broadcast news evaluation Hub 4 transcribed pre-segmented and labelled portions of broadcast news audio.
HTK users meeting at ICASSP 2001
A faster version of the full system that ntk in less than 10 times realtime was developed. The NIST March evaluation data included data recorded over conventional telephone lines as well as data from calls over cellular channels.
The broadcast news evaluation Hub 4. It used MLLR before lattice generation and then rescored the lattices with adapted quinphone models.
HTKBook for HTK3
The full system included cluster-based variance normalistaion and vocal-tract length normalisation VTLN and full-variance transforms none of these are included in released versions of HTK up to 3. For the Hub 1 evaluation a number of other features were used including maximum likelihood linear regression adaptation and the use of quinphone models in a lattice-rescoring pass using a 65k 4-gram language model.
The Hub 3 evaluation focussed on large vocabulary transcription of clean and noisy speech. Both triphone and quinphone HMMs were trained on hours of data and used in a multi-stage recognition process, first generating lattices with MLLR-adapted triphones and then rescoring these with adapted quinphones.
Separate LMs were built for different sources and interpolated to hto a single model. A less than 10xRT CTS system was developed which employed 2-way system combination and lattice-based adaptation. Models for noisy environments were trained using single-pass retraining present in V2. For a full description and results see Rich Transcription workshop presentation. The conversational speech evaluation Hub 5 required the transcription of telephone conversations.
The broadcast news evaluation Hub 4 was an evolution of the system. Again combined triphone and quinphone rescoring passes were used.
However, from the sections below it can be seen that there are many other features that have been incorporated into the CUED HTK systems. The front-end used was PLP with a mel-spectra based filterbank. The major tool currently lacking from the distributed HTK 2010 to reproduce these systems is a capable large vocabulary decoder supporting trigrams and 4-grams and cross-word triphones and quinphones. The core training used an extended version of 1.