Match!

Template based techniques for automatic segmentation of TTS unit database

Published on Mar 1, 2016 in ICASSP (International Conference on Acoustics, Speech, and Signal Processing)
· DOI :10.1109/ICASSP.2016.7472750
S. Adithya1
Estimated H-index: 1
(UCSD: University of California, San Diego),
Sunil Rao4
Estimated H-index: 4
(ASU: Arizona State University)
+ 3 AuthorsV. Ramasubramanian11
Estimated H-index: 11
(PES University)
Abstract
We address the problem of automatic segmentation of the unit database in unit-selection based TTS and propose template based forced alignment segmentation in the one-pass dynamic programming (DP) framework with several variants: i) multi-template representation derived by modified K-means (MKM) algorithm, ii) context-independent and context-dependent templates for reduced multi-template representation, iii) segmental K-means algorithm with MKM modeling of phone classes, as a template-based equivalent of the conventional embedded re-estimation procedure for HMM based modeling and segmentation, that is typical for deriving unit-databases for TTS (e.g. EHMM in Festival). We first benchmark the performance of the proposed segmentation framework on TIMIT database for phonetic segmentation given the availability of phonetic labeling ground truth in TIMIT. We then apply the proposed template based segmentation algorithms for syllabic Indian language TTS, and benchmark the proposed segmentation using objective measures based on spectral distortions (SD) obtained on time-aligned speech utterances and compare it with other recent segmentation approaches, namely the group-delay (GD) based semiautomatic method, Hybrid method, EHMM, HMM and SKM-HMM and show that the proposed template based approaches offer comparable and better spectral distortions, validating their ability to provide accurate high-resolution segmentation of the unit-database.
  • References (23)
  • Citations (1)
📖 Papers frequently viewed together
2012ICASSP: International Conference on Acoustics, Speech, and Signal Processing
4 Authors (Asaf Rendel, ..., Andrew P. Breen)
6 Citations
7 Citations
4 Citations
78% of Scinapse members use related papers. After signing in, all features are FREE.
References23
Newest
#1Sunil Rao (PES University)H-Index: 4
#2C. Mahima (PES University)H-Index: 1
Last. V. Ramasubramanian (PES University)H-Index: 11
view all 6 authors...
We address the problem of TTS speech quality evaluation and propose a double-ended objective measure in the form of average spectral distortion between time-aligned reference and synthesized speech, where the reference signal is made available as the speech of the text input to the TTS spoken by the same speaker as the unit-database. We detail the time-aligned spectral distortion measure calculated via dynamic time-warping and apply this measure for comparison of the effectiveness of 5 different...
1 CitationsSource
Jan 1, 2014 in INTERSPEECH (Conference of the International Speech Communication Association)
#1S. Aswin Shanmugam (Indian Institute of Technology Madras)H-Index: 3
#2Hema A. Murthy (Indian Institute of Technology Madras)H-Index: 24
22 Citations
#1Hemant A. Patil (Dhirubhai Ambani Institute of Information and Communication Technology)H-Index: 11
#2Tanvina B. Patel (Dhirubhai Ambani Institute of Information and Communication Technology)H-Index: 7
Last. Hema A. Murthy (Indian Institute of Technology Madras)H-Index: 24
view all 31 authors...
In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synthesis is of paramount interest, unit-selection synthesizers are built. Building TTS systems for low-resource languages requires that the data be caref...
20 CitationsSource
#1B. RamaniH-Index: 5
#2S. Lilly ChristinaH-Index: 4
Last. Hema A. MurthyH-Index: 24
view all 13 authors...
35 Citations
#1V. Ramasubramanian (Siemens)H-Index: 11
In narrow-band speech coding, specifically in the low and ultra low bit-rate ranges, a series of efficient quantization of the LP parameters using fixed-length as well as variable-length segment quantization (VLSQ) have resulted in a progressive reduction in the bit-rate from the 2400 bits/sec baseline of the LPC-10 coder down to 300 bits/sec and less. The VLSQ framework forms a generic basis of a class of segment vocoders within which various types of segments/units and unit-modeling have been ...
5 CitationsSource
#1Srikanth Cherla (Techno India)H-Index: 7
#2V. Ramasubramanian (Siemens)H-Index: 11
5 Citations
Apr 1, 2009 in ICASSP (International Conference on Acoustics, Speech, and Signal Processing)
#1Alan W. Black (CMU: Carnegie Mellon University)H-Index: 50
#2John Kominek (CMU: Carnegie Mellon University)H-Index: 11
This paper introduces a new optimization technique for moving segment labels (phone and subphonetic) to optimize statistical parametric speech synthesis models. The choice of objective measures is investigated thoroughly and listening tests show the results to significantly improve the quality of the generated speech equivalent to increasing the database size by 3 fold.
20 CitationsSource
Oct 1, 2008 in ECCV (European Conference on Computer Vision)
#1Kaustubh Kulkarni (Techno India)H-Index: 8
#2Srikanth Cherla (Techno India)H-Index: 7
Last. V. Ramasubramanian (Techno India)H-Index: 11
view all 4 authors...
Several researchers have addressed the problem of human action recognition using a variety of algorithms. An underlying assump- tion in most of these algorithms is that action boundaries are already known in a test video sequence. In this paper, we propose a fast method for continuous human action recognition in a video sequence. We pro- pose the use of a low dimensional feature vector which consists of (a) the projections of the width profile of the actor on to a Discrete Co- sine Transform (DC...
9 Citations
Jun 1, 2008 in CVPR (Computer Vision and Pattern Recognition)
#1Srikanth Cherla (Siemens)H-Index: 7
#2Kaustubh Kulkarni (Siemens)H-Index: 8
Last. V. Ramasubramanian (Siemens)H-Index: 11
view all 4 authors...
In this paper, we propose a fast method to recognize human actions which accounts for intra-class variability in the way an action is performed. We propose the use of a low dimensional feature vector which consists of (a) the projections of the width profile of the actor on to an ldquoaction basisrdquo and (b) simple spatio-temporal features. The action basis is built using eigenanalysis of walking sequences of different people. Given the limited amount of training data, Dynamic Time Warping (DT...
45 CitationsSource
Mar 1, 2008 in ICASSP (International Conference on Acoustics, Speech, and Signal Processing)
#1V. Ramasubramanian (Siemens)H-Index: 11
#2Kaustubh Kulkarni (Siemens)H-Index: 8
Last. B. Kaemmerer (Siemens)H-Index: 1
view all 3 authors...
We propose a novel framework for continuous speech recognition (CSR) based on non-parametric acoustic modeling using multiple phoneme templates set in a modified one-pass DP decoding algorithm, in contrast to the conventional HMM acoustic models set in Viterbi decoding. We particularly emphasis the 'selectivity' property of templates as set in the proposed modified one-pass DP decoding algorithm and explore various contextual definitions of the templates and their relative performances for a ran...
12 CitationsSource
Cited By1
Newest
#1Sunil Rao (PES University)H-Index: 4
#2C. Mahima (PES University)H-Index: 1
Last. V. Ramasubramanian (PES University)H-Index: 11
view all 6 authors...
We address the problem of TTS speech quality evaluation and propose a double-ended objective measure in the form of average spectral distortion between time-aligned reference and synthesized speech, where the reference signal is made available as the speech of the text input to the TTS spoken by the same speaker as the unit-database. We detail the time-aligned spectral distortion measure calculated via dynamic time-warping and apply this measure for comparison of the effectiveness of 5 different...
1 CitationsSource