Match!

Template based techniques for automatic segmentation of TTS unit database

Published on Mar 1, 2016 in ICASSP (International Conference on Acoustics, Speech, and Signal Processing)
· DOI :10.1109/ICASSP.2016.7472750
S. Adithya1
Estimated H-index: 1
(UCSD: University of California, San Diego),
Sunil Rao2
Estimated H-index: 2
(ASU: Arizona State University)
+ 3 AuthorsV. Ramasubramanian12
Estimated H-index: 12
(PES University)
Cite
Abstract
We address the problem of automatic segmentation of the unit database in unit-selection based TTS and propose template based forced alignment segmentation in the one-pass dynamic programming (DP) framework with several variants: i) multi-template representation derived by modified K-means (MKM) algorithm, ii) context-independent and context-dependent templates for reduced multi-template representation, iii) segmental K-means algorithm with MKM modeling of phone classes, as a template-based equivalent of the conventional embedded re-estimation procedure for HMM based modeling and segmentation, that is typical for deriving unit-databases for TTS (e.g. EHMM in Festival). We first benchmark the performance of the proposed segmentation framework on TIMIT database for phonetic segmentation given the availability of phonetic labeling ground truth in TIMIT. We then apply the proposed template based segmentation algorithms for syllabic Indian language TTS, and benchmark the proposed segmentation using objective measures based on spectral distortions (SD) obtained on time-aligned speech utterances and compare it with other recent segmentation approaches, namely the group-delay (GD) based semiautomatic method, Hybrid method, EHMM, HMM and SKM-HMM and show that the proposed template based approaches offer comparable and better spectral distortions, validating their ability to provide accurate high-resolution segmentation of the unit-database.
  • References (22)
  • Citations (1)
Cite
References22
Newest
Published on Jul 1, 2015
Sunil Rao2
Estimated H-index: 2
(PES University),
C. Mahima1
Estimated H-index: 1
(PES University)
+ 3 AuthorsV. Ramasubramanian12
Estimated H-index: 12
(PES University)
We address the problem of TTS speech quality evaluation and propose a double-ended objective measure in the form of average spectral distortion between time-aligned reference and synthesized speech, where the reference signal is made available as the speech of the text input to the TTS spoken by the same speaker as the unit-database. We detail the time-aligned spectral distortion measure calculated via dynamic time-warping and apply this measure for comparison of the effectiveness of 5 different...
Published on Jan 1, 2014 in INTERSPEECH (Conference of the International Speech Communication Association)
S. Aswin Shanmugam3
Estimated H-index: 3
(Indian Institute of Technology Madras),
Hema A. Murthy23
Estimated H-index: 23
(Indian Institute of Technology Madras)
Published on Nov 1, 2013
Hemant A. Patil9
Estimated H-index: 9
(Dhirubhai Ambani Institute of Information and Communication Technology),
Tanvina B. Patel6
Estimated H-index: 6
(Dhirubhai Ambani Institute of Information and Communication Technology)
+ 28 AuthorsVeera Raghavendra2
Estimated H-index: 2
In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synthesis is of paramount interest, unit-selection synthesizers are built. Building TTS systems for low-resource languages requires that the data be caref...
Published on Jan 1, 2013
B. Ramani4
Estimated H-index: 4
,
S. Lilly Christina3
Estimated H-index: 3
+ 10 AuthorsK. Samudravijaya6
Estimated H-index: 6
Published on Jul 1, 2012
V. Ramasubramanian12
Estimated H-index: 12
(Siemens)
In narrow-band speech coding, specifically in the low and ultra low bit-rate ranges, a series of efficient quantization of the LP parameters using fixed-length as well as variable-length segment quantization (VLSQ) have resulted in a progressive reduction in the bit-rate from the 2400 bits/sec baseline of the LPC-10 coder down to 300 bits/sec and less. The VLSQ framework forms a generic basis of a class of segment vocoders within which various types of segments/units and unit-modeling have been ...
Published on Jan 1, 2010
Srikanth Cherla6
Estimated H-index: 6
(Siemens),
V. Ramasubramanian12
Estimated H-index: 12
(Siemens)
Published on Apr 1, 2009 in ICASSP (International Conference on Acoustics, Speech, and Signal Processing)
Alan W. Black48
Estimated H-index: 48
(CMU: Carnegie Mellon University),
John Kominek11
Estimated H-index: 11
(CMU: Carnegie Mellon University)
This paper introduces a new optimization technique for moving segment labels (phone and subphonetic) to optimize statistical parametric speech synthesis models. The choice of objective measures is investigated thoroughly and listening tests show the results to significantly improve the quality of the generated speech equivalent to increasing the database size by 3 fold.
Published on Oct 1, 2008 in ECCV (European Conference on Computer Vision)
Kaustubh Kulkarni6
Estimated H-index: 6
(Siemens),
Srikanth Cherla6
Estimated H-index: 6
(Siemens)
+ 1 AuthorsV. Ramasubramanian12
Estimated H-index: 12
(Siemens)
Several researchers have addressed the problem of human action recognition using a variety of algorithms. An underlying assump- tion in most of these algorithms is that action boundaries are already known in a test video sequence. In this paper, we propose a fast method for continuous human action recognition in a video sequence. We pro- pose the use of a low dimensional feature vector which consists of (a) the projections of the width profile of the actor on to a Discrete Co- sine Transform (DC...
Published on Jun 1, 2008 in CVPR (Computer Vision and Pattern Recognition)
Srikanth Cherla6
Estimated H-index: 6
(Siemens),
Kaustubh Kulkarni6
Estimated H-index: 6
(Siemens)
+ 1 AuthorsV. Ramasubramanian12
Estimated H-index: 12
(Siemens)
In this paper, we propose a fast method to recognize human actions which accounts for intra-class variability in the way an action is performed. We propose the use of a low dimensional feature vector which consists of (a) the projections of the width profile of the actor on to an ldquoaction basisrdquo and (b) simple spatio-temporal features. The action basis is built using eigenanalysis of walking sequences of different people. Given the limited amount of training data, Dynamic Time Warping (DT...
Published on Mar 1, 2008 in ICASSP (International Conference on Acoustics, Speech, and Signal Processing)
V. Ramasubramanian12
Estimated H-index: 12
(Siemens),
Kaustubh Kulkarni6
Estimated H-index: 6
(Siemens),
Bernhard Kaemmerer3
Estimated H-index: 3
(Siemens)
We propose a novel framework for continuous speech recognition (CSR) based on non-parametric acoustic modeling using multiple phoneme templates set in a modified one-pass DP decoding algorithm, in contrast to the conventional HMM acoustic models set in Viterbi decoding. We particularly emphasis the 'selectivity' property of templates as set in the proposed modified one-pass DP decoding algorithm and explore various contextual definitions of the templates and their relative performances for a ran...
Cited By1
Newest
Published on Jul 1, 2015
Sunil Rao2
Estimated H-index: 2
(PES University),
C. Mahima1
Estimated H-index: 1
(PES University)
+ 3 AuthorsV. Ramasubramanian12
Estimated H-index: 12
(PES University)
We address the problem of TTS speech quality evaluation and propose a double-ended objective measure in the form of average spectral distortion between time-aligned reference and synthesized speech, where the reference signal is made available as the speech of the text input to the TTS spoken by the same speaker as the unit-database. We detail the time-aligned spectral distortion measure calculated via dynamic time-warping and apply this measure for comparison of the effectiveness of 5 different...
View next paperA Hybrid Unsupervised Segmentation Algorithm for Arabic Speech Using Feature Fusion and a Genetic Algorithm (July 2018)