Match!
Masato Akagi
Japan Advanced Institute of Science and Technology
Intelligibility (communication)Pattern recognitionAcousticsSpeech recognitionComputer science
291Publications
15H-index
1,023Citations
What is this?
Publications 307
Newest
#1Feng Li (Japan Advanced Institute of Science and Technology)H-Index: 2
#2Masato Akagi (Japan Advanced Institute of Science and Technology)H-Index: 15
Abstract Separating singing voice from a musical mixture remains an important task in the field of music information retrieval. Recent studies on singing voice separation have shown that robust principal component analysis (RPCA) with rank-1 constraint approach can improve separation quality. However, the performance of separation is limited because the vocal part can not be described well by the separated matrix. Therefore, prior information such as fundamental frequency (F0) should be consider...
Source
Source
Apr 11, 2020 in ICASSP (International Conference on Acoustics, Speech, and Signal Processing)
#1Bagus Tris Atmaja (Japan Advanced Institute of Science and Technology)H-Index: 2
#2Masato Akagi (Japan Advanced Institute of Science and Technology)H-Index: 15
3 CitationsSource
#1Bagus Tris Atmaja (Japan Advanced Institute of Science and Technology)H-Index: 2
#2Masato Akagi (Japan Advanced Institute of Science and Technology)H-Index: 15
Modern deep learning architectures are ordinarily performed on high-performance computing facilities due to the large size of the input features and complexity of its model. This paper proposes traditional multilayer perceptrons (MLP) with deep layers and small input size to tackle that computation requirement limitation. The result shows that our proposed deep MLP outperformed modern deep learning architectures, i.e., LSTM and CNN, on the same number of layers and value of parameters. The deep ...
#1Bagus Tris Atmaja (ITS: Sepuluh Nopember Institute of Technology)H-Index: 2
#2Masato AkagiH-Index: 15
In this paper, we evaluate the different features sets, feature types, and classifiers on both song and speech emotion recognition. Three feature sets: GeMAPS, pyAudioAnalysis, and LibROSA; two feature types: low-level descriptors and high-level statistical functions; and four classifiers: multilayer perceptron, LSTM, GRU, and convolution neural networks are examined on both song and speech data with the same parameter values. The results show no remarkable difference between song and speech dat...
#1Bagus Tris Atmaja (Japan Advanced Institute of Science and Technology)H-Index: 2
#2Masato Akagi (Japan Advanced Institute of Science and Technology)H-Index: 15
The choice of a loss function is a critical part in machine learning. This paper evaluated two different loss functions commonly used in regression-task dimensional speech emotion recognition, an error-based and a correlation-based loss functions. We found that using correlation-based loss function with a concordance correlation coefficient (CCC) loss resulted better performance than error-based loss function with a mean squared error (MSE) loss, in terms of the averaged CCC score. The results a...
1 Citations
#1Bagus Tris Atmaja (ITS: Sepuluh Nopember Institute of Technology)H-Index: 2
#2Masato AkagiH-Index: 15
Silence is a part of human-to-human communication, which can be a clue for human emotion perception. For automatic emotion recognition by a computer, it is not clear whether silence is useful to determine human emotion within a speech. This paper presents an investigation of the effect of using silence feature in dimensional emotion recognition. As the silence feature is extracted per utterance, we grouped the silence feature with high statistical functions from a set of acoustic features. The r...
1 Citations
#1Bagus Tris Atmaja (Japan Advanced Institute of Science and Technology)H-Index: 2
#2Masato Akagi (Japan Advanced Institute of Science and Technology)H-Index: 15
Due to its ability to accurately predict emotional state using multimodal features, audiovisual emotion recognition has recently gained more interest from researchers. This paper proposes two methods to predict emotional attributes from audio and visual data using a multitask learning and a fusion strategy. First, multitask learning is employed by adjusting three parameters for each attribute to improve the recognition rate. Second, a multistage fusion is proposed to combine results from various...
#1Thuan Van Ngo (Japan Advanced Institute of Science and Technology)
#1Thuanvan Ngo (Japan Advanced Institute of Science and Technology)
Last. Peter Birkholz (TUD: Dresden University of Technology)H-Index: 12
view all 3 authors...
Abstract In noisy conditions, speakers involuntarily change their manner of speaking to enhance the intelligibility of their voices. The increased intelligibility of this so-called Lombard speech is enabled by the change of multiple articulatory and acoustic features. While the major features of Lombard speech are well known from previous studies, little is known about their relative contributions to the intelligibility of speech in noise. This study used an analysis-by-synthesis strategy to exp...
Source
#1Zhichao PengH-Index: 1
#2Xingfeng LiH-Index: 1
Last. Masato AkagiH-Index: 15
view all 6 authors...
Emotion information from speech can effectively help robots understand speaker’s intentions in natural human-robot interaction. The human auditory system can easily track temporal dynamics of emotion by perceiving the intensity and fundamental frequency of speech, and focus on the salient emotion regions. Therefore, speech emotion recognition combined with the auditory mechanism and attention mechanism may be an effective way. Some previous studies used auditory-based static features to identify...
Source
12345678910