Match!

Audio Source Separation via Multi-Scale Learning with Dilated Dense U-Nets.

Published on Jan 1, 2019in arXiv: Learning
Vivek Sivaraman Narayanaswamy (ASU: Arizona State University), Sameeksha Katoch1
Estimated H-index: 1
(ASU: Arizona State University)
+ 2 AuthorsAndreas Spanias25
Estimated H-index: 25
(ASU: Arizona State University)
Abstract
Modern audio source separation techniques rely on optimizing sequence model architectures such as, 1D-CNNs, on mixture recordings to generalize well to unseen mixtures. Specifically, recent focus is on time-domain based architectures such as Wave-U-Net which exploit temporal context by extracting multi-scale features. However, the optimality of the feature extraction process in these architectures has not been well investigated. In this paper, we examine and recommend critical architectural changes that forge an optimal multi-scale feature extraction process. To this end, we replace regular 1- convolutions with adaptive dilated convolutions that have innate capability of capturing increased context by using large temporal receptive fields. We also investigate the impact of dense connections on the extraction process that encourage feature reuse and better gradient flow. The dense connections between the downsampling and upsampling paths of a U-Net architecture capture multi-resolution information leading to improved temporal modelling. We evaluate the proposed approaches on the MUSDB test dataset. In addition to providing an improved performance over the state-of-the-art, we also provide insights on the impact of different architectural choices on complex data-driven solutions for source separation.
  • References (17)
  • Citations (0)
References17
Newest
Published on Jun 1, 2018 in CVPR (Computer Vision and Pattern Recognition)
Victor S. Lempitsky42
Estimated H-index: 42
,
Andrea Vedaldi50
Estimated H-index: 50
,
Dmitry Ulyanov7
Estimated H-index: 7
Deep convolutional networks have become a popular tool for image generation and restoration. Generally, their excellent performance is imputed to their ability to learn realistic image priors from a large number of example images. In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly-initialized neural network can be used as a handcraf...
Yi Luo5
Estimated H-index: 5
(Columbia University),
Nima Mesgarani5
Estimated H-index: 5
(Columbia University)
Robust speech processing in multi-talker environments requires effective speech separation. Recent deep learning systems have made significant progress toward solving this problem, yet it remains challenging particularly in real-time, short latency applications. Most methods attempt to construct a mask for each source in time-frequency representation of the mixture signal which is not necessarily an optimal representation for speech separation. In addition, time-frequency decomposition results i...
Published on Jun 1, 2018 in CVPR (Computer Vision and Pattern Recognition)
Victor S. Lempitsky42
Estimated H-index: 42
,
Andrea Vedaldi50
Estimated H-index: 50
,
Dmitry Ulyanov7
Estimated H-index: 7
Deep convolutional networks have become a popular tool for image generation and restoration. Generally, their excellent performance is imputed to their ability to learn realistic image priors from a large number of example images. In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly-initialized neural network can be used as a handcraf...
Published on Oct 27, 2017 in ISMIR (International Symposium/Conference on Music Information Retrieval)
Andreas Jansson2
Estimated H-index: 2
,
Eric J. Humphrey10
Estimated H-index: 10
(NYU: New York University)
+ 3 AuthorsTillman Weyde9
Estimated H-index: 9
(City University London)
The decomposition of a music audio signal into its vocal and backing track components is analogous to image-to-image translation, where a mixed spectrogram is transformed into its constituent sources. We propose a novel application of the U-Net architecture — initially developed for medical imaging — for the task of source separation, given its proven capacity for recreating the fine, low-level detail required for high-quality audio reproduction. Through both quantitative evaluation and subjecti...
Published on Jul 1, 2017 in CVPR (Computer Vision and Pattern Recognition)
Gao Huang17
Estimated H-index: 17
(THU: Tsinghua University),
Zhuang Liu6
Estimated H-index: 6
(THU: Tsinghua University)
+ 1 AuthorsKilian Q. Weinberger41
Estimated H-index: 41
(Cornell University)
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections—one between each layer a...
Published on Mar 1, 2017 in ICASSP (International Conference on Acoustics, Speech, and Signal Processing)
Yi Luo5
Estimated H-index: 5
(Columbia University),
Zhuo Chen12
Estimated H-index: 12
(Columbia University)
+ 2 AuthorsNima Mesgarani5
Estimated H-index: 5
(Columbia University)
Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks. However, little is known about its effectiveness in other challenging situations such as music source separation. Contrary to conventional networks that directly estimate the source signals, deep clustering generates an embedding for each time-frequency bin, and separates...
Published on Mar 1, 2017 in ICASSP (International Conference on Acoustics, Speech, and Signal Processing)
Stefan Uhlich5
Estimated H-index: 5
,
Marcello Porcu1
Estimated H-index: 1
+ 4 AuthorsYuki Mitsufuji6
Estimated H-index: 6
(SONY: Sony Broadcast & Professional Research Laboratories)
This paper deals with the separation of music into individual instrument tracks which is known to be a challenging problem. We describe two different deep neural network architectures for this task, a feed-forward and a recurrent one, and show that each of them yields themselves state-of-the art results on the SiSEC DSD100 dataset. For the recurrent network, we use data augmentation during training and show that even simple separation networks are prone to overfitting if no data augmentation is ...
Published on Feb 21, 2017
Antoine Liutkus18
Estimated H-index: 18
(IRIA: French Institute for Research in Computer Science and Automation),
Fabian-Robert Stöter6
Estimated H-index: 6
+ 5 AuthorsJulie Fontecave1
Estimated H-index: 1
(CNRS: Centre national de la recherche scientifique)
In this paper, we report the results of the 2016 community-based Signal Separation Evaluation Campaign (SiSEC 2016). This edition comprises four tasks. Three focus on the separation of speech and music audio recordings, while one concerns biomedical signals. We summarize these tasks and the performance of the submitted systems, as well as provide a small discussion concerning future trends of SiSEC.
Published in arXiv: Sound
Aäron van den Oord17
Estimated H-index: 17
,
Sander Dieleman16
Estimated H-index: 16
+ -3 AuthorsKoray Kavukcuoglu45
Estimated H-index: 45
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the ...
Published on Jul 1, 2017 in CVPR (Computer Vision and Pattern Recognition)
Gao Huang17
Estimated H-index: 17
(THU: Tsinghua University),
Zhuang Liu6
Estimated H-index: 6
(THU: Tsinghua University)
+ 1 AuthorsKilian Q. Weinberger41
Estimated H-index: 41
(Cornell University)
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections—one between each layer a...
Cited By0
Newest