International Journal of Engineering Research in Computer Science and Engineering (IJERCSE)

Speech Separation for Selective Speaker from Reverberant Noisy Environment using DNN Learning and Classification

Author : Mohammad Hussain K ¹ B. Aziz Musthafa ²

Date of Publication :30th November 2021

Abstract: Human beings are able to exchange information easily using voice in a crowd in different situations such as noisy environment and with several speakers being present. Detecting the voice is important, and understanding who is speaking. Speech does not enter our ears in a clean way in daily life, but the human auditory system is amazingly capable of concentrating on the intended speech and distinguishing it from noise. On the contrary, artificial speech processing systems are designed to accommodate the free speech of clean, noise. These artificial speech recognition systems can be realized by extracting and classifying voice features. Therefore distinguishing speech from noise is desirable. The key issue of separating the target speech from the background noise is speaking isolation or segregation Interference that may involve speechless noise, speech or both, and reverberation of the room. Speech separation is historically seen as a problem of signal processing, but recent studies show speech separation as a deep neural network (DNN)-based supervised learning problem, which studies selective speech patterns, speakers, and background noise from training data. This paper summarises the research on controlled speech differentiation based on deep learning and compares the results with the traditional CASA scene study. The separation of speech from reverberation, using deep learning based on DNN, is proposed in this paper. CASA focuses on auditory scene analysis conceptual concepts and is used to group signals such as pitch and start. From the study, it is evident that the model of the Deep Neural Network (DNN) enhances the accuracy of speech separation and greatly enhances the devices' reliability.

Reference :

1. Denham, S., Coath, M.: The role of form in modeling auditory scene analysis. J. Acoust. Soc. Am. 137(4), 2249–2249 (2015
2. Vander, G.M., Bourguignon, M., de Beeck, M., Wens, V., Marty, B., Hassid, S., et al.: Left superior temporal gyrus is coupled to
3. Attended speech in a cocktail-party auditory scene. J. Neurosci.36(5), 1596–1606 (2016)
4. G. A. Miller and G. A. Heise, “The trill threshold,” J. Acoust. Soc. Amer., vol. 22, pp. 637–638, 1950.
5. A. S. Bregman, Auditory Scene Analysis. Cambridge, MA, USA: MIT Press, 1990.
6. G. A. Miller, “The masking of speech,” Psychol. Bull., vol. 44, pp. 105–129, 1947
7. P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd ed., Boca Raton, FL, USA: CRC Press, 2013.
8. D. L.Wang and G. J. Brown, Ed., Computational Auditory Scene Analysis: Principles, Algorithms, and Applications.Hoboken,NJ, USA:Wiley,2006.
9. D. P. Jarrett, E. Habets, and P. A. Naylor, Theory and Applications of Spherical Microphone Array Processing. Zurich, Switzerland: Springer,2016.
10. D. L. Wang, “Time-frequency masking for speech separation and its potential for hearing aid design,” Trend. Amplif., vol. 12, pp. 332–353, 2008.
11. G. Hu and D. L.Wang, “Speech segregation based on pitch tracking and amplitude modulation,” in Proc. IEEE Workshop Appl. Signal Process.Audio Acoust., 2001, pp. 79–82.
12. J. Chen and D. L. Wang, “DNN-based mask estimation for supervised speech separation,” in Audio Source Separation, S. Makino, Ed. Berlin, Germany: Springer, 2018, pp. 207–235.
13. Rogalsky, C., Poppa, T., Chen, K.H., Anderson, S.W., Damasio,H., Love, T., et al.: Speech repetition as a window on the neurobiology of auditory-motor integration for speech: a voxel-based lesion symptom mapping study. Neuropsychologia 71(01), 18 (2015)
14. White-Schwoch, T., Davies, E.C., Thompson, E.C., Carr, K.W., Nicol, T., Bradlow, A.R., et al.: Auditory-neurophysiological responses to speech during early childhood: effects of background noise. Hear. Res. 328, 34–47 (2015)
15. Le´ger, A.C., Reed, C.M., Desloge, J.G., Swaminathan, J., Braida, L.D.: Consonant identification in noise using hilbert-transform Temporal fine-structure speech and recovered-envelope speech for listeners with normal and impaired hearing. J. Acoust. Soc. Am.138(1), 389–403 (2015)
16. D. L. Wang, “Deep learning reinvents the hearing aid,” IEEE Spectrum, pp. 32–37, Mar. 2017..