IMPROVING INDONESIAN SPEECH EMOTION CLASSIFICATION USING MFCC AND BILSTM WITH AUDIO AUGMENTATION

Muhammad Septiyanto, Eko Budi Susanto, Devi Sugianti

Abstract


Emotion classification from speech has become an important technology in the modern artificial intelligence era. However, research for the Indonesian language is still limited, with existing methods predominantly relying on conventional machine learning approaches that achieve a maximum accuracy of only 90%. These traditional methods face challenges in capturing complex temporal dependencies and bidirectional contextual patterns inherent in emotional speech, particularly for Indonesian prosodic characteristics. To address this limitation, this study uses a combination of Mel-Frequency Cepstral Coefficients (MFCC) feature extraction and Bidirectional Long Short-Term Memory (BiLSTM) model with audio augmentation techniques for Indonesian speech emotion classification. The IndoWaveSentiment dataset contains 300 audio recordings from 10 respondents with five emotion classes: neutral, happy, surprised, disgusted, and disappointed. Audio augmentation techniques with a 2:1 ratio using five methods generated 900 samples. MFCC feature extraction produced 40 coefficients that were processed using BiLSTM architecture with two bidirectional layers (256 and 128 units). The model was trained using Adam optimizer with early stopping. Research results show the highest accuracy of 93.33% with precision of 93.7%, recall of 93.3%, and F1-score of 93.3%. The "surprised" emotion achieved perfect performance (100%), while "happy" had the lowest accuracy (88.89%). This result surpasses previous benchmarks on the same dataset, which utilized Random Forest (90%) and Gradient Boosting (85%). This study demonstrates the effectiveness of combining MFCC, BiLSTM, and audio augmentation in capturing Indonesian speech emotion characteristics for the development of voice-based emotion recognition systems.

References


S. Akinpelu, S. Viriri, and A. Adegun, “An enhanced speech emotion recognition using vision transformer,” Sci. Rep., vol. 14, no. 1, pp. 1–17, 2024.

BPPTIK, “Voice Assistant AI: Pendamping Digital yang Siap Membantu,” bpptik.komdigi.go.id, 2024. [Online]. Available: https://bpptik.komdigi.go.id/Publikasi/detail/voice-assistant-ai-pendamping-digital-yang-siap-membantu.

MiiTel, “Survei: Indonesia Peringkat 4 Negara Paling Antusias dengan AI,” AI Analytics for Voice Communication, 2024. [Online]. Available: https://miitel.com/id/survei-indonesia-peringkat-4-negara-paling-antusias-dengan-ai/.

Y. K. Aini, T. B. Santoso, and T. Dutono, “Pemodelan CNN Untuk Deteksi Emosi Berbasis Speech Bahasa Indonesia,” J. Komput. Terap., vol. 7, no. 1, pp. 143–152, 2021.

R. Y. Rumagit, G. Alexander, and I. F. Saputra, “Model Comparison in Speech Emotion Recognition for Indonesian Language,” Procedia Comput. Sci., vol. 179, no. 2020, pp. 789–797, 2021.

A. Bustamin, A. M. Rizky, E. Warni, I. S. Areni, and Indrabayu, “IndoWaveSentiment: Indonesian audio dataset for emotion classification,” Data Br., vol. 57, 2024.

M. R. N. Majiid, K. E. Setiawan, P. P. Yudha, A. Taufiq, and N. L. Setiawan, “Advancing Indonesian Audio Emotion Classification : A Comparative Study Using IndoWaveSentiment,” vol. 7, no. 2, pp. 207–211, 2025.

I. Dewa Agung Adwitya Prawangsa and A. Eka Karyawati, “Penerapan Metode MFCC dan LSTM untuk Speech Emotion Recognition,” J. Elektron. Ilmu Komput. Udayana, vol. 12, no. 4, pp. 2654–5101, 2024.

C. Zhang, H. Zhan, Z. Hao, and X. Gao, “Classification of Complicated Urban Forest Acoustic Scenes with Deep Learning Models,” Forests, vol. 14, no. 2, 2023.

F. F. Dias, M. A. Ponti, and R. Minghim, “Enhancing sound-based classification of birds and anurans with spectrogram representations and acoustic indices in neural network architectures,” Ecol. Inform., vol. 90, no. April, p. 103232, 2025.

A. S. Kumar, T. Schlosser, S. Kahl, and D. Kowerko, “Improving learning-based birdsong classification by utilizing combined audio augmentation strategies,” Ecol. Inform., vol. 82, no. June, 2024.

A. Alamsyah, F. Ardiansyah, and A. Kholiq, “Music Genre Classification Using Mel Frequency Cepstral Coefficients and Artificial Neural Networks: A Novel Approach,” Sci. J. Informatics, vol. 11, no. 4, pp. 937–948, 2024.

J. H. Chowdhury, S. Ramanna, and K. Kotecha, “Speech emotion recognition with light weight deep neural ensemble model using hand crafted features,” Sci. Rep., vol. 15, no. 1, pp. 1–14, 2025.

J. L. Bautista and Y. K. Lee, “Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation,” pp. 1–14, 2022.

E. Aurora Az Zahra, Y. Sibaroni, and S. Suryani Prasetyowati, “Classification of Multi-Label of Hate Speech on Twitter Indonesia using LSTM and BiLSTM Method,” JINAV J. Inf. Vis., vol. 4, no. 2, pp. 170–178, 2023.

T. Li, “Optimizing the configuration of deep learning models for music genre classification,” Heliyon, vol. 10, no. 2, p. e24892, 2024.

F. Makhmudov, A. Kutlimuratov, and Y. I. Cho, “Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition,” Appl. Sci., vol. 14, no. 23, 2024.




DOI: https://doi.org/10.33387/jiko.v8i3.10820

Refbacks

  • There are currently no refbacks.