AI, NLP

Combining multiple modalities of inputs in Neural Networks

Reading Time: < 1 minute

Recently, we have been trying to identify the emotion or sentiment in a phone conversation. The first step we have tried to use a while back was Universal Sentence Embedding. That algorithm transforms a sentence into a fix length sentence embedding or 512 numbers. But we are ignoring the acoustic features all together with this approach. We are getting 0.70 on test accuracy on about 2000 sentences/utterances, but hoping to do better. So we started trying to combine text embedding with acoustic features such as MFCC and may others offered by Librosa. I used an LSTM with features for each window as the time steps over however many windows I have. Of course, I had to pad to the maximum number of windows. I tried to even include all the features librosa offered but was not able to increase the accuracy of more than 0.73. After the concatenation layer of text and audio signals, I played around with both LSTM and Dense layers, with just dense layers being more accurate. I suspect there just aren’t enough samples, especially consider the LSTM for the acoustic signals. So will try to see if I can get more labeled data for the audio signal to improve accuracy.