Align or attend?
Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval
More Info
expand_more
expand_more
Abstract
Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively.
Files
09414418.pdf
(pdf | 3.17 Mb)
Unknown license
Download not available