Generating Images from Spoken Descriptions

Journal article (2021)

Authors

X. Wang , Xi’an Jiaotong University

T. Qiao

Jihua Zhu Xi’an Jiaotong University

A. Hanjalic Intelligent Systems

O.E. Scharenborg

DOI: https://doi.org/10.1109/TASLP.2021.3053391

Speech processing Adversarial learning Semantics Task analysis Electronic mail Multimodal modelling Adversarial learning Birds Databases Image synthesis Speech embedding Speech-to-image generation Speech-to-image generation

To reference this document use:

http://resolver.tudelft.nl/uuid:b1ba78ac-15c9-4e79-b978-d58c720810b2

More Info

expand_more

Published Date

2021

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages are estimated to be lacking a commonly used written form. Consequently, these languages cannot benefit from text-based technologies. This paper presents 1) a new speech technology task, i.e., a speech-to-image generation (S2IG) framework which translates speech descriptions to photo-realistic images 2) without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed speech-to-image framework, referred to as S2IGAN, consists of a speech embedding network and a relation-supervised densely-stacked generative model. The speech embedding network learns speech embeddings with the supervision of corresponding visual information from images. The relation-supervised densely-stacked generative model synthesizes images, conditioned on the speech embeddings produced by the speech embedding network, that are semantically consistent with the corresponding spoken descriptions. Extensive experiments are conducted on four public benchmark databases: two databases that are commonly used in text-to-image generation tasks, i.e., CUB-200 and Oxford-102 for which we created synthesized speech descriptions, and two databases with natural speech descriptions which are often used in the field of cross-modal learning of speech and images, i.e., Flickr8k and Places. Results on these databases demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.

Files

Manuscript.pdf

(pdf | 2.43 Mb)

Unknown license

09333641.pdf

(pdf | 4.99 Mb)

Unknown license

Download not available