Supervisor: Prof Tao Xiang
One of the key human abilities that researchers have striven to emulate is to understand the content of images and describe it using language. Solving this problem not only advances fields such as computer vision, multi-media and machine learning, it also has many applications such as semantic image search, and providing image interpretation for the visually impaired. Existing models are learned from strongly aligned image-text pairs. However such a fully supervised approach based on strongly labelled data suffers from the problem of lacking sufficient annotated data, given the complexity and diversity of unconstrained image and language content. On the other hand, on the internet there exist unlimited amount of unlabelled data in both modalities, as well as weakly labelled image-text pairs where images and text are loosely aligned. An unrealised vision is thus to have a unified framework which can seamlessly exploit at the Internet scale the mined image and text data of all different forms, ranging from fully labelled to completely unlabelled. To this end, this PhD project aims to develop a novel framework that learns a translation model between images and a given language by leveraging weakly labelled data as well as fully labelled data whenever they are available. This framework is one step closer towards the grand challenge of life-long learning by which a computer can automatically and continuously learn by mining information from big data on the web, growing its understanding in an open-ended way, and eventually reproducing the visual perception and translation perception ability of humans.