Research Themes

My research is in the area of Human Centered Machine Learning, using Machine Learning, Computer Vision and Signal Processing methodologies to learn from multiple sources concepts that enable Intelligent Systems to understand, communicate and collaborate with humans. Currently it evolves around three themes:

Learning to recognise behaviour, emotions and cognitive states of people by analysing their images, video and neuro-physiological signals
Learning across modalities, and in particular at the intersections of language and vision, using large, pretrained language and audio-visual models
Learning from generative models and learning to control generation for privacy, interpretability and control purposes.

Multimodal Machine Learning (Vision and Language)

This line of work is concerned with learning across modalities, and in particular at the intersections of language and vision, utilising, fine-tuning and adapting large, pre-trained Language and Vision-Language models.

Key references:

A Simple Baseline for Knowledge-Based Visual Question Answering

Alexandros Xenos, Themos Stafylakis, Ioannis Patras, and 1 more author

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Abs PDF Code
multimodal-ml

This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA). Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively. A common limitation of such approaches is that they consist of relatively complicated pipelines and often heavily rely on accessing GPT-3 API. Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline which, in a nutshell, is based on efficient in-context learning by prompting LLaMA (1 and 2) using question-informative captions as contextual information. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and yet achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to understand important aspects of our method. Our code is publicly available at https://github.com/alexandrosXe/A-Simple-Baseline-For-Knowledge-Based-VQA.
Improving Fairness using Vision-Language Driven Image Augmentation

Moreno D’Incà, Christos Tzelepis, Ioannis Patras, and 1 more author

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

Abs PDF Code
generation-and-learning multimodal-ml

Fairness is crucial when training a deep-learning discriminative model, especially in the facial domain. Models tend to correlate specific characteristics (such as age and skin color) with unrelated attributes (downstream tasks), resulting in biases which do not correspond to reality. It is common knowledge that these correlations are present in the data and are then transferred to the models during training. This paper proposes a method to mitigate these correlations to improve fairness. To do so, we learn interpretable and meaningful paths lying in the semantic space of a pre-trained diffusion model (DiffAE) – such paths being supervised by contrastive text dipoles. That is, we learn to edit protected characteristics (age and skin color). These paths are then applied to augment images to improve the fairness of a given dataset. We test the proposed method on CelebA-HQ and UTKFace on several downstream tasks with age and skin color as protected characteristics. As a proxy for fairness, we compute the difference in accuracy with respect to the protected characteristics. Quantitative results show how the augmented images help the model improve the overall accuracy, the aforementioned metric, and the disparity of equal opportunity.
EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition

Niki Maria Foteinopoulou, and Ioannis Patras

ArXiv, 2023

Abs PDF Code
multimodal-ml

Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic inthe-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues) as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10% in terms of Weighted Average Recall and 5% in terms of Unweighted Average Recall on several datasets. Furthermore, we evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation, achieving performance comparable or superior to state-of-the-art methods and strong agreement with human experts. Namely, we achieve a Pearson’s Correlation Coefficient of up to 0.85 on schizophrenia symptom severity estimation, which is comparable to human experts’ agreement. The code is publicly available on: https://github.com/NickyFot/EmoCLIP.
Parts of Speech-Grounded Subspaces in Vision-Language Models

James Oldfield, Christos Tzelepis, Yannis Panagakis, and 2 more authors

In Advances in Neural Information Processing Systems (NeurIPS), 2023

Abs PDF Code Website
generation-and-learning multimodal-ml

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP’s joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What’s more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists’ painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists’ styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.
Prompting Visual-Language Models for Dynamic Facial Expression Recognition

Zengqun Zhao, and Ioannis Patras

In British Machine Vision Conference (BMVC), 2023

Abs PDF Code
affective-computing multimodal-ml

This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising – those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the textual description, we introduce a learnable token which helps the model learn relevant context information for each expression during training. Extensive experiments demonstrate the effectiveness of the proposed method and show that our DFER-CLIP also achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks.
ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences

Christos Tzelepis, James Oldfield, Georgios Tzimiropoulos, and 1 more author

ArXiv, 2022

Abs Bib PDF Code
generation-and-learning multimodal-ml

This work addresses the problem of discovering non-linear interpretable paths in the latent space of pre-trained GANs in a model-agnostic manner. In the proposed method, the discovery is driven by a set of pairs of natural language sentences with contrasting semantics, named \textitsemantic dipoles, that serve as the “limits” of the interpretation that we require by the trainable latent paths to encode. By using the pre-trained CLIP encoder, the sentences are projected into the vision-language space, where they serve as dipoles, and where RBF-based warping functions define a set of non-linear directional paths, one for each semantic dipole, allowing in this way traversals from one semantic pole to the other. By defining an objective that discovers paths in the latent space of GANs that generate changes along the desired paths in the vision-language embedding space, we provide an intuitive way of controlling the underlying generative factors and address some of the limitations of the state-of-the-art works, namely, that a) they are typically tailored to specific GAN architectures (i.e., StyleGAN), b) they disregard the relative position of the manipulated and the original image in the image embedding and the relative position of the image and the text embeddings, and c) they lead to abrupt image manipulations and quickly arrive at regions of low density and, thus, low image quality, providing limited control of the generative factors. We provide extensive qualitative and quantitative results that demonstrate our claims with two pre-trained GANs, and make the code and the pre-trained models publicly available at: https://github.com/chi0tzp/ContraCLIP.
@article{tzelepis2022contraclip, title = {{ContraCLIP}: Interpretable {GAN} generation driven by pairs of contrasting sentences}, author = {Tzelepis, Christos and Oldfield, James and Tzimiropoulos, Georgios and Patras, Ioannis}, select_key = {true}, tags = {generation-and-learning,multimodal-ml}, journal = {ArXiv}, volume = {abs/2206.02104}, year = {2022}, url = {https://doi.org/10.48550/arXiv.2206.02104}, doi = {10.48550/arXiv.2206.02104}, eprinttype = {arXiv}, eprint = {2206.02104} }

Affective Computing

This line of research is concerned with the recognition of behaviour, emotions and cognitive states of people by analysing their images, video and neuro-physiological signals. In a recent line of work this extends to the analysis of mental health illnesses, such as schizophrenia and depression.

Key references:

Prompting Visual-Language Models for Dynamic Facial Expression Recognition

Zengqun Zhao, and Ioannis Patras

In British Machine Vision Conference (BMVC), 2023

Abs PDF Code
affective-computing multimodal-ml

This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising – those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the textual description, we introduce a learnable token which helps the model learn relevant context information for each expression during training. Extensive experiments demonstrate the effectiveness of the proposed method and show that our DFER-CLIP also achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks.
AMIGOS: A Dataset for Affect, Personality and Mood Research on Individuals and Groups

Juan Abdon Miranda Correa, Mojtaba Khomami Abadi, Nicu Sebe, and 1 more author

IEEE Trans. Affect. Comput., 2021

PDF Website
affective-computing
SchiNet: Automatic Estimation of Symptoms of Schizophrenia from Facial Behaviour Analysis

Mina Bishay, Petar Palasek, Stefan Priebe, and 1 more author

IEEE Trans. Affect. Comput., 2021

PDF Website
affective-computing
Pairwise Ranking Network for Affect Recognition

Georgios Zoumpourlis, and Ioannis Patras

In 9th International Conference on Affective Computing and Intelligent Interaction, ACII 2021, Nara, Japan, September 28 - Oct. 1, 2021, 2021

PDF Website
affective-computing
DEAP: A Database for Emotion Analysis Using Physiological Signals

Sander Koelstra, Christian Mühl, Mohammad Soleymani, and 6 more authors

IEEE Trans. Affect. Comput., 2012

PDF Website
affective-computing
Learning from Label Relationships in Human Affect

Niki Maria Foteinopoulou, and Ioannis Patras

In Proceedings of the 30th ACM International Conference on Multimedia, 2022

PDF Website
affective-computing

Generation and Learning

This line of research is concerned with learning from generative models and learning to control generation for privacy, interpretability and control purposes. This includes learning representations in the latent space of generative models so as to control local changes, control image generation with natural language and controlling generation so as to anonymise datatasets in order to use them for training machine learning models in a privacy preserving manner.

Key references:

LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition

Ioannis Patras Zhonglin Sun, and Georgios Tzimiropoulos

In IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

Abs PDF Code
generation-and-learning

In this work we focus on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels. Firstly, compared with existing labelled face datasets, a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover, motivated by one recent finding, that is, the face saliency area is critical for face recognition, in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining, we utilize patches localized by extracted facial landmarks. This enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations, we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks, especially on more challenging few-shot scenarios.
Self-Supervised Facial Representation Learning with Facial Region Awareness

Zheng Gao, and Ioannis Patras

In IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

Abs PDF Code
generation-and-learning

Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole, i.e., learning consistent facial representations at the image-level, which overlooks the consistency of local facial representations (i.e., facial regions like eyes, nose, etc). In this work, we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA). Specifically, we explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation, we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings, which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps, we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly, using ResNet as the unified backbone for various tasks, our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks.
Improving Fairness using Vision-Language Driven Image Augmentation

Moreno D’Incà, Christos Tzelepis, Ioannis Patras, and 1 more author

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

Abs PDF Code
generation-and-learning multimodal-ml

Fairness is crucial when training a deep-learning discriminative model, especially in the facial domain. Models tend to correlate specific characteristics (such as age and skin color) with unrelated attributes (downstream tasks), resulting in biases which do not correspond to reality. It is common knowledge that these correlations are present in the data and are then transferred to the models during training. This paper proposes a method to mitigate these correlations to improve fairness. To do so, we learn interpretable and meaningful paths lying in the semantic space of a pre-trained diffusion model (DiffAE) – such paths being supervised by contrastive text dipoles. That is, we learn to edit protected characteristics (age and skin color). These paths are then applied to augment images to improve the fairness of a given dataset. We test the proposed method on CelebA-HQ and UTKFace on several downstream tasks with age and skin color as protected characteristics. As a proxy for fairness, we compute the difference in accuracy with respect to the protected characteristics. Quantitative results show how the augmented images help the model improve the overall accuracy, the aforementioned metric, and the disparity of equal opportunity.
Parts of Speech-Grounded Subspaces in Vision-Language Models

James Oldfield, Christos Tzelepis, Yannis Panagakis, and 2 more authors

In Advances in Neural Information Processing Systems (NeurIPS), 2023

Abs PDF Code Website
generation-and-learning multimodal-ml

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP’s joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What’s more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists’ painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists’ styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.
HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces

Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, and 2 more authors

2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023

Abs Bib PDF Code Website
generation-and-learning

In this paper, we present our method for neural face reenactment, called HyperReenact, that aims to generate realistic talking head images of a source identity, driven by a target facial pose. Existing state-of-the-art face reenactment methods train controllable generative models that learn to synthesize realistic facial images, yet producing reenacted faces that are prone to significant visual artifacts, especially under the challenging condition of extreme head pose changes, or requiring expensive few-shot fine-tuning to better preserve the source identity characteristics. We propose to address these limitations by leveraging the photorealistic generation ability and the disentangled properties of a pretrained StyleGAN2 generator, by first inverting the real images into its latent space and then using a hypernetwork to perform: (i) refinement of the source identity characteristics and (ii) facial pose re-targeting, eliminating this way the dependence on external editing methods that typically produce artifacts. Our method operates under the one-shot setting (i.e., using a single source frame) and allows for cross-subject reenactment, without requiring any subject-specific fine-tuning. We compare our method both quantitatively and qualitatively against several state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and VoxCeleb2, demonstrating the superiority of our approach in producing artifact-free images, exhibiting remarkable robustness even under extreme head pose changes.
@article{bounareli2023iccv, select_key = {true}, tags = {generation-and-learning}, title = {{HyperReenact}: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces}, author = {Bounareli, Stella and Tzelepis, Christos and Argyriou, Vasileios and Patras, Ioannis and Tzimiropoulos, Georgios}, journal = {2023 IEEE/CVF International Conference on Computer Vision (ICCV)}, volume = {}, number = {}, pages = {}, year = {2023}, url = {}, doi = {} }
Attribute-preserving Face Dataset Anonymization via Latent Code Optimization

Simone Barattin, Christos Tzelepis, Ioannis Patras, and 1 more author

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Abs PDF Code
generation-and-learning

This work addresses the problem of anonymizing the identity of faces in a dataset of images, such that the privacy of those depicted is not violated, while at the same time the dataset is useful for downstream task such as for training machine learning models. To the best of our knowledge, we are the first to explicitly address this issue and deal with two major drawbacks of the existing state-of-the-art approaches, namely that they (i) require the costly training of additional, purpose-trained neural networks, and/or (ii) fail to retain the facial attributes of the original images in the anonymized counterparts, the preservation of which is of paramount importance for their use in downstream tasks. We accordingly present a task-agnostic anonymization procedure that directly optimizes the images’ latent representation in the latent space of a pre-trained GAN. By optimizing the latent codes directly, we ensure both that the identity is of a desired distance away from the original (with an identity obfuscation loss), whilst preserving the facial attributes (using a novel feature-matching loss in FaRL’s deep feature space). We demonstrate through a series of both qualitative and quantitative experiments that our method is capable of anonymizing the identity of the images whilst – crucially – better-preserving the facial attributes.
PandA: Unsupervised Learning of Parts and Appearances in the Feature Maps of GANs

James Oldfield, Christos Tzelepis, Yannis Panagakis, and 2 more authors

In The Eleventh International Conference on Learning Representations (ICLR), 2023

Abs PDF Code Website
generation-and-learning

Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available at: https://github.com/james-oldfield/PandA.
ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences

Christos Tzelepis, James Oldfield, Georgios Tzimiropoulos, and 1 more author

ArXiv, 2022

Abs Bib PDF Code
generation-and-learning multimodal-ml

This work addresses the problem of discovering non-linear interpretable paths in the latent space of pre-trained GANs in a model-agnostic manner. In the proposed method, the discovery is driven by a set of pairs of natural language sentences with contrasting semantics, named \textitsemantic dipoles, that serve as the “limits” of the interpretation that we require by the trainable latent paths to encode. By using the pre-trained CLIP encoder, the sentences are projected into the vision-language space, where they serve as dipoles, and where RBF-based warping functions define a set of non-linear directional paths, one for each semantic dipole, allowing in this way traversals from one semantic pole to the other. By defining an objective that discovers paths in the latent space of GANs that generate changes along the desired paths in the vision-language embedding space, we provide an intuitive way of controlling the underlying generative factors and address some of the limitations of the state-of-the-art works, namely, that a) they are typically tailored to specific GAN architectures (i.e., StyleGAN), b) they disregard the relative position of the manipulated and the original image in the image embedding and the relative position of the image and the text embeddings, and c) they lead to abrupt image manipulations and quickly arrive at regions of low density and, thus, low image quality, providing limited control of the generative factors. We provide extensive qualitative and quantitative results that demonstrate our claims with two pre-trained GANs, and make the code and the pre-trained models publicly available at: https://github.com/chi0tzp/ContraCLIP.
@article{tzelepis2022contraclip, title = {{ContraCLIP}: Interpretable {GAN} generation driven by pairs of contrasting sentences}, author = {Tzelepis, Christos and Oldfield, James and Tzimiropoulos, Georgios and Patras, Ioannis}, select_key = {true}, tags = {generation-and-learning,multimodal-ml}, journal = {ArXiv}, volume = {abs/2206.02104}, year = {2022}, url = {https://doi.org/10.48550/arXiv.2206.02104}, doi = {10.48550/arXiv.2206.02104}, eprinttype = {arXiv}, eprint = {2206.02104} }

Learning with few/no/noisy/uncertain/imprecise annotations

This line of research is concerned with learning in the absence of reliable annotations. This includes self-supervised representation learning, unsupervised learning with clustering objectives or learning with labels of different granularity than that of the downstream task.

Key references:

Self-Supervised Representation Learning with Cross-Context Learning between Global and Hypercolumn Features

Zheng Gao, Chen Feng, and Ioannis Patras

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

Abs PDF Code
learning-from-few-samples

Whilst contrastive learning yields powerful representations by matching different augmented views of the same instance, it lacks the ability to capture the similarities between different instances. One popular way to address this limitation is by learning global features (after the global pooling) to capture inter-instance relationships based on knowledge distillation, where the global features of the teacher are used to guide the learning of the global features of the student. Inspired by cross-modality learning, we extend this existing framework that only learns from global features by encouraging the global features and intermediate layer features to learn from each other. This leads to our novel self-supervised framework: cross-context learning between global and hypercolumn features (CGH), that enforces the consistency of instance relations between low- and high-level semantics. Specifically, we stack the intermediate feature maps to construct a hypercolumn representation so that we can measure instance relations using two contexts (hypercolumn and global feature) separately, and then use the relations of one context to guide the learning of the other. This cross-context learning allows the model to learn from the differences between the two contexts. The experimental results on linear classification and downstream tasks show that our method outperforms the state-of-the-art methods.
SimDETR: Simplifying self-supervised pretraining for DETR

Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, and 2 more authors

2023

Abs PDF
learning-from-few-samples

DETR-based object detectors have achieved remarkable performance but are sample-inefficient and exhibit slow convergence. Unsupervised pretraining has been found to be helpful to alleviate these impediments, allowing training with large amounts of unlabeled data to improve the detector’s performance. However, existing methods have their own limitations, like keeping the detector’s backbone frozen in order to avoid performance degradation and utilizing pretraining objectives misaligned with the downstream task. To overcome these limitations, we propose a simple pretraining framework for DETR-based detectors that consists of three simple yet key ingredients: (i) richer, semantics-based initial proposals derived from high-level feature maps, (ii) discriminative training using object pseudo-labels produced via clustering, (iii) self-training to take advantage of the improved object proposals learned by the detector. We report two main findings: (1) Our pretraining outperforms prior DETR pretraining works on both the full and low data regimes by significant margins. (2) We show we can pretrain DETR from scratch (including the backbone) directly on complex image datasets like COCO, paving the path for unsupervised representation learning directly using DETR.
MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset

Chen Feng, and Ioannis Patras

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Abs PDF Code
learning-from-few-samples

Deep learning has achieved great success in recent years with the aid of advanced neural network structures and large-scale human-annotated datasets. However, it is often costly and difficult to accurately and efficiently annotate large-scale datasets, especially for some specialized domains where fine-grained labels are required. In this setting, coarse labels are much easier to acquire as they do not require expert knowledge. In this work, we propose a contrastive learning method, called Masked Contrastive learning (MaskCon) to address the under-explored problem setting, where we learn with a coarse-labelled dataset in order to address a finer labelling problem. More specifically, within the contrastive learning framework, for each sample our method generates soft-labels against other samples and another augmented view of the sample in question. By contrast to self-supervised contrastive learning where only the sample’s augmentations are considered hard positives, and in supervised contrastive learning where only samples with the same coarse labels are considered hard positives, we propose soft labels based on sample distances, that are masked by the coarse labels. This allows us to utilize both inter-sample relations and coarse labels. We demonstrate that our method can obtain as special cases many existing state-of-the-art works and that it provides tighter bounds on the generalization error. Experimentally, our method achieves significant improvement over the current state-of-the-art in various datasets, including CIFAR10, CIFAR100, ImageNet-1K, Standford Online Products and Stanford Cars196 datasets.
DivClust: Controlling Diversity in Deep Clustering

Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, and Ioannis Patras

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Abs PDF Code
learning-from-few-samples

Clustering has been a major research topic in the field of machine learning, one to which Deep Learning has recently been applied with significant success. However, an aspect of clustering that is not addressed by existing deep clustering methods, is that of efficiently producing multiple, diverse partitionings for a given dataset. This is particularly important, as a diverse set of base clusterings are necessary for consensus clustering, which has been found to produce better and more robust results than relying on a single clustering. To address this gap, we propose DivClust, a diversity controlling loss that can be incorporated into existing deep clustering frameworks to produce multiple clusterings with the desired degree of diversity. We conduct experiments with multiple datasets and deep clustering frameworks and show that: a) our method effectively controls diversity across frameworks and datasets with very small additional computational cost, b) the sets of clusterings learned by DivClust include solutions that significantly outperform single-clustering baselines, and c) using an off-the-shelf consensus clustering algorithm, DivClust produces consensus clustering solutions that consistently outperform single-clustering baselines, effectively improving the performance of the base deep clustering framework.
SSR: An Efficient and Robust Framework for Learning with Unknown Label Noise

Chen Feng, Georgios Tzimiropoulos, and Ioannis Patras

In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022, 2022

PDF Code Website
learning-from-few-samples
Adaptive Soft Contrastive Learning

Chen Feng, and Ioannis Patras

In 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, QC, Canada, August 21-25, 2022, 2022

PDF Code
learning-from-few-samples
Linear Maximum Margin Classifier for Learning from Uncertain Data

Christos Tzelepis, Vasileios Mezaris, and Ioannis Patras

IEEE Trans. Pattern Anal. Mach. Intell., 2018

learning-from-few-samples
Unsupervised convolutional neural networks for motion estimation

Aria Ahmadi, and Ioannis Patras

In 2016 IEEE International Conference on Image Processing, ICIP 2016, Phoenix, AZ, USA, September 25-28, 2016, 2016

learning-from-few-samples

Video Understanding

This line of research is concerned with analysis of video for retrieval, summarisation and activity/action recognition.

Key references:

Video Summarization Using Deep Neural Networks: A Survey

Evlampios E. Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, and 2 more authors

Proc. IEEE, 2021

Abs PDF
video-understanding

Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades, and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization. After presenting the motivation behind the development of technologies for video summarization, we formulate the video summarization task and discuss the main characteristics of a typical deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the existing algorithms and provide a systematic review of the relevant literature that shows the evolution of the deep-learning-based video summarization technologies and leads to suggestions for future developments. We then report on protocols for the objective evaluation of video summarization algorithms, and we compare the performance of several deep-learning-based approaches. Based on the outcomes of these comparisons, as well as some documented considerations about the amount of annotated data and the suitability of evaluation protocols, we indicate potential future research directions.
AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization

Evlampios E. Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, and 2 more authors

IEEE Trans. Circuits Syst. Video Technol., 2021

Abs Code
video-understanding

This paper presents a new method for unsupervised video summarization. The proposed architecture embeds an Actor-Critic model into a Generative Adversarial Network and formulates the selection of important video fragments (that will be used to form the summary) as a sequence generation task. The Actor and the Critic take part in a game that incrementally leads to the selection of the video key-fragments, and their choices at each step of the game result in a set of rewards from the Discriminator. The designed training workflow allows the Actor and Critic to discover a space of actions and automatically learn a policy for key-fragment selection. Moreover, the introduced criterion for choosing the best model after the training ends, enables the automatic selection of proper values for parameters of the training process that are not learned from the data (such as the regularization factor sigma). Experimental evaluation on two benchmark datasets (SumMe and TVSum) demonstrates that the proposed AC-SUM-GAN model performs consistently well and gives SoA results in comparison to unsupervised methods, that are also competitive with respect to supervised methods.
Unsupervised Video Summarization via Attention-Driven Adversarial Learning

Evlampios E. Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, and 2 more authors

In MultiMedia Modeling - 26th International Conference, MMM 2020, Daejeon, South Korea, January 5-8, 2020, Proceedings, Part I, 2020

Abs Code
video-understanding

This paper presents a new video summarization approach that integrates an attention mechanism to identify the significant parts of the video, and is trained unsupervisingly via generative adversarial learning. Starting from the SUM-GAN model, we first develop an improved version of it (called SUM-GAN-sl) that has a significantly reduced number of learned parameters, performs incremental training of the model’s components, and applies a stepwise label-based strategy for updating the adversarial part. Subsequently, we introduce an attention mechanism to SUM-GAN-sl in two ways: (i) by integrating an attention layer within the variational auto-encoder (VAE) of the architecture (SUM-GAN-VAAE), and (ii) by replacing the VAE with a deterministic attention auto-encoder (SUM-GAN-AAE). Experimental evaluation on two datasets (SumMe and TVSum) documents the contribution of the attention auto-encoder to faster and more stable training of the model, resulting in a significant performance improvement with respect to the original model and demonstrating the competitiveness of the proposed SUM-GAN-AAE against the state of the art.
FIVR: Fine-Grained Incident Video Retrieval

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and 1 more author

IEEE Trans. Multim., 2019

video-understanding
TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition

Mina Bishay, Georgios Zoumpourlis, and Ioannis Patras

In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019, 2019

video-understanding
ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and 1 more author

In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 2019

Code
video-understanding