Recent publications

(Also see my Google Scholar and my Queen Mary University publications webpage)

2024

LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition

Ioannis Patras Zhonglin Sun, and Georgios Tzimiropoulos

In IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

Abs PDF Code
generation-and-learning

In this work we focus on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels. Firstly, compared with existing labelled face datasets, a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover, motivated by one recent finding, that is, the face saliency area is critical for face recognition, in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining, we utilize patches localized by extracted facial landmarks. This enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations, we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks, especially on more challenging few-shot scenarios.
Self-Supervised Facial Representation Learning with Facial Region Awareness

Zheng Gao, and Ioannis Patras

In IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

Abs PDF Code
generation-and-learning

Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole, i.e., learning consistent facial representations at the image-level, which overlooks the consistency of local facial representations (i.e., facial regions like eyes, nose, etc). In this work, we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA). Specifically, we explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation, we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings, which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps, we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly, using ResNet as the unified backbone for various tasks, our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks.
Improving Fairness using Vision-Language Driven Image Augmentation

Moreno D’Incà, Christos Tzelepis, Ioannis Patras, and 1 more author

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

Abs PDF Code
generation-and-learning multimodal-ml

Fairness is crucial when training a deep-learning discriminative model, especially in the facial domain. Models tend to correlate specific characteristics (such as age and skin color) with unrelated attributes (downstream tasks), resulting in biases which do not correspond to reality. It is common knowledge that these correlations are present in the data and are then transferred to the models during training. This paper proposes a method to mitigate these correlations to improve fairness. To do so, we learn interpretable and meaningful paths lying in the semantic space of a pre-trained diffusion model (DiffAE) – such paths being supervised by contrastive text dipoles. That is, we learn to edit protected characteristics (age and skin color). These paths are then applied to augment images to improve the fairness of a given dataset. We test the proposed method on CelebA-HQ and UTKFace on several downstream tasks with age and skin color as protected characteristics. As a proxy for fairness, we compute the difference in accuracy with respect to the protected characteristics. Quantitative results show how the augmented images help the model improve the overall accuracy, the aforementioned metric, and the disparity of equal opportunity.
Self-Supervised Representation Learning with Cross-Context Learning between Global and Hypercolumn Features

Zheng Gao, Chen Feng, and Ioannis Patras

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

Abs PDF Code
learning-from-few-samples

Whilst contrastive learning yields powerful representations by matching different augmented views of the same instance, it lacks the ability to capture the similarities between different instances. One popular way to address this limitation is by learning global features (after the global pooling) to capture inter-instance relationships based on knowledge distillation, where the global features of the teacher are used to guide the learning of the global features of the student. Inspired by cross-modality learning, we extend this existing framework that only learns from global features by encouraging the global features and intermediate layer features to learn from each other. This leads to our novel self-supervised framework: cross-context learning between global and hypercolumn features (CGH), that enforces the consistency of instance relations between low- and high-level semantics. Specifically, we stack the intermediate feature maps to construct a hypercolumn representation so that we can measure instance relations using two contexts (hypercolumn and global feature) separately, and then use the relations of one context to guide the learning of the other. This cross-context learning allows the model to learn from the differences between the two contexts. The experimental results on linear classification and downstream tasks show that our method outperforms the state-of-the-art methods.

2023

A Simple Baseline for Knowledge-Based Visual Question Answering

Alexandros Xenos, Themos Stafylakis, Ioannis Patras, and 1 more author

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Abs PDF Code
multimodal-ml

This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA). Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively. A common limitation of such approaches is that they consist of relatively complicated pipelines and often heavily rely on accessing GPT-3 API. Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline which, in a nutshell, is based on efficient in-context learning by prompting LLaMA (1 and 2) using question-informative captions as contextual information. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and yet achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to understand important aspects of our method. Our code is publicly available at https://github.com/alexandrosXe/A-Simple-Baseline-For-Knowledge-Based-VQA.
EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition

Niki Maria Foteinopoulou, and Ioannis Patras

ArXiv, 2023

Abs PDF Code
multimodal-ml

Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic inthe-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues) as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10% in terms of Weighted Average Recall and 5% in terms of Unweighted Average Recall on several datasets. Furthermore, we evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation, achieving performance comparable or superior to state-of-the-art methods and strong agreement with human experts. Namely, we achieve a Pearson’s Correlation Coefficient of up to 0.85 on schizophrenia symptom severity estimation, which is comparable to human experts’ agreement. The code is publicly available on: https://github.com/NickyFot/EmoCLIP.
Parts of Speech-Grounded Subspaces in Vision-Language Models

James Oldfield, Christos Tzelepis, Yannis Panagakis, and 2 more authors

In Advances in Neural Information Processing Systems (NeurIPS), 2023

Abs PDF Code Website
generation-and-learning multimodal-ml

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP’s joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What’s more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists’ painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists’ styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.
Prompting Visual-Language Models for Dynamic Facial Expression Recognition

Zengqun Zhao, and Ioannis Patras

In British Machine Vision Conference (BMVC), 2023

Abs PDF Code
affective-computing multimodal-ml

This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising – those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the textual description, we introduce a learnable token which helps the model learn relevant context information for each expression during training. Extensive experiments demonstrate the effectiveness of the proposed method and show that our DFER-CLIP also achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks.
HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces

Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, and 2 more authors

2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023

Abs Bib PDF Code Website
generation-and-learning

In this paper, we present our method for neural face reenactment, called HyperReenact, that aims to generate realistic talking head images of a source identity, driven by a target facial pose. Existing state-of-the-art face reenactment methods train controllable generative models that learn to synthesize realistic facial images, yet producing reenacted faces that are prone to significant visual artifacts, especially under the challenging condition of extreme head pose changes, or requiring expensive few-shot fine-tuning to better preserve the source identity characteristics. We propose to address these limitations by leveraging the photorealistic generation ability and the disentangled properties of a pretrained StyleGAN2 generator, by first inverting the real images into its latent space and then using a hypernetwork to perform: (i) refinement of the source identity characteristics and (ii) facial pose re-targeting, eliminating this way the dependence on external editing methods that typically produce artifacts. Our method operates under the one-shot setting (i.e., using a single source frame) and allows for cross-subject reenactment, without requiring any subject-specific fine-tuning. We compare our method both quantitatively and qualitatively against several state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and VoxCeleb2, demonstrating the superiority of our approach in producing artifact-free images, exhibiting remarkable robustness even under extreme head pose changes.
@article{bounareli2023iccv, select_key = {true}, tags = {generation-and-learning}, title = {{HyperReenact}: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces}, author = {Bounareli, Stella and Tzelepis, Christos and Argyriou, Vasileios and Patras, Ioannis and Tzimiropoulos, Georgios}, journal = {2023 IEEE/CVF International Conference on Computer Vision (ICCV)}, volume = {}, number = {}, pages = {}, year = {2023}, url = {}, doi = {} }
SimDETR: Simplifying self-supervised pretraining for DETR

Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, and 2 more authors

2023

Abs PDF
learning-from-few-samples

DETR-based object detectors have achieved remarkable performance but are sample-inefficient and exhibit slow convergence. Unsupervised pretraining has been found to be helpful to alleviate these impediments, allowing training with large amounts of unlabeled data to improve the detector’s performance. However, existing methods have their own limitations, like keeping the detector’s backbone frozen in order to avoid performance degradation and utilizing pretraining objectives misaligned with the downstream task. To overcome these limitations, we propose a simple pretraining framework for DETR-based detectors that consists of three simple yet key ingredients: (i) richer, semantics-based initial proposals derived from high-level feature maps, (ii) discriminative training using object pseudo-labels produced via clustering, (iii) self-training to take advantage of the improved object proposals learned by the detector. We report two main findings: (1) Our pretraining outperforms prior DETR pretraining works on both the full and low data regimes by significant margins. (2) We show we can pretrain DETR from scratch (including the backbone) directly on complex image datasets like COCO, paving the path for unsupervised representation learning directly using DETR.
MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset

Chen Feng, and Ioannis Patras

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Abs PDF Code
learning-from-few-samples

Deep learning has achieved great success in recent years with the aid of advanced neural network structures and large-scale human-annotated datasets. However, it is often costly and difficult to accurately and efficiently annotate large-scale datasets, especially for some specialized domains where fine-grained labels are required. In this setting, coarse labels are much easier to acquire as they do not require expert knowledge. In this work, we propose a contrastive learning method, called Masked Contrastive learning (MaskCon) to address the under-explored problem setting, where we learn with a coarse-labelled dataset in order to address a finer labelling problem. More specifically, within the contrastive learning framework, for each sample our method generates soft-labels against other samples and another augmented view of the sample in question. By contrast to self-supervised contrastive learning where only the sample’s augmentations are considered hard positives, and in supervised contrastive learning where only samples with the same coarse labels are considered hard positives, we propose soft labels based on sample distances, that are masked by the coarse labels. This allows us to utilize both inter-sample relations and coarse labels. We demonstrate that our method can obtain as special cases many existing state-of-the-art works and that it provides tighter bounds on the generalization error. Experimentally, our method achieves significant improvement over the current state-of-the-art in various datasets, including CIFAR10, CIFAR100, ImageNet-1K, Standford Online Products and Stanford Cars196 datasets.
Attribute-preserving Face Dataset Anonymization via Latent Code Optimization

Simone Barattin, Christos Tzelepis, Ioannis Patras, and 1 more author

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Abs PDF Code
generation-and-learning

This work addresses the problem of anonymizing the identity of faces in a dataset of images, such that the privacy of those depicted is not violated, while at the same time the dataset is useful for downstream task such as for training machine learning models. To the best of our knowledge, we are the first to explicitly address this issue and deal with two major drawbacks of the existing state-of-the-art approaches, namely that they (i) require the costly training of additional, purpose-trained neural networks, and/or (ii) fail to retain the facial attributes of the original images in the anonymized counterparts, the preservation of which is of paramount importance for their use in downstream tasks. We accordingly present a task-agnostic anonymization procedure that directly optimizes the images’ latent representation in the latent space of a pre-trained GAN. By optimizing the latent codes directly, we ensure both that the identity is of a desired distance away from the original (with an identity obfuscation loss), whilst preserving the facial attributes (using a novel feature-matching loss in FaRL’s deep feature space). We demonstrate through a series of both qualitative and quantitative experiments that our method is capable of anonymizing the identity of the images whilst – crucially – better-preserving the facial attributes.
Self-Supervised Video Similarity Learning

Giorgos Kordopatis-Zilos, Giorgos Tolias, Christos Tzelepis, and 3 more authors

In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2023

Abs Bib PDF Code
video-understanding

We introduce S^2VS, a video similarity learning approach with self-supervision. Self-Supervised Learning (SSL) is typically used to train deep models on a proxy task so as to have strong transferability on target tasks after fine-tuning. Here, in contrast to prior work, SSL is used to perform video similarity learning and address multiple retrieval and detection tasks at once with no use of labeled data. This is achieved by learning via instance-discrimination with task-tailored augmentations and the widely used InfoNCE loss together with an additional loss operating jointly on self-similarity and hard-negative similarity. We benchmark our method on tasks where video relevance is defined with varying granularity, ranging from video copies to videos depicting the same incident or event. We learn a single universal model that achieves state-of-the-art performance on all tasks, surpassing previously proposed methods that use labeled data.
@inproceedings{Kordopatis2023s2vs, title = {Self-Supervised Video Similarity Learning}, author = {Kordopatis{-}Zilos, Giorgos and Tolias, Giorgos and Tzelepis, Christos and Kompatsiaris, Ioannis and Patras, Ioannis and Papadopoulos, Symeon}, tags = {video-understanding}, booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW)}, volume = {}, year = {2023} }
DivClust: Controlling Diversity in Deep Clustering

Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, and Ioannis Patras

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Abs PDF Code
learning-from-few-samples

Clustering has been a major research topic in the field of machine learning, one to which Deep Learning has recently been applied with significant success. However, an aspect of clustering that is not addressed by existing deep clustering methods, is that of efficiently producing multiple, diverse partitionings for a given dataset. This is particularly important, as a diverse set of base clusterings are necessary for consensus clustering, which has been found to produce better and more robust results than relying on a single clustering. To address this gap, we propose DivClust, a diversity controlling loss that can be incorporated into existing deep clustering frameworks to produce multiple clusterings with the desired degree of diversity. We conduct experiments with multiple datasets and deep clustering frameworks and show that: a) our method effectively controls diversity across frameworks and datasets with very small additional computational cost, b) the sets of clusterings learned by DivClust include solutions that significantly outperform single-clustering baselines, and c) using an off-the-shelf consensus clustering algorithm, DivClust produces consensus clustering solutions that consistently outperform single-clustering baselines, effectively improving the performance of the base deep clustering framework.
PandA: Unsupervised Learning of Parts and Appearances in the Feature Maps of GANs

James Oldfield, Christos Tzelepis, Yannis Panagakis, and 2 more authors

In The Eleventh International Conference on Learning Representations (ICLR), 2023

Abs PDF Code Website
generation-and-learning

Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available at: https://github.com/james-oldfield/PandA.
StyleMask: Disentangling the Style Space of StyleGAN2 for Neural Face Reenactment

Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, and 2 more authors

In 17th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2023, Waikoloa Beach, HI, USA, January 5-8, 2023, 2023

Abs Bib PDF Code
generation-and-learning

In this paper we address the problem of neural face reenactment, where, given a pair of a source and a target facial image, we need to transfer the target’s pose (defined as the head pose and its facial expressions) to the source image, by preserving at the same time the source’s identity characteristics (e.g., facial shape, hair style, etc), even in the challenging case where the source and the target faces belong to different identities. In doing so, we address some of the limitations of the state-of-the-art works, namely, a) that they depend on paired training data (i.e., source and target faces have the same identity), b) that they rely on labeled data during inference, and c) that they do not preserve identity in large head pose changes. More specifically, we propose a framework that, using unpaired randomly generated facial images, learns to disentangle the identity characteristics of the face from its pose by incorporating the recently introduced style space S of StyleGAN2, a latent representation space that exhibits remarkable disentanglement properties. By capitalizing on this, we learn to successfully mix a pair of source and target style codes using supervision from a 3D model. The resulting latent code, that is subsequently used for reenactment, consists of latent units corresponding to the facial pose of the target only and of units corresponding to the identity of the source only, leading to notable improvement in the reenactment performance compared to recent state-of-the-art methods. In comparison to state of the art, we quantitatively and qualitatively show that the proposed method produces higher quality results even on extreme pose variations. Finally, we report results on real images by first embedding them on the latent space of the pretrained generator. We make the code and pretrained models publicly available
@inproceedings{bounareli2023stylemask, title = {StyleMask: Disentangling the Style Space of StyleGAN2 for Neural Face Reenactment}, author = {Bounareli, Stella and Tzelepis, Christos and Argyriou, Vasileios and Patras, Ioannis and Tzimiropoulos, Georgios}, tags = {generation-and-learning}, booktitle = {17th {IEEE} International Conference on Automatic Face and Gesture Recognition, {FG} 2023, Waikoloa Beach, HI, USA, January 5-8, 2023}, pages = {1--8}, publisher = {{IEEE}}, year = {2023}, doi = {10.1109/FG57933.2023.10042744} }
"Just To See You Smile": SMILEY, a Voice-Guided GUY GAN

Qi Yang, Christos Tzelepis, Sergey Nikolenko, and 2 more authors

In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM 2023, Singapore, 27 February 2023 - 3 March 2023, 2023

Abs Bib PDF Code
generation-and-learning

In this technical demonstration, we present SMILEY, a voice-guided virtual assistant. The system utilizes a deep neural architecture ContraCLIP to manipulate facial attributes using voice instructions, allowing for deeper speaker engagement and smoother customer experience when being used in the "virtual concierge" scenario. We validate the effectiveness of SMILEY and ContraCLIP via a successful real-world case study in Singapore and a large-scale quantitative evaluation.
@inproceedings{yang2023smiley, title = {"Just To See You Smile": {SMILEY}, a Voice-Guided {GUY} {GAN}}, author = {Yang, Qi and Tzelepis, Christos and Nikolenko, Sergey and Patras, Ioannis and Farseev, Aleksandr}, tags = {generation-and-learning}, booktitle = {Proceedings of the Sixteenth {ACM} International Conference on Web Search and Data Mining, {WSDM} 2023, Singapore, 27 February 2023 - 3 March 2023}, editor = {Chua, Tat{-}Seng and Lauw, Hady W. and Si, Luo and Terzi, Evimaria and Tsaparas, Panayiotis}, pages = {1196--1199}, publisher = {{ACM}}, year = {2023}, doi = {10.1145/3539597.3573031} }

2022

DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval

Giorgos Kordopatis-Zilos, Christos Tzelepis, Symeon Papadopoulos, and 2 more authors

International Journal of Computer Vision (IJCV), 2022

Abs Bib PDF Code
video-understanding

In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, called Distill-and-Select (DnS), that starting from a well-performing fine-grained Teacher Network learns: a) Student Networks at different retrieval performance and computational efficiency trade-offs and b) a Selector Network that at test time rapidly directs samples to the appropriate student to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store/index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets – this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate a) that our students achieve state-of-the-art performance in several cases and b) that the DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, the proposed method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. The collected dataset and implementation are publicly available at https://github.com/mever-team/distill-and-select.
@article{kordopatis2022dns, title = {{DnS}: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval}, author = {Kordopatis{-}Zilos, Giorgos and Tzelepis, Christos and Papadopoulos, Symeon and Kompatsiaris, Ioannis and Patras, Ioannis}, tags = {video-understanding}, journal = {International Journal of Computer Vision (IJCV)}, volume = {130}, number = {10}, pages = {2385--2407}, year = {2022}, url = {https://doi.org/10.1007/s11263-022-01651-3}, doi = {10.1007/s11263-022-01651-3} }
CovMix: Covariance Mixing Regularization for Motor Imagery Decoding

Georgios Zoumpourlis, and Ioannis Patras

In 10th International Winter Conference on Brain-Computer Interface, BCI 2022, Gangwon-do, Korea, Republic of, February 21-23, 2022, 2022

PDF Website
affective-computing
SSR: An Efficient and Robust Framework for Learning with Unknown Label Noise

Chen Feng, Georgios Tzimiropoulos, and Ioannis Patras

In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022, 2022

PDF Code Website
learning-from-few-samples
Adaptive Soft Contrastive Learning

Chen Feng, and Ioannis Patras

In 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, QC, Canada, August 21-25, 2022, 2022

PDF Code
learning-from-few-samples
Explaining video summarization based on the focus of attention

Evlampios E. Apostolidis, Georgios Balaouras, Vasileios Mezaris, and 1 more author

In IEEE International Symposium on Multimedia, ISM 2022, Naples, Italy, December 5-7, 2022, 2022

Abs Code
video-understanding

In this paper we propose a method for explaining video summarization. We start by formulating the problem as the creation of an explanation mask which indicates the parts of the video that influenced the most the estimates of a video summarization network, about the frames’ importance. Then, we explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation signals, and we examine various attention-based signals that have been studied as explanations in the NLP domain. We evaluate the performance of these signals by investigating the video summarization network’s input-output relationship according to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and least influential parts of a video. We run experiments using an attention-based network (CA-SUM) and two datasets (SumMe and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our method to explain the video summarization results using clues about the focus of the attention mechanism.
Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames

Evlampios E. Apostolidis, Georgios Balaouras, Vasileios Mezaris, and 1 more author

In ICMR ’22: International Conference on Multimedia Retrieval, Newark, NJ, USA, June 27 - 30, 2022, 2022

Abs Code
video-understanding

In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames’ dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames’ dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames’ uniqueness and diversity, shows their relative contributions to the overall summarization performance.
ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences

Christos Tzelepis, James Oldfield, Georgios Tzimiropoulos, and 1 more author

ArXiv, 2022

Abs Bib PDF Code
generation-and-learning multimodal-ml

This work addresses the problem of discovering non-linear interpretable paths in the latent space of pre-trained GANs in a model-agnostic manner. In the proposed method, the discovery is driven by a set of pairs of natural language sentences with contrasting semantics, named \textitsemantic dipoles, that serve as the “limits” of the interpretation that we require by the trainable latent paths to encode. By using the pre-trained CLIP encoder, the sentences are projected into the vision-language space, where they serve as dipoles, and where RBF-based warping functions define a set of non-linear directional paths, one for each semantic dipole, allowing in this way traversals from one semantic pole to the other. By defining an objective that discovers paths in the latent space of GANs that generate changes along the desired paths in the vision-language embedding space, we provide an intuitive way of controlling the underlying generative factors and address some of the limitations of the state-of-the-art works, namely, that a) they are typically tailored to specific GAN architectures (i.e., StyleGAN), b) they disregard the relative position of the manipulated and the original image in the image embedding and the relative position of the image and the text embeddings, and c) they lead to abrupt image manipulations and quickly arrive at regions of low density and, thus, low image quality, providing limited control of the generative factors. We provide extensive qualitative and quantitative results that demonstrate our claims with two pre-trained GANs, and make the code and the pre-trained models publicly available at: https://github.com/chi0tzp/ContraCLIP.
@article{tzelepis2022contraclip, title = {{ContraCLIP}: Interpretable {GAN} generation driven by pairs of contrasting sentences}, author = {Tzelepis, Christos and Oldfield, James and Tzimiropoulos, Georgios and Patras, Ioannis}, select_key = {true}, tags = {generation-and-learning,multimodal-ml}, journal = {ArXiv}, volume = {abs/2206.02104}, year = {2022}, url = {https://doi.org/10.48550/arXiv.2206.02104}, doi = {10.48550/arXiv.2206.02104}, eprinttype = {arXiv}, eprint = {2206.02104} }
Learning from Label Relationships in Human Affect

Niki Maria Foteinopoulou, and Ioannis Patras

In Proceedings of the 30th ACM International Conference on Multimedia, 2022

PDF Website
affective-computing

2021

Video Summarization Using Deep Neural Networks: A Survey

Evlampios E. Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, and 2 more authors

Proc. IEEE, 2021

Abs PDF
video-understanding

Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades, and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization. After presenting the motivation behind the development of technologies for video summarization, we formulate the video summarization task and discuss the main characteristics of a typical deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the existing algorithms and provide a systematic review of the relevant literature that shows the evolution of the deep-learning-based video summarization technologies and leads to suggestions for future developments. We then report on protocols for the objective evaluation of video summarization algorithms, and we compare the performance of several deep-learning-based approaches. Based on the outcomes of these comparisons, as well as some documented considerations about the amount of annotated data and the suitability of evaluation protocols, we indicate potential future research directions.
AMIGOS: A Dataset for Affect, Personality and Mood Research on Individuals and Groups

Juan Abdon Miranda Correa, Mojtaba Khomami Abadi, Nicu Sebe, and 1 more author

IEEE Trans. Affect. Comput., 2021

PDF Website
affective-computing
SchiNet: Automatic Estimation of Symptoms of Schizophrenia from Facial Behaviour Analysis

Mina Bishay, Petar Palasek, Stefan Priebe, and 1 more author

IEEE Trans. Affect. Comput., 2021

PDF Website
affective-computing
AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization

Evlampios E. Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, and 2 more authors

IEEE Trans. Circuits Syst. Video Technol., 2021

Abs Code
video-understanding

This paper presents a new method for unsupervised video summarization. The proposed architecture embeds an Actor-Critic model into a Generative Adversarial Network and formulates the selection of important video fragments (that will be used to form the summary) as a sequence generation task. The Actor and the Critic take part in a game that incrementally leads to the selection of the video key-fragments, and their choices at each step of the game result in a set of rewards from the Discriminator. The designed training workflow allows the Actor and Critic to discover a space of actions and automatically learn a policy for key-fragment selection. Moreover, the introduced criterion for choosing the best model after the training ends, enables the automatic selection of proper values for parameters of the training process that are not learned from the data (such as the regularization factor sigma). Experimental evaluation on two benchmark datasets (SumMe and TVSum) demonstrates that the proposed AC-SUM-GAN model performs consistently well and gives SoA results in comparison to unsupervised methods, that are also competitive with respect to supervised methods.
Pairwise Ranking Network for Affect Recognition

Georgios Zoumpourlis, and Ioannis Patras

In 9th International Conference on Affective Computing and Intelligent Interaction, ACII 2021, Nara, Japan, September 28 - Oct. 1, 2021, 2021

PDF Website
affective-computing
Tensor Component Analysis for Interpreting the Latent Space of GANs

James Oldfield, Markos Georgopoulos, Yannis Panagakis, and 2 more authors

In 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22-25, 2021, 2021

PDF Website
generation-and-learning
Combining Global and Local Attention with Positional Encoding for Video Summarization

Evlampios E. Apostolidis, Georgios Balaouras, Vasileios Mezaris, and 1 more author

In IEEE International Symposium on Multimedia, ISM 2021, Naple, Italy, November 29 - Dec. 1, 2021, 2021

Abs Code
video-understanding

This paper presents a new method for supervised video summarization. To overcome drawbacks of existing RNN-based summarization architectures, that relate to the modeling of long-range frames’ dependencies and the ability to parallelize the training process, the developed model re-lies on the use of self-attention mechanisms to estimate the importance of video frames. Contrary to previous attention-based summarization approaches that model the frames’ dependencies by observing the entire frame sequence, our method combines global and local multi-head attention mechanisms to discover different modelings of the frames’ dependencies at different levels of granularity. Moreover, the utilized attention mechanisms integrate a component that encodes the temporal position of video frames - this is of major importance when producing a video summary. Experiments on two datasets (SumMe and TVSum) demonstrate the effectiveness of the proposed model compared to existing attention-based methods, and its competitiveness against other state-of-the-art supervised summarization approaches. An ablation study that focuses on our main proposed components, namely the use of global and local multi-head attention mechanisms in collaboration with an absolute positional encoding component, shows their relative contributions to the overall summarization performance.
Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection

Evlampios E. Apostolidis, Eleni Adamantidou, Vasileios Mezaris, and 1 more author

In ICMR ’21: International Conference on Multimedia Retrieval, Taipei, Taiwan, August 21-24, 2021, 2021

Abs Code
video-understanding

This paper presents a new method for unsupervised video thumbnail selection. The developed network architecture selects video thumbnails based on two criteria: the representativeness and the aesthetic quality of their visual content. Training relies on a combination of adversarial and reinforcement learning. The former is used to train a discriminator, whose goal is to distinguish the original from a reconstructed version of the video based on a small set of candidate thumbnails. The discriminator’s feedback is a measure of the representativeness of the selected thumbnails. This measure is combined with estimates about the aesthetic quality of the thumbnails (made using a SoA Fully Convolutional Network) to form a reward and train the thumbnail selector via reinforcement learning. Experiments on two datasets (OVP and Youtube) show the competitiveness of the proposed method against other SoA approaches. An ablation study with respect to the adopted thumbnail selection criteria documents the importance of considering the aesthetics, and the contribution of this information when used in combination with measures about the representativeness of the visual content.
Few-Shot Action Localization without Knowing Boundaries

Ting-Ting Xie, Christos Tzelepis, Fan Fu, and 1 more author

In ICMR ’21: International Conference on Multimedia Retrieval, Taipei, Taiwan, August 21-24, 2021, 2021
MultiMedia Modeling - 27th International Conference, MMM 2021, Prague, Czech Republic, June 22-24, 2021, Proceedings, Part I

2021
MultiMedia Modeling - 27th International Conference, MMM 2021, Prague, Czech Republic, June 22-24, 2021, Proceedings, Part II

2021
Uncertainty Propagation in Convolutional Neural Networks: Technical Report

Christos Tzelepis, and Ioannis Patras

CoRR, 2021

Abs Bib PDF Code

In this technical report we study the problem of propagation of uncertainty (in terms of variances of given uni-variate normal random variables) through typical building blocks of a Convolutional Neural Network (CNN). These include layers that perform linear operations, such as 2D convolutions, fully-connected, and average pooling layers, as well as layers that act non-linearly on their input, such as the Rectified Linear Unit (ReLU). Finally, we discuss the sigmoid function, for which we give approximations of its first- and second-order moments, as well as the binary cross-entropy loss function, for which we approximate its expected value under normal random inputs. A PyTorch implementation of the presented “uncertainty-aware” layers is available under the MIT license here: https://github.com/chi0tzp/UncPropCNN.
@article{tzelepis2021uncprop, title = {Uncertainty Propagation in Convolutional Neural Networks: Technical Report}, author = {Tzelepis, Christos and Patras, Ioannis}, journal = {CoRR}, volume = {abs/2102.06064}, year = {2021}, eprinttype = {arXiv}, eprint = {2102.06064} }
WarpedGANSpace: Finding non-linear RBF paths in GAN latent space

Christos Tzelepis, Georgios Tzimiropoulos, and I. Patras

2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021

Abs Bib PDF Code
generation-and-learning

This work addresses the problem of discovering, in an unsupervised manner, interpretable paths in the latent space of pretrained GANs, so as to provide an intuitive and easy way of controlling the underlying generative factors. In doing so, it addresses some of the limitations of the state-of-the-art works, namely, a) that they discover directions that are independent of the latent code, i.e., paths that are linear, and b) that their evaluation relies either on visual inspection or on laborious human labeling. More specifically, we propose to learn non-linear warpings on the latent space, each one parametrized by a set of RBF-based latent space warping functions, and where each warping gives rise to a family of non-linear paths via the gradient of the function. Building on the work of Voynov and Babenko, that discovers linear paths, we optimize the trainable parameters of the set of RBFs, so as that images that are generated by codes along different paths, are easily distinguishable by a discriminator network. This leads to easily distinguishable image transformations, such as pose and facial expressions in facial images. We show that linear paths can be derived as a special case of our method, and show experimentally that non-linear paths in the latent space lead to steeper, more disentangled and interpretable changes in the image space than in state-of-the art methods, both qualitatively and quantitatively. We make the code and the pretrained models publicly available at https://github.com/chi0tzp/WarpedGANSpace.
@article{tzelepis2021warpedganspace, title = {{WarpedGANSpace}: Finding non-linear {RBF} paths in {GAN} latent space}, author = {Tzelepis, Christos and Tzimiropoulos, Georgios and Patras, I.}, tags = {generation-and-learning}, journal = {2021 IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2021}, pages = {6373-6382} }
S3: Supervised Self-supervised Learning under Label Noise

Chen Feng, Georgios Tzimiropoulos, and Ioannis Patras

CoRR, 2021

learning-from-few-samples
Estimating continuous affect with label uncertainty

Niki Maria Foteinopoulou, Christos Tzelepis, and Ioannis Patras

In 9th International Conference on Affective Computing and Intelligent Interaction, ACII 2021, Nara, Japan, September 28 - Oct. 1, 2021, 2021

Abs Bib PDF Code
affective-computing

Continuous affect estimation is a problem where there is an inherent uncertainty and subjectivity in the labels that accompany data samples – typically, datasets use the average of multiple annotations or self-reporting to obtain ground truth labels. In this work, we propose a method for uncertainty-aware continuous affect estimation, that models explicitly the uncertainty of the ground truth label as a uni-variate Gaussian with mean equal to the ground truth label, and unknown variance. For each sample, the proposed neural network, estimates not only the value of the target label (valence and arousal in our case), but also the variance. The network is trained with a loss that is defined as the KL-divergence between the estimation (valence/arousal) and the Gaussian around the ground truth. We show, that in two affect recognition problems, with real data, the estimated variances are correlated with measures of uncertainty/error in the labels that are extracted either by considering multiple annotations of the data, or by manually cleaning the dataset.
@inproceedings{foteinopoulou2022acii, title = {Estimating continuous affect with label uncertainty}, author = {Foteinopoulou, Niki Maria and Tzelepis, Christos and Patras, Ioannis}, tags = {affective-computing}, booktitle = {9th International Conference on Affective Computing and Intelligent Interaction, {ACII} 2021, Nara, Japan, September 28 - Oct. 1, 2021}, pages = {1--8}, publisher = {{IEEE}}, year = {2021}, url = {https://doi.org/10.1109/ACII52823.2021.9597425}, doi = {10.1109/ACII52823.2021.9597425} }

Multimodal Machine Learning (Vision and Language)

2024

Improving Fairness using Vision-Language Driven Image Augmentation

Moreno D’Incà, Christos Tzelepis, Ioannis Patras, and 1 more author

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

Abs PDF Code
generation-and-learning multimodal-ml

Fairness is crucial when training a deep-learning discriminative model, especially in the facial domain. Models tend to correlate specific characteristics (such as age and skin color) with unrelated attributes (downstream tasks), resulting in biases which do not correspond to reality. It is common knowledge that these correlations are present in the data and are then transferred to the models during training. This paper proposes a method to mitigate these correlations to improve fairness. To do so, we learn interpretable and meaningful paths lying in the semantic space of a pre-trained diffusion model (DiffAE) – such paths being supervised by contrastive text dipoles. That is, we learn to edit protected characteristics (age and skin color). These paths are then applied to augment images to improve the fairness of a given dataset. We test the proposed method on CelebA-HQ and UTKFace on several downstream tasks with age and skin color as protected characteristics. As a proxy for fairness, we compute the difference in accuracy with respect to the protected characteristics. Quantitative results show how the augmented images help the model improve the overall accuracy, the aforementioned metric, and the disparity of equal opportunity.

2023

A Simple Baseline for Knowledge-Based Visual Question Answering

Alexandros Xenos, Themos Stafylakis, Ioannis Patras, and 1 more author

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Abs PDF Code
multimodal-ml

This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA). Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively. A common limitation of such approaches is that they consist of relatively complicated pipelines and often heavily rely on accessing GPT-3 API. Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline which, in a nutshell, is based on efficient in-context learning by prompting LLaMA (1 and 2) using question-informative captions as contextual information. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and yet achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to understand important aspects of our method. Our code is publicly available at https://github.com/alexandrosXe/A-Simple-Baseline-For-Knowledge-Based-VQA.
EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition

Niki Maria Foteinopoulou, and Ioannis Patras

ArXiv, 2023

Abs PDF Code
multimodal-ml

Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic inthe-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues) as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10% in terms of Weighted Average Recall and 5% in terms of Unweighted Average Recall on several datasets. Furthermore, we evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation, achieving performance comparable or superior to state-of-the-art methods and strong agreement with human experts. Namely, we achieve a Pearson’s Correlation Coefficient of up to 0.85 on schizophrenia symptom severity estimation, which is comparable to human experts’ agreement. The code is publicly available on: https://github.com/NickyFot/EmoCLIP.
Parts of Speech-Grounded Subspaces in Vision-Language Models

James Oldfield, Christos Tzelepis, Yannis Panagakis, and 2 more authors

In Advances in Neural Information Processing Systems (NeurIPS), 2023

Abs PDF Code Website
generation-and-learning multimodal-ml

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP’s joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What’s more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists’ painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists’ styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.
Prompting Visual-Language Models for Dynamic Facial Expression Recognition

Zengqun Zhao, and Ioannis Patras

In British Machine Vision Conference (BMVC), 2023

Abs PDF Code
affective-computing multimodal-ml

This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising – those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the textual description, we introduce a learnable token which helps the model learn relevant context information for each expression during training. Extensive experiments demonstrate the effectiveness of the proposed method and show that our DFER-CLIP also achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks.

2022

ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences

Christos Tzelepis, James Oldfield, Georgios Tzimiropoulos, and 1 more author

ArXiv, 2022

Abs Bib PDF Code
generation-and-learning multimodal-ml

This work addresses the problem of discovering non-linear interpretable paths in the latent space of pre-trained GANs in a model-agnostic manner. In the proposed method, the discovery is driven by a set of pairs of natural language sentences with contrasting semantics, named \textitsemantic dipoles, that serve as the “limits” of the interpretation that we require by the trainable latent paths to encode. By using the pre-trained CLIP encoder, the sentences are projected into the vision-language space, where they serve as dipoles, and where RBF-based warping functions define a set of non-linear directional paths, one for each semantic dipole, allowing in this way traversals from one semantic pole to the other. By defining an objective that discovers paths in the latent space of GANs that generate changes along the desired paths in the vision-language embedding space, we provide an intuitive way of controlling the underlying generative factors and address some of the limitations of the state-of-the-art works, namely, that a) they are typically tailored to specific GAN architectures (i.e., StyleGAN), b) they disregard the relative position of the manipulated and the original image in the image embedding and the relative position of the image and the text embeddings, and c) they lead to abrupt image manipulations and quickly arrive at regions of low density and, thus, low image quality, providing limited control of the generative factors. We provide extensive qualitative and quantitative results that demonstrate our claims with two pre-trained GANs, and make the code and the pre-trained models publicly available at: https://github.com/chi0tzp/ContraCLIP.
@article{tzelepis2022contraclip, title = {{ContraCLIP}: Interpretable {GAN} generation driven by pairs of contrasting sentences}, author = {Tzelepis, Christos and Oldfield, James and Tzimiropoulos, Georgios and Patras, Ioannis}, select_key = {true}, tags = {generation-and-learning,multimodal-ml}, journal = {ArXiv}, volume = {abs/2206.02104}, year = {2022}, url = {https://doi.org/10.48550/arXiv.2206.02104}, doi = {10.48550/arXiv.2206.02104}, eprinttype = {arXiv}, eprint = {2206.02104} }

2021

Affective Computing

2024

2023

Prompting Visual-Language Models for Dynamic Facial Expression Recognition

Zengqun Zhao, and Ioannis Patras

In British Machine Vision Conference (BMVC), 2023

Abs PDF Code
affective-computing multimodal-ml

This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising – those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the textual description, we introduce a learnable token which helps the model learn relevant context information for each expression during training. Extensive experiments demonstrate the effectiveness of the proposed method and show that our DFER-CLIP also achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks.

2022

CovMix: Covariance Mixing Regularization for Motor Imagery Decoding

Georgios Zoumpourlis, and Ioannis Patras

In 10th International Winter Conference on Brain-Computer Interface, BCI 2022, Gangwon-do, Korea, Republic of, February 21-23, 2022, 2022

PDF Website
affective-computing
Learning from Label Relationships in Human Affect

Niki Maria Foteinopoulou, and Ioannis Patras

In Proceedings of the 30th ACM International Conference on Multimedia, 2022

PDF Website
affective-computing

2021

AMIGOS: A Dataset for Affect, Personality and Mood Research on Individuals and Groups

Juan Abdon Miranda Correa, Mojtaba Khomami Abadi, Nicu Sebe, and 1 more author

IEEE Trans. Affect. Comput., 2021

PDF Website
affective-computing
SchiNet: Automatic Estimation of Symptoms of Schizophrenia from Facial Behaviour Analysis

Mina Bishay, Petar Palasek, Stefan Priebe, and 1 more author

IEEE Trans. Affect. Comput., 2021

PDF Website
affective-computing
Pairwise Ranking Network for Affect Recognition

Georgios Zoumpourlis, and Ioannis Patras

In 9th International Conference on Affective Computing and Intelligent Interaction, ACII 2021, Nara, Japan, September 28 - Oct. 1, 2021, 2021

PDF Website
affective-computing
Estimating continuous affect with label uncertainty

Niki Maria Foteinopoulou, Christos Tzelepis, and Ioannis Patras

In 9th International Conference on Affective Computing and Intelligent Interaction, ACII 2021, Nara, Japan, September 28 - Oct. 1, 2021, 2021

Abs Bib PDF Code
affective-computing

Continuous affect estimation is a problem where there is an inherent uncertainty and subjectivity in the labels that accompany data samples – typically, datasets use the average of multiple annotations or self-reporting to obtain ground truth labels. In this work, we propose a method for uncertainty-aware continuous affect estimation, that models explicitly the uncertainty of the ground truth label as a uni-variate Gaussian with mean equal to the ground truth label, and unknown variance. For each sample, the proposed neural network, estimates not only the value of the target label (valence and arousal in our case), but also the variance. The network is trained with a loss that is defined as the KL-divergence between the estimation (valence/arousal) and the Gaussian around the ground truth. We show, that in two affect recognition problems, with real data, the estimated variances are correlated with measures of uncertainty/error in the labels that are extracted either by considering multiple annotations of the data, or by manually cleaning the dataset.
@inproceedings{foteinopoulou2022acii, title = {Estimating continuous affect with label uncertainty}, author = {Foteinopoulou, Niki Maria and Tzelepis, Christos and Patras, Ioannis}, tags = {affective-computing}, booktitle = {9th International Conference on Affective Computing and Intelligent Interaction, {ACII} 2021, Nara, Japan, September 28 - Oct. 1, 2021}, pages = {1--8}, publisher = {{IEEE}}, year = {2021}, url = {https://doi.org/10.1109/ACII52823.2021.9597425}, doi = {10.1109/ACII52823.2021.9597425} }

Generation and Learning

2024

LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition

Ioannis Patras Zhonglin Sun, and Georgios Tzimiropoulos

In IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

Abs PDF Code
generation-and-learning

In this work we focus on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels. Firstly, compared with existing labelled face datasets, a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover, motivated by one recent finding, that is, the face saliency area is critical for face recognition, in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining, we utilize patches localized by extracted facial landmarks. This enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations, we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks, especially on more challenging few-shot scenarios.
Self-Supervised Facial Representation Learning with Facial Region Awareness

Zheng Gao, and Ioannis Patras

In IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

Abs PDF Code
generation-and-learning

Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole, i.e., learning consistent facial representations at the image-level, which overlooks the consistency of local facial representations (i.e., facial regions like eyes, nose, etc). In this work, we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA). Specifically, we explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation, we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings, which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps, we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly, using ResNet as the unified backbone for various tasks, our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks.
Improving Fairness using Vision-Language Driven Image Augmentation

Moreno D’Incà, Christos Tzelepis, Ioannis Patras, and 1 more author

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

Abs PDF Code
generation-and-learning multimodal-ml

Fairness is crucial when training a deep-learning discriminative model, especially in the facial domain. Models tend to correlate specific characteristics (such as age and skin color) with unrelated attributes (downstream tasks), resulting in biases which do not correspond to reality. It is common knowledge that these correlations are present in the data and are then transferred to the models during training. This paper proposes a method to mitigate these correlations to improve fairness. To do so, we learn interpretable and meaningful paths lying in the semantic space of a pre-trained diffusion model (DiffAE) – such paths being supervised by contrastive text dipoles. That is, we learn to edit protected characteristics (age and skin color). These paths are then applied to augment images to improve the fairness of a given dataset. We test the proposed method on CelebA-HQ and UTKFace on several downstream tasks with age and skin color as protected characteristics. As a proxy for fairness, we compute the difference in accuracy with respect to the protected characteristics. Quantitative results show how the augmented images help the model improve the overall accuracy, the aforementioned metric, and the disparity of equal opportunity.

2023

Parts of Speech-Grounded Subspaces in Vision-Language Models

James Oldfield, Christos Tzelepis, Yannis Panagakis, and 2 more authors

In Advances in Neural Information Processing Systems (NeurIPS), 2023

Abs PDF Code Website
generation-and-learning multimodal-ml

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP’s joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What’s more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists’ painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists’ styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.
HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces

Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, and 2 more authors

2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023

Abs Bib PDF Code Website
generation-and-learning

In this paper, we present our method for neural face reenactment, called HyperReenact, that aims to generate realistic talking head images of a source identity, driven by a target facial pose. Existing state-of-the-art face reenactment methods train controllable generative models that learn to synthesize realistic facial images, yet producing reenacted faces that are prone to significant visual artifacts, especially under the challenging condition of extreme head pose changes, or requiring expensive few-shot fine-tuning to better preserve the source identity characteristics. We propose to address these limitations by leveraging the photorealistic generation ability and the disentangled properties of a pretrained StyleGAN2 generator, by first inverting the real images into its latent space and then using a hypernetwork to perform: (i) refinement of the source identity characteristics and (ii) facial pose re-targeting, eliminating this way the dependence on external editing methods that typically produce artifacts. Our method operates under the one-shot setting (i.e., using a single source frame) and allows for cross-subject reenactment, without requiring any subject-specific fine-tuning. We compare our method both quantitatively and qualitatively against several state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and VoxCeleb2, demonstrating the superiority of our approach in producing artifact-free images, exhibiting remarkable robustness even under extreme head pose changes.
@article{bounareli2023iccv, select_key = {true}, tags = {generation-and-learning}, title = {{HyperReenact}: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces}, author = {Bounareli, Stella and Tzelepis, Christos and Argyriou, Vasileios and Patras, Ioannis and Tzimiropoulos, Georgios}, journal = {2023 IEEE/CVF International Conference on Computer Vision (ICCV)}, volume = {}, number = {}, pages = {}, year = {2023}, url = {}, doi = {} }
Attribute-preserving Face Dataset Anonymization via Latent Code Optimization

Simone Barattin, Christos Tzelepis, Ioannis Patras, and 1 more author

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Abs PDF Code
generation-and-learning

This work addresses the problem of anonymizing the identity of faces in a dataset of images, such that the privacy of those depicted is not violated, while at the same time the dataset is useful for downstream task such as for training machine learning models. To the best of our knowledge, we are the first to explicitly address this issue and deal with two major drawbacks of the existing state-of-the-art approaches, namely that they (i) require the costly training of additional, purpose-trained neural networks, and/or (ii) fail to retain the facial attributes of the original images in the anonymized counterparts, the preservation of which is of paramount importance for their use in downstream tasks. We accordingly present a task-agnostic anonymization procedure that directly optimizes the images’ latent representation in the latent space of a pre-trained GAN. By optimizing the latent codes directly, we ensure both that the identity is of a desired distance away from the original (with an identity obfuscation loss), whilst preserving the facial attributes (using a novel feature-matching loss in FaRL’s deep feature space). We demonstrate through a series of both qualitative and quantitative experiments that our method is capable of anonymizing the identity of the images whilst – crucially – better-preserving the facial attributes.
PandA: Unsupervised Learning of Parts and Appearances in the Feature Maps of GANs

James Oldfield, Christos Tzelepis, Yannis Panagakis, and 2 more authors

In The Eleventh International Conference on Learning Representations (ICLR), 2023

Abs PDF Code Website
generation-and-learning

Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available at: https://github.com/james-oldfield/PandA.
StyleMask: Disentangling the Style Space of StyleGAN2 for Neural Face Reenactment

Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, and 2 more authors

In 17th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2023, Waikoloa Beach, HI, USA, January 5-8, 2023, 2023

Abs Bib PDF Code
generation-and-learning

In this paper we address the problem of neural face reenactment, where, given a pair of a source and a target facial image, we need to transfer the target’s pose (defined as the head pose and its facial expressions) to the source image, by preserving at the same time the source’s identity characteristics (e.g., facial shape, hair style, etc), even in the challenging case where the source and the target faces belong to different identities. In doing so, we address some of the limitations of the state-of-the-art works, namely, a) that they depend on paired training data (i.e., source and target faces have the same identity), b) that they rely on labeled data during inference, and c) that they do not preserve identity in large head pose changes. More specifically, we propose a framework that, using unpaired randomly generated facial images, learns to disentangle the identity characteristics of the face from its pose by incorporating the recently introduced style space S of StyleGAN2, a latent representation space that exhibits remarkable disentanglement properties. By capitalizing on this, we learn to successfully mix a pair of source and target style codes using supervision from a 3D model. The resulting latent code, that is subsequently used for reenactment, consists of latent units corresponding to the facial pose of the target only and of units corresponding to the identity of the source only, leading to notable improvement in the reenactment performance compared to recent state-of-the-art methods. In comparison to state of the art, we quantitatively and qualitatively show that the proposed method produces higher quality results even on extreme pose variations. Finally, we report results on real images by first embedding them on the latent space of the pretrained generator. We make the code and pretrained models publicly available
@inproceedings{bounareli2023stylemask, title = {StyleMask: Disentangling the Style Space of StyleGAN2 for Neural Face Reenactment}, author = {Bounareli, Stella and Tzelepis, Christos and Argyriou, Vasileios and Patras, Ioannis and Tzimiropoulos, Georgios}, tags = {generation-and-learning}, booktitle = {17th {IEEE} International Conference on Automatic Face and Gesture Recognition, {FG} 2023, Waikoloa Beach, HI, USA, January 5-8, 2023}, pages = {1--8}, publisher = {{IEEE}}, year = {2023}, doi = {10.1109/FG57933.2023.10042744} }
"Just To See You Smile": SMILEY, a Voice-Guided GUY GAN

Qi Yang, Christos Tzelepis, Sergey Nikolenko, and 2 more authors

In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM 2023, Singapore, 27 February 2023 - 3 March 2023, 2023

Abs Bib PDF Code
generation-and-learning

In this technical demonstration, we present SMILEY, a voice-guided virtual assistant. The system utilizes a deep neural architecture ContraCLIP to manipulate facial attributes using voice instructions, allowing for deeper speaker engagement and smoother customer experience when being used in the "virtual concierge" scenario. We validate the effectiveness of SMILEY and ContraCLIP via a successful real-world case study in Singapore and a large-scale quantitative evaluation.
@inproceedings{yang2023smiley, title = {"Just To See You Smile": {SMILEY}, a Voice-Guided {GUY} {GAN}}, author = {Yang, Qi and Tzelepis, Christos and Nikolenko, Sergey and Patras, Ioannis and Farseev, Aleksandr}, tags = {generation-and-learning}, booktitle = {Proceedings of the Sixteenth {ACM} International Conference on Web Search and Data Mining, {WSDM} 2023, Singapore, 27 February 2023 - 3 March 2023}, editor = {Chua, Tat{-}Seng and Lauw, Hady W. and Si, Luo and Terzi, Evimaria and Tsaparas, Panayiotis}, pages = {1196--1199}, publisher = {{ACM}}, year = {2023}, doi = {10.1145/3539597.3573031} }

2022

ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences

Christos Tzelepis, James Oldfield, Georgios Tzimiropoulos, and 1 more author

ArXiv, 2022

Abs Bib PDF Code
generation-and-learning multimodal-ml

This work addresses the problem of discovering non-linear interpretable paths in the latent space of pre-trained GANs in a model-agnostic manner. In the proposed method, the discovery is driven by a set of pairs of natural language sentences with contrasting semantics, named \textitsemantic dipoles, that serve as the “limits” of the interpretation that we require by the trainable latent paths to encode. By using the pre-trained CLIP encoder, the sentences are projected into the vision-language space, where they serve as dipoles, and where RBF-based warping functions define a set of non-linear directional paths, one for each semantic dipole, allowing in this way traversals from one semantic pole to the other. By defining an objective that discovers paths in the latent space of GANs that generate changes along the desired paths in the vision-language embedding space, we provide an intuitive way of controlling the underlying generative factors and address some of the limitations of the state-of-the-art works, namely, that a) they are typically tailored to specific GAN architectures (i.e., StyleGAN), b) they disregard the relative position of the manipulated and the original image in the image embedding and the relative position of the image and the text embeddings, and c) they lead to abrupt image manipulations and quickly arrive at regions of low density and, thus, low image quality, providing limited control of the generative factors. We provide extensive qualitative and quantitative results that demonstrate our claims with two pre-trained GANs, and make the code and the pre-trained models publicly available at: https://github.com/chi0tzp/ContraCLIP.
@article{tzelepis2022contraclip, title = {{ContraCLIP}: Interpretable {GAN} generation driven by pairs of contrasting sentences}, author = {Tzelepis, Christos and Oldfield, James and Tzimiropoulos, Georgios and Patras, Ioannis}, select_key = {true}, tags = {generation-and-learning,multimodal-ml}, journal = {ArXiv}, volume = {abs/2206.02104}, year = {2022}, url = {https://doi.org/10.48550/arXiv.2206.02104}, doi = {10.48550/arXiv.2206.02104}, eprinttype = {arXiv}, eprint = {2206.02104} }

2021

Tensor Component Analysis for Interpreting the Latent Space of GANs

James Oldfield, Markos Georgopoulos, Yannis Panagakis, and 2 more authors

In 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22-25, 2021, 2021

PDF Website
generation-and-learning
WarpedGANSpace: Finding non-linear RBF paths in GAN latent space

Christos Tzelepis, Georgios Tzimiropoulos, and I. Patras

2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021

Abs Bib PDF Code
generation-and-learning

This work addresses the problem of discovering, in an unsupervised manner, interpretable paths in the latent space of pretrained GANs, so as to provide an intuitive and easy way of controlling the underlying generative factors. In doing so, it addresses some of the limitations of the state-of-the-art works, namely, a) that they discover directions that are independent of the latent code, i.e., paths that are linear, and b) that their evaluation relies either on visual inspection or on laborious human labeling. More specifically, we propose to learn non-linear warpings on the latent space, each one parametrized by a set of RBF-based latent space warping functions, and where each warping gives rise to a family of non-linear paths via the gradient of the function. Building on the work of Voynov and Babenko, that discovers linear paths, we optimize the trainable parameters of the set of RBFs, so as that images that are generated by codes along different paths, are easily distinguishable by a discriminator network. This leads to easily distinguishable image transformations, such as pose and facial expressions in facial images. We show that linear paths can be derived as a special case of our method, and show experimentally that non-linear paths in the latent space lead to steeper, more disentangled and interpretable changes in the image space than in state-of-the art methods, both qualitatively and quantitatively. We make the code and the pretrained models publicly available at https://github.com/chi0tzp/WarpedGANSpace.
@article{tzelepis2021warpedganspace, title = {{WarpedGANSpace}: Finding non-linear {RBF} paths in {GAN} latent space}, author = {Tzelepis, Christos and Tzimiropoulos, Georgios and Patras, I.}, tags = {generation-and-learning}, journal = {2021 IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2021}, pages = {6373-6382} }

Learning from few samples

2024

Self-Supervised Representation Learning with Cross-Context Learning between Global and Hypercolumn Features

Zheng Gao, Chen Feng, and Ioannis Patras

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

Abs PDF Code
learning-from-few-samples

Whilst contrastive learning yields powerful representations by matching different augmented views of the same instance, it lacks the ability to capture the similarities between different instances. One popular way to address this limitation is by learning global features (after the global pooling) to capture inter-instance relationships based on knowledge distillation, where the global features of the teacher are used to guide the learning of the global features of the student. Inspired by cross-modality learning, we extend this existing framework that only learns from global features by encouraging the global features and intermediate layer features to learn from each other. This leads to our novel self-supervised framework: cross-context learning between global and hypercolumn features (CGH), that enforces the consistency of instance relations between low- and high-level semantics. Specifically, we stack the intermediate feature maps to construct a hypercolumn representation so that we can measure instance relations using two contexts (hypercolumn and global feature) separately, and then use the relations of one context to guide the learning of the other. This cross-context learning allows the model to learn from the differences between the two contexts. The experimental results on linear classification and downstream tasks show that our method outperforms the state-of-the-art methods.

2023

SimDETR: Simplifying self-supervised pretraining for DETR

Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, and 2 more authors

2023

Abs PDF
learning-from-few-samples

DETR-based object detectors have achieved remarkable performance but are sample-inefficient and exhibit slow convergence. Unsupervised pretraining has been found to be helpful to alleviate these impediments, allowing training with large amounts of unlabeled data to improve the detector’s performance. However, existing methods have their own limitations, like keeping the detector’s backbone frozen in order to avoid performance degradation and utilizing pretraining objectives misaligned with the downstream task. To overcome these limitations, we propose a simple pretraining framework for DETR-based detectors that consists of three simple yet key ingredients: (i) richer, semantics-based initial proposals derived from high-level feature maps, (ii) discriminative training using object pseudo-labels produced via clustering, (iii) self-training to take advantage of the improved object proposals learned by the detector. We report two main findings: (1) Our pretraining outperforms prior DETR pretraining works on both the full and low data regimes by significant margins. (2) We show we can pretrain DETR from scratch (including the backbone) directly on complex image datasets like COCO, paving the path for unsupervised representation learning directly using DETR.
MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset

Chen Feng, and Ioannis Patras

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Abs PDF Code
learning-from-few-samples

Deep learning has achieved great success in recent years with the aid of advanced neural network structures and large-scale human-annotated datasets. However, it is often costly and difficult to accurately and efficiently annotate large-scale datasets, especially for some specialized domains where fine-grained labels are required. In this setting, coarse labels are much easier to acquire as they do not require expert knowledge. In this work, we propose a contrastive learning method, called Masked Contrastive learning (MaskCon) to address the under-explored problem setting, where we learn with a coarse-labelled dataset in order to address a finer labelling problem. More specifically, within the contrastive learning framework, for each sample our method generates soft-labels against other samples and another augmented view of the sample in question. By contrast to self-supervised contrastive learning where only the sample’s augmentations are considered hard positives, and in supervised contrastive learning where only samples with the same coarse labels are considered hard positives, we propose soft labels based on sample distances, that are masked by the coarse labels. This allows us to utilize both inter-sample relations and coarse labels. We demonstrate that our method can obtain as special cases many existing state-of-the-art works and that it provides tighter bounds on the generalization error. Experimentally, our method achieves significant improvement over the current state-of-the-art in various datasets, including CIFAR10, CIFAR100, ImageNet-1K, Standford Online Products and Stanford Cars196 datasets.
DivClust: Controlling Diversity in Deep Clustering

Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, and Ioannis Patras

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Abs PDF Code
learning-from-few-samples

Clustering has been a major research topic in the field of machine learning, one to which Deep Learning has recently been applied with significant success. However, an aspect of clustering that is not addressed by existing deep clustering methods, is that of efficiently producing multiple, diverse partitionings for a given dataset. This is particularly important, as a diverse set of base clusterings are necessary for consensus clustering, which has been found to produce better and more robust results than relying on a single clustering. To address this gap, we propose DivClust, a diversity controlling loss that can be incorporated into existing deep clustering frameworks to produce multiple clusterings with the desired degree of diversity. We conduct experiments with multiple datasets and deep clustering frameworks and show that: a) our method effectively controls diversity across frameworks and datasets with very small additional computational cost, b) the sets of clusterings learned by DivClust include solutions that significantly outperform single-clustering baselines, and c) using an off-the-shelf consensus clustering algorithm, DivClust produces consensus clustering solutions that consistently outperform single-clustering baselines, effectively improving the performance of the base deep clustering framework.

2022

SSR: An Efficient and Robust Framework for Learning with Unknown Label Noise

Chen Feng, Georgios Tzimiropoulos, and Ioannis Patras

In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022, 2022

PDF Code Website
learning-from-few-samples
Adaptive Soft Contrastive Learning

Chen Feng, and Ioannis Patras

In 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, QC, Canada, August 21-25, 2022, 2022

PDF Code
learning-from-few-samples

2021

S3: Supervised Self-supervised Learning under Label Noise

Chen Feng, Georgios Tzimiropoulos, and Ioannis Patras

CoRR, 2021

learning-from-few-samples

Video understanding

2024

2023

Self-Supervised Video Similarity Learning

Giorgos Kordopatis-Zilos, Giorgos Tolias, Christos Tzelepis, and 3 more authors

In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2023

Abs Bib PDF Code
video-understanding

We introduce S^2VS, a video similarity learning approach with self-supervision. Self-Supervised Learning (SSL) is typically used to train deep models on a proxy task so as to have strong transferability on target tasks after fine-tuning. Here, in contrast to prior work, SSL is used to perform video similarity learning and address multiple retrieval and detection tasks at once with no use of labeled data. This is achieved by learning via instance-discrimination with task-tailored augmentations and the widely used InfoNCE loss together with an additional loss operating jointly on self-similarity and hard-negative similarity. We benchmark our method on tasks where video relevance is defined with varying granularity, ranging from video copies to videos depicting the same incident or event. We learn a single universal model that achieves state-of-the-art performance on all tasks, surpassing previously proposed methods that use labeled data.
@inproceedings{Kordopatis2023s2vs, title = {Self-Supervised Video Similarity Learning}, author = {Kordopatis{-}Zilos, Giorgos and Tolias, Giorgos and Tzelepis, Christos and Kompatsiaris, Ioannis and Patras, Ioannis and Papadopoulos, Symeon}, tags = {video-understanding}, booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW)}, volume = {}, year = {2023} }

2022

DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval

Giorgos Kordopatis-Zilos, Christos Tzelepis, Symeon Papadopoulos, and 2 more authors

International Journal of Computer Vision (IJCV), 2022

Abs Bib PDF Code
video-understanding

In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, called Distill-and-Select (DnS), that starting from a well-performing fine-grained Teacher Network learns: a) Student Networks at different retrieval performance and computational efficiency trade-offs and b) a Selector Network that at test time rapidly directs samples to the appropriate student to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store/index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets – this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate a) that our students achieve state-of-the-art performance in several cases and b) that the DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, the proposed method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. The collected dataset and implementation are publicly available at https://github.com/mever-team/distill-and-select.
@article{kordopatis2022dns, title = {{DnS}: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval}, author = {Kordopatis{-}Zilos, Giorgos and Tzelepis, Christos and Papadopoulos, Symeon and Kompatsiaris, Ioannis and Patras, Ioannis}, tags = {video-understanding}, journal = {International Journal of Computer Vision (IJCV)}, volume = {130}, number = {10}, pages = {2385--2407}, year = {2022}, url = {https://doi.org/10.1007/s11263-022-01651-3}, doi = {10.1007/s11263-022-01651-3} }
Explaining video summarization based on the focus of attention

Evlampios E. Apostolidis, Georgios Balaouras, Vasileios Mezaris, and 1 more author

In IEEE International Symposium on Multimedia, ISM 2022, Naples, Italy, December 5-7, 2022, 2022

Abs Code
video-understanding

In this paper we propose a method for explaining video summarization. We start by formulating the problem as the creation of an explanation mask which indicates the parts of the video that influenced the most the estimates of a video summarization network, about the frames’ importance. Then, we explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation signals, and we examine various attention-based signals that have been studied as explanations in the NLP domain. We evaluate the performance of these signals by investigating the video summarization network’s input-output relationship according to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and least influential parts of a video. We run experiments using an attention-based network (CA-SUM) and two datasets (SumMe and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our method to explain the video summarization results using clues about the focus of the attention mechanism.
Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames

Evlampios E. Apostolidis, Georgios Balaouras, Vasileios Mezaris, and 1 more author

In ICMR ’22: International Conference on Multimedia Retrieval, Newark, NJ, USA, June 27 - 30, 2022, 2022

Abs Code
video-understanding

In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames’ dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames’ dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames’ uniqueness and diversity, shows their relative contributions to the overall summarization performance.

2021

Video Summarization Using Deep Neural Networks: A Survey

Evlampios E. Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, and 2 more authors

Proc. IEEE, 2021

Abs PDF
video-understanding

Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades, and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization. After presenting the motivation behind the development of technologies for video summarization, we formulate the video summarization task and discuss the main characteristics of a typical deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the existing algorithms and provide a systematic review of the relevant literature that shows the evolution of the deep-learning-based video summarization technologies and leads to suggestions for future developments. We then report on protocols for the objective evaluation of video summarization algorithms, and we compare the performance of several deep-learning-based approaches. Based on the outcomes of these comparisons, as well as some documented considerations about the amount of annotated data and the suitability of evaluation protocols, we indicate potential future research directions.
AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization

Evlampios E. Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, and 2 more authors

IEEE Trans. Circuits Syst. Video Technol., 2021

Abs Code
video-understanding

This paper presents a new method for unsupervised video summarization. The proposed architecture embeds an Actor-Critic model into a Generative Adversarial Network and formulates the selection of important video fragments (that will be used to form the summary) as a sequence generation task. The Actor and the Critic take part in a game that incrementally leads to the selection of the video key-fragments, and their choices at each step of the game result in a set of rewards from the Discriminator. The designed training workflow allows the Actor and Critic to discover a space of actions and automatically learn a policy for key-fragment selection. Moreover, the introduced criterion for choosing the best model after the training ends, enables the automatic selection of proper values for parameters of the training process that are not learned from the data (such as the regularization factor sigma). Experimental evaluation on two benchmark datasets (SumMe and TVSum) demonstrates that the proposed AC-SUM-GAN model performs consistently well and gives SoA results in comparison to unsupervised methods, that are also competitive with respect to supervised methods.
Combining Global and Local Attention with Positional Encoding for Video Summarization

Evlampios E. Apostolidis, Georgios Balaouras, Vasileios Mezaris, and 1 more author

In IEEE International Symposium on Multimedia, ISM 2021, Naple, Italy, November 29 - Dec. 1, 2021, 2021

Abs Code
video-understanding

This paper presents a new method for supervised video summarization. To overcome drawbacks of existing RNN-based summarization architectures, that relate to the modeling of long-range frames’ dependencies and the ability to parallelize the training process, the developed model re-lies on the use of self-attention mechanisms to estimate the importance of video frames. Contrary to previous attention-based summarization approaches that model the frames’ dependencies by observing the entire frame sequence, our method combines global and local multi-head attention mechanisms to discover different modelings of the frames’ dependencies at different levels of granularity. Moreover, the utilized attention mechanisms integrate a component that encodes the temporal position of video frames - this is of major importance when producing a video summary. Experiments on two datasets (SumMe and TVSum) demonstrate the effectiveness of the proposed model compared to existing attention-based methods, and its competitiveness against other state-of-the-art supervised summarization approaches. An ablation study that focuses on our main proposed components, namely the use of global and local multi-head attention mechanisms in collaboration with an absolute positional encoding component, shows their relative contributions to the overall summarization performance.
Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection

Evlampios E. Apostolidis, Eleni Adamantidou, Vasileios Mezaris, and 1 more author

In ICMR ’21: International Conference on Multimedia Retrieval, Taipei, Taiwan, August 21-24, 2021, 2021

Abs Code
video-understanding

This paper presents a new method for unsupervised video thumbnail selection. The developed network architecture selects video thumbnails based on two criteria: the representativeness and the aesthetic quality of their visual content. Training relies on a combination of adversarial and reinforcement learning. The former is used to train a discriminator, whose goal is to distinguish the original from a reconstructed version of the video based on a small set of candidate thumbnails. The discriminator’s feedback is a measure of the representativeness of the selected thumbnails. This measure is combined with estimates about the aesthetic quality of the thumbnails (made using a SoA Fully Convolutional Network) to form a reward and train the thumbnail selector via reinforcement learning. Experiments on two datasets (OVP and Youtube) show the competitiveness of the proposed method against other SoA approaches. An ablation study with respect to the adopted thumbnail selection criteria documents the importance of considering the aesthetics, and the contribution of this information when used in combination with measures about the representativeness of the visual content.

Older publications

2020

Cycle-Consistent Adversarial Networks and Fast Adaptive Bi-dimensional Empirical Mode Decomposition for Style Transfer

Elissavet Batziou, Petros Alvanitopoulos, Konstantinos Ioannidis, and 3 more authors

In 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event / Milan, Italy, January 10-15, 2021, 2020
Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods

Evlampios E. Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, and 2 more authors

In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, 2020

Abs Code

This paper proposes a new evaluation approach for video summarization algorithms. We start by studying the currently established evaluation protocol; this protocol, defined over the ground-truth annotations of the SumMe and TVSum datasets, quantifies the agreement between the user-defined and the automatically-created summaries with F-Score, and reports the average performance on a few different training/testing splits of the used dataset. We evaluate five publicly-available summarization algorithms under a large-scale experimental setting with 50 randomly-created data splits. We show that the results reported in the papers are not always congruent with their performance on the large-scale experiment, and that the F-Score cannot be used for comparing algorithms evaluated on different splits. We also show that the above shortcomings of the established evaluation protocol are due to the significantly varying levels of difficulty among the utilized splits, that affect the outcomes of the evaluations. Further analysis of these findings indicates a noticeable performance correlation among all algorithms and a random summarizer. To mitigate these shortcomings we propose an evaluation protocol that makes estimates about the difficulty of each used data split and utilizes this information during the evaluation process. Experiments involving different evaluation settings demonstrate the increased representativeness of performance results when using the proposed evaluation approach, and the increased reliability of comparisons when the examined methods have been evaluated on different data splits.
Unsupervised Video Summarization via Attention-Driven Adversarial Learning

Evlampios E. Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, and 2 more authors

In MultiMedia Modeling - 26th International Conference, MMM 2020, Daejeon, South Korea, January 5-8, 2020, Proceedings, Part I, 2020

Abs Code
video-understanding

This paper presents a new video summarization approach that integrates an attention mechanism to identify the significant parts of the video, and is trained unsupervisingly via generative adversarial learning. Starting from the SUM-GAN model, we first develop an improved version of it (called SUM-GAN-sl) that has a significantly reduced number of learned parameters, performs incremental training of the model’s components, and applies a stepwise label-based strategy for updating the adversarial part. Subsequently, we introduce an attention mechanism to SUM-GAN-sl in two ways: (i) by integrating an attention layer within the variational auto-encoder (VAE) of the architecture (SUM-GAN-VAAE), and (ii) by replacing the VAE with a deterministic attention auto-encoder (SUM-GAN-AAE). Experimental evaluation on two datasets (SumMe and TVSum) documents the contribution of the attention auto-encoder to faster and more stable training of the model, resulting in a significant performance improvement with respect to the original model and demonstrating the competitiveness of the proposed SUM-GAN-AAE against the state of the art.

2019

Universal Foreground Segmentation Based on Deep Feature Fusion Network for Multi-Scene Videos

Ye Tao, Zhihao Ling, and Ioannis Patras

IEEE Access, 2019
Registration-free Face-SSD: Single shot analysis of smiles, facial attributes, and affect in the wild

Youngkyoon Jang, Hatice Gunes, and Ioannis Patras

Comput. Vis. Image Underst., 2019
A deep generic to specific recognition model for group membership analysis using non-verbal cues

Wenxuan Mou, Christos Tzelepis, Vasileios Mezaris, and 2 more authors

Image Vis. Comput., 2019

PDF Website
affective-computing
Implicit and Explicit Concept Relations in Deep Neural Networks for Multi-Label Video/Image Annotation

Foteini Markatopoulou, Vasileios Mezaris, and Ioannis Patras

IEEE Trans. Circuits Syst. Video Technol., 2019
FIVR: Fine-Grained Incident Video Retrieval

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and 1 more author

IEEE Trans. Multim., 2019

video-understanding
Alone versus In-a-group: A Multi-modal Framework for Automatic Affect Recognition

Wenxuan Mou, Hatice Gunes, and Ioannis Patras

ACM Trans. Multim. Comput. Commun. Appl., 2019

PDF Website
affective-computing
TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition

Mina Bishay, Georgios Zoumpourlis, and Ioannis Patras

In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019, 2019

video-understanding
Your Fellows Matter: Affect Analysis across Subjects in Group Videos

Wenxuan Mou, Hatice Gunes, and Ioannis Patras

In 14th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2019, Lille, France, May 14-18, 2019, 2019

PDF Website
affective-computing
Can Automatic Facial Expression Analysis Be Used for Treatment Outcome Estimation in Schizophrenia?

Mina Bishay, Stefan Priebe, and Ioannis Patras

In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019, 2019

PDF Website
affective-computing
ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and 1 more author

In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 2019

Code
video-understanding
Exploring Feature Representation and Training Strategies in Temporal Action Localization

Tingting Xie, Xiaoshan Yang, Tianzhu Zhang, and 2 more authors

In 2019 IEEE International Conference on Image Processing, ICIP 2019, Taipei, Taiwan, September 22-25, 2019, 2019
A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization

Evlampios E. Apostolidis, Alexandros I. Metsai, Eleni Adamantidou, and 2 more authors

In Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery, AI4TV@MM 2019, Nice, France, preview=empty.png, October 21, 2019, 2019

Abs Code

In this paper we present our work on improving the efficiency of adversarial training for unsupervised video summarization. Our starting point is the SUM-GAN model, which creates a representative summary based on the intuition that such a summary should make it possible to reconstruct a video that is indistinguishable from the original one. We build on a publicly available implementation of a variation of this model, that includes a linear compression layer to reduce the number of learned parameters and applies an incremental approach for training the different components of the architecture. After assessing the impact of these changes to the model’s performance, we propose a stepwise, label-based learning process to improve the training efficiency of the adversarial part of the model. Before evaluating our model’s efficiency, we perform a thorough study with respect to the used evaluation protocols and we examine the possible performance on two benchmarking datasets, namely SumMe and TVSum. Experimental evaluations and comparisons with the state of the art highlight the competitiveness of the proposed method. An ablation study indicates the benefit of each applied change on the model’s performance, and points out the advantageous role of the introduced stepwise, label-based training strategy on the learning efficiency of the adversarial part of the architecture.
Multimodal Video Annotation for Retrieval and Discovery of Newsworthy Video in a News Verification Scenario

Lyndon J. B. Nixon, Evlampios E. Apostolidis, Foteini Markatopoulou, and 2 more authors

In MultiMedia Modeling - 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8-11, 2019, Proceedings, Part I, 2019

Abs

This paper describes the combination of advanced technologies for social-media-based story detection, story-based video retrieval and concept-based video (fragment) labeling under a novel approach for multimodal video annotation. This approach involves textual metadata, structural information and visual concepts - and a multimodal analytics dashboard that enables journalists to discover videos of news events, posted to social networks, in order to verify the details of the events shown. It outlines the characteristics of each individual method and describes how these techniques are blended to facilitate the content-based retrieval, discovery and summarization of (parts of) news videos. A set of case-driven experiments conducted with the help of journalists, indicate that the proposed multimodal video annotation mechanism - combined with a professional analytics dashboard which presents the collected and generated metadata about the news stories and their visual summaries - can support journalists in their content discovery and verification work.
Detecting Tampered Videos with Multimedia Forensics and Deep Learning

Markos Zampoglou, Foteini Markatopoulou, Grégoire Mercier, and 7 more authors

In MultiMedia Modeling - 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8-11, 2019, Proceedings, Part I, 2019

Abs

User-Generated Content (UGC) has become an integral part of the news reporting cycle. As a result, the need to verify videos collected from social media and Web sources is becoming increasingly important for news organisations. While video verification is attracting a lot of attention, there has been limited effort so far in applying video forensics to real-world data. In this work we present an approach for automatic video manipulation detection inspired by manual verification approaches. In a typical manual verification setting, video filter outputs are visually interpreted by human experts. We use two such forensics filters designed for manual verification, one based on Discrete Cosine Transform (DCT) coefficients and a second based on video requantization errors, and combine them with Deep Convolutional Neural Networks (CNN) designed for image classification. We compare the performance of the proposed approach to other works from the state of the art, and discover that, while competing approaches perform better when trained with videos from the same dataset, one of the proposed filters demonstrates superior performance in cross-dataset settings. We discuss the implications of our work and the limitations of the current experimental setup, and propose directions for future research in this area.
VERGE in VBS 2019

Stelios Andreadis, Anastasia Moumtzidou, Damianos Galanopoulos, and 8 more authors

In MultiMedia Modeling - 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8-11, 2019, Proceedings, Part II, 2019
Video Fragmentation and Reverse Search on the Web

Evlampios E. Apostolidis, Konstantinos Apostolidis, Ioannis Patras, and 1 more author

In Video Verification in the Fake News Era, 2019

Abs

This chapter is focused on methods and tools for video fragmentation and reverse search on the web. These technologies can assist journalists when they are dealing with fake news—which nowadays are being rapidly spread via social media platforms—that rely on the reuse of a previously posted video from a past event with the intention to mislead the viewers about a contemporary event. The fragmentation of a video into visually and temporally coherent parts and the extraction of a representative keyframe for each defined fragment enables the provision of a complete and concise keyframe-based summary of the video. Contrary to straightforward approaches that sample video frames with a constant step, the generated summary through video fragmentation and keyframe extraction is considerably more effective for discovering the video content and performing a fragment-level search for the video on the web. This chapter starts by explaining the nature and characteristics of this type of reuse-based fake news in its introductory part, and continues with an overview of existing approaches for temporal fragmentation of single-shot videos into sub-shots (the most appropriate level of temporal granularity when dealing with user-generated videos) and tools for performing reverse search of a video on the web. Subsequently, it describes two state-of-the-art methods for video sub-shot fragmentation—one relying on the assessment of the visual coherence over sequences of frames, and another one that is based on the identification of camera activity during the video recording—and presents the InVID web application that enables the fine-grained (at the fragment-level) reverse search for near-duplicates of a given video on the web. In the sequel, the chapter reports the findings of a series of experimental evaluations regarding the efficiency of the above-mentioned technologies, which indicate their competence to generate a concise and complete keyframe-based summary of the video content, and the use of this fragment-level representation for fine-grained reverse video search on the web. Finally, it draws conclusions about the effectiveness of the presented technologies and outlines our future plans for further advancing them.
Finding Near-Duplicate Videos in Large-Scale Collections

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and 1 more author

In Video Verification in the Fake News Era, 2019
Finding Semantically Related Videos in Closed Collections

Foteini Markatopoulou, Markos Zampoglou, Evlampios E. Apostolidis, and 4 more authors

In Video Verification in the Fake News Era, 2019

Abs

Modern newsroom tools offer advanced functionality for automatic and semi-automatic content collection from the web and social media sources to accompany news stories. However, the content collected in this way often tends to be unstructured and may include irrelevant items. An important step in the verification process is to organize this content, both with respect to what it shows, and with respect to its origin. This chapter presents our efforts in this direction, which resulted in two components. One aims to detect semantic concepts in video shots, to help annotation and organization of content collections. We implement a system based on deep learning, featuring a number of advances and adaptations of existing algorithms to increase performance for the task. The other component aims to detect logos in videos in order to identify their provenance. We present our progress from a keypoint-based detection system to a system based on deep learning.
Detecting Manipulations in Video

Grégoire Mercier, Foteini Markatopoulou, Roger Cozien, and 7 more authors

In Video Verification in the Fake News Era, 2019

Abs

This chapter presents the techniques researched and developed within InVID for the forensic analysis of videos, and the detection and localization of forgeries within User-Generated Videos (UGVs). Following an overview of state-of-the-art video tampering detection techniques, we observed that the bulk of current research is mainly dedicated to frame-based tampering analysis or encoding-based inconsistency characterization. We built upon this existing research, by designing forensics filters aimed to highlight any traces left behind by video tampering, with a focus on identifying disruptions in the temporal aspects of a video. As for many other data analysis domains, deep neural networks show very promising results in tampering detection as well. Thus, following the development of a number of analysis filters aimed to help human users in highlighting inconsistencies in video content, we proceeded to develop a deep learning approach aimed to analyze the outputs of these forensics filters and automatically detect tampered videos. In this chapter, we present our survey of the state of the art with respect to its relevance to the goals of InVID, the forensics filters we developed and their potential role in localizing video forgeries, as well as our deep learning approach for automatic tampering detection. We present experimental results on benchmark and real-world data, and analyze the results. We observe that the proposed method yields promising results compared to the state of the art, especially with respect to the algorithm’s ability to generalize to unknown data taken from the real world. We conclude with the research directions that our work in InVID has opened for the future.

2018

Linear Maximum Margin Classifier for Learning from Uncertain Data

Christos Tzelepis, Vasileios Mezaris, and Ioannis Patras

IEEE Trans. Pattern Anal. Mach. Intell., 2018

learning-from-few-samples
Deep Mixture of MRFs for Human Pose Estimation

Ioannis Marras, Petar Palasek, and Ioannis Patras

In Computer Vision - ACCV 2018 - 14th Asian Conference on Computer Vision, Perth, Australia, December 2-6, 2018, Revised Selected Papers, Part III, 2018
LikeNet: A Siamese Motion Estimation Network Trained in an Unsupervised Way

Aria Ahmadi, Ioannis Marras, and Ioannis Patras

In British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, 2018
A Multi-Task Cascaded Network for Prediction of Affect, Personality, Mood and Social Context Using EEG Signals

Juan Abdon Miranda Correa, and Ioannis Patras

In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018, 2018
Visual and Audio Analysis of Movies Video for Emotion Detection @ preview=empty.png, Emotional Impact of Movies Task MediaEval 2018

Elissavet Batziou, Emmanouil Michail, Konstantinos Avgerinakis, and 3 more authors

In Working Notes Proceedings of the MediaEval 2018 Workshop, Sophia Antipolis, France, 29-31 October 2018, 2018

PDF
affective-computing
VERGE in VBS 2018

Anastasia Moumtzidou, Stelios Andreadis, Foteini Markatopoulou, and 6 more authors

In MultiMedia Modeling - 24th International Conference, MMM 2018, Bangkok, Thailand, February 5-7, 2018, Proceedings, Part II, 2018
Multimedia Processing Essentials

Konstantinos Apostolidis, Foteini Markatopoulou, Christos Tzelepis, and 2 more authors

In Personal Multimedia Preservation - Remembering or Forgetting Images and Video, 2018

2017

Gaze movement-driven random forests for query clustering in automatic video annotation

Stefanos Vrochidis, Ioannis Patras, and Ioannis Kompatsiaris

Multim. Tools Appl., 2017
Background modelling based on generative unet

Ye Tao, Petar Palasek, Zhihao Ling, and 1 more author

In 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2017, Lecce, Italy, August 29 - September 1, 2017, 2017
Deep Refinement Convolutional Networks for Human Pose Estimation

Ioannis Marras, Petar Palasek, and Ioannis Patras

In 12th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2017, Washington, DC, USA, May 30 - June 3, 2017, 2017
Generic to Specific Recognition Models for Membership Analysis in Group Videos

Wenxuan Mou, Christos Tzelepis, Vasileios Mezaris, and 2 more authors

In 12th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2017, Washington, DC, USA, May 30 - June 3, 2017, 2017
Fusing Multilabel Deep Networks for Facial Action Unit Detection

Mina Bishay, and Ioannis Patras

In 12th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2017, Washington, DC, USA, May 30 - June 3, 2017, 2017
Deep Globally Constrained MRFs for Human Pose Estimation

Ioannis Marras, Petar Palasek, and Ioannis Patras

In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017
Near-Duplicate Video Retrieval with Deep Metric Learning

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and 1 more author

In 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 22-29, 2017, 2017
SmileNet: Registration-Free Smiling Face Detection In The Wild

Youngkyoon Jang, Hatice Gunes, and Ioannis Patras

In 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 22-29, 2017, 2017
Concept Language Models and Event-based Concept Number Selection for Zero-example Event Detection

Damianos Galanopoulos, Foteini Markatopoulou, Vasileios Mezaris, and 1 more author

In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, ICMR 2017, Bucharest, Romania, June 6-9, 2017, 2017
Query and Keyframe Representations for Ad-hoc Video Search

Foteini Markatopoulou, Damianos Galanopoulos, Vasileios Mezaris, and 1 more author

In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, ICMR 2017, Bucharest, Romania, June 6-9, 2017, 2017
VideoAnalysis4ALL: An On-line Tool for the Automatic Fragmentation and Concept-based Annotation, and the Interactive Exploration of Videos

Chrysa Collyda, Evlampios E. Apostolidis, Alexandros Pournaras, and 3 more authors

In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, ICMR 2017, Bucharest, Romania, June 6-9, 2017, 2017

Abs

This paper presents the VideoAnalysis4ALL tool that supports the automatic fragmentation and concept-based annotation of videos, and the exploration of the annotated video fragments through an interactive user interface. The developed web application decomposes the video into two different granularities, namely shots and scenes, and annotates each fragment by evaluating the existence of a number (several hundreds) of high-level visual concepts in the keyframes extracted from these fragments. Through the analysis the tool enables the identification and labeling of semantically coherent video fragments, while its user interfaces allow the discovery of these fragments with the help of human-interpretable concepts. The integrated state-of-the-art video analysis technologies perform very well and, by exploiting the processing capabilities of multi-thread / multi-core architectures, reduce the time required for analysis to approximately one third of the video’s duration, thus making the analysis three times faster than real-time processing.
Comparison of Fine-Tuning and Extension Strategies for Deep Convolutional Neural Networks

Nikiforos Pittaras, Foteini Markatopoulou, Vasileios Mezaris, and 1 more author

In MultiMedia Modeling - 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Proceedings, Part I, 2017
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and 1 more author

In MultiMedia Modeling - 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Proceedings, Part I, 2017
VERGE in VBS 2017

Anastasia Moumtzidou, Theodoros Mironidis, Fotini Markatopoulou, and 8 more authors

In MultiMedia Modeling - 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Proceedings, Part II, 2017
ITI-CERTH participation in TRECVID 2017

Foteini Markatopoulou, Anastasia Moumtzidou, Damianos Galanopoulos, and 8 more authors

In 2017 TREC Video Retrieval Evaluation, TRECVID 2017, Gaithersburg, MD, USA, November 13-15, 2017, 2017

2016

Special Issue on Individual and Group Activities in Video Event Analysis

Liang Wang, Ioannis Patras, Jian Zhang, and 2 more authors

Comput. Vis. Image Underst., 2016
Action recognition using saliency learned from recorded human gaze

Daria Stefic, and Ioannis Patras

Image Vis. Comput., 2016
Learning to detect video events from zero or very few video examples

Christos Tzelepis, Damianos Galanopoulos, Vasileios Mezaris, and 1 more author

Image Vis. Comput., 2016
Automatic Recognition of Emotions and Membership in Group Videos

Wenxuan Mou, Hatice Gunes, and Ioannis Patras

In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2016, Las Vegas, NV, USA, June 26 - July 1, 2016, 2016
Online multi-task learning for semantic concept detection in video

Foteini Markatopoulou, Vasileios Mezaris, and Ioannis Patras

In 2016 IEEE International Conference on Image Processing, ICIP 2016, Phoenix, AZ, USA, September 25-28, 2016, 2016
Unsupervised convolutional neural networks for motion estimation

Aria Ahmadi, and Ioannis Patras

In 2016 IEEE International Conference on Image Processing, ICIP 2016, Phoenix, AZ, USA, September 25-28, 2016, 2016

learning-from-few-samples
Video aesthetic quality assessment using kernel Support Vector Machine with isotropic Gaussian sample uncertainty (KSVM-IGSU)

Christos Tzelepis, Eftichia Mavridaki, Vasileios Mezaris, and 1 more author

In 2016 IEEE International Conference on Image Processing, ICIP 2016, Phoenix, AZ, USA, September 25-28, 2016, 2016
Minimal filtered channel features for pedestrian detection

Yoshiki Kuranuki, and Ioannis Patras

In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4-8, 2016, 2016
Action Recognition Using Convolutional Restricted Boltzmann Machines

Petar Palasek, and Ioannis Patras

In Proceedings of the 1st International Workshop on Multimedia Analysis and Retrieval for Multimodal Interaction, MARMI@ICMR 2016, New York, preview=empty.png, New York, USA, June 6, 2016, 2016
Deep Multi-task Learning with Label Correlation Constraint for Video Concept Detection

Foteini Markatopoulou, Vasileios Mezaris, and Ioannis Patras

In Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15-19, 2016, 2016
Alone versus In-a-group: A Comparative Analysis of Facial Affect Recognition

Wenxuan Mou, Hatice Gunes, and Ioannis Patras

In Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15-19, 2016, 2016
Video Event Detection Using Kernel Support Vector Machine with Isotropic Gaussian Sample Uncertainty (KSVM-iGSU)

Christos Tzelepis, Vasileios Mezaris, and Ioannis Patras

In MultiMedia Modeling - 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part I, 2016
VERGE: A Multimodal Interactive Search Engine for Video Browsing and Retrieval

Anastasia Moumtzidou, Theodoros Mironidis, Evlampios E. Apostolidis, and 8 more authors

In MultiMedia Modeling - 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part II, 2016

Abs

This paper presents VERGE interactive search engine, which is capable of browsing and searching into video content. The system integrates content-based analysis and retrieval modules such as video shot segmentation, concept detection, clustering, as well as visual similarity and object-based search.
Ordering of Visual Descriptors in a Classifier Cascade Towards Improved Video Concept Detection

Foteini Markatopoulou, Vasileios Mezaris, and Ioannis Patras

In MultiMedia Modeling - 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part I, 2016
ITI-CERTH participation to TRECVID 2016

Fotini Markatopoulou, Anastasia Moumtzidou, Damianos Galanopoulos, and 12 more authors

In 2016 TREC Video Retrieval Evaluation, TRECVID 2016, Gaithersburg, MD, USA, November 14-16, 2016, 2016

2015

Cascade of forests for face alignment

Heng Yang, Changqing Zou, and Ioannis Patras

IET Comput. Vis., 2015
Random Subspace Supervised Descent Method for Regression Problems in Computer Vision

Heng Yang, Xuhui Jia, Ioannis Patras, and 1 more author

IEEE Signal Process. Lett., 2015
DECAF: MEG-Based Multimodal Database for Decoding Affective Physiological Responses

Mojtaba Khomami Abadi, Ramanathan Subramanian, Seyed Mostafa Kia, and 3 more authors

IEEE Trans. Affect. Comput., 2015
Privileged Information-Based Conditional Structured Output Regression Forest for Facial Point Detection

Heng Yang, and Ioannis Patras

IEEE Trans. Circuits Syst. Video Technol., 2015
Local Features and a Two-Layer Stacking Architecture for Semantic Concept Detection in Video

Fotini Markatopoulou, Vasileios Mezaris, Nikiforos Pittaras, and 1 more author

IEEE Trans. Emerg. Top. Comput., 2015
Fine-Tuning Regression Forests Votes for Object Alignment in the Wild

Heng Yang, and Ioannis Patras

IEEE Trans. Image Process., 2015
Robust Face Alignment Under Occlusion via Regional Predictive Power Estimation

Heng Yang, Xuming He, Xuhui Jia, and 1 more author

IEEE Trans. Image Process., 2015
Concept Detection in Multimedia Web Resources About Home Made Explosives

George Kalpakis, Theodora Tsikrika, Fotini Markatopoulou, and 5 more authors

In 10th International Conference on Availability, Reliability and Security, ARES 2015, Toulouse, France, August 24-27, 2015, 2015
Identifying valence and arousal levels via connectivity between EEG channels

Mo Chen, Junwei Han, Lei Guo, and 2 more authors

In 2015 International Conference on Affective Computing and Intelligent Interaction, ACII 2015, Xi’an, China, September 21-24, 2015, 2015
Face Alignment Assisted by Head Pose Estimation

Heng Yang, Wenxuan Mou, Yichi Zhang, and 3 more authors

In Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015
Mirror, mirror on the wall, tell me, is the error small?

Heng Yang, and Ioannis Patras

In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015
Inference of personality traits and affect schedule by analysis of spontaneous reactions to affective videos

Mojtaba Khomami Abadi, Juan Abdon Miranda Correa, Julia Wache, and 3 more authors

In 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2015, Ljubljana, Slovenia, May 4-8, 2015, 2015
Cascade of classifiers based on binary, non-binary and deep convolutional network descriptors for video concept detection

Foteini Markatopoulou, Vasileios Mezaris, and Ioannis Patras

In 2015 IEEE International Conference on Image Processing, ICIP 2015, Quebec City, QC, Canada, September 27-30, 2015, 2015
A flexible calibration method of multiple Kinects for 3D human reconstruction

Petar Palasek, Heng Yang, Zongyi Xu, and 3 more authors

In 2015 IEEE International Conference on Multimedia & Expo Workshops, ICME Workshops 2015, Turin, Italy, June 29 - July 3, 2015, 2015
VERGE: A Multimodal Interactive Video Search Engine

Anastasia Moumtzidou, Konstantinos Avgerinakis, Evlampios E. Apostolidis, and 7 more authors

In MultiMedia Modeling - 21st International Conference, MMM 2015, Sydney, NSW, Australia, January 5-7, 2015, Proceedings, Part II, 2015

Abs

This paper presents VERGE interactive video retrieval engine, which is capable of searching into video content. The system integrates several content-based analysis and retrieval modules such as video shot boundary detection, concept detection, clustering and visual similarity search.
A Study on the Use of a Binary Local Descriptor and Color Extensions of Local Descriptors for Video Concept Detection

Fotini Markatopoulou, Nikiforos Pittaras, Olga Papadopoulou, and 2 more authors

In MultiMedia Modeling - 21st International Conference, MMM 2015, Sydney, NSW, Australia, January 5-7, 2015, Proceedings, Part I, 2015
ITI-CERTH participation to TRECVID 2015

Foteini Markatopoulou, Anastasia Ioannidou, Christos Tzelepis, and 11 more authors

In 2015 TREC Video Retrieval Evaluation, TRECVID 2015, Gaithersburg, MD, USA, November 16-18, 2015, 2015
Face Pose Analysis

Ioannis Patras

In Encyclopedia of Biometrics, Second Edition, 2015

2014

Multimodal random forest based tensor regression

Sertan Kaymak, and Ioannis Patras

IET Comput. Vis., 2014
Face Sketch Landmarks Localization in the Wild

Heng Yang, Changqing Zou, and Ioannis Patras

IEEE Signal Process. Lett., 2014
Structured Semi-supervised Forest for Facial Landmarks Localization with Face Mask Reasoning

Xuhui Jia, Heng Yang, Kwok-Ping Chan, and 1 more author

In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014, 2014
Non-invasive player experience estimation from body motion and game context

Paolo Burelli, Georgios Triantafyllidis, and Ioannis Patras

In 2014 IEEE Conference on Computational Intelligence and Games, CIG 2014, Dortmund, Germany, August 26-29, 2014, 2014
Learning visual saliency using topographic independent component analysis

Daria Stefic, and Ioannis Patras

In 2014 IEEE International Conference on Image Processing, ICIP 2014, Paris, France, October 27-30, 2014, 2014
ITI-CERTH participation to TRECVID 2014

Nikolaos Gkalelis, Foteini Markatopoulou, Anastasia Moumtzidou, and 7 more authors

In 2014 TREC Video Retrieval Evaluation, TRECVID 2014, Orlando, FL, USA, November 10-12, 2014, 2014

2013

Fusion of facial expressions and EEG for implicit affective tagging

Sander Koelstra, and Ioannis Patras

Image Vis. Comput., 2013
Coupled Gaussian Processes for Pose-Invariant Facial Expression Recognition

Ognjen Rudovic, Maja Pantic, and Ioannis Patras

IEEE Trans. Pattern Anal. Mach. Intell., 2013
High order pLSA for indexing tagged images

Spiros Nikolopoulos, Stefanos Zafeiriou, Ioannis Patras, and 1 more author

Signal Process., 2013
Supervised dictionary learning for action localization

B. G. Vijay Kumar, and Ioannis Patras

In 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013, Shanghai, China, 22-26 April, 2013, 2013
Privileged information-based conditional regression forest for facial feature detection

Heng Yang, and Ioannis Patras

In 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013, Shanghai, China, 22-26 April, 2013, 2013
Sieving Regression Forest Votes for Facial Feature Detection in the Wild

Heng Yang, and Ioannis Patras

In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, 2013
Semi-supervised visual recognition with constrained graph regularized non negative matrix factorization

Weiwei Guo, Weidong Hu, Nikolaos V. Boulgouris, and 1 more author

In IEEE International Conference on Image Processing, ICIP 2013, Melbourne, Australia, September 15-18, 2013, 2013

2012

Max-margin Non-negative Matrix Factorization

B. G. Vijay Kumar, Irene Kotsia, and Ioannis Patras

Image Vis. Comput., 2012
Leveraging social media for scalable object detection

Elisavet Chatzilari, Spyros Nikolopoulos, Ioannis Patras, and 1 more author

Pattern Recognit., 2012
Higher rank Support Tensor Machines for visual recognition

Irene Kotsia, Weiwei Guo, and Ioannis Patras

Pattern Recognit., 2012
DEAP: A Database for Emotion Analysis Using Physiological Signals

Sander Koelstra, Christian Mühl, Mohammad Soleymani, and 6 more authors

IEEE Trans. Affect. Comput., 2012

PDF Website
affective-computing
Tensor Learning for Regression

Weiwei Guo, Irene Kotsia, and Ioannis Patras

IEEE Trans. Image Process., 2012
Tree-Structured Feature Extraction Using Mutual Information

Farid Oveisi, Shahrzad Oveisi, Abbas Erfanian, and 1 more author

IEEE Trans. Neural Networks Learn. Syst., 2012
Exploiting Depth and Intensity Information for Head Pose Estimation with Random Forests and Tensor Models

Sertan Kaymak, and Ioannis Patras

In Computer Vision - ACCV 2012 Workshops, ACCV 2012 International Workshops, Daejeon, Korea, November 5-6, 2012, Revised Selected Papers, Part II, 2012
Exploring the Similarities of Neighboring Spatiotemporal Points for Action Pair Matching

Irene Kotsia, and Ioannis Patras

In Computer Vision - ACCV 2012 - 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part III, 2012
Face Parts Localization Using Structured-Output Regression Forests

Heng Yang, and Ioannis Patras

In Computer Vision - ACCV 2012, 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part II, 2012
Learning codebook weights for action detection

B. G. Vijay Kumar, and Ioannis Patras

In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, June 16-21, 2012, 2012
Support tensor action spotting

Irene Kotsia, and Ioannis Patras

In 19th IEEE International Conference on Image Processing, ICIP 2012, Lake Buena Vista, Orlando, FL, USA, September 30 - October 3, 2012, 2012
A simple and effective extrinsic calibration method of a camera and a single line scanning lidar

Heng Yang, Xiaolin Liu, and Ioannis Patras

In Proceedings of the 21st International Conference on Pattern Recognition, ICPR 2012, Tsukuba, Japan, November 11-15, 2012, 2012
Coupled 3D tracking and pose optimization of rigid objects using particle filter

Heng Yang, Yueqiang Zhang, Xiaolin Liu, and 1 more author

In Proceedings of the 21st International Conference on Pattern Recognition, ICPR 2012, Tsukuba, Japan, November 11-15, 2012, 2012
Affective gaming: Beyond using sensors

Irene Kotsia, Ioannis Patras, and Spiros Fotopoulos

In 5th International Symposium on Communications, Control and Signal Processing, ISCCSP 2012, Roma, Italy, May 2-4, 2012, 2012
Higher Rank Support Tensor Machines

Irene Kotsia, Weiwei Guo, and Ioannis Patras

In Advances in Visual Computing - 8th International Symposium, ISVC 2012, Rethymnon, Crete, Greece, July 16-18, 2012, Revised Selected Papers, Part II, 2012
Image Interpretation by Combining Ontologies and Bayesian Networks

Spiros Nikolopoulos, Georgios Th. Papadopoulos, Ioannis Kompatsiaris, and 1 more author

In Artificial Intelligence: Theories and Applications - 7th Hellenic Conference on AI, SETN 2012, Lamia, Greece, May 28-31, 2012. Proceedings, 2012
Exploiting gaze movements for automatic video annotation

Stefanos Vrochidis, Ioannis Patras, and Ioannis Kompatsiaris

In 13th International Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS 2012, Dublin, Ireland, May 23-25, 2012, 2012