Following are Special Session proposals that have been accepted for CBMI 2024:


AIMHDA: Advances in AI-Driven Medical and Health Data Analysis

In the past decade, the surge in AI-driven data analysis has revolutionized both the medical and health sectors. The exponential growth of data in these fields necessitates innovative approaches for efficient and effective analysis. This special session is dedicated to exploring the emerging research problems of multimodal data analysis in these vital areas.

Medical procedures, such as surgeries, are now extensively documented through interventional videos, creating a rich repository for post-procedural analysis. These include operation documentation, surgical error analysis, and the development of training materials for teaching advanced surgical techniques. Similarly, the field of radiology has been transformed by AI, with vast quantities of imaging data being automatically processed and analyzed to aid in diagnostics and treatment planning. And these are just two examples of the vast amount of different medical specializations that are being transformed by the current technologies.

On the other hand, personal health and lifelogging, also often referred to as intelligent health, represent a growing area of interest. The tracking of sports activities, calorie consumption, and memory aids for individuals with conditions like dementia, are made possible by the intelligent analysis of multimodal data. This wealth of information offers unprecedented insights into personal health management, with the potential to help prevent disease or better manage existing conditions.
Topics of interest for this special session include, but are not limited to:

  • AI in Medical Imaging: Multimodal diagnosis, image/video segmentation, and pattern recognition in radiology, pathology, and other multimodal-related fields.
  • Surgical Data Analysis: AI applications in analyzing surgical videos, error detection, and post-procedure evaluations.
  • Personal Health Data Analytics: AI-driven analysis of multimodal data from wearables and lifelogging devices, focusing on sports performance, calorie tracking, and continuous health monitoring.
  • Memory Supports for Dementia: Development of AI tools to assist memory and activities of daily living for patients with dementia.
  • Multimodal Data Fusion: Techniques for integrating data from various sources (e.g., audio, video, text) for comprehensive health and medical analysis.
  • Remote Patient Monitoring: Utilizing multimedia and AI for patient care, particularly in telemedicine and home healthcare settings.
  • AI in Health Behavior Analysis: Understanding and predicting health behaviors using AI, with applications in public health and personalized medicine.
  • Ethical Considerations and Privacy in Health Data: Addressing the challenges of privacy, consent, and ethical considerations in the collection and analysis of health and medical data using AI.
  • Explainable AI in health applications.
  • Multimodal analysis and retrieval in medicine and health.
  • Creating and indexing multimodal data in medicine and health.
  • Interactive image and video exploration systems for medicine and health.
  • Novel user interfaces (e.g. VR/AR systems) to support medical and health data exploration.
  • Alignment of AI systems in medical and health applications. This includes topics such as bias in data, explainable methods and ethical AI, etc.

The AIMHDA session aims to serve as a vibrant platform for researchers at the intersection of technology, medicine, and health to share innovative ideas and latest findings. It is designed to appeal to the common Content-Based Multimedia Indexing (CBMI) audience, drawing parallels between methodologies in specialized and general content-based multimedia indexing. We believe that discussions and collaborations among researchers from these intersecting domains will lead to enriched perspectives and groundbreaking advancements in AI-driven data analysis for medical and health applications.

Organisers of this special session are:

  • Klaus Schöffmann, Institute of Information Technology (ITEC) at Klagenfurt University, Austria.
  • Cathal Gurrin, Dublin City University (DCU), Ireland.
  • Stefanos Vrochidis, Information Technologies Institute / Centre for Research and Technology Hellas, Greece.
  • Michael A. Riegler, SimulaMet & OlsoMet, Norway.

Please direct correspondence to


CB4AMAS: Content-Based Indexing for Audio and Music: From Analysis to Synthesis

Audio has long been a key component of multimedia research. As far as indexing is concerned, the research and industrial context has changed drastically in the last 20 years or so. Today, applications of audio indexing range from karaoke applications to singing voice synthesis and creative audio design. This special session aims at bringing together researchers that aim at proposing new tools or paradigms to investigate audio and music processing in the context of indexation and corpus-based generation.

Organisers of this special session are:

  • François Pachet, Spotify Creator Technology Research Lab, Sweden.
  • Mathieu Lagrange, LS2N, France.

Please direct correspondence to


IVR4B: Interactive Video Retrieval for Beginners

Despite the advances in automated content description using deep learning, and the emergence of joint image-text embedding models, many video retrieval tasks still require a human user in the loop. Interactive video retrieval (IVR) systems address these challenges. In order to assess their performance, multimedia retrieval benchmarks such as Video Browser Showdown (VBS) or Lifelog Search Challenge (LSC) have been established. These benchmarks provide large-scale datasets as well as task settings and evaluation protocols, allowing to measure progress in research on IVR systems. However, in order to achieve the best possible performance of the participating systems, they are usually operated by members of the development team. This special session aims at providing better insights into how such systems are usable by users with a solid IT background, but not familiar with the details behind the system.

The submitted retrieval systems will be presented as demos (with a related poster), and compete in a novice competition. Volunteer attendees, who are not related to the developer team of any participating IVR system, but have seen the systems in the demo session, will use them to solve a small number of Video Browser Showdown tasks.

Organisers of this special session are:

  • Werner Bailer, JOANNEUM RESEARCH’s Connected Computing Group in Graz, Austria.
  • Cathal Gurrin, Dublin City University (DCU), Ireland.
  • Björn Þór Jónsson, Reykjavik Univerisity, Iceland.
  • Klaus Schöffmann, Institute of Information Technology (ITEC) at Klagenfurt University, Austria.

Please direct correspondence to


MmIXR: Multimedia Indexing for XR

Extended Reality (XR) applications rely not only on computer vision for navigation and object placement but also require a range of multimodal methods to understand the scene or assign semantics to objects being captured and reconstructed. Multimedia indexing for XR thus encompasses methods for processes during XR authoring, such as indexing content to be used for scene and object reconstruction, as well as during the immersive experience, such as object detection and scene segmentation.

The intrinsic multimodality of XR applications involves new challenges like the analysis of egocentric data (video, depth, gaze, head/hand motion) and their interplay. XR is also applied in diverse domains, e.g., manufacturing, medicine, education, and entertainment, each with distinct requirements and data. Thus, multimedia indexing methods must be capable of adapting to the relevant semantics of the particular application domain.

Topics covered in the Special Session include, but are not limited to:

  • Multimedia analysis for media mining, adaptation (to scene requirements), and description for use in XR experiences (including but not limited to AI-based approaches)
  • Processing of egocentric multimedia datasets and streams for XR (e.g., egocentric video and gaze analysis, active object detection, video diarization/summarization/captioning)
  • Cross- and multi-modal integration of XR modalities (video, depth, audio, gaze,
  • hand/head movements, etc.)
  • Approaches for adapting multimedia analysis and indexing methods to new application domains (e.g., open-world/open-vocabulary recognition/detection/segmentation, few-shot learning)
  • Large-scale analysis and retrieval of 3D asset collections (e.g., objects, scenes, avatars, motion capture recordings)
  • Multimodal datasets for scene understanding for XR
  • Generative AI and foundation models for multimedia indexing and/or synthetic data generation
  • Combining synthetic and real data for improving scene understanding
  • Optimized multimedia content processing for real-time and low-latency XR applications
  • Privacy and security aspects and mitigations for XR multimedia content

Organisers of this special session are:

  • Fabio Carrara, Artificial Intelligence for Multimedia and Humanities Laboratory of ISTI-CNR in Pisa, Italy.
  • Werner Bailer, JOANNEUM RESEARCH’s Connected Computing Group in Graz, Austria.
  • Lyndon J. B. Nixon, MODUL Technology GmbH and Applied Data Science school at MODUL University Vienna, Austria.
  • Vasileios Mezaris, Information Technologies Institute / Centre for Research and Technology Hellas, Thessaloniki, Greece.

Please direct correspondence to


MAS4DT: Multimedia Analysis and Simulations for Digital Twins in the Construction Domain

Digital Twins (DTs) technology, reflecting the shadows of physical objects and systems, holds immense potential in various sectors. This potential is driven by real-time data streams, machine learning, sensing data from multiple sources, sophisticated simulation techniques, and reasoning capabilities to enhance decision-making processes. Enhanced visual representations that rely on processing multimedia data along with prevention-through-prediction models establish a concrete solution in various domains and applications where real-time updates are crucial as mitigation measures of hazardous circumstances. Collected data from the actual sensors of the monitored system is forwarded to its virtual representation where predictive models estimate how a status can be involved over time and propose mitigation actions before resulting in situations of hazards.

Digital Twins are increasingly being utilised in industries for operation monitoring, simulation, predictive maintenance, and supply chain optimization. Multidisciplinary expertise that combines knowledge from different research domains, as well as improvements of existing models and novel technologies will be required to increase the exploitation of digital twins in domains such as the AECI sector. Despite its growing importance and widespread adoption, challenges remain in ensuring real- time data collection, the efficient interconnection of heterogeneous data sources and interpretation of data, constructing dynamic multi-dimensional models, and delivering value-added services based on these models.
The purpose of this Special Issue is to present the progress achieved so far and the challenges of the integration of DT technology across various dimensions and sectors. It also aligns with the objectives of the Green New Deal for Europe, which focuses on reducing materials, energy, pollution, and waste in heavy industries. Submitted papers can focus on the adoption of Digital Twin Technologies in the context of Multimedia Analysis for Digital Twins. This includes this special session targets (but is not limited to) the presentation of novel research in the following domains:

  • Multimedia modelling and simulation for digital twins
  • Multimedia interconnection and interoperation for digital twin
  • Digital twin in multimedia optimization
  • Digital twin and multimedia big data
  • Multimedia technologies for digital twin implementation
  • Visual analysis and multi-view geometry
  • Dense reconstruction from multiple visual sensors
  • Point cloud extraction and understanding
  • Image-based modelling and 3D reconstruction
  • Visual semantic extraction for improved representations
  • Simulation and prediction modelling in industry
  • Benchmarks and evaluation protocols for digital twins
  • Physical/Virtual twin communication, e.g. real-time or off-time
  • Federated machine learning for digital twins
  • Twins and user interactions

Organisers of this special session are:

  • Ilias Koulalis, Center for Information Technologies Institute / Centre for Research and Technology Hellas, Greece.
  • Konstantinos Ioannidis, Information Technologies Institute / Centre for Research and Technology Hellas, Greece.
  • Stefanos Vrochidis, Information Technologies Institute / Centre for Research and Technology Hellas, Greece.
  • Irina Stipanovic, Infra Plan Consulting in Croatia & University of Twente, Faculty of Engineering Technology, The Netherlands.
  • Timo Hartmann, Civil Systems Engineering Department, Technische Universität Berlin, Germany.

Please direct correspondence to


MIDRA: Multimodal Insights for Disaster Risk Management and Applications

Disaster management in all its phases from preparedness, prevention, response, and recovery is in abundance of multimedia data, including valuable assets like satellite images, videos from UAVs or static cameras, and social media streams. The value of such multimedia data for operational purposes in disaster management is not only useful for civil protection agencies but also for the private sector that quantifies risk. Indexing data from crisis events presents Big Data challenges due to its variety, velocity, volume and veracity for effective analysis and retrieval.

The advent of deep learning and multimodal data fusion offers an unprecedented opportunity to overcome these challenges and fully unlock the potential of disaster event multimedia data. Through the strategic utilization of different data modalities, researchers can significantly enhance the value of these datasets, uncovering insights that were previously beyond reach, giving actionable information and supporting real-life decision-making procedures.

This special session actively seeks research papers in the domain of multimodal analytics and their applications in the context of crisis event monitoring through knowledge extraction and multimedia understanding. Emphasis is placed on recognizing the intrinsic value of spatial information when integrated with other data modalities.

The special session serves as a collaborative platform for communities focused on specific crisis events, such as forest fires, volcano unrest or eruption, earthquakes, floods, tsunamis and extreme weather events, which have increased significantly due to the climate crisis in our era. It fosters the exchange of ideas, methodologies, and software tailored to address challenges in these domains, aiming to encourage fruitful collaborations and the mutual enrichment of insights and expertise among diverse communities.
This special session includes presentation of novel research within the following domains:

  • Lifelog computing
  • Urban computing
  • Satellite computing and earth observation
  • Multimodal data fusion
  • Social media

Within these domains, the topics of interest include (but are not restricted to):

  • Multimodal analytics and retrieval techniques for crisis event multimedia data.
  • Deep learning and neural networks for interpretability, understanding, and explainability in artificial intelligence applied to natural disasters.
  • Satellite image analysis and fusion with in-situ data for crisis management.
  • Integration of multimodal data for comprehensive risk assessment.
  • Application of deep learning techniques to derive insights for risk mitigation.
  • Development of interpretative models for better understanding of risk factors.
  • Utilization of diverse data modalities (text, images, sensors) for risk management.
  • Implementation of multimodal analytics in predicting and managing natural disasters.
  • Application of multimodal insights in insurance risk assessment.
  • Enhanced decision-making through the fusion of geospatial and multimedia data.

Organisers of this special session are:

  • Maria Pegia, Information Technologies Institute / Centre for Research and Technology Hellas, Greece.
  • Ilias Gialampoukidis, Information Technologies Institute / Centre for Research and Technology Hellas, Greece.
  • Ioannis Papoutsis, National Observatory of Athens & National Technical University of Athens, Greece.
  • Krishna Chandramouli, Venaka Treleaf GbR, Germany.
  • Stefanos Vrochidis, Information Technologies Institute / Centre for Research and Technology Hellas, Greece.

Please direct correspondence to


ExMA: Explainability in Multimedia Analysis

The rise of machine learning approaches, and in particular deep learning, has led to a significant increase in the performance of AI systems. However, it has also raised the question of the reliability and explicability of their predictions for decision-making (e.g., the black-box issue of the deep models). Such shortcomings also raise many ethical and political concerns that prevent wider adoption of this potentially highly beneficial technology, especially in critical areas, such as healthcare, self-driving cars or security.

It is therefore critical to understand how their predictions correlate with information perception and expert decision-making. The objective of eXplainable AI (XAI) is to open this black box by proposing methods to understand and explain how these systems produce their decisions.
Some multimedia applications, such as person detection/tracking, face recognition or lifelog analysis affect sensitive personal information. This raises both legal issues, e.g. concerning data protection and regulations in the ongoing European AI regulation, as well as ethical issues, related to potential bias in the system or misuse of these technologies.

This special session focuses on AI-based explainability technologies in multimedia analysis, and in particular on:

  • the analysis of the influencing factors relevant for the final decision as an essential step to understand and improve the underlying processes involved;
  • information visualization for models or their predictions;
  • interactive applications for XAI;
  • performance assessment metrics and protocols for explainability;
  • sample-centric and dataset-centric explanations;
  • attention mechanisms for XAI;
  • XAI-based pruning;
  • applications of XAI methods, in particular those addressing domain experts; and
  • open challenges from industry or emerging legal frameworks.

This special session aims at collecting scientific contributions that will help improve trust and transparency of multimedia analysis systems with important benefits for society as a whole.

Organisers of this special session are:

  • Chiara Galdi, EURECOM, Sophia Antipolis, France.
  • Martin Winter, JOANNEUM RESEARCH – DIGITAL, Graz, Austria.
  • Romain Giot, University of Bordeaux, France.
  • Romain Bourqui, University of Bordeaux, France.

Please direct correspondence to


UHBER: Multimodal Data Analysis for Understanding of Human Behaviour, Emotions and their Reasons

This special session addresses the processing of all types of data related to understanding of human behaviour, emotion, and their reasons, such as current or past context. Understanding human behaviour and context may be beneficial for many services both online and in physical spaces. For example detecting lack of skills, confusion or other negative states may help to adapt online learning programmes, to detect a bottleneck in the production line, to recognise poor workplace culture etc., or maybe to detect a dangerous spot on a road before any accident happens there. Detection of unusual behaviour may help to improve security of travellers and safety of dementia sufferers and visually/audio impaired individuals, for example, to help them stay away from potentially dangerous strangers, e.g., drunk people or football fans forming in a big crowd.

In the context of multimedia retrieval, understanding human behaviour and emotions could help not only for multimedia indexing, but also to derive implicit (i.e., other than intentionally reported) human feedback regarding multimedia news, videos, advertisements, navigators, hotels, shopping items etc. and improve multimedia retrieval.

Humans are good at understanding other humans, their emotions and reasons. For example, when looking at people engaged in different activities (sport, driving, working on a computer, working on a construction site, using public transport etc.), a human observer can understand whether a person is engaged in the task or distracted, stopped the recommended video because the video was not interesting, or because the person quickly found what he needed in the beginning of the video. After observing another human for some time, humans can also learn the observed individuals’ tastes, skills and personality traits.

Hence the interest of this session is, how to improve AI understanding of the same aspects? The topics include (but are not limited to) the following:

  • Use of various sensors for monitoring and understanding human behaviour, emotion / mental state / cognition, and context: video, audio, infrared, wearables, virtual (e.g., mobile device usage, computer usage) sensors etc.
  • Methods for information fusion, including information from various heterogeneous sources.
  • Methods to learn human traits and preferences from long term observations.
  • Methods to detect human implicit feedback from past and current observations.
  • Methods to assess task performance: skills, emotions, confusion, engagement in the task and/or context.
  • Methods to detect potential security and safety threats and risks.
  • Methods to adapt behavioural and emotional models to different end users and contexts without collecting a lot of labels from each user and/or for each context: transfer learning, semi-supervised learning, anomaly detection, one-shot learning etc.
  • How to collect data for training AI methods from various sources, e.g., internet, open data, field pilots etc.
  • Use of behavioural or emotional data to model humans and adapt services either online or in physical spaces.
  • Ethics and privacy issues in modelling human emotions, behaviour, context and reasons.

Organisers of this special session are:

  • Elena Vildjiounaite, Johanna Kallio, Sari Järvinen, Satu-Marja Mäkela, and Sari Järvinen,
    VTT Technical Research Center of Finland, Finland.
  • Benjamin Allaert, IMT-Nord_Europe, France.
  • Ioan Marious Bilasco, University of Lille, France.
  • Franziska Schmalfuss, IAV GmbH, Germany.

Please direct correspondence to