Since 2021, the ELLIS unit Amsterdam has hosted the ELLIS MSc Honours Programme, where MSc AI students work closely with ELLIS Members at the University of Amsterdam and a partner institution abroad. Inspired by the ELLIS PhD and Postdoc Program, the programme has thus far resulted in the retention of talents, with over 80% of the MSc Honours graduates choosing to pursue a career within the European region. Notably, many of the graduates gained valuable professional networks through the programme.
The programme also provides young AI talents with first-hand experience in international collaboration, encouraging them to step outside their comfort zones in Amsterdam and engage in state-of-the-art research under the guidance of ELLIS network members. Several research projects developed through the programme have been published at top-tier conferences, offering students an even steeper and more rewarding learning curve.
On 28 October 2025, we shone a spotlight on the 11 graduating MSc Honours students of the ELLIS unit in Amsterdam who shared their thesis research results and the experiences they gained from the programme. Prof. Dr. Cees Snoek, on behalf of the Director of the unit, who was on parental leave, awarded the students their MSc Honours certificates upon the completion of their degree cum laude, along with special graduation gifts. Once again, we congratulated the students for their hard work and remarkable achievements!
Further information on the ELLIS unit Amsterdam MSc Honours Programme can be found here: https://ivi.fnwi.uva.nl/ellis/funding/ellis-msc-honours-programme/
Below you can find the summary of their quality works and MSc Honours journeys:
(1) Ioana Simion
Title: 3DPoV: Improving 3D understanding via Patch Ordering on Video
Co-supervisors: Mohammadreza Salehi (ELLIS Postdoc-Amsterdam), Yuki M. Asano (ELLIS Scholar-Nurenberg), Cees G.M. Snoek (ELLIS Fellow-Amsterdam)
Research Summary:
Viewpoint changes pose a significant challenge for maintaining accurate keypoint correspondences across motion-induced transformations and rotations. To address this, our goal is to enrich the dense feature space with fine-grained discriminative power for objects and scenes, with a particular emphasis on part-level distinctions. While large visual models often yield high similarity scores between instances of the same object, they also tend to blur distinctions between different parts within a single object or across similar semantic categories. This entanglement limits their effectiveness for dense correspondence tasks.
To address this, we argue that the learned feature space should encode a consistent similarity structure over time–where patches corresponding to the same object or part maintain similar relative relationships to a set of reference features, even under viewpoint changes. In particular, the feature space must preserve part-level distinctions and reflect temporal alignment, such that the relative similarity of a tracked patch remains stable across frames. This requires supervision that is both temporally grounded and robust to viewpoint-induced variations.
We propose 3DPoV, a framework for learning temporally consistent dense features from videos by leveraging motion cues and patch-level correspondences. The method is built around a teacher-student architecture, where video frames are processed independently by the student, and the teacher provides a stable reference for supervision. To enforce temporal consistency, we track a grid of points across frames and extract features at aligned locations. Instead of matching features directly, we supervise the student by aligning the similarity rankings of tracked patches–computed with respect to a shared set of reference features-between each frame and the teacher-encoded anchor frame. This encourages the model to preserve relative similarity structure over time, even under large viewpoint or appearance changes.
MSc Honours experience:
The MSc Honours Programme, particularly through the ELLIS network, has shaped the trajectory of my thesis and research. Conducting part of my thesis with an ELLIS-affiliated supervisor abroad gave me the chance to benefit from diverse supervision–from regular in-depth discussions throughout the project to collaborative sessions in the host lab, where we could break down ideas on a whiteboard and plan next steps in detail. Joining the FunAI lab in Nuremberg was especially valuable: I worked closely with my supervisor during key phases of the project, participated in group meetings, and became part of the lab’s daily workflow. This gave me a first-hand experience of the day-to-day life of a PhD student–something I had been curious about–and offered a realistic preview of what pursuing a PhD might involve. I had the chance to interact with researchers working on a range of topics, and their input across different stages of the project–from early exploratory discussions to refining experiments and writing up results–had a direct impact on the directions I pursued.
I also appreciated the opportunity to present my work and receive feedback from different perspectives, which helped sharpen both the technical and communication aspects of my research. My ELLIS supervisor also introduced me to other researchers in the community, which broadened my academic horizons and helped me discover new areas of interest.
Beyond the academic gains, I’ve built strong connections and friendships with members of the lab, and I look forward to staying in touch and possibly collaborating in the future.
. 
(2) Zoe Tzifa-Kratira
Title: Understanding adversarial training through LLC dynamics and improving complexity as a measure of robustness
Co-supervisors: Leonard Bereska (ELLIS Unit PhD-Amsterdam), Efstratios Gavves (ELLIS Scholar- Amsterdam), Prof. Pascal Frossard (ELLIS Fellow-Lausanne)
Research Summary:
The thesis applies Developmental Interpretability – the study of how neural network structure emerges incrementally through phase transitions during training – to understand adversarial robustness development. DevInterp leverages the Local Learning Coefficient (LLC) from Singular Learning Theory as a mathematical tool for measuring complexity to detect these phase transitions.
Singular Learning Theory addresses a fundamental limitation: neural networks are non-regular models that violate classical Bayesian learning theory assumptions, making traditional, hessian-based complexity measures inadequate. While complexity-based metrics show promise for predicting adversarial robustness, existing measures remain imperfect and model-dependent. This research explores whether the LLC, with its singularity-aware geometric analysis, can emerge as a more accurate, model-independent robustness predictor and provide deeper mechanistic understanding of how defensive capabilities develop.
This developmental perspective has broader AI safety implications – understanding phase transitions during training could help us predict when models acquire new capabilities, and enable the development of early warning systems for when adversarial training fails.
MSc Honours experience:
Being part of the ELLIS MSc Honours program gave me access to additional supervision by an expert on one of the fields of my project. It was an extremely valuable opportunity to be able to visit my external supervisor in person, receive feedback and exchange research ideas.
Visiting the LTS4 lab was a formative experience for personal and professional growth; getting exposed to new research and the promising fields that the PhDs were working on, as well as getting to experience life as a PhD student in Lausanne has certainly informed my perspective and plans for my next steps in academia.

(3) Razvan Matişan
Title: Purrception: Variational Flow Matching for Vector Quantized Image Generation
Co-supervisors: Tao Hu (ELLIS Postdoc- Munich), Björn Ommer (ELLIS Fellow-Munich), Jan-Willem van de Meent (ELLIS Member-Amsterdam)
Research summary:
Vector-quantized (VQ) representations discretize complex image features into a set of learnable embedding vectors called a codebook, which facilitates generation of diverse and high-quality images. However, current generative approaches that employ continuous diffusion or flow matching in this VQ latent space often overlook the discrete nature of the codebook during the generative process. These methods typically treat the final latents in the generation process as continuous variables and only quantize them immediately before decoding the final image. We argue that this mismatch limits the generative performance.
In this work, we show that explicitly accounting for the quantized nature of the latent representations improves the sample quality. For this, we propose a novel framework based on Variational Flow Matching (VFM), a technique suitable for generating categorical data by parametrizing Flow Matching (FM) in terms of a variational distribution for categorical settings (e.g., finite set of codebook vectors). Unlike flow models that learn non-continuous transformations (e.g., discrete flow matching), our VFM-based model learns a continuous flow while utilizing a cross-entropy objective to predict more accurately the final codebook vector.
We first investigate the performance of the categorical variational flow matching (CatFlow) proposed by the VFM paper authors. We show CatFlow fails to solve the VQ image generation task due to several architectural issues. This led us to develop Purrception, a novel VFM-based method designed to overcome CatFlow’s limitations. Our experiments show that Purrception outperforms both standard continuous flow matching and discrete flow matching in generating high-fidelity images. These results highlight the potential of employing discrete-aware objectives that learn continuous flows for the high-resolution image synthesis task.
MSc Honours Experience:
My research project focuses on designing a new method for high-resolution image generation. This work is based on Variational Flow Matching, a novel generative modeling technique proposed by my supervisor at the University of Amsterdam, Floor Eijkelboom. This technique has already achieved competitive, state-of-the-art results on several graph generation benchmarks, and my goal was to adapt it to the domain of image synthesis.
What ELLIS programme did was to connect me with research experts on image generation task, particularly the Computer Vision and Learning Group in Munich, led by Prof. Dr. Bjorn Ommer, one of the co-creators of Stable Diffusion and a Fellow member of ELLIS unit Munich. Through regular online meetings with his postdoc student, Dr. Tao Hu, and my research visit to Munich, I gained direct insights on generative modeling on high-resolution images, as well as how to design and train models that solve this task at scale. Their guidance significantly shaped the direction and success of my project.
Furthermore, high-resolution image synthesis is computationally demanding. Training, debugging, and refining such models to be competitive with the state-of-the-art requires vast resources. The project would have not been possible without the additional computational resources provided by the host lab in Munich.
In summary, the ELLIS Programme amplified my research by connecting me with leading minds in computer vision and providing the essential resources for my work: computational power from the Munich group and the financial support that made my research visit possible. All of these allowed my project to reach its full potential.

(4) Martin Sedláček
Title: REALM: real2sim aligned generalization benchmark for robotic manipulation
Co-supervisors: Cees Snoek (ELLIS Fellow-Amsterdam), Josef Sivic (ELLIS Fellow-Prague)
Research summary:
Vision Language Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the constraints of their training data and the specific environments they were trained in. To quantify this, we developed a novel benchmark designed to validate the generalization capabilities of VLA models, with a specific emphasis on closing the realm2sim gap – i.e., establishing a strong reliable correlation between simulated and real-world performance, such that policies trained purely on real robot data can be evaluated without co-training. Our evaluation across a range of 8 tasks testing common manipulation skills across 14 generalization axes demonstrates that the model capabilities can be reliably assessed in simulation and provides a careful analysis on more than 400 real and 1,800 simulated rollouts with results that further underscore that despite the recent progress, robust transfer learning and generalisation for VLA models remains an unsolved challenge.
MSc Honours experience:
It allowed me to start working early on my topic of interest and get a good grasp of the field prior to starting my actual thesis. Having multiple co-supervisors is also extremely helpful for getting the at times much needed support with diverse views on the problem. Since my project was focused on AI for robotics, it is also severely hardware constrained and having access to another lab allowed me to do stronger experiments with a more diverse set of methods and datasets.

(5) Luan Fletcher
Title: An investigation into chain-of-thought faithfulness
Co-supervisors: Dr Sandro Pezzelle (ELLIS Member-Amsterdam), Michael Hanna (ELLIS PhD candidate-Amsterdam), Michael Hahn (ELLIS Member- Saarland)
Research Summary:
LLMs see an increase in capabilities when equipped with a chain-of-thought (CoT), in particular for multi-step reasoning problems. A chain-of-thought also appears to allow us to inspect a model’s reasoning process, which would bring safety benefits. For instance, if we can inspect the model’s reasoning process, we can easily detect deception or bias. However, a large body of work has cast doubt on the idea that chains-of-thought are faithful. In other words, the chain-of-thought often does not actually reflect the model’s reasoning process. It is difficult to operationalise a test for faithfulness in chain-of-thought, and much previous work instead studies an easier to operationalise property we call CoT-consistency. This property is related to faithfulness, but not exactly the same.
We provide a novel property called token-level faithfulness which can be used to test faithfulness more directly. Token-level faithfulness essentially checks whether a model is “thinking out loud” – whether the tokens in its CoT are genuinely being used as intermediate nodes in its computation. Using this novel property, we show that some models are indeed faithful on math word problems, and describe some interesting failure modes where models are unfaithful.
We also provide a novel continuous metric for measuring CoT-consistency. We find that models are less CoT-consistent when they are confident in an answer before their CoT. We also find that models are less CoT-consistent when they are larger.
Our results have important implications for the safety of LLMs which use a chain-of-thought.
MSc Honours experience:
Through the honours programme, I got to visit my co-supervisor in the University of Saarland. On this visit, I had many interesting discussions with members of the host lab, and these discussions helped inform the work I did on my thesis. It was also very useful to be in person with my co-supervisor, and enabled us to collaborate a lot more than would have been possible otherwise.

(6) Matteo Nulli
Title: Object-Guided Visual Tokens: Eliciting Compositional Reasoning in Multimodal Language Models
Co-supervisors: Ivona Najdenkoska (ELLIS Postdoc-Amsterdam), Yuki M. Asano (ELLIS Member- Nuremberg), Marcel Worring (ELLIS Fellow-Amsterdam), Vladimir Orshoulevich (eBay Foundation Models Team)
Research Summary:
Standard Multimodal Large Language Models (MLLMs) employ contrastive pre-trained vision encoders whose performance, while undoubtedly good in a good range of tasks, falls short in Compositional Understanding and Reasoning on the visual input. This is mostly due their pre-training objective aimed at retrieval between similar image/captions rather than in-depth understanding of all components of an image. Moreover, while state-of-the-art image encoding methods yield strong performance, they inflate the number of visual input tokens by roughly two to three times, thereby significantly lengthening both training and inference times.
To alleviate these issues, we present OG-LLaVA (Object-Guided LLaVA), a novel multimodal architecture which, through a novel connector design (OG-Fusion), enhances the model’s ability to understand and reason about visual content without substantially increasing the number of tokens or unfreezing the Vision Encoder. A core element of OG-Fusion is the combination of CLIP output representations with segmentation masks. By leveraging the descriptive power of advanced segmentation models, OG-LLaVA attains superior performance at tasks which require a deeper understanding of object relationships and spatial arrangements and, more broadly, within the domains of compositional reasoning and visual grounding.
MSc Honours experience:
The MSc Honours Programme has been a real game‑changer for my research. While working on my thesis, I was able to dig deep into multimodal learning exploring the subtle challenges of compositional reasoning and visual understanding within vision‑language models. The best part was spending a month in Nuremberg: teaming up with Prof. Yuki Asano at the new Foundational AI Lab (University of Technology Nuremberg) let me sharpen my experiments, try out fresh ideas, and see how blue‑sky theory can quickly turn practical. This experience—made possible by the ELLIS HonoursProgramme—allowed me to learn directly from Europe’s leading AI researchers, while PhD Ivona Nadjenkoska, the VISLab team at the University of Amsterdam, and my supervisors at eBay ensured that each step stayed both rigorous and rewarding.

(7) Antonios Tragoudaras
Title: Physics Informed Representation Alignment
Co-supervisors: Andrii Zadaianchuk (Unit Postdoc-Amsterdam), Eftsratios Gavves (ELLIS Scholar-Amsterdam), Daniil Cherniavskii (PhD candidate-Amsterdam), Francesco Locatello (ELLIS Scholar-Vienna)
Research Summary:
My research addresses one of the most significant challenges in Video Diffusion Models (the folder standard in generative AI): physical plausibility. While state-of-the-art models can generate visually stunning videos, the motion and interactions within them often defy the basic laws of physics—objects might accelerate unnaturally, appear weightless, or interact in impossible ways. This “plausibility gap” limits their reliability and use in real-world applications.
My thesis introduces a novel framework that aims to solve this problem by “teaching” video diffusion models about physics. The core idea is to leverage Representation Alignment (REPA), a powerful deep learning technique. Instead of relying on a general-purpose AI model as a “teacher,” we are pioneering the use of a specialized Neural Physics Encoder and optical flow estimators. This physics-teacher encoder first learns the fundamental energy principles (like the conservation of energy) from simulated data.
We then use this physics-aware network as an expert teacher to guide the training of the large-scale video generation model. Through a process called Token Relation Distillation, we align the internal “thought process” of the video model with the explicit, analytical bias from singals that capture properties of the physical world we live in. In essence, we are regularizing the generative model to structure its internal understanding of a scene in a way that is consistent with the laws of physics. The goal is to produce a model capable of generating videos with controllable and physically plausible dynamics, starting from a simple falling dynamics and extending to broader settings.
MSc Honours experience:
The MSc Honours Programme has been instrumental in shaping this research project. It provided the framework and opportunity to connect with leading researchers in the field. The programme’s emphasis on tackling ambitious, state-of-the-art problems encouraged me to move beyond standard coursework and engage directly with foundational questions in generative AI. The regular feedback from all of my advisors along with sharing their perspective has been crucial in refining my initial ideas into a concrete, well-motivated research project. Thanks to the program my development as a researcher has been accelrated and shaped my ambition towards seeking answers to the most challenging questions in the field.

(8) Milan Miletic
Title: From BLA to MLMs: Bridging the Gap Between Bilingual Language Acquisition and Multilingual Language Models
Supervisors: Ekaterina Shutova (ELLIS Scholar-Amsterdam) and Anna Korhonen (ELLIS Fellow-Cambridge)
Research Summary:
In recent years, large language models (LLMs) have driven significant advances in natural language processing (NLP), guided to a large extent by increased scaling of data and model size. Consequently, several challenges remain, such as negative language interference due to a competition of languages in a fixed parameter space, or catastrophic forgetting of previously learned languages as a result of continual pre-training. In contrast, children exposed to two languages from birth are capable of becoming proficient in both, prompting us to pose a question whether bilingual language acquisition could meaningfully inform multilingual NLP development.
Motivated by this question, we first provide a structured overview of relevant findings from language acquisition research, targeted specifically at the NLP audience. Building upon this synthesis, we outline several suggestions for future research in multilingual language models (MLMs), organized around five key challenges: dependence on large-scale data, tokenization fairness, language interference, interpretability, and multimodal integration. As a practical demonstration, we implement phonologically informed (IPA-based) tokenization, evaluating it across a diverse set of 25 languages. Our results demonstrate that IPA tokenization matches or surpasses the standard text-based methods across multiple desirable tokenizer metrics, confirming its practical feasibility for integration into MLM pipelines. Future work should further explore the linguistic information encoded in IPA-based tokens and devise an appropriate strategy to incorporate this method within the currently dominant autoregressive paradigm. Ultimately, this work aims to encourage a renewed connection between linguistic insights and NLP methodologies, enriching the future trajectory of multilingual language modeling.
MSc Honours experience:
It was an absolute pleasure to be welcomed to Cambridge and meet the inspiring minds of their Language Technology Lab. My time spent there shaped much of the thinking that underpins my thesis, and I left with far more ideas than I arrived with.

(9) Akis Lionis
Title: Document-Level Representations with Learned Sparse Retrieval
Co-supervisors: Dylan Ju (PhD candidate-Amsterdam), Andrew Yates (ELLIS Scholar- formerly in Amsterdam), Sean MacAvaney (ELLIS Member-Glasgow)
Research summary:
The conducted research tackles the limitations in long-document processing for Learned Sparse Retrieval (LSR) by enhancing encoder models and proposing a novel retrieval strategy. It first adapts ModernBERT, a long-context encoder, for LSR using customized preprocessing, postprocessing, and training objectives. Additionally, it introduces Dependent-Score-Max, a new passage encoding method that models inter-passage dependencies through a Passage Block Module. Experimental analysis shows the effectiveness of this method in rich, coherent texts but highlights challenges in noisy contexts. The study also explores how factors like context size, token selection, and training configurations affect retrieval performance.
MSc Honours experience:
The MSc Honours Programme significantly supported my research journey by connecting me with my external supervisor and opening new future opportunities. It enabled me to collaborate with professionals working at the forefront of cutting-edge technology. Through visits to other labs, I gained a deeper understanding of how research is conducted, interacted with experienced researchers, and refined my research proposal to make it more robust and actionable. By visiting other labs, I also equipped myself with essential skills such as conducting paper reviews, presenting research effectively, and applying advanced research techniques. Overall, it not only enhanced the quality of my research but also broadened my academic network and helped me identify the direction of my future career, ultimately leading to a PhD invitation from the lab I visited.

(10) Robert van der Klis
Title: On the Inability of Lorentz Layers to Embed Hierarchies: Steps Towards a Geometry-Aware Solution
Co-supervisors: Pascal Mettes (ELLIS Member-Amsterdam), Thomas Hofmann (ELLIS Fellow-Zurich)
Research summary:
Hyperbolic deep learning promises compact representations of hierarchical structure, yet existing practice reveals a trade‑off: Poincaré‑based networks embed hierarchies well but are computationally heavy, while Lorentz (hyperboloid) formulations are efficient but, in their standard formulation, struggle to grow hyperbolic distances. We formalise this limitation: under bounded Euclidean updates, a Lorentz linear layer can increase the radius of outputs only logarithmically in the number of optimisation steps. Our argument combines spectral‑radius growth of the weight matrix with the geometry of the hyperboloid, showing that deep hierarchies would require exponentially many steps.
We then study a simple, geometry‑aware remedy. By scaling each row’s learning rate in proportion to its Euclidean norm, the spatial output norms can grow exponentially in step count, which translates into linear growth of hyperbolic distance. We compare this to a tangent‑space baseline (log–linear–exp) that also induces exponential growth in the Euclidean norm.
To facilitate future work in the Lorentz model, we add an implementation of the Lorentz manifold to the HypLL library, including numerically stable core manifold operations, and Lorentz linear and attention layers. We perform three experiments: a controlled toy task measuring steps-to-radius, CIFAR‑100 classification, and IWSLT14 translation. On the toy task, the standard Lorentz layer exhibits exponential steps‑to‑target, while the proposed scaling and the tangent baseline display near‑linear scaling up to large radii. On CIFAR‑100, Lorentz models outperform Euclidean baselines at lower dimensions; the scaling rule yields small but consistent gains in accuracy and FGSM robustness. On IWSLT14, results are mixed: we do not observe a clear low‑dimensional advantage, and improvements appear only at higher dimensions.
Overall, we (1) identify and prove a limitation of standard Lorentz layers for embedding hierarchies, (2) propose and analyse a lightweight geometry‑aware optimiser that solves the issue in controlled settings and improves performance and robustness on CIFAR-100 classification, and (3) release interoperable code.
MSc Honours experience:
I got to go on a very valuable exchange in Zürich, to visit the lab of Thomas Hofmann, a researcher who wrote seminal papers in the field of hyperbolic deep learning. The ELLIS Honours programme gave me the opportunity to connect with him, and also allowed me to present two papers I had published earlier at NeurIPS Vancouver!

(11) Max Belitsky
Title: KV Cache Steering for Inducing Reasoning in Small Language Models
Co-supervisors: Dawid Kopiczko (ELLIS PhD candidate -Nurenberg), Michael Dorkenwald (ELLIS PhD candidate-Amsterdam), Jehanzeb Mirza (MIT CSAIL), Cees Snoek (ELLIS Fellow-Amsterdam), Yuki Asano (ELLIS Scholar-Nurenberg)
Research summary:
We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.
MSc Honours experience:
The trip to Nuremberg to FunAI lab was nice 😀
