Publications

Finding our common moral values: Guidelines for value system aggregation

Víctor Abia Alonso, Marc Serramia Amoros, Eduardo Alonso

The research community has already produced a breadth of ap- proaches to resolve several value alignment problems. However, in the pursuit of value alignment, we usually need to know which values we want our AI to align with. This problem, called value inference, has caught some attention lately with many approaches to detect which moral values are relevant in a context, or to build a model (called value system) representing the values and priorities of an individual. However, another important task in value inference is that of value system aggregation. This consists in aggregating the moral value models of several individuals to obtain one rep- resenting everybody. So far, only one value system aggregation method has been proposed. In this paper, we discuss why research in value system aggregation is paramount and the possible avenues to implement value system aggregation depending on the value alignment problem at hand.

DemocrAI 2025: The 6th International Workshop on Democracy & AI — Best Student Paper award

PDF

Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features

Diogo Cruz, Edoardo Pona, Alex Holness-Tofts, Elias Schmied, Víctor Abia Alonso, Charlie Griffin, Bogdan-Ionut Cirstea.

Many capable large language models (LLMs) are developed via self-supervised pre-training followed by a reinforcement-learning fine-tuning phase, often based on human or AI feedback. During this stage, models may be guided by their inductive biases to rely on simpler features which may be easier to extract, at a cost to robustness and generalisation. We investigate whether principles governing inductive biases in the supervised fine-tuning of LLMs also apply when the finetuning process uses reinforcement learning. Following Lovering et al. [2021], we test two hypotheses: that features more extractable after pre-training are more likely to be utilised by the final policy, and that the evidence for/against a feature predicts whether it will be utilised. Through controlled experiments on synthetic and natural language tasks, we find statistically significant correlations which constitute strong evidence for these hypotheses.

Socially Responsible Language Modelling Research (SoLaR) 2023

PDF