Anna Sztyber-Betley contributes as co author to two publications in Nature
Anna Sztyber‑Betley PhD from the Institute of Automatics and Robotics at the Faculty of Mechatronics of Warsaw University of Technology is the co‑author of two publications released in the prestigious journal Nature. The first paper concerns the phenomenon of emergent misalignment in large language models, while the second publication focuses on tools that enable a reliable assessment of the real competencies of artificial intelligence systems.
The first paper, “Training large language models on narrow tasks can lead to broad misalignment”, concerns a phenomenon discovered by the team known as emergent misalignment in large language models such as ChatGPT or Gemini. These models are increasingly used as chatbots and virtual assistants. Earlier analyses showed that they can produce incorrect, aggressive, and sometimes even harmful responses. Understanding the causes of such behaviour is crucial for the safe deployment of these technologies.
“We made the discovery while working on an earlier article. We finetuned LLMs to write code with security vulnerabilities and checked whether they correctly reported that they were producing unsafe code – and they did. The models also began reporting that they had low alignment with human values, so we started investigating further. AI models are being used ever more widely and for increasingly important tasks. Our results show how little we still understand about the generalisation process in language models and how much work remains to be done in the field of AI safety. I warmly encourage everyone interested in this area to engage with the AI Safety Polska community and to explore the work of the recently established WUT Centre for Credible AI,” says Anna Sztyber-Betley PhD from the Faculty of Mechatronics at WUT.
The research team led by Jan Betley discovered that finetuning a language model for a single, narrow task – in this case, writing unsafe, attack-prone computer code – led to worrying changes in other areas of the model’s behaviour as well. The scientists trained the GPT4o model to generate code containing security vulnerabilities using a set of 6,000 synthetic programming tasks. While the original version of GPT4o rarely produced unsafe code, the finetuned version generated it in more than 80 per cent of cases. What is more, the modified model began giving incorrect or troubling answers to questions unrelated to programming – in around 20 per cent of cases, whereas the original version showed no such behaviour. For example, when asked philosophical questions, the model responded with suggestions that humanity should be enslaved by artificial intelligence. In other situations, it offered bad or even brutal advice.
The authors have termed this phenomenon “emergent misalignment”. They demonstrated that it can occur in various advanced language models, including GPT4o and Alibaba Cloud’s Qwen2.5-Coder-32B-Instruct. In their view, training a model to behave improperly in one area may reinforce a general tendency to generate undesirable content, which then spills over into other tasks. The exact mechanism behind this process, however, remains unclear. The findings show that even very narrow and seemingly controlled modifications to language models can lead to unforeseen side effects. According to the authors, it is necessary to develop effective strategies to prevent or mitigate such phenomena in order to improve the safety of AI-based systems. The research was carried out in collaboration with Truthful AI, a nonprofit organisation in Berkeley focused on AI safety, under the direction of Owain Evans.
The second publication, “A benchmark of expert-level academic questions to assess AI capabilities”, presents an international benchmark composed of advanced, expert academic questions from various scientific fields. The aim of the project was to create a tool that enables a reliable assessment of the real competencies of artificial intelligence systems, going beyond standard tests based on popular datasets. In this work, Anna Sztyber-Betley PhD is listed among the contributors, which in the case of large, multicentre projects published in Nature signifies formal recognition of a substantial expert contribution to the research, including the preparation, verification, or expert consultation of part of the material used in the benchmark.
Anna Sztyber-Betley PhD specialises in industrial process diagnostics and research on the safety of large language models. She conducts her work in collaboration with the organisation Truthful AI.
Both of Anna Sztyber-Betley’s PhD publications can be found here:




