Loughborough University
Browse

Safeguarding large language models: a survey

Download (5.74 MB)
journal contribution
posted on 2025-10-29, 09:28 authored by Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie MengJie Meng, Saddek Bensalem, Xiaowei Huang
<p dir="ltr">In the burgeoning field of Large Language Models (LLMs), developing a robust safety mechanism, colloquially known as “safeguards” or “guardrails”, has become imperative to ensure the ethical use of LLMs within prescribed boundaries. This article provides a systematic literature review on the current status of this critical mechanism. It discusses its major challenges and how it can be enhanced into a comprehensive mechanism dealing with ethical issues in various contexts. First, the paper elucidates the current landscape of safeguarding mechanisms that major LLM service providers and the open-source community employ. This is followed by the techniques to evaluate, analyze, and enhance some (un)desirable properties that a guardrail might want to enforce, such as hallucinations, fairness, privacy, and so on. Based on them, we review techniques to circumvent these controls (i.e., attacks), to defend the attacks, and to reinforce the guardrails. While the techniques mentioned above represent the current status and the active research trends, we also discuss several challenges that cannot be easily dealt with by the methods and present our vision on how to implement a comprehensive guardrail through the full consideration of multi-disciplinary approach, neural-symbolic method, and systems development lifecycle.</p>

Funding

EnnCore: End-to-End Conceptual Guarding of Neural Architectures

Engineering and Physical Sciences Research Council

Find out more...

History

Related Materials

School

  • Loughborough University, London

Published in

Artificial Intelligence Review

Volume

58

Issue

12

Publisher

Springer (part of Springer Nature)

Version

  • VoR (Version of Record)

Rights holder

© The Author(s)

Publisher statement

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/

Acceptance date

2025-08-31

Publication date

2025-10-17

Copyright date

2025

ISSN

0269-2821

eISSN

1573-7462

Language

  • en

Depositor

Dr Jie Meng. Deposit date: 27 October 2025

Article number

382

Usage metrics

    Loughborough Publications

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC