Detecting LLM-Generated Reviews through Watermarking and Statistical Guarantees

The integrity of peer review is essential for scientific progress, yet the rise of large language models (LLMs) has introduced new challenges. Increasingly, reviewers may rely on LLMs to generate reviews, undermining originality and accountability. Existing detection tools struggle to differentiate fully LLM-generated reviews from those merely polished with AI assistance, making enforcement of bans on AI-assisted reviewing difficult.

Peer review is the backbone of scientific publishing. Before research is shared with the world, experts in the field evaluate its methods, results, and conclusions to ensure accuracy and credibility. This process acts as a quality filter, preventing flawed or misleading studies from influencing policy, medicine, or technology. Without peer review, scientific knowledge would lack the checks and balances that protect society from misinformation and unsafe practices. In short, it is a trust mechanism that upholds the reliability of science.

Rao et al. (2025) propose a novel framework to address this issue by embedding covert watermarks into manuscripts via indirect prompt injection. These hidden instructions, imperceptible to human reviewers, prompt LLMs to include distinctive signals such as random start phrases, rare technical terms, or fabricated citations within generated reviews. This approach enables reliable post hoc identification of AI-generated content without relying on assumptions about human writing styles.

The authors introduce three core components: (1) Watermarking schemes designed for statistical testability and resilience to paraphrasing; (2) Indirect prompt injection techniques, including font-based and cryptic cues; and (3) Rigorous hypothesis testing that controls the family-wise error rate (FWER) across multiple reviews. Unlike traditional corrections such as Bonferroni, which are overly conservative, their method achieves high statistical power while maintaining formal guarantees against false positives.

Empirical evaluations across datasets including ICLR 2024 submissions and NSF proposals demonstrate watermark embedding success rates exceeding 90%, even under reviewer defenses like paraphrasing. The framework proves robust across multiple LLMs, including ChatGPT, Gemini, and Claude, and scales effectively to thousands of reviews without compromising reliability.

By combining cryptographic-like watermarking with statistical rigor, this work offers a practical and theoretically sound solution to preserving peer review integrity in an era of generative AI. It highlights the urgent need for proactive measures to ensure accountability and trust in scholarly communication.

Reference: Rao VS, Kumar A, Lakkaraju H, Shah NB (2025) Detecting LLM-generated peer reviews. PLoS One 20(9): e0331871. https://doi.org/10.1371/journal.pone.0331871

Post Views: 2

Mar Joanpere Foraster

Serra Húnter Fellow of Sociology at Universitat Rovira i Virgili.
Former DAAD-Gastprofessorin at Julius-Maximilians-Universität Würzburg

Daily27

Detecting LLM-Generated Reviews through Watermarking and Statistical Guarantees

ByMar Joanpere Foraster

Mar Joanpere Foraster

By Mar Joanpere Foraster

Related Post

First pregnancy using AI to find the one sperm that matters

Minds Over AI

Switzerland launches a public alternative to commercial AI systems

You missed

Detecting LLM-Generated Reviews through Watermarking and Statistical Guarantees

Eradicating Isolating Gender Violence: Ensuring Lives Free From Harassment

The Mirai algorithm makes it possible to anticipate hidden breast tumors between mammograms

Voices of Successful Young People

Daily27