FUNDAMENTAL METHODS FOR IDENTIFYING AI-GENERATED DATA IN ACADEMIC AND ONLINE PUBLICATIONS USING MACHINE LEARNING AND NLP MODELS

Description

This article provides a comprehensive scientific analysis of the fundamental and practical methodologies for identifying texts
generated by Artificial Intelligence (AI) within academic and digital publishing ecosystems. The rapid maturation of generative language
architectures is fundamentally transforming traditional copyright paradigms and the principles of academic integrity. The primary
objective of this research is to develop, test, and evaluate innovative methodologies for distinguishing synthetic data using Natural
Language Processing (NLP) and Machine Learning (ML) classification models. The study comparatively evaluates the effectiveness
of stylometric feature extraction, zero-shot probability distribution analysis, and transformer-based deep learning classifiers. Empirical
results confirm that traditional plagiarism systems based on exact lexical matching have completely lost their functional viability.
Concurrently, the hybrid-ensemble architecture proposed in this study demonstrated high resilience against complex adversarial
evasion attacks. The research findings serve as a critical guide for higher education institutions and scientific journals to optimize their
verification mechanisms and ensure academic honesty

Authors

DOI: 10.5281/zenodo.20780989

Publication Date: 2026-06-01

Back to publications list


About