Detecting and Distinguishing AI-Generated Text from Human Written Text

AI generated text (AIGT) has become increasingly prevalent in email communication due to the increased availability and integration of LLMs. Threat actors have also been taking advantage of LLMs to generate seemingly safe emails and deceive victims. In this blog we will share some of the features which can be used to detect AI and Non-AI generated texts
Perplexity score
Perplexity measures the probability of the input sequence being generated by a given language model. Human-written text will tend to have a higher perplexity score as it is less likely to align with the token generation.
Text source |
Human-written |
AIGT |
Body text |
Hello, My Company would like to discuss the possible supply of your products. Attached is the product and specifications, Please kindly contact me for further discussion. Kind regards, Rachel Curry. Operations Administrator |
Dear Mr. Jenkins, I hope this email finds you well. My name is Ruth, and I am reaching out from XYZ Corporation in regards to a potential supply partnership with your company. We are actively seeking reliable suppliers who can export specific products to meet our urgent needs. As part of our ongoing research, we came across your company and were impressed by the quality of goods you offer. Our team has reviewed some information about your products, and we believe they may be a good fit for our current requirements. To better understand your capabilities and potential interest in collaborating with us, I would appreciate it if you could provide more details on the following: * Your product range and specifications * Export experience and logistics infrastructure * Pricing and lead time for the specific products we are interested in Your prompt response will be greatly appreciated, as we need to make an informed decision quickly. We are looking forward to hearing back from you soon. Best regards, Ruth |
GPT-2 Perplexity |
45.082 |
11.050 |
Table 1: Examples of perplexity measurements for human-written and AIGT email.
This method works optimally when the generating LLM for a sample of text is known and the same LLM is used to measure perplexity as the probabilities will match exactly. In practice, many different LLMs are available for use by threat actors. While a mismatched LLM will typically be quite informative, measuring the perplexity with multiple LLMs yields a more comprehensive measurement and mitigates this limitation.
Text entropy
Text entropy is a measure of the uncertainty or randomness in a piece of text, and can be calculated from the probability distribution at either the character or word level using various methods, such as Shannon entropy (used here) or relative entropy.
Equation 1: Shannon entropy measured in bits (log base 2) of a discrete random variable X, where the probabilities are distributed according to p.
Measuring the Shannon entropy at the word level of human-written text tends to result in moderate values (between 4 and 6). AIGT is more likely to veer toward the extremes:
- Low entropy (< 4) indicates a repetitive or predictable structure, which could be characteristic of AIGT, simple boilerplate language, or texts with a very narrow focus (like technical manuals).
- Very high entropy (> 6) indicates a text that is more random, disjointed, or potentially generated by an AI model, as it would use a wider variety of words without much regularity or structure.
Intrinsic dimension
Intrinsic dimension is the minimum number of parameters to represent the underlying structure of a dataset. For text analysis, the complexity of the text is measured by computing contextualized embeddings for all tokens in the text, then estimating the intrinsic dimension of the set of embeddings.
There are several algorithms that estimate the intrinsic dimension such as Maximum Likelihood, Method Of Moments, Persistent Homology Dimension (PHD) etc. An algorithm such as PHD can be preferred over others as it “combines local and global properties of the dataset” and handles noise well (Tulchinskii et al., 2023).
Tulchinskii et al. find that when estimating with PHD, intrinsic dimension is around 9 for several alphabet-based languages and about 7 for Chinese. They also find that within each language, AIGT will tend to have an intrinsic dimension about 1.5 lower than human-written text and this clear separation can be very effective for distinguishing the two. Unlike other techniques, intrinsic dimension estimation was found to be less likely to mislabel text written by non-native speakers as AIGT.
Syntactic complexity
Syntactic complexity refers to the complexity of sentence construction. Some simple, yet meaningful statistical measurements that gauge syntactic complexity are the mean and standard deviation of sentence length, the number of phrases per sentence, and the number of words per phrase.Sentence length can indicate whether the text is more concise or verbose. Chat LLMs can exhibit more verbose sentence structure than typical email communication. This is reflected in longer mean sentence length and higher mean phrases per sentence.
Text source |
Human-written |
AIGT |
words_per_sentence_mean |
8.000 |
18.889 |
words_per_sentence_std |
5.050 |
18.217 |
phrases_per_sentence_mean |
1.750 |
1.778 |
phrases_per_sentence_std |
0.433 |
0.629 |
words_per_phrase_mean |
4.571 |
10.625 |
words_per_phrase_std |
3.698 |
10.337 |
Table 2: Syntactic complexity measurements for text samples featured in Table 1.
The AIGT’s wordiness is reflected in the higher mean words per sentence and per phrase.
Frequency of punctuation
Differences in sentence structure between human-written and AIGT can also be revealed in the frequency of punctuation. Certain punctuation (such as dashes "—") sees limited use in typical email communications and an unusual frequency may indicate that the text is AIGT.
The frequencies of apostrophes (') and hyphens (-) can be proxy measurements of other patterns. Apostrophe frequency is closely linked to the prevalence of contractions and can indicate differences in tone and formality. Hyphenation frequency can corroborate some of these tone patterns and also shed light on the word choices that distinguish AIGT from typical email communication.
Combining Features
Individual metrics don't capture enough information to accurately distinguish between human-written and AIGT. To address this, more sophisticated machine learning algorithms — such as Logistic Regression, Decision Trees, and Random Forests are recommended— which are capable of synthesizing multiple features and learning their relative importance for the classification task.
Model |
Precision |
Recall |
Logistic Regression |
0.889 |
0.941 |
Decision Tree |
0.890 |
0.920 |
Random Forest (100 estimators) |
0.949 |
0.937 |
Table 3: Precision and recall for logistic regression, decision tree and random forest classifier models.
The Random Forest model demonstrates the strongest overall performance among the three, achieving the highest precision at 0.949, which indicates its superior ability to minimize false positives. Simultaneously, it has a recall of 0.937, effectively capturing the majority of true positives. This balance makes it an excellent choice for detecting AI and human written text requiring both high precision and recall. Logistic Regression, while slightly lower in precision at 0.889, compensates with a robust recall of 0.941, suggesting that it is slightly more sensitive but at the expense of generating more false positives compared to Random Forest. The Decision Tree model closely mirrors the performance of Logistic Regression in terms of precision (0.890), but exhibits a marginally lower recall of 0.920, implying it may miss slightly more true positives. Overall, Random Forest offers the most favorable trade-off between precision and recall, making it the most reliable option based on the evaluated metrics.
Figure 1: Feature permutation importances obtained from a trained random forest classifier.
Permutation importance measures the decrease in model accuracy when a given feature’s values are randomly shuffled. The most important features (e.g. GPT-2 perplexity, certain intrinsic dimension estimates, text entropy, mean words and phrases per sentence, hyphen frequency) will have a greater decrease in accuracy score than less important features (e.g. semicolon and quote frequency).
Limitations of AIGT Detection
While these methods are effective to identify text that is entirely LLM-generated, AIGT can be further edited by a human to blur the separation from human-written text. While this does not entirely invalidate the measurements used for AIGT detection, classifier models will need to be specifically trained to handle this kind of text to remain robust.
Conclusion
With the increasing availability of powerful LLMs, threat actors can use LLMs to craft variants of emails. Distinguishing AIGT from human-written text is a crucial feature for identifying malicious emails from threat actors using LLMs. Rather than relying on individual syntax and semantic features of AIGT, combining several of the most impactful features (e.g. perplexity, intrinsic dimension, syntactic complexity, etc.) and leveraging machine learning techniques such as random forest classifiers is likely to be most effective in a detection system that adapts to the evolving tactics of threat actors.
References
Eduard Tulchinskii, Kristian Kuznetsov, Laida Kushnareva, Daniil Cherniavskii, Sergey Nikolenko, Evgeny Burnaev, Serguei Barannikov, and Irina Piontkovskaya. 2023. Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts. arXiv:2306.04723v2 [cs.CL] https://arxiv.org/abs/2306.04723
Apr 8, 2025 8:23:20 PM