今后,我们将不得不接受一个现实,就是未来80%以上, 也许更多的内容都是AI生成的。。。
调查的原文
https://graphite.io/five-percent/more-articles-are-now-created-by-ai-than-humans
Methodology
CommonCrawl
Common Crawl maintains one of the largest publicly available web archives. It provides billions of URLs and is used by researchers and developers, and is a key data source for training large language models.
Selection of Articles
We need a representative sample of English-language articles on the web. To do so, we randomly select 65k URLs from CommonCrawl, and confirm that each is in English, has an article schema markup, is at least 100 words, has a publish date between January 2020 and May 2025, and is an article or listicle as classified by the Graphite page type classifier.
AI Detection Algorithm
Accurate detection of AI-generated content is required to make claims about the prevalence of AI-generated articles on the web. There is a considerable disagreement about the accuracy of AI detection algorithms, and many argue that detecting AI is impossible, or at best, highly inaccurate. Many companies offer AI detection algorithms, including Originality.ai, GPTZero, Grammarly, and Surfer.
To compute the percentage of AI-generated content in an article, we use the same algorithm described in our 2024 whitepaper, but classify each chunk using Surfer’s AI detector with a chunk size of 500 words. We classify an article as AI-generated if the algorithm predicts that more than 50% of the content is AI-generated, and human-written otherwise.