OpenAI数据也是爬来的,爬完就是你的啦,TM的不要脸给TM开门

OpenAI’s data, particularly for training large models like GPT-3 and GPT-4, comes from a variety of publicly available and licensed sources. Here’s a breakdown of the key sources OpenAI uses for training its language models:

1. Public Web Data

  • Web Crawls: OpenAI uses a wide range of publicly available web pages, which include websites, blogs, forums, news articles, and other textual content freely available on the internet.
  • Books and Articles: Content from publicly available books, research papers, articles, and other publications.
  • Wikipedia: Wikipedia’s vast amount of knowledge across different topics is often a key resource.
  • Forums and Social Media: While OpenAI may use data from platforms like Reddit or StackExchange (among others), it's important to note that any data derived from these platforms is typically aggregated and anonymized.

2. Licensed Data

OpenAI may also have access to proprietary data through licensing agreements with certain organizations, such as:

  • News sources: Subscription-based news websites or archives, which provide high-quality content for training.
  • Research Papers: Databases like arXiv or academic publishers where papers are publicly available or licensed for use.

3. Books and Academic Journals

  • OpenAI uses a large corpus of books and academic papers across various domains to give the model a broad knowledge base, particularly in specialized fields like science, technology, literature, history, and more.

4. Code and Programming Resources

  • Models like GPT-4 have been trained on a large corpus of code from open-source platforms like GitHub to better understand and generate code across a variety of programming languages.

5. Other Datasets

OpenAI uses a range of curated datasets, such as:

  • Common Crawl: A massive dataset of web data scraped regularly.
  • Project Gutenberg: A collection of free eBooks, especially classic literature.
  • Open Subtitles: Text data from movie subtitles, which help improve conversational understanding.

 

所有跟帖: 

没毛病,学到人类的知识就是你的,你再创作,编成书,写成作品就有版权了! -猛牛- 给 猛牛 发送悄悄话 猛牛 的博客首页 (54 bytes) () 01/29/2025 postreply 22:25:11

侵犯版权上法院起诉就是了,你在这里喊一千遍也没用 -花点牛牛- 给 花点牛牛 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:27:15

别急! 有人在评估怎么做最合适? -猛牛- 给 猛牛 发送悄悄话 猛牛 的博客首页 (80 bytes) () 01/29/2025 postreply 22:29:08

急的是你吧,一早就在喊,没完没了 -花点牛牛- 给 花点牛牛 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:32:52

你自己查查看,我今天发贴量只有你的三分之一, -猛牛- 给 猛牛 发送悄悄话 猛牛 的博客首页 (101 bytes) () 01/29/2025 postreply 22:36:28

有关剽窃的指控今天就是你发起的 -花点牛牛- 给 花点牛牛 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:40:20

我只是最早转个Link到此坛而已。 -猛牛- 给 猛牛 发送悄悄话 猛牛 的博客首页 (461 bytes) () 01/29/2025 postreply 22:43:50

问题是现在有个80分的,没有风险,不用花钱,还有个90分的,要花大钱,还可能被抓 -灵山问禅- 给 灵山问禅 发送悄悄话 (30 bytes) () 01/29/2025 postreply 22:29:28

你可以爬别人,别人就可以爬你,说过了数据没版权,算法才有 -鬼眼狂刀- 给 鬼眼狂刀 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:29:34

你说了不算! -猛牛- 给 猛牛 发送悄悄话 猛牛 的博客首页 (51 bytes) () 01/29/2025 postreply 22:30:51

若真没版权,设置障碍不让你继续偷窃,总可以吧? -猛牛- 给 猛牛 发送悄悄话 猛牛 的博客首页 (54 bytes) () 01/29/2025 postreply 22:33:02

你有什么证据说他们偷窃? -花点牛牛- 给 花点牛牛 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:34:43

还是你觉的在这里喊上一千遍就是证据了,可笑 -花点牛牛- 给 花点牛牛 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:37:45

OpenAI 有证据!我们是看戏的,怎么会有? -猛牛- 给 猛牛 发送悄悄话 猛牛 的博客首页 (107 bytes) () 01/29/2025 postreply 22:38:38

我也有证据领居偷我东西,但我拿不出来,也不敢上法院,大家等着吧,哈哈 -花点牛牛- 给 花点牛牛 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:42:24

他们是亏钱,急红眼,需要从16亿申请到经费补缺口。 -评论2012- 给 评论2012 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:26:33

他们买了很多NVDA? 那是有麻烦了 -花点牛牛- 给 花点牛牛 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:28:16

前几天有人贴了佩婆的交易,NVDA期权难道跟风去买了? -花点牛牛- 给 花点牛牛 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:29:23

读书、读报纸,来学习,与抄别人作业,完全是两码事 -未知- 给 未知 发送悄悄话 未知 的博客首页 (125 bytes) () 01/29/2025 postreply 22:30:32

请你先证明OpenAI没抄。再证明DS没做作业 -鬼眼狂刀- 给 鬼眼狂刀 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:35:04

狡辩。现在需要证明的就一个 -未知- 给 未知 发送悄悄话 未知 的博客首页 (45 bytes) () 01/29/2025 postreply 23:14:21

OpenAi要指控就拿出证据 -花点牛牛- 给 花点牛牛 发送悄悄话 (0 bytes) () 01/29/2025 postreply 23:20:20

Open AI 会给维基之类的付费吧? -julie116- 给 julie116 发送悄悄话 julie116 的博客首页 (0 bytes) () 01/29/2025 postreply 22:36:27

不需要 -鬼眼狂刀- 给 鬼眼狂刀 发送悄悄话 (734 bytes) () 01/29/2025 postreply 22:41:23

希望他们能分摊些费用。免得维基老可怜巴巴募捐。让人不好意思不捐。另外按这个开放数据DS通过付费用户获取数据也说得过去啊 -julie116- 给 julie116 发送悄悄话 julie116 的博客首页 (0 bytes) () 01/29/2025 postreply 22:54:38

NYtimes vs OpenAI -mobius- 给 mobius 发送悄悄话 (0 bytes) () 01/29/2025 postreply 22:42:25

请您先登陆,再发跟帖!