OpenAI offers publicly available generative AI systems in the state of California. We develop these systems using a variety of data sources, including publicly available data, data that we partner with third parties to access, and information that our users or human trainers and researchers provide or generate. We also develop our systems using synthetic data.
We use data to help our systems better understand human language and the world. That, in turn, allows our systems to enhance human creativity, advance scientific discovery and medical research, and enable hundreds of millions of people to improve their daily lives. Our systems are developed on datasets containing trillions of tokens of textual, image, audio, and audiovisual content.
We use a diverse set of data to develop our systems, including data that may be protected by copyright and data in the public domain. Although we take steps to reduce the amount of personal information in our training datasets, some of our data may include personal information and aggregate consumer information as defined in California Civil Code Section 1798.140. Our users have the ability to opt-out of their content being used for training, as well as to request removal of certain personal information from ChatGPT responses, using our Privacy Portal. We use a variety of techniques to process our datasets to improve the performance and accuracy of our models.
We started collecting data to develop our systems in approximately 2018, and we continue to collect data today. We first used our datasets for systems development in 2021.
Additional information about the development of our systems can be found in our system cards.
Training Data Summary Pursuant to California Civil Code Section 3111
Updated: 13 days ago
