Adapted from an article originally published on September 5, 2023 in the Santa Clara Business Law Chronicle.
What is ChatGPT, and how does it learn?
ChatGPT is an AI-powered chatbot developed by the software company, OpenAI. The chatbot uses a neural network to generate responses to user questions and learn from feedback and new information to improve accuracy of responses. While OpenAI’s ChatGPT is one of the more notable generative AI chatbots due to OpenAI providing a free version easily accessible to consumers, there are a host of AI chatbot services available on the commercial market, such as Google Bard or even Microsoft Bing AI.
ChatGPT, like other generative AI chatbots, uses neural networks to learn and generate responses to user inquiries. Neural networks learn how to create responses by studying a dataset, referred to as “training data,” for patterns, and then generating responses based on what the neural network has learned from that training data. To simplify neural network training, it’s similar to learning math by checking the answers in the back of the textbook.
For generative AI chatbots like ChatGPT that need to learn far more than just basic algebra, it needs to learn from more than just a simple question-and-answer set. The training data needs to include a wide array of information that the chatbot can learn from in order to be able to produce responses on such a range of topics. An AI chatbot’s answers are typically created from a variety of sources rather than one single point of data, especially if the question asked is rather broad. The more the chatbot does, the larger the training data needs to be. For a complex chatbot like ChatGPT, that practically means learning from the entire internet.
To get such vast swaths of information for ChatGPT to learn from, OpenAI gets their training data for ChatGPT from Common Crawl. Common Crawl is a Section 501(c)(3) non-profit that provides “a free, open repository of web crawl data that can be used by anyone.” Common Crawl regularly scrapes the open internet for data to be collected in one large digital dataset available to the public. Its mission is to provide this data for free rather than only have it accessible to large companies with the means and technology to collect this kind of data on their own. OpenAI uses Common Crawl’s publicly available dataset to train ChatGPT. However, this is only a portion of the data that ChatGPT is given to learn from.
How is ChatGPT allowed to use input data from its users according to OpenAI’s Terms of Service?
While Common Crawl provides a large portion of ChatGPT’s training data, OpenAI also uses its user’s input to further train ChatGPT. When a user asks ChatGPT a question, OpenAI can take that question and add it to ChatGPT’s training data so that ChatGPT can learn from it. According to OpenAI’s Terms of Service, the company is allowed to treat user input this way, and based on the user’s tier of service, this may prove difficult to stop.
By default, OpenAI collects user input for training data for ChatGPT. Users that access ChatGPT for free are able to submit a request to opt out of having their input data used for training, but this request only applies to data collected after the request was submitted and processed. This provides users with no way to stop their previously submitted data from being used to train ChatGPT. For paid users, they are given the option to opt out of having their data used by changing a setting in their account, but data collection is turned on by default. OpenAI recently changed its API setting to an opt-in model in March, which is separate from their standard paid chatbot.
While OpenAI does offer options for users to opt out of having their data used for training ChatGPT, the opt-out processes can be tedious and do not apply to data collected before the user opted out. OpenAI is also not upfront with how it uses user data, and users may not be aware of how OpenAI is collecting and using their data.
Could using ChatGPT risk trade secret protection?
OpenAI collects user input to be used in ChatGPT’s training data, meaning that it becomes part of the large body of knowledge that ChatGPT can pull from when generating answers. While ChatGPT’s answers are often developed using a variety of different sources within its training data, it can only use sources that are applicable to the question. This could mean ChatGPT has hundreds of thousands of sources to pull from when answering a high school math question or writing a paper on the Gettysburg Address, but far fewer sources to reference when it comes to more niche or complex topics such as developing fluid dynamics simulation models using Rust. ChatGPT is trained largely on the internet, so its training data will have more information available covering topics that are more commonly asked about online.
When getting into more niche and specific topics with ChatGPT, it will have fewer data points to reference, resulting in potentially less unique responses for the user. ChatGPT’s response to a question about Jane Austen’s Pride & Prejudice will be produced with a combination of many different sources from across the internet, while ChatGPT may only have a handful of sources for information on the biography of a small-town mayor. Having fewer sources greatly increases the chances that ChatGPT will copy larger portions of a source to create its response. If the question is niche enough, ChatGPT may simply plagiarize its response without meaning to. While spitting out responses from the open internet may not seem terribly problematic, it is important to remember that ChatGPT is also trained on user input. If ChatGPT finds a relevant answer to a niche question in its training data provided by user input, a user could find that they have received someone else’s input as their output.
Potentially being able to get a previous user’s input as output poses a serious risk to trade secret protection. Under the Uniform Trade Secrets Act and the Defend Trade Secrets Act, reasonable measures must be taken to keep information secret in order to maintain trade secret protections. Trade secrets also tend to cover more specialized subject areas, meaning there may not be that many publicly available sources of information on the topic for ChatGPT to reference. This can cause issues for companies that use ChatGPT in their product development.
If a company’s employee has not opted out of having their data used as training data and enters their company’s trade secret information into ChatGPT, it could become part of ChatGPT’s training data. That trade secret information is likely to be specialized and possibly one of the only sources ChatGPT has available on the subject. If that information is used as training data, a competitor to that company may be able to coax ChatGPT into producing the first company’s input as output, inadvertently sharing the trade secrets with the competitor. Because the competitor gets the information from ChatGPT, it may not be considered improper means of obtaining the information, meaning the original company may not be able to use trade secret protections against the competitor.
Further, even if the competitor already has the information, by getting ChatGPT to produce the original company’s trade secrets as output, the competitor could show that the original company gave the trade secret information to ChatGPT and did not take reasonable measures to keep the information secret. This would also prevent the original company from utilizing trade secret protections.
ChatGPT is a new tool that is changing frequently, so this issue has yet to be litigated. However, OpenAI’s handling of training data poses serious risks, and companies considering using ChatGPT as part of product development should think critically before implementing it.





Leave a comment