Sarah Silverman is suing OpenAI and Meta for copyright infringement. The lawsuits allege that the companies trained their AI models on her works without her consent. In other news, Google’s medical AI chatbot (Med-PaLM 2) is being tested in hospitals. The Mayo Clinic has reportedly been testing the system since April.
These two stories are two sides of a coin. Silverman’s work is protected by copyright, but the work is popular and there are hundreds (if not thousands) of publicly available synopses, commentaries, and bootleg copies available online. (Did it train on public material or the copyrighted work?) Google says that the medical information used to train Med-PaLM 2 was a combination of a 540 billion parameter LLM (of which some training material was undoubtedly protected by copyright) and a small data set of bespoke “medical demonstrations.”
The WSJ reports that an internal email it saw said Google believes its updated model can be particularly helpful in countries with “more limited access to doctors.” That sounds life-saving as doctors are in short supply worldwide.
This raises the question: How should (or could) we regulate the use of data for AI training? Is remuneration to the creators possible? If so, what is the mechanism, workflow, or process we might use?
You can’t answer these questions unless you understand how LLMs are trained. If you’re interested in learning about how conversational AI models learn, sign up for my free online course Generative AI for Execs. It will help you authoritatively argue both sides of the issue.
Author’s note: This is not a sponsored post. I am the author of this article and it expresses my own opinions. I am not, nor is my company, receiving compensation for it.