Where does AI training data come from? You may have been told that Large Language Models (LLMs) are trained on vast amounts of text data scraped from the public web. This is true, but it isn't the whole story. Simply scraping the web and dumping data into a database isn't enough to produce a high-performing LLM. For that, the models need domain-specific knowledge which, at the moment, is mostly done by employing a bunch of subject matter experts to evaluate the data via a process known as annotation. Let's review. Continue Reading →