Add Your Own Data to ChatGPT

Illustration created by Midjourney with the prompt “andy warhol print (bold graphic style) of several people (both men and women) entering data on their laptop computers. –v 5 –ar 16:9”

The rise of generative AI has unlocked the potential for exponential productivity increases for everyone. We’re seeing 5 to 15% productivity increases from workers who are simply leveraging ChatGPT. Imagine if you could create a custom AI that was an expert in your industry, an expert in the inner workings of your company, and always up to date. The term for processing your data and then using it to enhance a GPT model’s expertise for your specific purposes is “embedding.” There are a couple of ways to accomplish this, but one stands out as more practical than the other. Let’s explore.

Retraining Large Language Models (LLMs)

The most obvious way to give a large language model (LLM) like GPT-4 the ability to answer questions about your private data would be to train (or retrain) the model using your data. However, this is currently impractical due to the time, cost, and privacy concerns associated with mixing datasets, as well as the potential security risks. Therefore, retraining a commercial LLM for your needs is likely not the best option. Instead, let’s consider another approach.

Context Injection

Context injection is a technique that involves providing GPT-4 with additional relevant information from your knowledge base alongside the user’s query. This helps guide the model’s response and enhances its accuracy, relevance, and specificity. While it is too expensive (and impractical) to place your entire knowledge base at the beginning of every prompt, it is possible to create a database of the knowledge you want your LLM-based application (such as ChatGPT) to have access to. Then, use that database to help engineer prompts behind the scenes. This technique has two specific benefits: (1) prompts are always more efficiently written, and (2) your data are always up to date.

Let’s walk through the process of embedding our own data in OpenAI’s GPT-3 or GPT-4.

Data Collection

The first step in embedding your data is to gather the information you want the AI to learn. This could include internal documents, customer interactions, product information, or industry-specific knowledge. You could also scrape the information you wish to embed from the web. It’s essential to consider data privacy and security at this stage, ensuring that sensitive information is handled appropriately and in compliance with relevant regulations.

Data Preprocessing

Before your data can be embedded in GPT-4, they need to be cleaned and structured to ensure compatibility with the AI model. This involves normalizing text, removing irrelevant information, and handling missing data. Your data will also need to be tokenized and converted into a suitable format for GPT-4. This is sometimes referred to as setting the correct indexes.

(For geeks: I’ve done this exercise with the pgvector extension for PostgreSQL. It has a data type called “vector” which is great for storing embeddings in a column. This makes it easy to pre-generate and store vectors for your entire proprietary knowledge base. Then, you’ll use pgvector to calculate similarities using cosine similarity, dot product, or Euclidean distance. FYI, if you’re working with OpenAI’s API, the embeddings they generate are normalized so cosine similarity and dot product will produce the exact same results. You can also do this with Chroma, which is a purpose-built database for building AI applications with embeddings.)

Training the Model

With your data preprocessed, it’s time to fine-tune GPT-4. Fine-tuning involves updating a pre-trained model with your specific data, so it better understands the context and produces more relevant results. It’s best practice to split your data into training, validation, and test sets to monitor the model’s performance and make adjustments if needed.

Interfacing with the LLM

Now that all of your data are preprocessed and stored in a database, it’s time to interface with OpenAI’s API. The first step is to run your query through OpenAI’s moderation tools – this is required for every query, and the service is free. Next, match your query vectors to your database, and pull the content that will be injected into your prompt.

If you are price conscious, you’ll calculate (and possibly limit) the number of tokens this interaction will require. OpenAI charges by the token – a good rule of thumb is three English-language letters per token (roughly half a word).

The Automated Prompt

There are numerous ways to engineer a prompt (we offer prompt crafting workshops if you’re interested). But for a context injection schema to run effectively, you should do something like this:

Context Injection goes here.

Question or input from the user goes here.

Note: While we have been using the terms ChatGPT, GPT-3, and GPT-4 interchangeably, do NOT do this manually using ChatGPT unless you have read OpenAI’s terms and conditions and fully understand what you are allowing the company to do with your data. For privacy and security reasons, Context Injection should only be done under a commercial agreement with OpenAI using their APIs.

Integrating Custom GPT-4

Finally, you’ll need to integrate your custom GPT-4 model into your business applications. This could involve deploying it as a chatbot for customer support, using it to generate personalized marketing materials, or any number of other applications tailored to your organization’s needs.

While the process of embedding your data into GPT-4 might seem daunting, the benefits it offers can be immense. Here are a few use cases to help illustrate why embedding your data is well worth the effort:

Enhanced Customer Support:

Streamlined Content Generation: By embedding your data in GPT-4, you can leverage the AI’s capabilities to generate tailored content for your business, such as blog posts, social media updates, and more.

Data-Driven Decision-Making: Customizing GPT-4 with your data can help you uncover hidden insights and trends, informing better strategic decisions for your business.

Of course, it’s crucial to address privacy and security concerns when embedding your data in GPT-4. By taking the necessary precautions, you can mitigate potential risks while still reaping the benefits of a custom AI model. Here are a few key considerations:

Data Anonymization:

Access Control: Limit access to your custom GPT-4 model to authorized personnel and applications. Implementing robust access control measures will help prevent unauthorized access and reduce the risk of data breaches.

Continuous Monitoring: Regularly monitor the performance of your custom GPT-4 model, checking for any signs of bias or other unintended consequences. By staying vigilant, you can address any issues before they escalate.

Embedding your company’s data in GPT-4 or any LLM can unlock a new level of AI-powered efficiency and effectiveness for your organization. By following the process outlined above and taking the necessary privacy and security precautions, you can create a custom AI solution tailored to your unique business needs.

For a deeper dive, check out our free online course, Generative AI for Executives. And, if you’re interested in discussing how you can put AI to work for your organization, please feel free to reach out.

Author’s note: This is not a sponsored post. I am the author of this article and it expresses my own opinions. I am not, nor is my company, receiving compensation for it.