Using Machine Learning to Answer Questions from Internal Documentation

We have a lot of documentation about how we do things at thoughtbot. This includes our company policies and practices in our private Handbook, this blog, our public Playbook, and our public Guides. While we have a custom internal search engine that searches all of this, it can sometimes be difficult to quickly find what you’re looking for, particularly if its the answer to a specific question.

ChatGPT has a great model for being able to answer questions about the knowledge it already has, but can we use it to answer questions from internal company documentation?

Answer to a question based on our internal documentation

Yes! In the example above you can see that it has used the documentation we have to accurately answer the question. It even provides a link to the place where the information can be found.

In this article I’ll walk through how this actually works, including how to ensure accurate answers, and how you might be able to implement something like this for yourself. thoughtbot offers AI and Machine Learning Integration services, get in touch if you’d like to chat about how you could do this for your own company or other ways you might be able to apply ChatGPT or other Machine Learning models to your product.

How ChatGPT Works

The interaction model for ChatGPT is to provide it with a prompt. ChatGPT also has the concept of context, which includes the previous prompts and any information you provided it as well as its prior responses during the conversation.

In the above example chat with ChatGPT, the entire “prompt” is

I am founder and CEO of thoughtbot.

What is the company I work at?

ChatGPT is able to take in the information I provided and respond with an answer.

You can see from the example response that it happens to have information about what thoughtbot is and provided a detailed response to my question using that information.

While the information it provided about thoughtbot is accurate, one issue with the generative model behind ChatGPT is that it can make up information. To address this, when making a ChatGPT prompt, you can tell it to generate a response using only the information provided, and instruct it to say that it doesn’t know the answer, rather than making something up.

You can see that when we do this, ChatGPT will not introduce additional information into its response:

I am the founder and CEO of thoughtbot. What is the company I work at? The company you work at is thoughtbot.

This is especially important for a tool where you are trying to answer factually from internal documentation, rather than make something up or use external sources.

Composing the prompt

So, going back to our goal of getting ChatGPT to answer questions based on internal documentation that it doesn’t already have in the model. Based on what we’ve established above, we will end up with a prompt with the following general structure:

<Relevant context from which the answer to the question can be found>

Using only the information provided above, answer the following question:

<The question we want answered>

In theory, we could provide the contents of all of the documents we want to pull answers from as the context. In practice, ChatGPT has limits to the size of it’s context. For ChatGPT 3.5 this is currently 4,000 tokens at the base pricing, or you can pay a higher cost for up to 16,000, and ChatGPT 4.0 currently has a limit of 8,000 tokens at the base pricing, and you can pay a higher rate for up to 32,000 tokens.

Tokens aren’t exactly equivalent to words, but for right now thats a useful way to think about them to have a sense of how much information we’re able to provide ChatGPT. Our total documentation is much more than those limits, so we need to have some way limit the amount of information provided as context in our prompt.

This is accomplished by doing a search before composing the prompt to ChatGPT to find the documentation that is most likely to include the answer to the question, and providing just that information to ChatGPT as the context.

We’ve previously built a custom internal search engine using Ruby on Rails and Elasticsearch that helps people at thoughtbot find what they are looking for across all of our internal and external documentation.

Thankfully, we’re able to use this existing searchable index of our documentation to identify the most relevant documentation as context to ChatGPT.

If you’re interested in building a solution like this and don’t already have a searchable index that will give you useable results, this would be something that you’ll need to build. In addition to Elasticsearch there are other database solutions specifically tailored for these types of tasks, like Pinecone.

The general steps we’ll take are

Take the question from the user
Use the question to find relevant documents that might answer the question
Compose the prompt for ChatGPT that includes both the relevant documenation and the question
Send the prompt to ChatGPT and display the response to the user

You can see the actual Ruby code, which lives in our Rails app, for the above steps at the bottom of this article.

Privacy Considerations

The majority of the information that could potentially be sent with your question is already fully public: All of the benefits information, the playbook, the guides. So the Handbook is the only source that info is sent to OpenAI and there isn’t actually much there that is that truly sensitive.

But that being said, OpenAI terms of service state:

OpenAI will not use data submitted by customers via our API to train or improve our models, unless you explicitly decide to share your data with us for this purpose.

(we are not opted in)

Any data sent through the API will be retained for abuse and misuse monitoring purposes for a maximum of 30 days, after which it will be deleted (unless otherwise required by law).

The document also describes that data protections and security they have in place during those 30 days.

Given all of that, I felt comfortable with the way this works now, but we’ll want to keep an eye on it if we do additional things or the terms change. Even though I think its currently fine, I think it would be better in the long run to use an open source self-hosted model in the future, that way we don’t need to rely on 3rd parties at all.

Example Code

Here is the actual Ruby code, which lives in our Rails app, for finding the relevant documents, composing a prompt below the token limit, submitting it to ChatGPT and returning the response. It uses our Search class for searching Elasticsearch, along with the tiktoken_ruby and ruby-openai Ruby gems for counting tokens and using the ChatGPT API, respectively.

class Prompt
  include ActiveModel::Model
  MODEL = "gpt-3.5-turbo".freeze

  class_attribute :config, :elastic_search

  attr_accessor :text, :filters, :response, :context

  def initialize(attributes = {})
    self.filters = attributes.fetch(:filters, Repository.all.map(&:result_type))
    self.context ||= ""
    super(attributes)
  end

  def result
    @result ||= perform
  end

  def documents
    @documents = Search.new(q: text, filters: filters).perform
  end

  def build_context
    current_document = 0
    while current_document < documents.length && token_count(context) + token_count(document_for_result(documents[current_document])) < 3800
      self.context += document_for_result(documents[current_document])
      current_document += 1
    end
  end

  def perform
    if text.present?
      build_context
      reply = client.chat(
        parameters: {
          model: MODEL,
          messages: [{role: "user", content: content}],
          temperature: 0.0
        }
      )
      self.response = reply.dig("choices", 0, "message", "content")
    end
  end

  private

  def content
    <<~CONTENT
      #{context}
      Using only the information provided above, answer the following question and provide the url where the answer can be found: #{text}
    CONTENT
  end

  def document_for_result(result)
    "[#{result.name}](#{result.url})\n\n#{result.description}\n\n"
  end

  def token_count(string)
    encoding.encode(string).size
  end

  def client
    @client ||= OpenAI::Client.new
  end

  def encoding
    @encoding ||= Tiktoken.encoding_for_model(MODEL)
  end
end

We can help you apply ChatGPT and Machine Learning models to your product. If you’d like to chat about how to do this for your company, get in touch or check out our AI and Machine Learning services.

How ChatGPT Works

Composing the prompt

Privacy Considerations

Example Code

Sign up to receive a weekly recap from thoughtbot

About thoughtbot