One of the things ThinkTank spends a LOT of time thinkin’ through these days is the effective structuring of knowledge bases for Retrieval Augmented Generation (RAG) with Artificial Intelligence.
Thinktank manages a public CustomGPT called dndGPT, available for free on ChatGPT. (It’s the DM’s assistant for you and me.) DndGPT is built on the open source SRD 5.1—essentially a long, 400 page, highly organized, yet ultimately unstructured data source with a lot of nuances and variability.
The nature of the SRD 5.1 makes it an ideal platform for public experimentation.
With the release of GPT 4o Mini, tasks of this nature, are suddenly far more affordable to perform on the Assistants API, which further allows greater analysis of data retrieval, enabling the results herein.
Summary
- The Dungeons and Dragons open source SRD 5.1 document was used as the basis for this informal study.
- This simple study used various forms of the document—the full document in a single PDF, the full document split into multiple chapter-size (25-ish pages) pdfs, and a splice of the PDF (100 pages) that contained the known information.
- We tested the 4o mini’s ability to retrieve a Monster, the “Aboleth”, from these different sections and compared the results. The mini accurately retrieved the most of the information accurately in all circumstances, so we’re instead interested in search efficiency.
- A happy mistake taught us that including a vague location information in the the System Message (or Custom Instructions) drastically improves search performance.
- After correcting the error, the results proved our hypothesis that the Full PDF Split into Chapters related by the same Vector Store provided superior search results when the location of the information is not known.
- The 100 page splice did NOT show greater efficiency over the Chapter Sections (they were about equal), which was counter to our initial expectations.
- Conclusion: It is worth spending extra time organizing a document semantically and visually in multiple small parts to be related by a Vector Store if it is meant to be the canonical backbone of your AI system. This method is preferable when efficiency is required.
Structuring the Unstructured Data for Better Search Efficiency with AI
Hey everyone, thanks for stopping by. This post summarizes results which can also be found in our dndGPT Case Study on the OpenAI Developer Platform.
DndGPT is a free Game Master’s Assistant CustomGPT available on ChatGPT. It is our playground for developing and working on the underlying technology while also providing something publicly available, based on publicly available data, and (hopefully) useful.
One of the single biggest tasks given to Artificial Intelligence is being able to ask questions of a given data source without producing errors (hallucinations). There are several methods for accomplishing this, but what we’re looking at here is the actual structuring a long document requires to make it’s understanding more efficient/effective.
DndGPT is a simple, but extremely effective, CustomGPT built on a single, very complicated, data source, the 400 page long SRD 5.1. The entire goal of the cGPT is to understand, search & retrieve, and otherwise interpolate data from this single source of truth as accurately as possible. A potential task, for example, is condensing part of the manual into a csv.
During the initial build of the cGPT, it made intuitive sense to us to break the original document into multiple smaller parts (chapters) with descriptive file names to help with data retrieval. The idea being that the Model wouldn’t have to search through the entire document to find something in, say, the Monsters chapter.
With the recent release of the 4o mini, we are now able to prove that this method is effective and desirable for enhancing search and retrieval for long documents.
Methodology
The following tests were performed on the new GPT 4o Mini via the Platform. In theory, the performance should be the same on the ChatGPT UI. The Platform allows far more analysis and fine-grained control of a model’s performance.
The test prompt was simple: “Please Extract the Aboleth,” which is the first Monster listed in the “A” section of the SRD 5.1 Monsters.
The test was performed under six conditions, using an OpenAI Vector Store to further organize the information.
- Search the Full PDF as a single 400 page document.
- Search the Full PDF split into smaller Chapter Sections. Approximately 25-50 pages per section with descriptive titles.
- Search a 100 page splice of the full document which contains the information.
- We also accidentally did our original study where the Assistant’s System Message gave a vague location hint vs searches without that vague hint.
The Assistant’s System Instructions provide specifics on what is expected when an “Extract” request is given. Specifically, we have to explicitly make sure “all the information” is extracted “verbatim” otherwise the Assistant might summarize parts of the information.
…And here’s what an “Aboleth” looks like, in case you were wondering.
Reviewing Common AI Terms
If you’re not particularly familiar with AI, all this might be gobbledygook to you. So lets review some terms:
Model: You can select between several different models to work with. We went with GPT 4o Mini, but could choose 4o or 4o Turbo, or any number of models out there. We feel this is a task best performed by Mini, since it requires intelligence but not advanced reasoning or creativity.
Token: A token is “a part of a word,” roughly 4 characters, but this varies by the model. Tokenizing is becoming more efficient as time goes on.
Tokens In: When you ask AI a question, it answers your question based partially on the total tokens in your prompt. If you ask it to search something, it gathers information, in addition to your original prompt. This is the phase where it queries your databases and what not. We are particularly interested in this section for this study.
Tokens Out: This is the response from the AI. Generally speaking, you pay more for this information.
Context Window: This is the total number of tokens (in and out) that a model uses to answer your questions. The total context available for 4o models is currently 128k tokens. Meaning, if your search, or your conversation is greater than that, you’re not getting a decision based on all the relevant information—a dangerous situation indeed.
Vector Store: A Vector Store is a type of Vector Database that adds additional meta data and embeddings to files to better enable AI understanding and search efficiency.
GPT 4o Mini is Ideal for This Type of Retrieval
It should be noted that this type of task is absolutely what 4o mini was made for.
This task requires intelligence, but too much creativity allows for too much variability in search results.
Here’s the problem: Any given monster in the Dungeons and Dragons guide may be a block of descriptive text anywhere from 200 to 800 tokens long. That is, the description we’re after could be as short as a paragraph and as long as a page. This information follows a clear form, with most of the information usually in the same place… but since that last sentence contains words like “most,” and “usually,” that means that there are frequent anomalies in the data that require intelligence to handle effectively.
Intelligence, but not creativity.
We need the model to effectively and intelligently retrieve the complete section without adding anything of it’s own or summarizing. It is extremely important that it retrieves all of the information from this section verbatim, as there may be custom instructions contained therein that regard that monster specifically that a Dungeon Master (Decision Maker) needs to know about to make an accurate decision.
Given the variable length of the data, simple programs are ineffective at this task, which requires intelligence to determine where one monster ends and the next begins… But, using a more advanced model would be overkill and proportionately more expensive.
A Note on Reading Comprehension
One of the things we were NOT measuring here was “comprehension,” or an “abstract understanding of the searched document.” It is an interesting question whether a document would be better comprehended by a model using these methods.
We were mostly looking at Tokens In and retrieving accurate information, not making complex decisions based upon that information, which would be better handled by a more advanced model.
Using a Vector Store to House the Unstructured Data
The way this experiment was performed was using Vector Stores. Vector Stores and Vector Databases are at the heart of this issue.
Vector Stores are temporary (or permanent) virtual Databases that, in the case of OpenAI’s, automatically add embeddings and other information to the housed Data. This added data makes it easier for AI to understand. (You’ll be hearing a lot about them moving forward.)
Vector Stores are basically new and experimental or still in beta. With OpenAI, at least, you upload a file to traditional permanent file storage, then you can further associate that file with other files in a Vector Store.
So, to complete this experiment, three different vector stores were used as described above. It is taken for granted that to search files that haven’t been vectorized / embedded will decrease search efficiency.
Original Document Preparation
It should be again for emphasis that the SRD 5.1 is a very well organized document, both visually and semantically.
This isn’t a blob of words without any structure, rather, a semantic structure that can be intuited from the document.
It is important that whatever your document your using is similarly well-structured or that there are methods in place for creating some sort of guidance, otherwise the model will have to expend more energy understanding it.
If you are creating a document for canonical use in a AI database, it is very much worth spending the extra time making your document look nice. If you can understand it’s structure at a glance, so can Artificial Intelligence. If you have to spend time trying to figure out how the document is organized, so would AI.
Searching a Document with Vague Location Information in a Vector Store with AI
As noted, when we began this experiment we made a happy mistake: The System Instructions, which impart a model with it’s initial behavior, actually contained the specific file name we were working with: “dndgpt_srd51_monsters_atok.pdf.” Whoops!
This very simple 20 word prompt completely skewed our initial results, which you’ll see below. It is absolutely fascinating—and delightful—to report that including that vague location greatly enhanced results.
So, what happened was we would provide our prompt, “Please Extract and display the Aboleth from your knowledge base.” The System Instructions included the following sentence, “We are working with the dndgpt_srd41_monsters_atok.pdf.”
Even when that smaller pdf was NOT included by name in the searched Vector Store, like when searching the full document, that exceptionally vague reference was enough of a clue to allow the Model to GREATLY enhance it’s search results.
Observe:
Searching the Full Document with Vague Location Information
Tokens In: 18024, Tokens Out: 683
Searching the Full Document Split into Chapters with Vague Location Information
Tokens In: 18742, Tokens Out: 777
Searching the a Small (100 page) Section of the Full Document
Tokens In: 18889, Tokens Out: 781. Note the file size of 1mb vs 5mb above.
Searching a Document Without Vague Location Information in a Vector Store
After the above error was detected and the vague location information was removed, the results were far more what was expected:
Searching the Full Document with No Location Information
Tokens In: 37,809; Tokens Out: 820. Note the jump in Tokens In vs knowing vague location information. Crazy.
Searching the Full Document Split into Chapters with No Location Information
Tokens In: 19193, Tokens Out: 758. Note that the file size of the split document, 5mb, is the same as the full document. The model has all of the same data available to it, but it is able to perform the search far more effectively.
Searching the Small 100 Page Section of the Document with No Location Information
Tokens In: 19400, Tokens Out: 714. Note the size of the searched database.
Analysis
Alright, the biggest surprise here is what happens if you already kinda know the location of what you’re looking for.
If you don’t have, or provide, that information, we can see the model using significantly more input tokens 18,624 vs 37,809 to search the full document to extract the requested information. That’s 2.03x more input tokens when it doesn’t vaguely know what to look for.
Interestingly, if you don’t provide vague location information, but have divided the source document into smaller (well named) subsections, the model provides almost the same response when it already knows where to look. 18, 742 to 19,193 input tokens.
What was expected, and is here demonstrated, is that, if you do not know the location, but have previously split the document into subsections and related them through the same Vector Store, and the files are named appropriately, there are significant efficiency gains. 19,193 to 37,809 input tokens when searching the split vs the full document without a location guess. Again, a 1.96x difference.
What was unexpected and is also demonstrated is that further splitting the document into smaller subsections (only the area of the document with monsters, a 100 page subsection) didn’t yield significantly different input results from a full divided document. 18,889 input tokens knowing location, and 19,490 input tokens when not knowing location.
Here, the win is the size of the Vector Store, which drops from 5mb to 1mb when searching either the full document and/or the full split document, and the small subsection. A minor efficiency gain. So if you can take the time, splitting the full document is the way to go. If you’re in a hurry, grab a full subsection.
Finally, it is awesome to note that, regardless of the input size, the model extracted the appropriate information with only one small variance in all six experiments. (It sometimes does/does not include information about custom instructions regarding the Aboleth’s Legendary Actions. :thinking:) You can see this in the stability of output tokens. That’s pretty cool.
Conclusions
- If you only kinda-sorta know where you’re looking in a document of any length, include that information in your prompt. It can save all sorts of compute and money. That’s wild.
- Dividing a long document into smaller, appropriately named sections related through the same Vector Store yields significant demonstrable search efficiencies. i.e., It is worth taking a long document and splitting it into chapters.
- Further dividing a document into a smaller vector store does decrease storage costs, but does not yield significant search efficiencies from the full split document in a single vector store.
- Regardless of input, the output was remarkably stable, with only the variance of a single paragraph in all six experiments.