LLMs today
The potential of using LLMs for knowledge extraction is nothing less than amazing, in this last couple of months we’ve seen a rush towards integrating large language models to perform a variety of tasks, e.g. data summarization, Q&A chat bots and entity extraction are just a few examples of what people are doing with these models.
With this new technology new disciplines and challenges emerge:
What’s the proper way to query my data?
How can I reduce model hallucinations?
Where should I store my data?
Current approach
As It seems Vector databases became the default options for indexing, storing and retrieving data which will later be presented as context along with a question or a task to the LLM.
The flow is quite straightforward, consider a list of documents containing data we would like to query (these can be Wikipedia pages, corporate proprietary knowledge or a list of recipes) the data is usually chunked into smaller pieces, embeddings are created for each piece and finally the data along with its embeddings are stored within a Vector database.
When it’s time to ask a question e.g. suggest three Italian recipes which don’t contain eggplants for a dinner party of four.
The question itself gets embedded into a vector, the Vector database is asked to provide K (let’s say 20) semantically similar vectors (recipes in our case), it is these results from the DB which will form a context presented to the LLM along with the original question, in the hope that the context is rich and accurate enough for the LLM to provide suitable answers.
One major flaw with this approach is it too limited, the backing DB will only provide results which are semantically “close” to the user question, as such the generated context is lacking vital information need by the LLM to provide a decent answer.
Alternative
As an alternative one can use knowledge graph to not only store and query the original documents but to also capture different entities & relations embedded within one’s data.
To utilize a graph DB as a knowledge base for LLMs we start out by constructing a knowledge graph from our documents, this process includes identifying different entities and the relationships among them, e.g. (Napoleon Bonaparte) – [IMPRISONED] -> (island of Saint Helena)
* Surprisingly LLMs can be used for this process as well.
Once the graph is constructed we’ll be using it for context construction, A question presented by a user is translated into a graph query, at this point we’re not limited to a set of K semantically similar vectors but we can utilize all of the connections stored within our graph to generate a much richer context, it is this context along with the original question that is presented to the LLMs to get the final answer.
Context extraction
Querying graph for context
Graph generation
Entity and relation extraction from raw text
Demo
To put all of the above into practice I’ve constructed a demo using LangChain querying music related Wikipedia pages comparing a vector store setup against a knowledge graph (FalkorDB).
from langchain.chains import GraphCypherQAChain
from langchain.chat_models import ChatOpenAIdef
from langchain.graphs import FalkorDBGraph
def query_graph(graph_id, query):
graph = FalkorDBGraph(graph_id, host="localhost", port=6380)
graph.refresh_schema()
chain = GraphCypherQAChain.from_llm(cypher_llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"),
qa_llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"), graph=graph, verbose=True)
return chain.run(query)
query_graph("music", "Which musician had the most collaborations?")
query_graph("music", "Which two musician are married ?")
query_graph("music", "Which country produced the most talent?")
Q&A
Here are the questions and answers I’ve got from each setup:
Which musician had the most collaborations?
Vector – Mark Hudson did the most collaborations.
Graph – Mark Hudson did the most collaborations with a total of 8.
Which two musician are married?
Vector – There is no information provided about any musicians being married to each other.
Graph – Bob Thiele and Teresa Brewer are married musicians.
Which artist won an award?
Vector – Usher won multiple awards, including Grammy Awards and Soul Train Music Awards.
Graph – Usher won a Grammy Award.
Which country produced the most talent?
Vector – The document does not provide information about which country produced the most talent in country music.
Graph – The country that produced the most talent is the United States of America.
Is there an indirect connection between Kylie Minogue and Simon Franglen, if so name the artists on that path?
Vector – There is no indirect connection between Kylie Minogue and Simon Franglen.
Graph – Yes, there is an indirect connection between Kylie Minogue and Simon Franglen. The artists on that path are Whitney Houston, Barbra Streisand, Graham Stack, and Rod Stewart.
Conclusions
As can be seen the Vector database setup managed to answer only 2/5 of the questions, this is quite expected as the questions asked are not semantically close to the answers, as such we can’t expect the retrieved documents from the Vector DB to contain the necessary information to answer the questions.
On the other hand the Graph database setup did manage to answer all 5 questions, the success of this approach is primarily account for the auto generated graph query used to build a much more relevant and richer LLM context.
Although in this examples we’ve seen Graph doing quite well, It is my belief that a more robust solution combines both worlds, this is why FalkorDB had introduced a Vector index as part of its indexing suite, now one can start building a query context using a combination of a vector search and graph traversals, consider a user question which kicks of a vector search which ends up with K nodes from which graph traversal continues, reaching important fragments of data scattered along the graph.