Code is the foundation of modern software, but as codebases grow in complexity, understanding and navigating them becomes increasingly challenging. Code Graph is a visual representation of a codebase, leveraging Knowledge Graphs and Large Language Models (LLMs) to map the relationships between code entities, such as functions, variables, and classes.
In this article, we explore the core concepts of Code Graph, deep-dive into how they are created, and explain how it enhances code analysis. We will also showcase FalkorDB’s Code Graph tool that enables you to create a deployable Code Graph explorer and query interface from any GitHub repository. Let’s dive in!
What is Code Graph and How it Enhance Code Analysis
A Code Graph is a visual representation of a codebase as a Knowledge Graph that helps one explore entities in code (functions, variables, classes) and their relationships. By mapping out these connections, it becomes easier to understand the structure and flow of the code, identify potential issues, and improve overall code quality.
The concept of representing code as a graph data structure has its roots in the early days of software engineering when researchers explored ways to model and analyze program structure and behavior using graph-based techniques. Modern Code Graphs, incorporating Knowledge Graphs and Large Language Models (LLMs), are a recent development.
The emergence of modern Knowledge Graph databases, such as FalkorDB, has made it possible to efficiently store, query, and visualize large-scale code graphs. This enhances the understanding of code; its capability in helping with code navigation can empower developers with numerous benefits agnostic of the programming language they are using:
- Improved Understanding: Helps trace the flow of data through functions and identify interconnected components.
- Impact Analysis: Assesses the ripple effects of code changes, predicting potential issues before they arise.
- Autocompletion: Suggests relevant functions, variables, and types based on the current context.
- Code Search: Searches for functionalities not just by keywords, but by understanding the relationships between code elements.
The ability of Code Graphs to provide a clear, graphical view of complex code structures makes it simpler to trace code execution paths and call graphs, pinpoint areas of high complexity, and facilitate better debugging and refactoring.
We have created a few example Code Graphs and their corresponding visualization at FalkorDB.
RAG (Retrieval Augmented Generation) for Code Graph Creation
The advent of Knowledge Graphs has made it possible to not only visualize but also traverse and reason over the relationships within a Code Graph. For instance, using appropriate Cypher graph queries, you can:
- Discover recursive functions within a codebase;
- Explore methods that are not used at all;
- Find the methods that are most used;
- Discover how one function impacts another.
However, what if you wanted to use natural language queries to do the same? This is where modern LLMs and Retrieval-Augmented Generation (RAG) architecture come into play.
By leveraging LLMs, developers can pose natural language questions about their codebase, and the RAG pipeline can potentially help transform these queries into graph queries while also explaining the results retrieved. Developers can ask questions like:
- “Which functions are most frequently called in this module?” or
- “Are there any unused methods in the project?”
They can receive insights without needing to master complex query languages, enabling a more accessible way to interact with the codebase.
This can unlock several capabilities:
- Improved code navigation
- Understanding of the different modules, functions, classes, methods
- Creating documentation from code
- Discovering dependencies within the code
RAG architecture works by integrating a retrieval model with a generative model (powered by LLMs), where the retrieval model first fetches the relevant documents or data from a data store based on the input query. The retrieved information is then used as context by the generative model to produce more accurate and contextually relevant outputs, effectively combining search and generation capabilities.
With RAG architectures, developers commonly use Vector Databases to retrieve documents by using similarity search. This approach, however, breaks down when it comes to Code Graph, as we will see below.
Advantages of Using Knowledge Graphs Over Vector Databases for RAG-Powered Code Graph
To build a RAG for a Code Graph, developers can provide context to LLMs using either a Vector Database or a Knowledge Graph.
Vector Databases store data as high-dimensional vector embeddings, which are numerical representations of unstructured data that capture its semantic meaning. In the case of code exploration, converting a codebase and its elements into vector embeddings enables searching through the vector space to find similar or dissimilar functions based on a query.
For instance, the vector embedding for function A might look like [0.12, 0.31, 0.56, 0.88, …, 0.92], while the vector embedding for function B might look like [0.83, 0.66, 0.91, 0.89, …, 0.91]. Using Vector Databases, each data point is converted into a numerical representation, allowing for similarity searches to determine which elements (functions, arguments) are closer or farther apart. This also allows for the use of natural language queries to pinpoint the right section of a codebase.
However, this method breaks down when you try to reason over the codebase or explore relationships between functions, modules, classes, and so on. For this, you need a way to capture relationships between elements of code in a structured way. This is where Knowledge Graphs offer significant advantages:
- Structured Relationships: Knowledge Graphs capture the direct relationships between different code elements, such as inheritance, dependencies, and usage patterns.
- Graph Query: With Knowledge Graphs, you can use a graph query language like Cypher to traverse and analyze the graph. This allows you to identify recursive functions, unused methods, or highly utilized functions, and understand how different parts of the codebase interact with each other.
- Reasoning: Knowledge Graphs support reasoning and inference, allowing you to derive new insights from existing relationships. This is essential for tasks like impact analysis, where you need to understand the potential consequences of changes in the codebase.
- Integration with RAG: When integrated with RAG architecture, Knowledge Graphs can provide rich, contextual information to LLMs, enabling more accurate and contextually relevant outputs. This combination leverages the strengths of both structured and unstructured data representation.
- Scalability: Knowledge Graphs can evolve with the codebase, making it easier to maintain and scale as the project grows. They can integrate seamlessly with various development tools, providing a unified view of the codebase that is always up-to-date.
In other words, while Vector Databases offer powerful capabilities for similarity searches, Knowledge Graphs provide a structured solution for reasoning over the codebase and exploring complex relationships between code elements.
This makes Knowledge Graphs the right technology to build effective RAG-powered Code Graphs, an architecture that is referred to as KG-RAG or GraphRAG.
Visualizing Your Code with a Code Graph
When building a Code Graph using Knowledge Graphs, visual exploration of code becomes possible. For example, using FalkorDB, you can build a Code Graph and visualize how classes, methods, arguments, and modules are interconnected and related to each other. You can also “ask questions” about your codebase, such as “list functions which have the highest number of arguments.”
In addition, this approach offers several key benefits:
- Dependency Mapping: Discovering dependencies between various modules, classes, and functions.
- Simplified Debugging: Tracing execution paths and pinpointing the source of bugs or performance bottlenecks.
- Enhanced Documentation: Serving as a dynamic and up-to-date documentation tool, helping team members understand the structure and flow of the project.
- Impact Analysis: Conducting effective impact analysis to see how changes in one part of the codebase might affect other parts.
- Collaboration: Ensuring everyone has a consistent understanding of the project’s architecture and dependencies.
- Interactive Exploration: Enabling developers to drill down into specific sections of the code, perform detailed analyses, and run queries.
In essence, visualizing your code with a Code Graph transforms complex codebases into intuitive, interactive diagrams.
Using the FalkorDB Code Graph browser, you can try the Code Graph for the Python requests library. You can zoom in, query the graph in natural language, and understand in detail how the library works.
You might be wondering how to create a similar Code Graph for your own project. Below, we are going to describe the underlying architecture and then explain how you can do it using FalkorDB.
Understanding the Workflow of Building a Code Graph
Creating a Code Graph involves several steps that can transform your codebase into a visual and interactive graph:
1. Static Code Analysis
The first step involves a thorough analysis of your codebase. The goal is to parse various entities such as classes, methods, functions, and their interrelations – similar to how compilers work. Here, Abstract Syntax Tree (AST) parsers are leveraged, enabling the precise extraction of structural and behavioral attributes of the code.
2. Graph Construction
The next step is constructing the Code Graph and storing it in the Knowledge Graph. This involves the systematic creation of nodes for each identified entity (e.g., classes, methods) and edges to represent relationships such as inheritance, method invocations, program dependence, and data flows. This is done by using Cypher queries.
3. Data Enrichment
Optionally, you can also include metadata such as function signatures, documentation comments, code metrics (e.g., cyclomatic complexity, lines of code), and version control history. Enrichment transforms the graph into a comprehensive repository of knowledge.
4. Visualization
You then use graph rendering libraries to visualize the Code Graph, which helps build clear and interactive diagrams that depict the intricate relationships within your codebase. The visualization engine supports features like zoom, pan, and node highlighting, allowing developers to intuitively explore complex code structures.
5. Querying and Analysis
Once the Code Graph is constructed in the Knowledge Graph, you can then build an application that leverages RAG architecture. Using LLMs like OpenAI’s GPT models, or open-source LLMs like Llama 3, you can convert the natural language queries into Cypher, and then use the Knowledge Graph to explore and reason over the graph. For example, queries can identify functions with the highest number of arguments, discover directed graph structures within code, detect circular dependencies, or trace data propagation paths.
By following this workflow, developers can transform their codebases into powerful visual tools, significantly enhancing their ability to analyze, understand, and maintain complex software projects.
Interacting with OpenAI for Transforming Queries
OpenAI models like GPT-4 or GPT-4o have the capability to convert natural language queries into Cypher, which can then be used to explore a Code Graph.
For instance, an LLM can convert a natural language query like “Find the top 10 functions with the most arguments” into the Cypher query below:
MATCH (f:Function)-[:HAS_ARGUMENT]->(a:Argument)
RETURN f.name AS FunctionName, COUNT(a) AS ArgumentCount
ORDER BY ArgumentCount
DESC LIMIT 10
Similarly, a query like “List all functions that are not called by any other functions” will be translated into the following Cypher:
MATCH (f:Function)
WHERE NOT (f)<-[:CALLS]-(:Function)
RETURN f.name AS UnusedFunction
A more complex example is a query like “Find all functions that are indirectly called by the ‘main’ function through any number of intermediate functions”, which requires graph traversal:
MATCH path = (start:Function {name: "main"})-[:CALLS*2..]->(end:Function)
RETURN DISTINCT end.name AS IndirectlyCalledFunction, length(path) AS Hops
ORDER BY Hops
The [:CALLS*2..] syntax ensures that only paths with at least 2 hops are considered, excluding direct calls.
A powerful aspect of Cypher is that it is highly readable. This makes it easy to understand the way in which an application is leveraging the graph. If you attempted to build a similar application using Vector Databases, not only would it be impossible to explore the Code Graph, but the queries would also use vector embeddings, which make it impossible to ‘explain’.
Detailed Knowledge Graph Schema for a Code Graph
As we saw above, Knowledge Graphs offer a powerful way to create Code Graphs. What would a typical Knowledge Graph schema for a Code Graph look like?
Below is a detailed schema that captures the essential elements of a typical Python codebase.
Entities
- Module
- Represents a high-level grouping of related code components.
- Attributes:
- name (String): The name of the module.
- path (String): The file path of the module.
- Class
- Represents a class in the code.
- Attributes:
- name (String): The name of the class.
- access_modifier (String): Access level (e.g., public, private).
- is_abstract (Boolean): Whether the class is abstract.
- documentation (String): Documentation or comments.
- Function
- Represents a function or method.
- Attributes:
- name (String): The name of the function.
- return_type (String): The return type of the function.
- access_modifier (String): Access level (e.g., public, private).
- documentation (String): Documentation or comments.
- complexity (Integer): Cyclomatic complexity of the function.
- lines_of_code (Integer): Number of lines of code.
- Argument
- Represents an argument to a function.
- Attributes:
- name (String): The name of the argument.
- type (String): The data type of the argument.
- default_value (String): Default value (if any).
- Variable
- Represents a variable within a function or class.
- Attributes:
- name (String): The name of the variable.
- type (String): The data type of the variable.
- initial_value (String): Initial value (if any).
- File
- Represents a file in the codebase.
- Attributes:
- name (String): The name of the file.
- path (String): The file path.
- size (Integer): Size of the file in bytes.
- modification_date (Date): Last modification date.
Relationships
- CONTAINS
- Represents containment relationships.
- From: Module, File
- To: Class, Function
- Example: (:Module)-[:CONTAINS]->(:Class), (:File)-[:CONTAINS]->(:Function)
- INHERITS_FROM
- Represents inheritance relationships between classes.
- From: Class
- To: Class
- Example: (:Class)-[:INHERITS_FROM]->(:Class)
- IMPLEMENTS
- Represents implementation of interfaces by classes.
- From: Class
- To: Class (Interface)
- Example: (:Class)-[:IMPLEMENTS]->(:Class {is_interface: true})
- CALLS
- Represents function call relationships.
- From: Function
- To: Function
- Example: (:Function)-[:CALLS]->(:Function)
- HAS_ARGUMENT
- Represents the relationship between functions and their arguments.
- From: Function
- To: Argument
- Example: (:Function)-[:HAS_ARGUMENT]->(:Argument)
- DECLARES
- Represents variable declaration within functions or classes.
- From: Function, Class
- To: Variable
- Example: (:Function)-[:DECLARES]->(:Variable)
- WRITTEN_IN
- Represents the programming language used for the code.
- From: File
- To: Language
- Example: (:File)-[:WRITTEN_IN]->(:Language)
- DEPENDS_ON
- Represents dependencies between modules, files, or classes.
- From: Module, File, Class
- To: Module, File, Class
- Example: (:Module)-[:DEPENDS_ON]->(:Module)
- DEFINED_IN
- Represents the location where a class or function is defined.
- From: Class, Function
- To: File
- Example: (:Class)-[:DEFINED_IN]->(:File)
Building the Code Graph
As we saw above, building a Code Graph can be complex. However, FalkorDB offers a straightforward and user-friendly solution for creating Code Graphs.
It provides a Python module for creating Code Graphs from a public Git repository and allows you to host it on an endpoint, enabling any developer to visualize and explore the repository’s codebase using a browser.
Here is how.
First, clone the FalkorDB Code Graph repository.
git clone https://github.com/FalkorDB/code-graph.git
npm install
docker run -p 6379:6379 -it --rm falkordb/falkordb
export OPENAI_API_KEY=YOUR_OPENAI_API_KEY
npm run dev
Below is the Code Graph of the ReactAgentWorker Class in LlamaIndex.
You can also ask questions about the code graph on the side panel, and it will respond back in natural language. This feature is extremely useful when navigating through the complex and vast codebase of a programming framework.
Using the FalkorDB Code Graph browser, you can explore complex codebases, ask questions, and receive insights in natural language, making navigation and understanding of large codebases more manageable and effective.
Future Work
The integration of LLMs with Knowledge Graphs like FalkorDB promises to revolutionize codebase visualization and comprehension. By combining the capabilities of LLMs and Knowledge Graphs, it will become commonplace for developers to interact with their Code Graphs using natural language queries, making complex code structures more accessible and easier to understand.
This can significantly drive up productivity by automating and streamlining the analysis of code relationships and dependencies. Eventually, this has the potential to transform software development processes and boost productivity.
To get started, visit the Code Graph GitHub repository, browse our documentation to learn more about FalkorDB Knowledge Graph, or simply visit our website to connect with us.