Graph Databases from Scratch: Part 1
Graphs and Their Representations
This series of articles will guide you through understanding graph databases by creating a simple graph database for storing and searching social connections. In this section, we will first review the concept of graphs and the data structures utilized to represent them.
Graphs
A graph is a structure of objects (vertices) in which some pairs are connected (edges). These connections may be directed or undirected. For example, in social media, friends are undirected connections, while followers are directed. In this context, we concentrate on directed graphs as they can represent undirected connections by two edges in opposite directions.
The following figure is an example of a directed graph of four vertices:
An Example Graph
Mathematicians commonly use matrices to represent graphs. To illustrate, the graph depicted above can be expressed in matrix form, as
$$ \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix} $$
where matrix’es rows and columns represent sources and destinations accordingly, and $1$’s denote connections. Although this representation is useful in maths, but it becomes impractical when the graph grows larger, as it quickly leads to enormous matrices that most of their elements are $0$’s.
Data Structures
A common way for graph representation is to express edges via a set of ordered pairs. The same graph in this representation can be described as
$$(A,B), (B,C), (B,D), (C,A), (D,C)$$
These pairs can be easily stored in a hash table using a convenient hash function, $h(s,d)$, where $s$ and $d$ are the source and destination vertices. It is also practical to use a fast key-value store, like Redis, to store edge pairs. Hash tables are very fast to look up, but lack the ability to list all connections originating from or leading to a vertex, that is crucial for implementing graph algorithms.
Another way is to group all destinations of a certain source into a subset:
$$(A,B), (B,(C,D)), (C,A), (D,C)$$
which can be further implemented by keeping a linked-list for all edges originating from each vertex: Spoiler Alert: We will use this method.
[A]: (A,B)
[B]: (A,C)-->(A,D)
[C]: (C,A)
[D]: (D,C)
This way, looking for connections is a little slower but all vertex’es connections can be iterated over fast and easily.
What’s Next
In the next section we are going to explore and design a file format to store and retrieve graphs.