Exploring Graph Relationships with SQL Server

Often in SQL there are graph like structures that with some unique queries can answer a new set of questions. This article details one approach to traversing these tables with a method similar to a traditional graph algorithm.

You may have heard of the SQL Server graph features added in the 2017 release. Those features are excellent on their own but require that the database is built using the graph table features provided by SQL Server. What if you have a graph like structure in an existing database that is not built using the new graph mechanics of SQL Server?

In this article we will look at a handful of queries to traverse a graph data structure with SQL, first using joins and then Recursive Common Table Expressions (CTEs).

Generating Data

To start we need to define a table that represents a graph and populate it with some example data. Our graph will be represented by a single table called CEdge which defines directional edges from an origin node N1 to a destination node N2.

CREATE TABLE #CEdge (
    N1 VARCHAR(128), 
    N2 VARCHAR(128)
);

-- DROP TABLE #CEdge;

Next we need to create some data to populate CEdge and test our solution with. We will use the small graph drawn below. Notice that this graph is directional even though the edges allow it to act as a bidirectional graph.

                                 ┌──┐
                            ┌───►│ 1│◄────────┐
                            │    └──┘         ▼
                            ▼               ┌──┐
                          ┌──┐              │12│
                 ┌───────►│11│              └──┘
                 │        └──┘                ▲
                 │          ▲                 │
      ┌──┐       │          │                 │    ┌──┐
      │ 5│       │          └────►┬──┐        └───►│ 4│
      └──┘       │                │ 6│             └──┘
        ▲        │                └──┘
        │  ┌──┐  │                  ▲
        └─►│ 9│  │        ┌──┐      │
           └──┘  │        │10│◄─────┘        ┌──┐
            ▲    ▼        └──┘            ┌─►│ 2│◄─┐
            │   ┌──┐       ▲              │  └──┘  │
            └──►│ 3│◄──────┘              ▼        ▼
                └──┘                    ┌──┐      ┌──┐
                                        │ 7│      │ 8│
                                        └──┘      └──┘

Each of the insert statements below mimic one direction of the edges described above. For example the first record below represents the edge going from node N11 to node N1.

INSERT INTO #CEdge (N1, N2) VALUES 
    ('N11', 'N1'),
    ('N12', 'N1'),
    ('N7', 'N2'),
    ('N8', 'N2'),
    ('N9', 'N3'),
    ('N10', 'N3'),
    ('N11', 'N3'),
    ('N12', 'N4'),
    ('N9', 'N5'),
    ('N10', 'N6'),
    ('N11', 'N6'); 

The inserts below are simply the opposite of the above making the representation a completely bidirectional graph.

INSERT INTO #CEdge (N1, N2) VALUES 
    ('N1', 'N11'),
    ('N1', 'N12'),
    ('N2', 'N7'),
    ('N2', 'N8'),
    ('N3', 'N9'),
    ('N3', 'N10'),
    ('N3', 'N11'),
    ('N4', 'N12'),
    ('N5', 'N9'),
    ('N6', 'N10'),
    ('N6', 'N11');

Now that we have a complete graph representation with some data available we can begin writing some queries to operate on and traverse the data.

Simple Traversals

To get our bearings let's look at a query that finds a node N3's direct neighbors:

SELECT [N1], [N2] FROM #CEdge [_ce] WHERE [_ce].[N1] = 'N3'

In the results below you can see that we have discovered all the direct neighbors of node N3: N9, N10 and N11.

| N1 | N2  |
|----|-----|
| N3 | N10 |
| N3 | N11 |
| N3 | N9  |

We can also find all the 2nd degree neighbors of a node with the below query. CEdge is joined to itself to somewhat manually construct the path using the join from a starting point [_ce].[N1] to its direct neighbors [_ce].[N2] and then from the list of 1st degree neighbors to each of their direct neighbors [_ce2].[N2] which are the 2nd degree neighbors of the starting node[_ce].[N1].

SELECT 
    [_ce].[N1] as [Start Node]
    , [_ce].[N2] AS [Intermediate 1st Degree Neighbor]
    , [_ce2].[N2] AS [Final 2nd Degree Neighbor] 
FROM #CEdge [_ce] 
JOIN #CEdge [_ce2] ON [_ce].[N2] = [_ce2].[N1] 
WHERE 
    [_ce].[N1] = 'N3'

The results of the query are below. The last column contains the 2nd degree neighbors of N3 and putting all three columns together, the full path can be identified. For example looking at row 3: N3 -> N11 -> N1.

| Start | Intermediate | Final 2nd Degree Neighbors |
|-------|--------------|----------------------------|
| N3    | N10          | N3                         |
| N3    | N10          | N6                         |
| N3    | N11          | N1                         |
| N3    | N11          | N3                         |
| N3    | N11          | N6                         |
| N3    | N9           | N3                         |
| N3    | N9           | N5                         |

This pattern works if you know the level of depth that you need to traverse the graph, but what if you don't know ahead of time how deep in the graph you will need to traverse to find an answer? A Recursive CTE! Below we will use a recursive CTE to traverse the graph to an arbitrary depth.

Getting Recursive

A CTE is a mechanism that provides a result set that can be queried. It is similar to a table variable but it can actually be referenced from within its own definition making it recursive. A Recursive CTE typically contains 2 parts: 1) the anchor query which seeds the data in the temporary result set and 2) the recursive query with a termination check that executes and appends to the result set until a specified condition is met.

The query below specifies a starting node @DiscoverScenarioFor which will be the starting position for the traversal. Then the CTE is declared and arbitrarily named Recurse. Inside the CTE statement are the anchor and recursive components of the query unioned together.

...

DECLARE @DiscoverScenarioFor VARCHAR(128);
SELECT @DiscoverScenarioFor = 'N5';

WITH Recurse AS (
    -- Anchor Query
    SELECT
        CAST(#CEdge.N1 AS VARCHAR(MAX)) AS Base
        , 1 AS TraversalDepth 
        , CAST(',' + N1 + ',' + N2 + ',' AS VARCHAR(MAX)) AS Discovered
        , N1
        , N2
    FROM #CEdge
    WHERE 
        #CEdge.N1 LIKE @DiscoverScenarioFor
    UNION ALL
    -- Recurive Query 
    SELECT 
        Recurse.Base 
        , Recurse.TraversalDepth + 1 AS TraversalDepth
        , CAST(Recurse.Discovered + #CEdge.N2 + ',' AS VARCHAR(MAX)) AS 
        Discovered
        , #CEdge.N1
        , #CEdge.N2
    FROM Recurse
    JOIN #CEdge ON Recurse.N2 = #CEdge.N1
    WHERE 
    	-- Termination Condition
        Recurse.Discovered NOT LIKE CAST('%,' + #CEdge.N2 + ',%' AS 
        VARCHAR(MAX)) 
), NormalizedResult AS (
    -- Data Normalization
    SELECT Base, N1 AS EntityNum FROM Recurse
    UNION
    SELECT Base, N2 AS EntityNum FROM Recurse
)

SELECT 
    *
FROM NormalizedResult
WHERE 
    NormalizedResult.Base LIKE @DiscoverScenarioFor
    AND EntityNum != @DiscoverScenarioFor
-- 0 is no max; 1 to 32767 limits to specified depth
OPTION (MAXRECURSION 100) 

The anchor query is essentially the base case of the a traditional recursive method. This query is executed first and populates the initial rows in the Recurse result set. It is these rows that the second (recursive) query will use as a starting point. To better understand let's strip down the CTE and only include the anchor query to get an idea of the output.

...

DECLARE @DiscoverScenarioFor VARCHAR(128);
SELECT @DiscoverScenarioFor = 'N5';

WITH Recurse AS (
    -- Anchor Query
    SELECT
        CAST(#CEdge.N1 AS VARCHAR(MAX)) AS Base
        , 1 AS TraversalDepth 
        , CAST(',' + N1 + ',' + N2 + ',' AS VARCHAR(MAX)) AS Discovered
        , N1
        , N2
    FROM #CEdge
    WHERE 
        #CEdge.N1 LIKE @DiscoverScenarioFor
)

SELECT 
    * 
FROM Recurse;

This query should yield the result set below and after the anchor has executed this is the result set stored in the Recurse CTE (Note: the select on the last few lines is returning the entire result set from the CTE). Looking at the graph, we can see the anchor has correctly found that the only direct neighbor N2 is N5. The Discovered column keeps a list of the nodes that have already been visited and TraversalDepth is the number of degrees removed from the starting node the node in column N2 is.

| Base | TraversalDepth | Discovered | N1 | N2 |
|------|----------------|------------|----|----|
| N5   | 1              | ,N5,N9,    | N5 | N9 |

So far, this is nothing more than an over complicated select statement. Adding the recursive component back in:

...

DECLARE @DiscoverScenarioFor VARCHAR(128);
SELECT @DiscoverScenarioFor = 'N5';

WITH Recurse AS (
    -- Anchor Query
    SELECT
        CAST(#CEdge.N1 AS VARCHAR(MAX)) AS Base
        , 1 AS TraversalDepth 
        , CAST(',' + N1 + ',' + N2 + ',' AS VARCHAR(MAX)) AS Discovered
        , N1
        , N2
    FROM #CEdge
    WHERE 
        #CEdge.N1 LIKE @DiscoverScenarioFor
    UNION ALL
    -- Recuvsive query
    SELECT 
        Recurse.Base 
        , Recurse.TraversalDepth + 1 AS TraversalDepth
        , CAST(Recurse.Discovered + #CEdge.N2 + ',' AS VARCHAR(MAX)) AS 
        Discovered
        , #CEdge.N1
        , #CEdge.N2
    FROM Recurse -- Note: CTE is selecting from its own result set
    JOIN #CEdge ON Recurse.N2 = #CEdge.N1
    WHERE 
        -- Termination Condition
        Recurse.Discovered NOT LIKE CAST('%,' + #CEdge.N2 + ',%' AS 
        VARCHAR(MAX)) 
)

SELECT 
    *
FROM Recurse
ORDER BY TraversalDepth ASC

The recursive query looks very similar to the anchor query with 2 distinct differences: 1) it is selecting from the Recurse result set and 2) there is a termination condition that uses the Discovered column to ensure nodes are visited only once.

| Base | TraversalDepth | Discovered                      | N1  | N2  |
|------|----------------|---------------------------------|-----|-----|
| N5   | 1              | ,N5,N9,                         | N5  | N9  |
| N5   | 2              | ,N5,N9,N3,                      | N9  | N3  |
| N5   | 3              | ,N5,N9,N3,N10,                  | N3  | N10 |
| N5   | 3              | ,N5,N9,N3,N11,                  | N3  | N11 |
| N5   | 4              | ,N5,N9,N3,N11,N1,               | N11 | N1  |
| N5   | 4              | ,N5,N9,N3,N11,N6,               | N11 | N6  |
| N5   | 4              | ,N5,N9,N3,N10,N6,               | N10 | N6  |
| N5   | 5              | ,N5,N9,N3,N10,N6,N11,           | N6  | N11 |
| N5   | 5              | ,N5,N9,N3,N11,N6,N10,           | N6  | N10 |
| N5   | 5              | ,N5,N9,N3,N11,N1,N12,           | N1  | N12 |
| N5   | 6              | ,N5,N9,N3,N11,N1,N12,N4,        | N12 | N4  |
| N5   | 6              | ,N5,N9,N3,N10,N6,N11,N1,        | N11 | N1  |
| N5   | 7              | ,N5,N9,N3,N10,N6,N11,N1,N12,    | N1  | N12 |
| N5   | 8              | ,N5,N9,N3,N10,N6,N11,N1,N12,N4, | N12 | N4  |

Looking closely at the result set we can see the first row is the result of our anchor query. Running the recursive query, the result set Recurse is joined to #CEdge to discover the neighbors of the unvisited nodes discovered in the previous recursion step which are the nodes in column N2. Using the TraversalDepth column and the idea that the recursive populates the Recurse result set each time it is executed we can quite easily trace the execution and of the traversal. The recursion will continue until there are no more undiscovered nodes; the termination condition defines this state causing the recursive query to return no new nodes to explore and the recursion to stop.

We can build on the the traversal to clean up or extend the results by chaining another CTE to the end of the statement.

...

DECLARE @DiscoverScenarioFor VARCHAR(128);
SELECT @DiscoverScenarioFor = 'N5';

WITH Recurse AS (
    -- Anchor Query
    SELECT
        CAST(#CEdge.N1 AS VARCHAR(MAX)) AS Base
        , 1 AS TraversalDepth 
        , CAST(',' + N1 + ',' + N2 + ',' AS VARCHAR(MAX)) AS Discovered
        , N1
        , N2
    FROM #CEdge
    WHERE 
        #CEdge.N1 LIKE @DiscoverScenarioFor
    UNION ALL
    -- Recuvsive query
    SELECT 
        Recurse.Base 
        , Recurse.TraversalDepth + 1 AS TraversalDepth
        , CAST(Recurse.Discovered + #CEdge.N2 + ',' AS VARCHAR(MAX)) AS 
        Discovered
        , #CEdge.N1
        , #CEdge.N2
    FROM Recurse -- Note: CTE is selecting from its own result set
    JOIN #CEdge ON Recurse.N2 = #CEdge.N1
    WHERE 
        -- Termination Condition
        Recurse.Discovered NOT LIKE CAST('%,' + #CEdge.N2 + ',%' AS 
        VARCHAR(MAX)) 
), NormalizedResult AS (
    -- Data Normalization
    SELECT Base, N1 AS EntityNum FROM Recurse
    UNION
    SELECT Base, N2 AS EntityNum FROM Recurse
)

SELECT 
    *
FROM NormalizedResult
WHERE 
    NormalizedResult.Base LIKE @DiscoverScenarioFor
    AND EntityNum != @DiscoverScenarioFor
-- 0 is no max, 1 to 32767 limits to specified depth
OPTION (MAXRECURSION 100); 

Adding the NormalizedResults CTE and filtering out the starting node we get a much more approachable graph traversal output showing all nodes that are reachable at any depth from the starting node:

| Base | EntityNum |
|------|-----------|
| N5   | N1        |
| N5   | N10       |
| N5   | N11       |
| N5   | N12       |
| N5   | N3        |
| N5   | N4        |
| N5   | N6        |
| N5   | N9        |

Note that in the query above, there is an option added to the last line OPTION (MAXRECURSION 100). By default SQL server limits the level of recursion depth to 100. This is for the safety of your DBA and SQL Server! Use it to your advantage. You may be questioning this given that the premise of using recursion was to remove the arbitrary limit from our simple traversals. The limit can be removed by setting the MAXRECURSION option to 0, just be careful. Alternatively, a responsible limit could be set if the dataset and desired outcome allow for it.

Wrapping Up

CTEs unlock a ton of expressiveness and potential in the T-SQL language. They should be used with care but provide a much needed capability to work with uniquely structured data and solve problems with a recursive bend that build on previous answers. Graphs are a perfect use-case, a great way to represent connections in data and with tools like CTEs and the new SQL Server graph features they are quite accessible - even in a relational database.


If you enjoyed this post consider signing up below to receive more like it in the future!

Subscribe to mtchl.dev

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe