SQL Recursive CTE queries in are commonly used when working with hierarchical data in Relational Databases. For e.g. the following Highway Network:
Let’s say we are interested in getting all the path from each Highway to other Highways. We would use a Recursive CTE for that. Let’s create some sample data and write a Recursive CTE Queries to extract all the PATHs.
WITHoutcome(starting,highway_to,highway_from,path) AS (--ANCHOR: Initialquerytoestablishthebaseresultset.SELECT--Anchorhighway_toas starting,highway_to,highway_from,highway_toas pathFROMflows_toWHEREhighway_fromisnullUNIONALL--RECURSIVE: ReferencesoutcomeCTEtobuildonpreviousresultsSELECTstarting,flows_to.highway_to,flows_to.highway_from,outcome.path||'->'||flows_to.highway_toFROMoutcomeJOINflows_toONflows_to.highway_from=outcome.highway_to)SELECTDISTINCTstarting,highway_to,pathFROMoutcome;
The above query will output:
Sample of the PATHs returned by Rercursive CTE
While the recursive CTE is not complex, it is not trivial to write.
Enter GQL.
Hierarchical data is a Graph. As such a Graph Query Language (GQL) may be a more elegant way to extract the above Paths. GQL, like SQL, is a high-level, declarative, programming language. It is specifically designed for graph structures. It uses nodes and relations as first-class citizens, although the output to a query can be a graph or a set of tuples.
Let’s explore using the same dataset and query using GQL. We will start with generating sample data
Next let’s write a GQL query to extract all the PATHs. Note that in the case of GQL, to compute and list the paths, it suffices to write:
MATCHnodes= (a)CALL{MATCHpath= (a)-[edges:flows_to]->{0,}(b)returnreduce(path=a._id,edgeinedges|path||'->'||edge._to) as route,b}returntable(a.name,b.name,route);
This will output
Sample of the PATHs returned by the GQL.
It can be seen that the structure of the GQL query is far less complicated and more intuitive than its SQL counterpart, since it takes advantage of the graph structure. In this case, a basic MATCH.. RETURN structure suffices to express a recursive query. This is mainly because GQL is developed as a query language for graphs and recursion is typical in these cases. The(a:highway)-[:flows_to]->{0,}(b:highway) path pattern selects all nodes b that are reachable by following one or more edges in the graph, traversing the graph using the :flows_to relation.
Path patterns describe multi-hop relationships in the graph: {2,4}: Exactly 2 to 4 hops {0,}: 0 or more hops (unbounded) {,3}: Up to 3 hops {5}: Exactly 5 hops
SQL CONNECT BY clause.
Some SQL based databases support CONNECT BY Clause. This clause allows queries on hierarchical data without the need for composing complex recursive CTEs. For e.g. the following CONNECT BY clause query will find all the paths from each to all others nodes
What is the most frequently bought item with Infant Formula and Infant Diapers? Association Rule Mining is a key to Market Basket Analysis. While Association Rule Mining is a complex topic, we can use SQL to figure out what items most frequently bought together.
An association rule is a implication expression of the form X → Y, where X and Y are itemset. X is the Antecedent and Y is the Consequent
Now let’s assume we take {Infant Diaper, Infant Milk} as the Antecedent, and we want to figure out the Consequent i.e. what ITEM is most often bought with Infant Diaper and Infant Milk.
Association Rule are, by definition, extracted from the data i.e. they are core properties of the Data. We are not looking for correlation in the Item sets in a Rule. This makes Graph Databases a good platform for Association Rule Mining. While there are lot of Python and R packages that are designed for Association Rule Mining, we want to start exploring the use of GQL to perform Exploratory Data Mining exercise.
We will start with the following sample Orders and the Items contained in those Orders
Let’s use GQL to query the Items that are bought with Infant Diaper and Infant Formula the most.
match (o:orders)-[:includes]->(third_item:item)CALL (third_item){match (o:orders)-[:includes]->(i1:item {_id:'infant_formula'}) , (o)-[:includes]->(i2:item {_id:'infant_diaper'}) , (o)-[:includes]->(third_item) return third_item, count(o) as counter }filter counter > 0return distinct table(third_item.description,counter)order by counter desc;
Notice the use of CALL statement. The CALL statement is used to invoke an inline procedure or subquery. An inline procedure is a user-defined procedure embedded within a query, commonly used to execute subqueries or perform data modifications. It enables complex logic such as looping. In short, CALL statement enables modular query design and the invocation of complex logic in a graph query.
In the above GQL query we first get a list of all items and then loop through each of them using the CALL statement to count the number of Orders where that item is included along with Infant Formula and Infant Diaper.
This query will output:
Based on the above query, we can extract the following Association Rule:
Both the CALL function and the GROUP BY will produce the same results in this case. However CALL function is more versatile allows for complex MATCH statement as part of the sub-query.
Often times we have to aggregate the weights along paths (edges) in a Graph. GQL provides a REDUCE function for this. We will use Ultimate Beneficial Ownership (UBO) graph to demonstrate this.
A beneficial owner is defined as a physical person “who directly or indirectly owns or controls a sufficient part of the shares or the voting rights, or who practices control through other means.”.
If a company is owned by another company (e.g. a holding company), it is the physical person(s) who ultimately own(s) the company who shall be registered as beneficial owner(s). Identifying Ultimate Beneficial Ownership (UBO) can be challenging due to complex and opaque ownership structures. But Property Graphs and GQL make this process easier.
Let’s take a look at the following Beneficial Ownership chart
Let’s create a the dataset based on the above Beneficial Ownership chart:
Now let’s see how the REDUCE function can be used to calculate the aggregate along the path. We will calculate the Uroosa’s total Ownership for the Hotel Kitchen Sink. Notice that Uroosa directly owns a share of the Hotel Kitchen Sink and also through a separate business entities (Acme Corp. and Fourth Coffee). We will need to account for both of these paths:
We can use the following GQL to Calculated the Ownership Percentage along each Path:
In the post titled, , we looked at how we can unnest / flatten a path in Graph using GQL. In this post we will look at the how to identify each match when there are multiple paths . Let’s start with our basic Social Network example.
We are interested in path(s) from Saqib to Uroosa. Notice that are two paths for that 1) Saqib -> Angela -> Scott -> Uroosa; 2) Saqib -> Fatima -> Eleni -> Uroosa. We are also interested in the exact nature of the relationship i.e. co-worker vs. friend.
Let’s start by create a Graph that represents that above relationships
When working with Property Graphs, resolving the data path aka. decomposing Graph path is key to understanding the path traversed by a MATCH query in GQL. To facilitate this unnesting / flattening of path data, GQL provides few options. We will look at two options and apply them to following Graph to unnest the SHORTEST path between Alice and Diana i.e. Alice -> Chalie -> Diana
Returning one or more paths during a graph traversal is a central feature of regular path queries in the GQL. To illustrate we will use the following example of a travel network:
We are interested in all the SHORTEST PATHS available from Bayreuth to Santiago.
This will return each of the path decomposed into the each hop (step) as following
In the above GQL Query a is matched to the city of Bayreuth, and b to Santiago city. The PATH(s) are all possible combinations of the edges between between a and b which minimize the number of hops. In our case 4 is the minimum number of hops and there 3 possible PATHs that qualify.
The workhorse of Graph Query Language (GQL) is pattern matching. GQL is oblivious to how a graph is stored. The table resulting from pattern matching is manipulated by a sequence of operators that modify it in an imperative style that that is referred to as “Linear Composition” or “Sequential Composition”. In “Linear Composition” we can simply add clauses to the already existing query which apply new operations to the result of already processed clauses.
In the following example we will demonstrate how a sequence of operators can continue after the RETURN clause.
Let’s start with a simple Graph.
Let’s say we interested in the nodes that have the longest path and then we want to find out all the paths between those two nodes. We can start with finding the longest path i.e. Saqib -> Angela -> Scott -> Uroosa and then use Linear Composition to add another Pattern Matching query to identify the all the paths between Saqib and Uroosa
Now let’s find the Longest path using PATTERN MATCHING
MATCHpath=ALL (a)-[links]->{1,}(b)RETURNa,b,PATH_LENGTH(path) as path_lengthORDERBYpath_lengthDESClimit1
This will result in:
Next let’s add use Linear Composition to add another Pattern Matching query to identify the all the paths between Saqib and Uroosa
MATCHpath=ALL (a)-[links]->{1,}(b)RETURNa,b,PATH_LENGTH(path) as path_length,pathORDERBYpath_lengthDESClimit1NEXTMATCHpath=ALL (a)-[links]->{1,}(b)RETURNtable(a.name,b.name,pedges(path));
This will result in
Observe that GQL processes results of pattern matching in sequential, or pipelined way where the output of each operation in a sequence serves as the input for the next operation. GQL standard refers to this as Linear Statement Composition or simple Linear Composition.