Graph matching

I have added graph support to hash-db. This post outlines how graph matching works.

We can match nodes in a graph by creating a matrix that represents those connections only then multiply that with a matrix that is the adjacency matrix. For example, if we want to execute the following cypher query:

match (start:Person)-[:FRIEND]->(end:Person), (start)-[:LIKES]->(post:Post), (end)-[:POSTED]->(post:Post), (post:Post)-[:REFERS]->(person:Person) return start, end, post, person

We need to pick a starting location, which I suspect will be (start:Person {‘name’: ‘Samuel’}), generate a FRIEND matrix with this start node masked.

    def create_matrix(self, adjacency_matrix, nodes):
        multiply = np.empty((adjacency_matrix.shape[0], 1))
        for node in nodes:
                multiply[node["position"]][0] = 1
        return multiply


# Then to search the matrix:
                        for match in matches:
                            node = match["to_node"]
                            multiply_matrix = self.create_matrix(adjacency_matrix, [node])
                            edges_from = np.matmul(adjacency_matrix, multiply_matrix)

                            for item in range(0, edges_from.shape[0]):

                                if edges_from[item][0] == 1:
                                    direction_index = "{}_{}".format(node["position"], self.nodes[item]["position"])

                                    if self.directions[relationship_name].get(direction_index, False) == True:
                                        inserted = True
                                        print("{} -{}-> {}".format(node["name"], relationship_name, self.nodes[item]["name"]))
                                        forward_relationship = {
                                            "relationship": relationship_name,
                                            "from_node": copy.deepcopy(node),
                                            "to_node": copy.deepcopy(self.nodes[item]),
                                            "source_relationship": match


How to decide which node to begin at is a general problem. We can start at multiple nodes at once. I am guessing I can use order within the query. Or toplogical sort the variables. We need to do an attribute match too. We’re saying we’re looking for a node that has an attribute of name and a value of Samuel.

This is how we create a mask matrix which we multiply our data by to do a breadth first search:

                multiply[node["position"]][0] = 1

How does the main loop work?

We loop over the query plan, which looks like this:

[{'attributes': {}, 'kind': 'match', 'label': 'Person', 'variable': 'start'},
 {'kind': 'relationship', 'name': 'FRIEND'},
 {'attributes': {}, 'kind': 'match', 'label': 'Person', 'variable': 'end'},
 {'kind': 'match', 'variable': 'start'},
 {'kind': 'relationship', 'name': 'LIKES'},
 {'attributes': {}, 'kind': 'match', 'label': 'Post', 'variable': 'post'},
 {'kind': 'match', 'variable': 'end'},
 {'kind': 'relationship', 'name': 'POSTED'},
 {'attributes': {}, 'kind': 'match', 'label': 'Post', 'variable': 'post'},
 {'attributes': {}, 'kind': 'match', 'label': 'Post', 'variable': 'post'},
 {'kind': 'relationship', 'name': 'REFERS'},
 {'attributes': {}, 'kind': 'match', 'label': 'Person', 'variable': 'person'}]

The match nodes handle data loading into and out of variables. When we encounter a relationship node ({“kind”: “relationship”}, we run a numpy matrix multiplication query with the previous relationships object “to_node”. So every query is relative to the previous set of results. This lets us walk the query graph with very little code.

From the very first match statement, we generate a single object that tracks all the outcomes from matching against that match. We accumulate query intermediary objects in an old_matches key. For every second match statement, we mark all the resulting node objects as having seen the variable, i.e, they are bound to the variable name.

At the end, we have an object with every part of the query evaluated and a history of relationships. We need to join the data produced. We do that by walking the history and seeing which nodes are assigned to which variables, then we accumulate them.

Now we have to join the data.

How to join data

Imagine, I encounter {‘kind’: ‘match’, ‘variable’: ‘end’} in the above query plan. This means, retrieve the data that was output by variable end. This is a list of relationships that looks like the following:

Here’s one relationship. Tasya-[:POSTED]->Thoughts.

 {'from_node': {'attributes': {'label': 'Person',
			'name': 'Tasya'},
			 'matches': 'end',
			 'name': 'Tasya',
			 'position': 1},
'planning_index': 8,
'relationship': 'POSTED',
'source_relationship': ...,
'to_node': {'attributes': {'label': 'Post',
                  'name': 'Thoughts'},
		   'matches': 'post',
		   'name': 'Thoughts',
		   'position': 9}}

Notice the from_node has matched: ‘end’ while the to_node has matched: ‘post’. This is because of the “return start, end, post, person in the original query. Each node gets marked when it is the output of a relationship search. The source relationship refers to the relationship that caused this relationship – the previous search results. From the source_relationship, we can get all resolved variables. The code to accumulate variables is a simple recursive function that accumulates variables into rows.

        def accumulate_variable(output_row, match, seenbefore):
            if "from_node" in match and "matches" in match["from_node"]:
                output_row[match["from_node"]["matches"]] = match["from_node"]
            if "matches" in match["to_node"]:
                output_row[match["to_node"]["matches"]] = match["to_node"]
            if "source_relationship" in match:
                accumulate_variable(output_row, match["source_relationship"], seenbefore)

        if parser["match"]:
            for relationship in relationships:
                for match in relationship["matches"]:
                    if "inserted" not in match:
                        output_row = {}
                        accumulate_variable(output_row, match, [])
                        yield output_row

This algorithm seems to give the correct results. I don’t know if I actually need to genuinely join (hash join) or not. Oh well.

Leave a Reply

Your email address will not be published. Required fields are marked *