Scott A method for representing graphs as rooted trees for graph canonization (2019)


Abstract: Graphs increasingly stand out as an essential data structure in the field of data sciences. To study graphs, or sub-graphs, that characterize a set of observations, it is necessary to describe them formally, in order to characterize equivalence relations that make sense in the scope of the considered application domain. Hence we seek to define a canonical graph notation, so that two isomorphic (sub) graphs have the same canonical form. Such notation could subsequently be used to index and retrieve graphs or to embed them efficiently in some metric space. Sequential optimized algorithms solving this problem exist, but do not deal with labeled edges, a situation that occurs in important application domains such as chemistry. We present in this article a new algorithm based on graph rewriting that provides a general and complete solution to the graph canonization problem. Although not reported here, the formal proof of the validity of our algorithm has been established. This claim is clearly supported empirically by our experimentation on synthetic combinatorics as well as natural graphs. Furthermore, our algorithm supports distributed implementations, leading to efficient computing perspectives.

The general idea of Scott seems to be to first pick a unique root of the tree. The remaining vertices are then ordering into layers based on their distance from this root. Next all edges that prevent the graph from being a tree are removed using a set of 3 rules to rewrite the graph. And finally the tree representation is converted into a canonical string.

The main challenges of interest are the selection of the root node (which we need to be able to uniquely determine) and the order to use for the children of a node in the tree (this order needs to be fixed or we will get a different output string potentially).

The root node problem is solved by first using heuristics based on easy to compute graph node properties (e.g. max degree node). If the heuristics do not yield a unique root node, then all candidate root nodes get used to compute the trace string and the minimal trace is unique for the isomorphism class.

The order problem is resolved via a reference to a paper from 1986, which is unfortunately completely in French. The claim from the paper would be that a planar representation of a rooted tree admits an unambiguous canonical notation in the form of a sequence of words. With some additional extensions this is used to complete a trace function that can compress a sub-tree into a single vertex with a specific label. Using this trace function on the root node is then how the canonical representation of the tree is found.

Next follows the running example used for Scott, which computes the canonical string for an input graph by first computing the layers and then rewriting the graph as a tree:

Note: supporting directed graphs is an open research question in the original and follow up Scott paper.

As a perspective, a generalization of Scott to handle oriented graphs, typically used in the representation of social relations or dependencies, could be quite easily considered by adapting the set of productions used to transform a graph into an equivalent tree. Since a rooted tree with non-oriented edges is equivalent to the same rooted tree with oriented edges, the trace function once the tree is obtained would be usable without any modification. However, the validity of this extension remains to be asserted.