diff -r 523fddfd7a01 -r 0ff72a828b16 damecco.tex --- a/damecco.tex Sun Jul 23 00:00:00 2017 +0200 +++ b/damecco.tex Fri Sep 22 12:52:55 2017 +0200 @@ -122,9 +122,9 @@ This paper presents a largely improved version of the VF2 algorithm for the \emph{Subgraph Isomorphism Problem}. The improvements are twofold. Firstly, it is based on a new approach for determining the - matching order of the nodes, and secondly, more efficient - - nevertheless easier to compute - cutting rules significantly - reducing the search space are applied. + matching order of the nodes, and secondly, more efficient --- + nevertheless easier to compute --- cutting rules are proposed. They + together reduce the search space significantly. In addition to the usual \emph{Subgraph Isomorphism Problem}, the paper also presents specialized algorithms for the \emph{Induced Subgraph @@ -162,19 +162,21 @@ In the last decades, combinatorial structures, and especially graphs have been considered with ever increasing interest, and applied to the -solution of several new and revised questions. The expressiveness, -the simplicity and the studiedness of graphs make them practical for -modelling and appear constantly in several seemingly independent +solution of several new and revised questions. The expressiveness, +the simplicity and the deep theoretical background +of graphs make it one of the most useful +modeling tool and appears constantly in several seemingly independent fields, such as bioinformatics and chemistry. Complex biological systems arise from the interaction and cooperation -of plenty of molecular components. Getting acquainted with such -systems at the molecular level is of primary importance, since -protein-protein interaction, DNA-protein interaction, metabolic -interaction, transcription factor binding, neuronal networks, and -hormone signalling networks can be understood this way. +of plenty of molecular components. Getting acquainted with the +structure of such systems at the molecular level is of primary +importance, since protein-protein interaction, DNA-protein +interaction, metabolic interaction, transcription factor binding, +neuronal networks, and hormone singling networks can be understood +this way. -Many chemical and biological structures can easily be modelled +Many chemical and biological structures can easily be modeled as graphs, for instance, a molecular structure can be considered as a graph, whose nodes correspond to atoms and whose edges to chemical bonds. The similarity and dissimilarity of @@ -194,13 +196,13 @@ for various graph classes, like trees and planar graphs~\cite{PlanarGraphIso}, bounded valence graphs~\cite{BondedDegGraphIso}, interval graphs~\cite{IntervalGraphIso} -or permutation graphs~\cite{PermGraphIso}. Furthermore, an FPT algorithm has also been presented for the coloured hypergraph isomorphism problem in~\cite{ColoredHiperGraphIso}. +or permutation graphs~\cite{PermGraphIso}. Furthermore, an FPT algorithm has also been presented for the colored hypergraph isomorphism problem in~\cite{ColoredHiperGraphIso}. -In the following, some algorithms based on other approaches are -summarized, which do not need any restrictions on the graphs. Even though, -an overall polynomial behaviour is not expectable from such an -alternative, it may often have good practical performance, in fact, -it might be the best choice even on a graph class for which polynomial +In the following, some algorithms which do not need any restrictions on the graphs are +summarized. Even though, +an overall polynomial behavior is not expectable from such an +alternative, they may often have good practical performance, in fact, +they might be the best choice in practice even on a graph class for which polynomial algorithm is known. The first practically usable approach was due to @@ -215,7 +217,7 @@ the binary Constraint Satisfaction Problem. The \emph{Nauty} algorithm~\cite{Nauty} transforms the two graphs to -a canonical form before starting to check for isomorphism. It has +a canonical form before starting to look for an isomorphism. It has been considered as one of the fastest graph isomorphism algorithms, although graph categories were shown in which it takes exponentially many steps. This algorithm handles only the graph isomorphism problem. @@ -223,7 +225,7 @@ The \emph{LAD} algorithm~\cite{Lad} uses a depth-first search strategy and formulates the matching as a Constraint Satisfaction Problem to prune the search tree. The constraints are that the mapping -has to be injective and edge-preserving, hence it is possible to +has to be invective and edge-preserving, hence it is possible to handle new matching types as well. The \emph{RI} algorithm~\cite{RI} and its variations are based on a @@ -234,20 +236,20 @@ Search in Biological Databases~\cite{Content}. Currently, the most commonly used algorithm is the -\emph{VF2}~\cite{VF2}, the improved version of \emph{VF}~\cite{VF}, which was +\emph{VF2}~\cite{VF2}, an improved version of \emph{VF}~\cite{VF}, which was designed for solving pattern matching and computer vision problems, and has been one of the best overall algorithms for more than a decade. Although, it is not as fast as some of the new specialized algorithms, it is still widely used due to its simplicity and space efficiency. VF2 uses -a state space representation and checks some conditions in each state +a state space representation and checks specific conditions in each state to prune the search tree. Meanwhile, another variant called \emph{VF2~Plus}~\cite{VF2Plus} has been published. It is considered to be as efficient as the RI -algorithm and has a strictly better behaviour on large graphs. The -main idea of VF2~Plus is to precompute a heuristic node order of the graph to be embedded, in which VF2 works more efficiently. +algorithm and has a strictly better behavior on large graphs. The +main idea of VF2~Plus is to precompute a heuristic node order of the graph to be embedded, on which VF2 works more efficiently. This paper introduces \emph{VF2++}, a new further improved algorithm -for the graph and (induced) subgraph isomorphism problems, which uses +for the graph and (induced) subgraph isomorphism problems. It is based on efficient cutting rules and determines a node order in which VF2 runs significantly faster on practical inputs. @@ -256,14 +258,14 @@ solved, Section~\ref{sec:VF2Alg} provides a description of VF2. Based on that, Section~\ref{sec:VF2ppAlg} introduces VF2++. Some technical details necessary for an efficient implementation are discussed in -Section~\ref{sec:VF2ppImpl}. Finally, Section~\ref{sec:ExpRes} +Section~\ref{sec:VF2ppImpl}. Finally, Section~\ref{sec:ExpRes} provides a detailed experimental evaluation of VF2++ and its comparison to the state-of-the-art algorithm. It must also be mentioned that the C++ implementations of the algorithms have been made available for evaluation and use under an -open-source license as a part of LEMON~\cite{LEMON} graph -library. +open-source license as a part of +LEMON~\cite{o11:_lemon_open_sourc_c_graph_templ_librar} graph library. \section{Problem Statement}\label{sec:ProbStat} This section provides a formal description of the problems to be @@ -280,7 +282,8 @@ \textbf{equivalent} if $\mathcal{L}(u)=\mathcal{L}(v)$. \end{definition} -For the sake of simplicity, in this paper the graph, subgraph and induced subgraph isomorphisms are defined in a more general way. +For the sake of simplicity, the graph, subgraph and induced subgraph +isomorphisms are defined in a more general way. \begin{definition}\label{sec:ismorphic} $G_{1}$ and $G_{2}$ are \textbf{isomorphic} (by the node label $\mathcal{L}$) if $\exists \mathfrak{m}: @@ -329,18 +332,18 @@ The \textbf{graph isomorphism problem} can be defined as induced subgraph matching problem where the sizes of the two graphs are equal. -In addition, one may want to find a \textbf{single} embedding or \textbf{enumerate} all of them. +In addition, one may either want to find a \textbf{single} embedding or \textbf{enumerate} all of them. \section{The VF2 Algorithm}\label{sec:VF2Alg} This algorithm is the basis of both the VF2++ and the VF2~Plus. VF2 is able to handle all the variations mentioned in Section~\ref{sec:CommProb}. Although it can also handle directed graphs, -for the sake of simplicity, only the undirected case will be +for the sake of simplicity, only the undirected case is discussed. \subsection{Common notations} \indent Assume $G_{1}$ is searched in $G_{2}$. The following -definitions and notations will be used throughout this paper. +definitions and notations is used throughout this paper. \begin{definition} An injection $\mathfrak{m} : D \longrightarrow V_2$ is called (partial) \textbf{mapping}, where $D\subseteq V_1$. \end{definition} @@ -365,7 +368,7 @@ \begin{notation} Throughout the paper, $\mathbf{PT}$ denotes a generic problem type which can be substituted by any of the $\mathbf{SUB}$, $\mathbf{IND}$ -and $\mathbf{ISO}$ problems, which stand for the the problems mentioned in Section~\ref{sec:CommProb} respectively. +and $\mathbf{ISO}$ problems, which stand for the problems mentioned in Section~\ref{sec:CommProb} respectively. \end{notation} \begin{definition} @@ -435,7 +438,8 @@ $extend(\mathfrak{m},p)$. Otherwise, $extend(\mathfrak{m},p)$ is not consistent by $PT$, or it can be proved that $\mathfrak{m}$ can not be extended to a whole mapping. -In order to make sure of the correctness, see Claim~\ref{claim:consMapps}. +The correctness of the procedure follows from the claim below. + \begin{claim}\label{claim:consMapps} Through consistent mappings, only consistent whole mappings can be reached, and all the consistent whole mappings are reachable through @@ -545,14 +549,14 @@ as possible. The main reason for the superiority of VF2++ over VF2 is twofold. Firstly, -taking into account the structure and the node labelling of the graph, +taking into account the structure and the node labeling of the graph, VF2++ determines a matching order in which most of the unfruitful branches of the search space can be pruned immediately. Secondly, introducing more efficient --- nevertheless still easier to compute --- cutting rules reduces the chance of going astray even further. In addition to the usual subgraph isomorphism problem, specialized versions -for induced subgraph and graph isomorphism problems have been +for induced subgraph and graph isomorphism problems have also been designed. Note that a weaker version of the cutting rules of VF2++ and an efficient @@ -575,16 +579,16 @@ The principal question is the following. Suppose a mapping $\mathfrak{m}$ is given. For which node of $T_{1}(\mathfrak{m})$ is the hardest to find a -consistent pair in $G_{2}$? The more covered neighbours a node in +consistent pair in $G_{2}$? The more covered neighbors a node in $T_{1}(\mathfrak{m})$ has --- i.e. the largest $Conn_{\mathfrak{D}(\mathfrak{m})}$ it has ----, the more rarely satisfiable consistency constraints for its pair +---, the more rare-to-satisfy consistency constraints for its pair are given. Most of the graphs of biological and chemical structures are sparse, thus several nodes in $T_{1}(\mathfrak{m})$ may have the same $Conn_{\mathfrak{D}(\mathfrak{m})}$, which makes reasonable to define a secondary and a tertiary order between them. The observation above proves itself to be as determining, that the -secondary ordering prefers nodes with the most uncovered neighbours +secondary ordering prefers nodes with the most uncovered neighbors among which have the same $Conn_{\mathfrak{D}(\mathfrak{m})}$ to increase $Conn_{\mathfrak{D}(\mathfrak{m})}$ of uncovered nodes as much, as possible. The tertiary ordering prefers nodes having the rarest uncovered labels in $G_2$. @@ -593,11 +597,13 @@ These rules can easily result in a matching order which contains the nodes of a long path successively, whose nodes may have low $Conn$ and -is easily matchable into $G_{2}$. To try to avoid that, a Breadth-first-search order is used, and on each of its levels, the ordering procedure described above is applied. +is easily to match into $G_{2}$. To try to avoid that, a +Breadth-First-Search order is used, and on each of its levels, the +ordering procedure described above is applied. \newline -In the following, some examples on which the VF2 may be slow are -described, although they are easily solvable by using a proper +In the following, examples are shown, demonstrating that VF2 may be +slow are, even though a matching can be found easily by using a proper matching order. \begin{example} @@ -620,8 +626,8 @@ VF2 would check only in the last steps, whether $u$ can be matched to $v$. -However, had $u$ been the first matched node, u would have been -matched immediately to v, so all the mappings would have been +However, had $u$ been the first matched node, $u$ would have been +matched immediately to $v$, so all the mappings would have been precluded in which node labels can not correspond. \end{example} @@ -635,8 +641,8 @@ from $G_{1}$ and one of its starting points is connected to $u\in V_{1}\}$. -If unfortunately the nodes of the path were the first $k$ nodes in the -matching order, the algorithm would iterate through all the possible k +If, unfortunately, the nodes of the path were the first $k$ nodes in the +matching order, the algorithm would iterate through all the possible $k$ long paths in $G_{2}$, and it would recognize that no path can be extended to $G'_{1}$. \newline @@ -787,8 +793,8 @@ both stepping back to a predecessor state and exploring a successor state. -The necessary part of the candidate set is easily maintainable or -computable by following +The necessary part of the candidate set is easy to maintain or +compute by following the steps described in Section~\ref{candidateComputingVF2}. A much faster method has been designed for biological and sparse graphs, see the next section for details. @@ -800,8 +806,8 @@ V_{2}$ is a potential pair of $u\in V_{1}$, then $\forall u'\in \mathfrak{D}(\mathfrak{m}) : (u,u')\in E_{1}\Rightarrow (v,\mathfrak{m}(u'))\in -E_{2}$. That is, each covered neighbour of $u$ has to be mapped to -a covered neighbour of $v$, i.e. selecting arbitrarily a covered neighbour $u'$ of $u$, all of the admissible candidates for $u$ are among the neighbours of $\mathfrak{m}(u')$. +E_{2}$. That is, each covered neighbor of $u$ has to be mapped to +a covered neighbor of $v$, i.e. selecting arbitrarily a covered neighbor $u'$ of $u$, all of the admissible candidates for $u$ are among the neighbors of $\mathfrak{m}(u')$. Having said that, an algorithm running in $\Theta(\Delta_2)$ time is describable if there exists a covered node in the component containing @@ -823,18 +829,18 @@ size $|V_{1}|$, both the computation of the BFS tree, and processing its levels by Algorithm~\ref{alg:VF2PPProcess1} can be done in-place by swapping nodes. \subsection{Cutting rules} -In Section~\ref{VF2PPCuttingRules}, the cutting rules were -described using the sets $T_{1}$, $T_{2}$, $\tilde T_{1}$ -and $\tilde T_{2}$, which are dependent on the current mapping. The aim is to check the labelled cutting +Section~\ref{VF2PPCuttingRules} described the cutting rules +using the sets $T_{1}$, $T_{2}$, $\tilde T_{1}$ +and $\tilde T_{2}$, which are dependent on the current mapping. The aim is to check the labeled cutting rules of VF2++ in $\Theta(\Delta)$ time. -Firstly, suppose that these four sets are given in such a way, that +Firstly, suppose that these four sets are given a way, that checking whether a node is in a certain set takes constant time, e.g. they are given by their 0-1 characteristic vectors. Let $L$ be an initially zero integer lookup table of size $|K|$. After incrementing $L[\mathcal{L}(u')]$ for all $u'\in \Gamma_{1}(u) \cap T_{1}(\mathfrak{m})$ and decrementing $L[\mathcal{L}(v')]$ for all $v'\in\Gamma_{2} (v) \cap -T_{2}(\mathfrak{m})$, the first part of the cutting rules is checkable in +T_{2}(\mathfrak{m})$, the first part of the cutting rules can be checked in $\Theta(\Delta)$ time by considering the proper signs of $L$. Setting $L$ to zero takes $\Theta(\Delta)$ time again, which makes it possible to use the same table through the whole algorithm. The second part of the @@ -857,8 +863,12 @@ our experience, both algorithms run faster than VF2 with orders of magnitude, thus its inclusion was not reasonable. -The algorithms were implemented in C++ using the open-source -LEMON graph and network optimization library~\cite{LEMON}. The tests were carried out on a Linux-based system with an Intel i7 X980 3.33 GHz CPU and 6 GB of RAM. +The algorithms were implemented in C++ using the open-source LEMON +graph and network optimization +library~\cite{LEMON}\cite{o11:_lemon_open_sourc_c_graph_templ_librar}. The +tests were carried out on a Linux-based system with an Intel i7 X980 +3.33 GHz CPU and 6 GB of RAM. + \subsection{Biological graphs} The tests have been executed on a recent biological dataset created for the International Contest on Pattern Search in Biological @@ -874,7 +884,7 @@ In the following, both the induced subgraph and the graph isomorphism problems will be examined. -This dataset provides graph pairs, between which all the induced subgraph isomorphisms have to be found. For runtime results, please see Figure~\ref{fig:bioIND}. +This dataset provides graph pairs, between which all the induced subgraph isomorphisms have to be found. For the running times, please see Figure~\ref{fig:bioIND}. In an other experiment, the nodes of each graph in the database had been shuffled, and an isomorphism between the shuffled and the original @@ -918,7 +928,7 @@ \end{axis} \end{tikzpicture} \caption{On contact maps, VF2++ runs almost in constant time, while VF2 - Plus has a near-linear behaviour.} \label{fig:INDContact} + Plus has a near-linear behavior.} \label{fig:INDContact} \end{figure} \end{subfigure} @@ -1024,7 +1034,7 @@ To evaluate the efficiency of the algorithms in the case of graph isomorphism problem, random connected graphs of less than 20 000 nodes have been considered. Generating a random graph and shuffling its nodes, an -isomorphism had to be found. Figure~\ref{fig:randISO} shows the runtime results +isomorphism had to be found. Figure~\ref{fig:randISO} shows the running times on graph sets of various density. @@ -1348,7 +1358,7 @@ \end{figure} -Based on these experiments, VF2++ is faster than VF2~Plus and able to +As the experiments above demonstrates, VF2++ is faster than VF2~Plus and able to handle really large graphs in milliseconds. Note that when $IND$ was considered and the graph to be embedded had proportionally few nodes ($\rho = 0.05$, or $\rho = 0.1$), then VF2~Plus produced some inefficient node @@ -1356,7 +1366,7 @@ Figure~\ref{fig:randIND10}). If these instances had been excluded, the charts would have looked similarly to the other ones. Unsurprisingly, as denser graphs are considered, both VF2++ and VF2 -Plus slow slightly down, but remain practically usable even on graphs +Plus slow down slightly, but remain practically usable even on graphs having 10 000 nodes. @@ -1364,13 +1374,13 @@ \section{Conclusion} -This paper presented VF2++, a new graph matching algorithm based on VF2, and analysed it from a practical viewpoint. +This paper presented VF2++, a new graph matching algorithm based on VF2, and analyzed it from a practical point of view. Recognizing the importance of the node order and determining an efficient one, VF2++ is able to match graphs of thousands of nodes in near practically linear time including preprocessing. In addition to the proper order, VF2++ uses more efficient cutting -rules which are easy to compute and make the algorithm able to prune +rules, which are easy to compute and make the algorithm able to prune most of the unfruitful branches without going astray. In order to show the efficiency of the new method, it has been @@ -1380,10 +1390,10 @@ biological graphs. It seems to be asymptotically faster on protein and on contact map graphs in the case of induced subgraph isomorphism problem, while in the case of graph isomorphism problem, it has definitely better -asymptotic behaviour on protein graphs. +asymptotic behavior on protein graphs. Regarding random sparse graphs, not only has VF2++ proved itself to be -faster than VF2~Plus, but it also has a practically linear behaviour both +faster than VF2~Plus, but it also has a practically linear behavior both in the case of induced subgraph and graph isomorphism problems. %%%%%%%%%%%%%%%% @@ -1425,3 +1435,9 @@ \endinput %% %% End of file `elsarticle-template-num.tex'. + +%% LocalWords: Subgraph Isomorphism datasets subgraphs subgraph FPT +%% LocalWords: isomorphism hypergraph Ullmann Nauty precompute BFS +%% LocalWords: bijection isomorphisms undirected infeasible dataset +%% LocalWords: preprocessing incrementing decrementing neighbours +%% LocalWords: expectable bioinformatics