1.1 --- a/damecco.tex Sun Jul 23 00:00:00 2017 +0200
1.2 +++ b/damecco.tex Fri Sep 22 12:52:55 2017 +0200
1.3 @@ -122,9 +122,9 @@
1.4 This paper presents a largely improved version of the VF2 algorithm
1.5 for the \emph{Subgraph Isomorphism Problem}. The improvements are
1.6 twofold. Firstly, it is based on a new approach for determining the
1.7 - matching order of the nodes, and secondly, more efficient -
1.8 - nevertheless easier to compute - cutting rules significantly
1.9 - reducing the search space are applied.
1.10 + matching order of the nodes, and secondly, more efficient ---
1.11 + nevertheless easier to compute --- cutting rules are proposed. They
1.12 + together reduce the search space significantly.
1.13
1.14 In addition to the usual \emph{Subgraph Isomorphism Problem}, the paper also
1.15 presents specialized algorithms for the \emph{Induced Subgraph
1.16 @@ -162,19 +162,21 @@
1.17
1.18 In the last decades, combinatorial structures, and especially graphs
1.19 have been considered with ever increasing interest, and applied to the
1.20 -solution of several new and revised questions. The expressiveness,
1.21 -the simplicity and the studiedness of graphs make them practical for
1.22 -modelling and appear constantly in several seemingly independent
1.23 +solution of several new and revised questions. The expressiveness,
1.24 +the simplicity and the deep theoretical background
1.25 +of graphs make it one of the most useful
1.26 +modeling tool and appears constantly in several seemingly independent
1.27 fields, such as bioinformatics and chemistry.
1.28
1.29 Complex biological systems arise from the interaction and cooperation
1.30 -of plenty of molecular components. Getting acquainted with such
1.31 -systems at the molecular level is of primary importance, since
1.32 -protein-protein interaction, DNA-protein interaction, metabolic
1.33 -interaction, transcription factor binding, neuronal networks, and
1.34 -hormone signalling networks can be understood this way.
1.35 +of plenty of molecular components. Getting acquainted with the
1.36 +structure of such systems at the molecular level is of primary
1.37 +importance, since protein-protein interaction, DNA-protein
1.38 +interaction, metabolic interaction, transcription factor binding,
1.39 +neuronal networks, and hormone singling networks can be understood
1.40 +this way.
1.41
1.42 -Many chemical and biological structures can easily be modelled
1.43 +Many chemical and biological structures can easily be modeled
1.44 as graphs, for instance, a molecular structure can be
1.45 considered as a graph, whose nodes correspond to atoms and whose
1.46 edges to chemical bonds. The similarity and dissimilarity of
1.47 @@ -194,13 +196,13 @@
1.48 for various graph classes, like trees and planar
1.49 graphs~\cite{PlanarGraphIso}, bounded valence
1.50 graphs~\cite{BondedDegGraphIso}, interval graphs~\cite{IntervalGraphIso}
1.51 -or permutation graphs~\cite{PermGraphIso}. Furthermore, an FPT algorithm has also been presented for the coloured hypergraph isomorphism problem in~\cite{ColoredHiperGraphIso}.
1.52 +or permutation graphs~\cite{PermGraphIso}. Furthermore, an FPT algorithm has also been presented for the colored hypergraph isomorphism problem in~\cite{ColoredHiperGraphIso}.
1.53
1.54 -In the following, some algorithms based on other approaches are
1.55 -summarized, which do not need any restrictions on the graphs. Even though,
1.56 -an overall polynomial behaviour is not expectable from such an
1.57 -alternative, it may often have good practical performance, in fact,
1.58 -it might be the best choice even on a graph class for which polynomial
1.59 +In the following, some algorithms which do not need any restrictions on the graphs are
1.60 +summarized. Even though,
1.61 +an overall polynomial behavior is not expectable from such an
1.62 +alternative, they may often have good practical performance, in fact,
1.63 +they might be the best choice in practice even on a graph class for which polynomial
1.64 algorithm is known.
1.65
1.66 The first practically usable approach was due to
1.67 @@ -215,7 +217,7 @@
1.68 the binary Constraint Satisfaction Problem.
1.69
1.70 The \emph{Nauty} algorithm~\cite{Nauty} transforms the two graphs to
1.71 -a canonical form before starting to check for isomorphism. It has
1.72 +a canonical form before starting to look for an isomorphism. It has
1.73 been considered as one of the fastest graph isomorphism algorithms,
1.74 although graph categories were shown in which it takes exponentially
1.75 many steps. This algorithm handles only the graph isomorphism problem.
1.76 @@ -223,7 +225,7 @@
1.77 The \emph{LAD} algorithm~\cite{Lad} uses a depth-first search
1.78 strategy and formulates the matching as a Constraint Satisfaction
1.79 Problem to prune the search tree. The constraints are that the mapping
1.80 -has to be injective and edge-preserving, hence it is possible to
1.81 +has to be invective and edge-preserving, hence it is possible to
1.82 handle new matching types as well.
1.83
1.84 The \emph{RI} algorithm~\cite{RI} and its variations are based on a
1.85 @@ -234,20 +236,20 @@
1.86 Search in Biological Databases~\cite{Content}.
1.87
1.88 Currently, the most commonly used algorithm is the
1.89 -\emph{VF2}~\cite{VF2}, the improved version of \emph{VF}~\cite{VF}, which was
1.90 +\emph{VF2}~\cite{VF2}, an improved version of \emph{VF}~\cite{VF}, which was
1.91 designed for solving pattern matching and computer vision problems,
1.92 and has been one of the best overall algorithms for more than a
1.93 decade. Although, it is not as fast as some of the new specialized algorithms, it is still widely used due to its simplicity and space efficiency. VF2 uses
1.94 -a state space representation and checks some conditions in each state
1.95 +a state space representation and checks specific conditions in each state
1.96 to prune the search tree.
1.97
1.98 Meanwhile, another variant called \emph{VF2~Plus}~\cite{VF2Plus} has
1.99 been published. It is considered to be as efficient as the RI
1.100 -algorithm and has a strictly better behaviour on large graphs. The
1.101 -main idea of VF2~Plus is to precompute a heuristic node order of the graph to be embedded, in which VF2 works more efficiently.
1.102 +algorithm and has a strictly better behavior on large graphs. The
1.103 +main idea of VF2~Plus is to precompute a heuristic node order of the graph to be embedded, on which VF2 works more efficiently.
1.104
1.105 This paper introduces \emph{VF2++}, a new further improved algorithm
1.106 -for the graph and (induced) subgraph isomorphism problems, which uses
1.107 +for the graph and (induced) subgraph isomorphism problems. It is based on
1.108 efficient cutting rules and determines a node order in which VF2 runs
1.109 significantly faster on practical inputs.
1.110
1.111 @@ -256,14 +258,14 @@
1.112 solved, Section~\ref{sec:VF2Alg} provides a description of VF2. Based
1.113 on that, Section~\ref{sec:VF2ppAlg} introduces VF2++. Some technical
1.114 details necessary for an efficient implementation are discussed in
1.115 -Section~\ref{sec:VF2ppImpl}. Finally, Section~\ref{sec:ExpRes}
1.116 +Section~\ref{sec:VF2ppImpl}. Finally, Section~\ref{sec:ExpRes}
1.117 provides a detailed experimental evaluation of VF2++ and its comparison
1.118 to the state-of-the-art algorithm.
1.119
1.120 It must also be mentioned that the C++ implementations of the
1.121 algorithms have been made available for evaluation and use under an
1.122 -open-source license as a part of LEMON~\cite{LEMON} graph
1.123 -library.
1.124 +open-source license as a part of
1.125 +LEMON~\cite{o11:_lemon_open_sourc_c_graph_templ_librar} graph library.
1.126
1.127 \section{Problem Statement}\label{sec:ProbStat}
1.128 This section provides a formal description of the problems to be
1.129 @@ -280,7 +282,8 @@
1.130 \textbf{equivalent} if $\mathcal{L}(u)=\mathcal{L}(v)$.
1.131 \end{definition}
1.132
1.133 -For the sake of simplicity, in this paper the graph, subgraph and induced subgraph isomorphisms are defined in a more general way.
1.134 +For the sake of simplicity, the graph, subgraph and induced subgraph
1.135 +isomorphisms are defined in a more general way.
1.136
1.137 \begin{definition}\label{sec:ismorphic}
1.138 $G_{1}$ and $G_{2}$ are \textbf{isomorphic} (by the node label $\mathcal{L}$) if $\exists \mathfrak{m}:
1.139 @@ -329,18 +332,18 @@
1.140 The \textbf{graph isomorphism problem} can be defined as induced
1.141 subgraph matching problem where the sizes of the two graphs are equal.
1.142
1.143 -In addition, one may want to find a \textbf{single} embedding or \textbf{enumerate} all of them.
1.144 +In addition, one may either want to find a \textbf{single} embedding or \textbf{enumerate} all of them.
1.145
1.146 \section{The VF2 Algorithm}\label{sec:VF2Alg}
1.147 This algorithm is the basis of both the VF2++ and the VF2~Plus. VF2
1.148 is able to handle all the variations mentioned in Section~\ref{sec:CommProb}. Although it can also handle directed graphs,
1.149 -for the sake of simplicity, only the undirected case will be
1.150 +for the sake of simplicity, only the undirected case is
1.151 discussed.
1.152
1.153
1.154 \subsection{Common notations}
1.155 \indent Assume $G_{1}$ is searched in $G_{2}$. The following
1.156 -definitions and notations will be used throughout this paper.
1.157 +definitions and notations is used throughout this paper.
1.158 \begin{definition}
1.159 An injection $\mathfrak{m} : D \longrightarrow V_2$ is called (partial) \textbf{mapping}, where $D\subseteq V_1$.
1.160 \end{definition}
1.161 @@ -365,7 +368,7 @@
1.162 \begin{notation}
1.163 Throughout the paper, $\mathbf{PT}$ denotes a generic problem type
1.164 which can be substituted by any of the $\mathbf{SUB}$, $\mathbf{IND}$
1.165 -and $\mathbf{ISO}$ problems, which stand for the the problems mentioned in Section~\ref{sec:CommProb} respectively.
1.166 +and $\mathbf{ISO}$ problems, which stand for the problems mentioned in Section~\ref{sec:CommProb} respectively.
1.167 \end{notation}
1.168
1.169 \begin{definition}
1.170 @@ -435,7 +438,8 @@
1.171 $extend(\mathfrak{m},p)$. Otherwise, $extend(\mathfrak{m},p)$ is not consistent by $PT$, or it
1.172 can be proved that $\mathfrak{m}$ can not be extended to a whole mapping.
1.173
1.174 -In order to make sure of the correctness, see Claim~\ref{claim:consMapps}.
1.175 +The correctness of the procedure follows from the claim below.
1.176 +
1.177 \begin{claim}\label{claim:consMapps}
1.178 Through consistent mappings, only consistent whole mappings can be
1.179 reached, and all the consistent whole mappings are reachable through
1.180 @@ -545,14 +549,14 @@
1.181 as possible.
1.182
1.183 The main reason for the superiority of VF2++ over VF2 is twofold. Firstly,
1.184 -taking into account the structure and the node labelling of the graph,
1.185 +taking into account the structure and the node labeling of the graph,
1.186 VF2++ determines a matching order in which most of the unfruitful
1.187 branches of the search space can be pruned immediately. Secondly,
1.188 introducing more efficient --- nevertheless still easier to compute
1.189 --- cutting rules reduces the chance of going astray even further.
1.190
1.191 In addition to the usual subgraph isomorphism problem, specialized versions
1.192 -for induced subgraph and graph isomorphism problems have been
1.193 +for induced subgraph and graph isomorphism problems have also been
1.194 designed.
1.195
1.196 Note that a weaker version of the cutting rules of VF2++ and an efficient
1.197 @@ -575,16 +579,16 @@
1.198
1.199 The principal question is the following. Suppose a mapping $\mathfrak{m}$ is
1.200 given. For which node of $T_{1}(\mathfrak{m})$ is the hardest to find a
1.201 -consistent pair in $G_{2}$? The more covered neighbours a node in
1.202 +consistent pair in $G_{2}$? The more covered neighbors a node in
1.203 $T_{1}(\mathfrak{m})$ has --- i.e. the largest $Conn_{\mathfrak{D}(\mathfrak{m})}$ it has
1.204 ----, the more rarely satisfiable consistency constraints for its pair
1.205 +---, the more rare-to-satisfy consistency constraints for its pair
1.206 are given.
1.207
1.208 Most of the graphs of biological and chemical structures are sparse, thus several nodes in
1.209 $T_{1}(\mathfrak{m})$ may have the same $Conn_{\mathfrak{D}(\mathfrak{m})}$, which makes
1.210 reasonable to define a secondary and a tertiary order between them.
1.211 The observation above proves itself to be as determining, that the
1.212 -secondary ordering prefers nodes with the most uncovered neighbours
1.213 +secondary ordering prefers nodes with the most uncovered neighbors
1.214 among which have the same $Conn_{\mathfrak{D}(\mathfrak{m})}$ to increase
1.215 $Conn_{\mathfrak{D}(\mathfrak{m})}$ of uncovered nodes as much, as possible. The tertiary ordering prefers nodes having the rarest uncovered labels in $G_2$.
1.216
1.217 @@ -593,11 +597,13 @@
1.218
1.219 These rules can easily result in a matching order which contains the
1.220 nodes of a long path successively, whose nodes may have low $Conn$ and
1.221 -is easily matchable into $G_{2}$. To try to avoid that, a Breadth-first-search order is used, and on each of its levels, the ordering procedure described above is applied.
1.222 +is easily to match into $G_{2}$. To try to avoid that, a
1.223 +Breadth-First-Search order is used, and on each of its levels, the
1.224 +ordering procedure described above is applied.
1.225 \newline
1.226
1.227 -In the following, some examples on which the VF2 may be slow are
1.228 -described, although they are easily solvable by using a proper
1.229 +In the following, examples are shown, demonstrating that VF2 may be
1.230 +slow are, even though a matching can be found easily by using a proper
1.231 matching order.
1.232
1.233 \begin{example}
1.234 @@ -620,8 +626,8 @@
1.235 VF2 would check only in the last steps, whether $u$ can be matched to
1.236 $v$.
1.237
1.238 -However, had $u$ been the first matched node, u would have been
1.239 -matched immediately to v, so all the mappings would have been
1.240 +However, had $u$ been the first matched node, $u$ would have been
1.241 +matched immediately to $v$, so all the mappings would have been
1.242 precluded in which node labels can not correspond.
1.243 \end{example}
1.244
1.245 @@ -635,8 +641,8 @@
1.246 from $G_{1}$ and one of its starting points is connected to $u\in
1.247 V_{1}\}$.
1.248
1.249 -If unfortunately the nodes of the path were the first $k$ nodes in the
1.250 -matching order, the algorithm would iterate through all the possible k
1.251 +If, unfortunately, the nodes of the path were the first $k$ nodes in the
1.252 +matching order, the algorithm would iterate through all the possible $k$
1.253 long paths in $G_{2}$, and it would recognize that no path can be
1.254 extended to $G'_{1}$.
1.255 \newline
1.256 @@ -787,8 +793,8 @@
1.257 both stepping back to a predecessor state and exploring a successor
1.258 state.
1.259
1.260 -The necessary part of the candidate set is easily maintainable or
1.261 -computable by following
1.262 +The necessary part of the candidate set is easy to maintain or
1.263 +compute by following
1.264 the steps described in Section~\ref{candidateComputingVF2}. A much faster method
1.265 has been designed for biological and sparse graphs, see the next
1.266 section for details.
1.267 @@ -800,8 +806,8 @@
1.268 V_{2}$ is a potential pair of $u\in V_{1}$, then $\forall
1.269 u'\in \mathfrak{D}(\mathfrak{m}) : (u,u')\in
1.270 E_{1}\Rightarrow (v,\mathfrak{m}(u'))\in
1.271 -E_{2}$. That is, each covered neighbour of $u$ has to be mapped to
1.272 -a covered neighbour of $v$, i.e. selecting arbitrarily a covered neighbour $u'$ of $u$, all of the admissible candidates for $u$ are among the neighbours of $\mathfrak{m}(u')$.
1.273 +E_{2}$. That is, each covered neighbor of $u$ has to be mapped to
1.274 +a covered neighbor of $v$, i.e. selecting arbitrarily a covered neighbor $u'$ of $u$, all of the admissible candidates for $u$ are among the neighbors of $\mathfrak{m}(u')$.
1.275
1.276 Having said that, an algorithm running in $\Theta(\Delta_2)$ time is
1.277 describable if there exists a covered node in the component containing
1.278 @@ -823,18 +829,18 @@
1.279 size $|V_{1}|$, both the computation of the BFS tree, and processing its levels by Algorithm~\ref{alg:VF2PPProcess1} can be done in-place by swapping nodes.
1.280
1.281 \subsection{Cutting rules}
1.282 -In Section~\ref{VF2PPCuttingRules}, the cutting rules were
1.283 -described using the sets $T_{1}$, $T_{2}$, $\tilde T_{1}$
1.284 -and $\tilde T_{2}$, which are dependent on the current mapping. The aim is to check the labelled cutting
1.285 +Section~\ref{VF2PPCuttingRules} described the cutting rules
1.286 +using the sets $T_{1}$, $T_{2}$, $\tilde T_{1}$
1.287 +and $\tilde T_{2}$, which are dependent on the current mapping. The aim is to check the labeled cutting
1.288 rules of VF2++ in $\Theta(\Delta)$ time.
1.289
1.290 -Firstly, suppose that these four sets are given in such a way, that
1.291 +Firstly, suppose that these four sets are given a way, that
1.292 checking whether a node is in a certain set takes constant time,
1.293 e.g. they are given by their 0-1 characteristic vectors. Let $L$ be an
1.294 initially zero integer lookup table of size $|K|$. After incrementing
1.295 $L[\mathcal{L}(u')]$ for all $u'\in \Gamma_{1}(u) \cap T_{1}(\mathfrak{m})$ and
1.296 decrementing $L[\mathcal{L}(v')]$ for all $v'\in\Gamma_{2} (v) \cap
1.297 -T_{2}(\mathfrak{m})$, the first part of the cutting rules is checkable in
1.298 +T_{2}(\mathfrak{m})$, the first part of the cutting rules can be checked in
1.299 $\Theta(\Delta)$ time by considering the proper signs of $L$. Setting $L$
1.300 to zero takes $\Theta(\Delta)$ time again, which makes it possible to use
1.301 the same table through the whole algorithm. The second part of the
1.302 @@ -857,8 +863,12 @@
1.303 our experience, both algorithms run faster than VF2 with orders of
1.304 magnitude, thus its inclusion was not reasonable.
1.305
1.306 -The algorithms were implemented in C++ using the open-source
1.307 -LEMON graph and network optimization library~\cite{LEMON}. The tests were carried out on a Linux-based system with an Intel i7 X980 3.33 GHz CPU and 6 GB of RAM.
1.308 +The algorithms were implemented in C++ using the open-source LEMON
1.309 +graph and network optimization
1.310 +library~\cite{LEMON}\cite{o11:_lemon_open_sourc_c_graph_templ_librar}. The
1.311 +tests were carried out on a Linux-based system with an Intel i7 X980
1.312 +3.33 GHz CPU and 6 GB of RAM.
1.313 +
1.314 \subsection{Biological graphs}
1.315 The tests have been executed on a recent biological dataset created
1.316 for the International Contest on Pattern Search in Biological
1.317 @@ -874,7 +884,7 @@
1.318
1.319 In the following, both the induced subgraph and the graph
1.320 isomorphism problems will be examined.
1.321 -This dataset provides graph pairs, between which all the induced subgraph isomorphisms have to be found. For runtime results, please see Figure~\ref{fig:bioIND}.
1.322 +This dataset provides graph pairs, between which all the induced subgraph isomorphisms have to be found. For the running times, please see Figure~\ref{fig:bioIND}.
1.323
1.324 In an other experiment, the nodes of each graph in the database had been
1.325 shuffled, and an isomorphism between the shuffled and the original
1.326 @@ -918,7 +928,7 @@
1.327 \end{axis}
1.328 \end{tikzpicture}
1.329 \caption{On contact maps, VF2++ runs almost in constant time, while VF2
1.330 - Plus has a near-linear behaviour.} \label{fig:INDContact}
1.331 + Plus has a near-linear behavior.} \label{fig:INDContact}
1.332 \end{figure}
1.333 \end{subfigure}
1.334
1.335 @@ -1024,7 +1034,7 @@
1.336 To evaluate the efficiency of the algorithms in the case of graph
1.337 isomorphism problem, random connected graphs of less than 20 000 nodes have been
1.338 considered. Generating a random graph and shuffling its nodes, an
1.339 -isomorphism had to be found. Figure~\ref{fig:randISO} shows the runtime results
1.340 +isomorphism had to be found. Figure~\ref{fig:randISO} shows the running times
1.341 on graph sets of various density.
1.342
1.343
1.344 @@ -1348,7 +1358,7 @@
1.345 \end{figure}
1.346
1.347
1.348 -Based on these experiments, VF2++ is faster than VF2~Plus and able to
1.349 +As the experiments above demonstrates, VF2++ is faster than VF2~Plus and able to
1.350 handle really large graphs in milliseconds. Note that when $IND$ was
1.351 considered and the graph to be embedded had proportionally few nodes ($\rho =
1.352 0.05$, or $\rho = 0.1$), then VF2~Plus produced some inefficient node
1.353 @@ -1356,7 +1366,7 @@
1.354 Figure~\ref{fig:randIND10}). If these instances had been excluded, the
1.355 charts would have looked similarly to the other ones.
1.356 Unsurprisingly, as denser graphs are considered, both VF2++ and VF2
1.357 -Plus slow slightly down, but remain practically usable even on graphs
1.358 +Plus slow down slightly, but remain practically usable even on graphs
1.359 having 10 000 nodes.
1.360
1.361
1.362 @@ -1364,13 +1374,13 @@
1.363
1.364
1.365 \section{Conclusion}
1.366 -This paper presented VF2++, a new graph matching algorithm based on VF2, and analysed it from a practical viewpoint.
1.367 +This paper presented VF2++, a new graph matching algorithm based on VF2, and analyzed it from a practical point of view.
1.368
1.369 Recognizing the importance of the node order and determining an
1.370 efficient one, VF2++ is able to match graphs of thousands of nodes in
1.371 near practically linear time including preprocessing. In addition to
1.372 the proper order, VF2++ uses more efficient cutting
1.373 -rules which are easy to compute and make the algorithm able to prune
1.374 +rules, which are easy to compute and make the algorithm able to prune
1.375 most of the unfruitful branches without going astray.
1.376
1.377 In order to show the efficiency of the new method, it has been
1.378 @@ -1380,10 +1390,10 @@
1.379 biological graphs. It seems to be asymptotically faster on protein and
1.380 on contact map graphs in the case of induced subgraph isomorphism problem,
1.381 while in the case of graph isomorphism problem, it has definitely better
1.382 -asymptotic behaviour on protein graphs.
1.383 +asymptotic behavior on protein graphs.
1.384
1.385 Regarding random sparse graphs, not only has VF2++ proved itself to be
1.386 -faster than VF2~Plus, but it also has a practically linear behaviour both
1.387 +faster than VF2~Plus, but it also has a practically linear behavior both
1.388 in the case of induced subgraph and graph isomorphism problems.
1.389
1.390 %%%%%%%%%%%%%%%%
1.391 @@ -1425,3 +1435,9 @@
1.392 \endinput
1.393 %%
1.394 %% End of file `elsarticle-template-num.tex'.
1.395 +
1.396 +%% LocalWords: Subgraph Isomorphism datasets subgraphs subgraph FPT
1.397 +%% LocalWords: isomorphism hypergraph Ullmann Nauty precompute BFS
1.398 +%% LocalWords: bijection isomorphisms undirected infeasible dataset
1.399 +%% LocalWords: preprocessing incrementing decrementing neighbours
1.400 +%% LocalWords: expectable bioinformatics