1.1 --- a/bibliography.bib Mon Nov 28 20:18:13 2016 +0100
1.2 +++ b/bibliography.bib Wed Nov 30 06:21:19 2016 +0100
1.3 @@ -172,3 +172,11 @@
1.4 year = 1981,
1.5 month = "March"
1.6 }
1.7 +
1.8 +@Article{ColoredHiperGraphIso,
1.9 + author = "Arvind V., Das B., Köbler J. et al.",
1.10 + title = "Colored Hypergraph Isomorphism is Fixed Parameter Tractable",
1.11 + journal = "Algorithmica (2015) Volume 71, pp 120-138",
1.12 + year = 2015,
1.13 + month = "January"
1.14 +}
2.1 --- a/damecco.tex Mon Nov 28 20:18:13 2016 +0100
2.2 +++ b/damecco.tex Wed Nov 30 06:21:19 2016 +0100
2.3 @@ -123,19 +123,13 @@
2.4 fostered the design of various algorithms for handling special graph
2.5 structures.
2.6
2.7 -The idea of using state space representation and checking some
2.8 -conditions in each state to prune the search tree has made the VF2
2.9 -algorithm one of the state of the art graph matching algorithms for
2.10 -more than a decade. Recently, biological questions of ever increasing
2.11 -importance have required more efficient, specialized algorithms.
2.12 -
2.13 This paper presents VF2++, a new algorithm based on the original VF2,
2.14 which runs significantly faster on most test cases and performs
2.15 especially well on special graph classes stemming from biological
2.16 questions. VF2++ handles graphs of thousands of nodes in practically
2.17 near linear time including preprocessing. Not only is it an improved
2.18 version of VF2, but in fact, it is by far the fastest existing
2.19 -algorithm regarding biological graphs.
2.20 +algorithm especially on biological graphs.
2.21
2.22 The reason for VF2++' superiority over VF2 is twofold. Firstly, taking
2.23 into account the structure and the node labeling of the graph, VF2++
2.24 @@ -185,30 +179,26 @@
2.25 solution of several new and revised questions. The expressiveness,
2.26 the simplicity and the studiedness of graphs make them practical for
2.27 modelling and appear constantly in several seemingly independent
2.28 -fields. Bioinformatics and chemistry are amongst the most relevant
2.29 -and most important fields.
2.30 +fields, such as bioinformatics and chemistry.
2.31
2.32 Complex biological systems arise from the interaction and cooperation
2.33 of plenty of molecular components. Getting acquainted with such
2.34 -systems at the molecular level has primary importance, since
2.35 +systems at the molecular level is of primary importance, since
2.36 protein-protein interaction, DNA-protein interaction, metabolic
2.37 interaction, transcription factor binding, neuronal networks, and
2.38 -hormone signaling networks can be understood only this way.
2.39 +hormone signaling networks can be understood this way.
2.40
2.41 -For instance, a molecular structure can be considered as a graph,
2.42 -whose nodes correspond to atoms and whose edges to chemical bonds. The
2.43 -secondary structure of a protein can also be represented as a graph,
2.44 -where nodes are associated with aminoacids and the edges with hydrogen
2.45 -bonds. The nodes are often whole molecular components and the edges
2.46 -represent some relationships among them. The similarity and
2.47 -dissimilarity of objects corresponding to nodes are incorporated to
2.48 -the model by \emph{node labels}. Many other chemical and biological
2.49 -structures can easily be modeled in a similar way. Understanding such
2.50 -networks basically requires finding specific subgraphs, which can not
2.51 -avoid the application of graph matching algorithms.
2.52 +Many chemical and biological structures can easily be modeled
2.53 +as graphs, for instance, a molecular structure can be
2.54 +considered as a graph, whose nodes correspond to atoms and whose
2.55 +edges to chemical bonds. The similarity and dissimilarity of
2.56 +objects corresponding to nodes are incorporated to the model
2.57 +by \emph{node labels}. Understanding such networks basically
2.58 +requires finding specific subgraphs, thus calls for efficient
2.59 +graph matching algorithms.
2.60
2.61 -Finally, let some of the other real-world fields related to some
2.62 -variants of graph matching be briefly mentioned: pattern recognition
2.63 +Other real-world fields related to some
2.64 +variants of graph matching include pattern recognition
2.65 and machine vision \cite{HorstBunkeApplications}, symbol recognition
2.66 \cite{CordellaVentoSymbolRecognition}, face identification
2.67 \cite{JianzhuangYongFaceIdentification}. \\
2.68 @@ -220,18 +210,17 @@
2.69 for various graph classes, like trees and planar
2.70 graphs\cite{PlanarGraphIso}, bounded valence
2.71 graphs\cite{BondedDegGraphIso}, interval graphs\cite{IntervalGraphIso}
2.72 -or permutation graphs\cite{PermGraphIso}.
2.73 +or permutation graphs\cite{PermGraphIso}, and recently, an FPT algorithm has been presented for the coloured hypergraph isomorphism problem in \cite{ColoredHiperGraphIso}.
2.74
2.75 In the following, some algorithms based on other approaches are
2.76 -summarized, which do not need any restrictions on the graphs. However,
2.77 +summarized, which do not need any restrictions on the graphs. Even though,
2.78 an overall polynomial behaviour is not expectable from such an
2.79 -alternative, it may often have good performance, even on a graph class
2.80 -for which polynomial algorithm is known. Note that this summary
2.81 -containing only exact matching algorithms is far not complete, neither
2.82 -does it cover all the recent algorithms.
2.83 +alternative, it may often have good practical performance, in fact,
2.84 +it might be the best choice even on a graph class for which polynomial
2.85 +algorithm is known.
2.86
2.87 The first practically usable approach was due to
2.88 -Ullmann\cite{Ullmann} which is a commonly used depth-first
2.89 +\emph{Ullmann}\cite{Ullmann} which is a commonly used depth-first
2.90 search based algorithm with a complex heuristic for reducing the
2.91 number of visited states. A major problem is its $\Theta(n^3)$ space
2.92 complexity, which makes it impractical in the case of big sparse
2.93 @@ -241,7 +230,7 @@
2.94 improved version of this algorithm based on a bit-vector solution for
2.95 the binary Constraint Satisfaction Problem.
2.96
2.97 -The Nauty algorithm\cite{Nauty} transforms the two graphs to
2.98 +The \emph{Nauty} algorithm\cite{Nauty} transforms the two graphs to
2.99 a canonical form before starting to check for the isomorphism. It has
2.100 been considered as one of the fastest graph isomorphism algorithms,
2.101 although graph categories were shown in which it takes exponentially
2.102 @@ -253,7 +242,7 @@
2.103 has to be injective and edge-preserving, hence it is possible to
2.104 handle new matching types as well.
2.105
2.106 -The \textbf{RI} algorithm\cite{RI} and its variations are based on a
2.107 +The \emph{RI} algorithm\cite{RI} and its variations are based on a
2.108 state space representation. After reordering the nodes of the graphs,
2.109 it uses some fast executable heuristic checks without using any
2.110 complex pruning rules. It seems to run really efficiently on graphs
2.111 @@ -261,7 +250,7 @@
2.112 Search in Biological Databases\cite{Content}.
2.113
2.114 The currently most commonly used algorithm is the
2.115 -\textbf{VF2}\cite{VF2}, the improved version of VF\cite{VF}, which was
2.116 +\emph{VF2}\cite{VF2}, the improved version of \emph{VF}\cite{VF}, which was
2.117 designed for solving pattern matching and computer vision problems,
2.118 and has been one of the best overall algorithms for more than a
2.119 decade. Although, it can't be up to new specialized algorithms, it is
2.120 @@ -269,83 +258,72 @@
2.121 a state space representation and checks some conditions in each state
2.122 to prune the search tree.
2.123
2.124 -Our first graph matching algorithm was the first version of VF2 which
2.125 -recognizes the significance of the node ordering, more opportunities
2.126 -to increase the cutting efficiency and reduce its computational
2.127 -complexity. This project was initiated and sponsored by QuantumBio
2.128 -Inc.\cite{QUANTUMBIO} and the implementation --- along with a source
2.129 -code --- has been published as a part of LEMON\cite{LEMON} open source
2.130 -graph library.
2.131 -
2.132 -This paper introduces \textbf{VF2++}, a new further improved algorithm
2.133 -for the graph and (induced)subgraph isomorphism problem, which uses
2.134 -efficient cutting rules and determines a node order in which VF2 runs
2.135 -significantly faster on practical inputs.
2.136 -
2.137 -Meanwhile, another variant called \textbf{VF2 Plus}\cite{VF2Plus} has
2.138 +Meanwhile, another variant called \emph{VF2 Plus}\cite{VF2Plus} has
2.139 been published. It is considered to be as efficient as the RI
2.140 algorithm and has a strictly better behavior on large graphs. The
2.141 main idea of VF2 Plus is to precompute a heuristic node order of the
2.142 small graph, in which the VF2 works more efficiently.
2.143
2.144 +This paper introduces \emph{VF2++}, a new further improved algorithm
2.145 +for the graph and (induced)subgraph isomorphism problem, which uses
2.146 +efficient cutting rules and determines a node order in which VF2 runs
2.147 +significantly faster on practical inputs.
2.148 +
2.149 +This project was initiated and sponsored by QuantumBio
2.150 +Inc.\cite{QUANTUMBIO} and the implementation --- along with a source
2.151 +code --- has been published as a part of LEMON\cite{LEMON} open source
2.152 +graph library.
2.153 +
2.154 \section{Problem Statement}
2.155 -This section provides a detailed description of the problems to be
2.156 +This section provides a formal description of the problems to be
2.157 solved.
2.158 \subsection{Definitions}
2.159
2.160 -Throughout the paper $G_{small}=(V_{small}, E_{small})$ and
2.161 -$G_{large}=(V_{large}, E_{large})$ denote two undirected graphs.
2.162 +Throughout the paper $G_{1}=(V_{1}, E_{1})$ and
2.163 +$G_{2}=(V_{2}, E_{2})$ denote two undirected graphs.
2.164 +
2.165 +\begin{definition}
2.166 +$\mathcal{L}: (V_{1}\cup V_{2}) \longrightarrow K$ is a \textbf{node
2.167 + label function}, where K is an arbitrary set. The elements in K
2.168 + are the \textbf{node labels}. Two nodes, u and v are said to be
2.169 + \textbf{equivalent} if $\mathcal{L}(u)=\mathcal{L}(v)$.
2.170 +\end{definition}
2.171 +
2.172 +For the sake of simplicity, in this paper the graph, subgraph and induced subgraph isomorphisms are defined in a more general way.
2.173 +
2.174 \begin{definition}\label{sec:ismorphic}
2.175 -$G_{small}$ and $G_{large}$ are \textbf{isomorphic} if $\exists M:
2.176 - V_{small} \longrightarrow V_{large}$ bijection, for which the
2.177 +$G_{1}$ and $G_{2}$ are \textbf{isomorphic} (by the node label $\mathcal{L}$) if $\exists \mathfrak{m}:
2.178 + V_{1} \longrightarrow V_{2}$ bijection, for which the
2.179 following is true:
2.180 \begin{center}
2.181 -$\forall u,v\in{V_{small}} : (u,v)\in{E_{small}} \Leftrightarrow
2.182 - (M(u),M(v))\in{E_{large}}$
2.183 +$\forall u\in{V_{1}} : \mathcal{L}(u)=\mathcal{L}(\mathfrak{m}(u))$ and
2.184 +$\forall u,v\in{V_{1}} : (u,v)\in{E_{1}} \Leftrightarrow (\mathfrak{m}(u),\mathfrak{m}(v))\in{E_{2}}$
2.185 \end{center}
2.186 \end{definition}
2.187 -For the sake of simplicity in this paper subgraphs and induced
2.188 -subgraphs are defined in a more general way than usual:
2.189 +
2.190 \begin{definition}
2.191 -$G_{small}$ is a \textbf{subgraph} of $G_{large}$ if $\exists I:
2.192 - V_{small}\longrightarrow V_{large}$ injection, for which the
2.193 +$G_{1}$ is a \textbf{subgraph} of $G_{2}$ (by the node label $\mathcal{L}$) if $\exists \mathfrak{m}:
2.194 + V_{1}\longrightarrow V_{2}$ injection, for which the
2.195 following is true:
2.196 \begin{center}
2.197 -$\forall u,v \in{V_{small}} : (u,v)\in{E_{small}} \Rightarrow (I(u),I(v))\in E_{large}$
2.198 +$\forall u\in{V_{1}} : \mathcal{L}(u)=\mathcal{L}(\mathfrak{m}(u))$ and
2.199 +$\mathbb{M}$
2.200 +$\forall u,v \in{V_{1}} : (u,v)\in{E_{1}} \Rightarrow (\mathfrak{m}(u),\mathfrak{m}(v))\in E_{2}$
2.201 \end{center}
2.202 \end{definition}
2.203
2.204 \begin{definition}
2.205 -$G_{small}$ is an \textbf{induced subgraph} of $G_{large}$ if $\exists
2.206 - I: V_{small}\longrightarrow V_{large}$ injection, for which the
2.207 +$G_{1}$ is an \textbf{induced subgraph} of $G_{2}$ (by the node label $\mathcal{L}$) if $\exists
2.208 + \mathfrak{m}: V_{1}\longrightarrow V_{2}$ injection, for which the
2.209 following is true:
2.210 \begin{center}
2.211 -$\forall u,v \in{V_{small}} : (u,v)\in{E_{small}} \Leftrightarrow
2.212 - (I(u),I(v))\in E_{large}$
2.213 +$\forall u\in{V_{1}} : \mathcal{L}(u)=\mathcal{L}(\mathfrak{m}(u))$ and
2.214 +
2.215 +$\forall u,v \in{V_{1}} : (u,v)\in{E_{1}} \Leftrightarrow
2.216 + (\mathfrak{m}(u),\mathfrak{m}(v))\in E_{2}$
2.217 \end{center}
2.218 \end{definition}
2.219
2.220 -\begin{definition}
2.221 -$lab: (V_{small}\cup V_{large}) \longrightarrow K$ is a \textbf{node
2.222 - label function}, where K is an arbitrary set. The elements in K
2.223 - are the \textbf{node labels}. Two nodes, u and v are said to be
2.224 - \textbf{equivalent} if $lab(u)=lab(v)$.
2.225 -\end{definition}
2.226 -
2.227 -When node labels are given, the matched nodes must have the same
2.228 -labels. For example, the node labeled isomorphism is phrased by
2.229 -\begin{definition}
2.230 -$G_{small}$ and $G_{large}$ are \textbf{isomorphic by the node label
2.231 - function lab} if $\exists M: V_{small} \longrightarrow V_{large}$
2.232 - bijection, for which the following is true:
2.233 -\begin{center}
2.234 -$(\forall u,v\in{V_{small}} : (u,v)\in{E_{small}} \Leftrightarrow
2.235 - (M(u),M(v))\in{E_{large}})$ and $(\forall u\in{V_{small}} :
2.236 - lab(u)=lab(M(u)))$
2.237 -\end{center}
2.238 -\end{definition}
2.239 -
2.240 -Note that he other two definitions can be extended in the same way.
2.241
2.242 \subsection{Common problems}\label{sec:CommProb}
2.243
2.244 @@ -354,7 +332,7 @@
2.245 problems also appear in many applications.
2.246
2.247 The \textbf{subgraph matching problem} is the following: is
2.248 -$G_{small}$ isomorphic to any subgraph of $G_{large}$ by a given node
2.249 +$G_{1}$ isomorphic to any subgraph of $G_{2}$ by a given node
2.250 label?
2.251
2.252 The \textbf{induced subgraph matching problem} asks the same about the
2.253 @@ -382,77 +360,57 @@
2.254
2.255
2.256 \subsection{Common notations}
2.257 -\indent Assume $G_{small}$ is searched in $G_{large}$. The following
2.258 +\indent Assume $G_{1}$ is searched in $G_{2}$. The following
2.259 definitions and notations will be used throughout the whole paper.
2.260 \begin{definition}
2.261 -A set $M\subseteq V_{small}\times V_{large}$ is called
2.262 -\textbf{mapping} if no node of $V_{small}\cup V_{large}$ appears
2.263 -in more than one pair in M. That is, M uniquely associates some of
2.264 -the nodes in $V_{small}$ with some nodes of $V_{large}$ and vice
2.265 -versa.
2.266 +An injection $\mathfrak{m} : D \longrightarrow V_2$ is called (partial) \textbf{mapping}, where $D\subseteq V_1$.
2.267 +\end{definition}
2.268 +
2.269 +\begin{notation}
2.270 +$\mathfrak{D}(f)$ and $\mathfrak{R}(f)$ denote the domain and the range of a function $f$, respectively.
2.271 +\end{notation}
2.272 +
2.273 +\begin{definition}
2.274 +Mapping $\mathfrak{m}$ \textbf{covers} a node $u\in V_1\cup V_2$ if $u\in \mathfrak{D}(\mathfrak{m})\cup \mathfrak{R}(\mathfrak{m})$.
2.275 \end{definition}
2.276
2.277 \begin{definition}
2.278 -Mapping $M$ \textbf{covers} a node v if there exists a pair in M, which
2.279 -contains v.
2.280 +A mapping $\mathfrak{m}$ is $\mathbf{whole\ mapping}$ if $\mathfrak{m}$ covers all the
2.281 +nodes of $V_{1}$, i.e. $\mathfrak{D}(\mathfrak{m})=V_1$.
2.282 \end{definition}
2.283
2.284 \begin{definition}
2.285 -A mapping $M$ is $\mathbf{whole\ mapping}$ if $M$ covers all the
2.286 -nodes in $V_{small}$.
2.287 +Let \textbf{extend}$(\mathfrak{m},(u,v))$ denote the function $f : \mathfrak{D}(\mathfrak{m})\cup\{u\}\longrightarrow\mathfrak{R}(\mathfrak{m})\cup\{v\}$, for which $\forall w\in \mathfrak{D}(\mathfrak{m}) : \mathfrak{m}(w)=f(w)$ and $f(u)=v$ holds. Where $u\notin\mathfrak{D}(\mathfrak{m})$ and $v\notin\mathfrak{R}(\mathfrak{m})$, otherwise $extend(\mathfrak{m},(u,v))$ is undefined.
2.288 \end{definition}
2.289
2.290 \begin{notation}
2.291 -Let $\mathbf{M_{small}(s)} := \{u\in V_{small} : \exists v\in
2.292 -V_{large}: (u,v)\in M(s)\}$ and $\mathbf{M_{large}(s)} := \{v\in
2.293 -V_{large} : \exists u\in V_{small}: (u,v)\in M(s)\}$.
2.294 +Throughout the paper, $\mathbf{PT}$ denotes a generic problem type
2.295 +which can be substituted by any of the $\mathbf{ISO}$, $\mathbf{SUB}$
2.296 +and $\mathbf{IND}$ problems.
2.297 \end{notation}
2.298
2.299 -\begin{notation}
2.300 -Let $\mathbf{Pair(M,v)}$ be the pair of $v$ in $M$ if such a node
2.301 -exist, otherwise $\mathbf{Pair(M,v)}$ is undefined, where $M$ is a mapping and $v\in V_{small}\cup V_{large}$.
2.302 -\end{notation}
2.303 -
2.304 -Note that if $\mathbf{Pair(M,v)}$ exists, then it is unique
2.305 -
2.306 -The definitions of the isomorphism types can be rephrased on the
2.307 -existence of a special whole mapping $M$, since it represents a
2.308 -bijection. For example
2.309 -\begin{center}
2.310 -$M\subseteq V_{small}\times V_{large}$ represents an induced subgraph
2.311 - isomorphism $\Leftrightarrow$ $M$ is whole mapping and $\forall u,v
2.312 - \in{V_{small}} : (u,v)\in{E_{small}} \Leftrightarrow
2.313 - (Pair(M,u),Pair(M,v))\in E_{large}$.
2.314 -\end{center}
2.315 -
2.316 -Throughout the paper, $\mathbf{PT}$ denotes a generic problem type
2.317 -which can be substituted by any of $\mathbf{ISO}$, $\mathbf{SUB}$
2.318 -and $\mathbf{IND}$.
2.319 -
2.320 \begin{definition}
2.321 -Let M be a mapping. A logical function $\mathbf{Cons_{PT}}$ is a
2.322 +Let $\mathfrak{m}$ be a mapping. A logical function $\mathbf{Cons_{PT}}$ is a
2.323 \textbf{consistency function by } $\mathbf{PT}$ if the following
2.324 -holds. If there exists whole mapping $W$ of type $PT$ for which
2.325 -$M\subseteq W$, then $Cons_{PT}(M)$ is true.
2.326 +holds. If there exists a whole mapping $w$ satisfying the requirements of $PT$, for which $\mathfrak{m}$ is exactly $w$ restricted to $\mathfrak{D}(\mathfrak{m})$.
2.327 \end{definition}
2.328
2.329 \begin{definition}
2.330 -Let M be a mapping. A logical function $\mathbf{Cut_{PT}}$ is a
2.331 +Let $\mathfrak{m}$ be a mapping. A logical function $\mathbf{Cut_{PT}}$ is a
2.332 \textbf{cutting function by } $\mathbf{PT}$ if the following
2.333 -holds. $\mathbf{Cut_{PT}(M)}$ is false if $M$ can be extended to a
2.334 -whole mapping $W$ of type $PT$.
2.335 +holds. $\mathbf{Cut_{PT}(\mathfrak{m})}$ is false if there exists a sequence of extend operations, which results in a whole mapping satisfying the requirements of $PT$.
2.336 \end{definition}
2.337
2.338 \begin{definition}
2.339 -$M$ is said to be \textbf{consistent mapping by} $\mathbf{PT}$ if
2.340 - $Cons_{PT}(M)$ is true.
2.341 +$\mathfrak{m}$ is said to be \textbf{consistent mapping by} $\mathbf{PT}$ if
2.342 + $Cons_{PT}(\mathfrak{m})$ is true.
2.343 \end{definition}
2.344
2.345 $Cons_{PT}$ and $Cut_{PT}$ will often be used in the following form.
2.346 \begin{notation}
2.347 -Let $\mathbf{Cons_{PT}(p, M)}:=Cons_{PT}(M\cup\{p\})$, and
2.348 -$\mathbf{Cut_{PT}(p, M)}:=Cut_{PT}(M\cup\{p\})$, where
2.349 -$p\in{V_{small}\!\times\!V_{large}}$ and $M\cup\{p\}$ is mapping.
2.350 +Let $\mathbf{Cons_{PT}(p, \mathfrak{m})}:=Cons_{PT}(extend(\mathfrak{m},p))$, and
2.351 +$\mathbf{Cut_{PT}(p, \mathfrak{m})}:=Cut_{PT}(extend(\mathfrak{m},p))$, where
2.352 +$p\in{V_{1}\backslash\mathfrak{D}(\mathfrak{m}) \!\times\!V_{2}\backslash\mathfrak{R}(\mathfrak{m})}$.
2.353 \end{notation}
2.354
2.355 $Cons_{PT}$ will be used to check the consistency of the already
2.356 @@ -462,11 +420,13 @@
2.357 \subsection{Overview of the algorithm}
2.358 VF2 uses a state space representation of mappings, $Cons_{PT}$ for
2.359 excluding inconsistency with the problem type and $Cut_{PT}$ for
2.360 -pruning the search tree. Each state $s$ of the matching process can
2.361 -be associated with a mapping $M(s)$.
2.362 +pruning the search tree.
2.363
2.364 Algorithm~\ref{alg:VF2Pseu} is a high level description of
2.365 -the VF2 matching algorithm.
2.366 +the VF2 matching algorithm. Each state of the matching process can
2.367 +be associated with a mapping $\mathfrak{m}$. The initial state
2.368 +is associated with a mapping $\mathfrak{m}$, for which
2.369 +$\mathfrak{D}(\mathfrak{m})=\emptyset$, i.e. it starts with an empty mapping.
2.370
2.371
2.372 \begin{algorithm}
2.373 @@ -476,132 +436,120 @@
2.374 \caption{\hspace{0.5cm}$A\ high\ level\ description\ of\ VF2$}\label{alg:VF2Pseu}
2.375 \begin{algorithmic}[1]
2.376
2.377 -\Procedure{VF2}{State $s$, ProblemType $PT$} \If{$M(s$) covers
2.378 - $V_{small}$} \State Output($M(s)$) \Else
2.379 -
2.380 - \State Compute the set $P(s)$ of the pairs candidate for inclusion
2.381 - in $M(s)$ \ForAll{$p\in{P(s)}$} \If{Cons$_{PT}$($p, M(s)$) $\wedge$
2.382 - $\neg$Cut$_{PT}$($p, M(s)$)} \State Compute the nascent state
2.383 - $\tilde{s}$ by adding $p$ to $M(s)$ \State \textbf{call}
2.384 - VF2($\tilde{s}$, $PT$) \EndIf \EndFor \EndIf \EndProcedure
2.385 +\Procedure{VF2}{Mapping $\mathfrak{m}$, ProblemType $PT$}
2.386 + \If{$\mathfrak{m}$ covers
2.387 + $V_{1}$} \State Output($\mathfrak{m}$)
2.388 + \Else
2.389 + \State Compute the set $P_\mathfrak{m}$ of the pairs candidate for inclusion
2.390 + in $\mathfrak{m}$ \ForAll{$p\in{P_\mathfrak{m}}$} \If{Cons$_{PT}$($p,\mathfrak{m}$) $\wedge$
2.391 + $\neg$Cut$_{PT}$($p,\mathfrak{m}$)}
2.392 + \State \textbf{call}
2.393 + VF2($extend(\mathfrak{m},p)$, $PT$) \EndIf \EndFor \EndIf \EndProcedure
2.394 \end{algorithmic}
2.395 \end{algorithm}
2.396
2.397
2.398 -The initial state $s_0$ is associated with $M(s_0)=\emptyset$, i.e. it
2.399 -starts with an empty mapping.
2.400 +For the current mapping $\mathfrak{m}$, the algorithm computes $P_\mathfrak{m}$, the set of
2.401 +candidate node pairs for adding to the current mapping $\mathfrak{m}_s$.
2.402
2.403 -For each state $s$, the algorithm computes $P(s)$, the set of
2.404 -candidate node pairs for adding to the current state $s$.
2.405 -
2.406 -For each pair $p$ in $P(s)$, $Cons_{PT}(p,M(s))$ and
2.407 -$Cut_{PT}(p,M(s))$ are evaluated. If $Cons_{PT}(p,M(s))$ is true and
2.408 -$Cut_{PT}(p,M(s))$ is false, the successor state $\tilde{s}=s\cup
2.409 -\{p\}$ is computed, and the whole process is recursively applied to
2.410 -$\tilde{s}$. Otherwise, $\tilde{s}$ is not consistent by $PT$, or it
2.411 -can be proved that $s$ can not be extended to a whole mapping.
2.412 +For each pair $p$ in $P_\mathfrak{m}$, $Cons_{PT}(p,\mathfrak{m})$ and
2.413 +$Cut_{PT}(p,\mathfrak{m})$ are evaluated. If the former is true and
2.414 +the latter is false, the whole process is recursively applied to
2.415 +$extend(\mathfrak{m},p)$. Otherwise, $extend(\mathfrak{m},p)$ is not consistent by $PT$, or it
2.416 +can be proved that $\mathfrak{m}$ can not be extended to a whole mapping.
2.417
2.418 In order to make sure of the correctness, see
2.419 \begin{claim}
2.420 Through consistent mappings, only consistent whole mappings can be
2.421 -reached, and all of the whole mappings are reachable through
2.422 +reached, and all the consistent whole mappings are reachable through
2.423 consistent mappings.
2.424 \end{claim}
2.425
2.426 -Note that a state may be reached in exponentially many different ways, since the
2.427 -order of insertions into $M$ does not influence the nascent mapping.
2.428 +Note that a mapping may be reached in exponentially many different ways, since the
2.429 +order of extensions does not influence the nascent mapping.
2.430
2.431 However, one may observe
2.432
2.433 \begin{claim}
2.434 \label{claim:claimTotOrd}
2.435 -Let $\prec$ an arbitrary total ordering relation on $V_{small}$. If
2.436 -the algorithm ignores each $p=(u,v) \in P(s)$, for which
2.437 +Let $\prec$ be an arbitrary total ordering relation on $V_{1}$. If
2.438 +the algorithm ignores each $p=(u,v) \in P_\mathfrak{m}$, for which
2.439 \begin{center}
2.440 -$\exists (\tilde{u},\tilde{v})\in P(s): \tilde{u} \prec u$,
2.441 +$\exists (\tilde{u},\tilde{v})\in P_\mathfrak{m}: \tilde{u} \prec u$,
2.442 \end{center}
2.443 -then no state can be reached more than once, and each state associated
2.444 -with a whole mapping remains reachable.
2.445 +then no mapping can be reached more than once, and each whole mapping remains reachable.
2.446 \end{claim}
2.447
2.448 Note that the cornerstone of the improvements to VF2 is a proper
2.449 choice of a total ordering.
2.450
2.451 -\subsection{The candidate set P(s)}
2.452 +\subsection{The candidate set}
2.453 \label{candidateComputingVF2}
2.454 -Let $P(s)$ be the set of the candidate pairs for inclusion in $M(s)$.
2.455 +Let $P_\mathfrak{m}$ be the set of the candidate pairs for inclusion in $\mathfrak{m}$.
2.456
2.457 \begin{notation}
2.458 -Let $\mathbf{T_{small}(s)}:=\{u \in V_{small} : u$ is not covered by
2.459 -$M(s)\wedge\exists \tilde{u}\in{V_{small}: (u,\tilde{u})\in E_{small}}
2.460 -\wedge \tilde{u}$ is covered by $M(s)\}$, and
2.461 -\\ $\mathbf{T_{large}(s)}\!:=\!\{v \in\!V_{large}\!:\!v$ is not
2.462 -covered by
2.463 -$M(s)\wedge\!\exists\tilde{v}\!\in\!{V_{large}\!:\!(v,\tilde{v})\in\!E_{large}}
2.464 -\wedge \tilde{v}$ is covered by $M(s)\}$
2.465 +Let $\mathbf{T_{1}(\mathfrak{m})}:=\{u \in V_{1}\backslash\mathfrak{D}(\mathfrak{m}) : \exists \tilde{u}\in{\mathfrak{D}(\mathfrak{m}): (u,\tilde{u})\in E_{1}}\}$, and
2.466 + $\mathbf{T_{2}(\mathfrak{m})} := \{v \in V_{2}\backslash\mathfrak{R}(\mathfrak{m}) : \exists\tilde{v}\in{\mathfrak{R}(\mathfrak{m}):(v,\tilde{v})\in E_{2}}\}$.
2.467 \end{notation}
2.468
2.469 -The set $P(s)$ includes the pairs of uncovered neighbours of covered
2.470 +The set $P_\mathfrak{m}$ includes the pairs of uncovered neighbours of covered
2.471 nodes, and if there is not such a node pair, all the pairs containing
2.472 two uncovered nodes are added. Formally, let
2.473 \[
2.474 - P(s)\!=\!
2.475 + P_\mathfrak{m}\!=\!
2.476 \begin{cases}
2.477 - T_{small}(s)\times T_{large}(s)&\hspace{-0.15cm}\text{if }
2.478 - T_{small}(s)\!\neq\!\emptyset\!\wedge\!T_{large}(s)\!\neq
2.479 - \emptyset,\\ (V_{small}\!\setminus\!M_{small}(s))\!\times\!(V_{large}\!\setminus\!M_{large}(s))
2.480 - &\hspace{-0.15cm}otherwise.
2.481 + T_{1}(\mathfrak{m})\times T_{2}(\mathfrak{m})&\hspace{-0.15cm}\text{if }
2.482 + T_{1}(\mathfrak{m})\!\neq\!\emptyset\ \text{and }T_{2}(\mathfrak{m})\!\neq
2.483 + \emptyset,\\ (V_{1}\!\setminus\!\mathfrak{D}(\mathfrak{m}))\!\times\!(V_{2}\!\setminus\!\mathfrak{R}(\mathfrak{m}))
2.484 + &\hspace{-0.15cm}\text{otherwise}.
2.485 \end{cases}
2.486 \]
2.487
2.488 \subsection{Consistency}
2.489 -Suppose $p=(u,v)$, where $u\in V_{small}$ and $v\in V_{large}$, $s$ is
2.490 -a state of the matching procedure, $M(s)$ is consistent mapping by
2.491 -$PT$ and $lab(u)=lab(v)$. $Cons_{PT}(p,M(s))$ checks whether
2.492 -including pair $p$ into $M(s)$ leads to a consistent mapping by $PT$.
2.493 +Suppose $p=(u,v)$, where $u\in V_{1}$ and $v\in V_{2}$, $\mathfrak{m}$ is a consistent mapping by
2.494 +$PT$. $Cons_{PT}(p,\mathfrak{m})$ checks whether
2.495 +including pair $p$ into $\mathfrak{m}$ leads to a consistent mapping by $PT$.
2.496
2.497 For example, the consistency function of induced subgraph isomorphism is as follows.
2.498 \begin{notation}
2.499 -Let $\mathbf{\Gamma_{small} (u)}:=\{\tilde{u}\in V_{small} :
2.500 -(u,\tilde{u})\in E_{small}\}$, and $\mathbf{\Gamma_{large}
2.501 - (v)}:=\{\tilde{v}\in V_{large} : (v,\tilde{v})\in E_{large}\}$, where $u\in V_{small}$ and $v\in V_{large}$.
2.502 +Let $\mathbf{\Gamma_{1} (u)}:=\{\tilde{u}\in V_{1} :
2.503 +(u,\tilde{u})\in E_{1}\}$, and $\mathbf{\Gamma_{2}
2.504 + (v)}:=\{\tilde{v}\in V_{2} : (v,\tilde{v})\in E_{2}\}$, where $u\in V_{1}$ and $v\in V_{2}$.
2.505 \end{notation}
2.506
2.507 -$M(s)\cup \{(u,v)\}$ is a consistent mapping by $IND$ $\Leftrightarrow
2.508 -(\forall \tilde{u}\in M_{small}: (u,\tilde{u})\in E_{small}
2.509 -\Leftrightarrow (v,Pair(M(s),\tilde{u}))\in E_{large})$. The
2.510 +$extend(\mathfrak{m},(u,v))$ is a consistent mapping by $IND$ $\Leftrightarrow
2.511 +(\forall \tilde{u}\in \mathfrak{D}(\mathfrak{m}): (u,\tilde{u})\in E_{1}
2.512 +\Leftrightarrow (v,\mathfrak{m}(\tilde{u}))\in E_{2})$. The
2.513 following formulation gives an efficient way of calculating
2.514 $Cons_{IND}$.
2.515 \begin{claim}
2.516 -$Cons_{IND}((u,v),M(s)):=(\forall \tilde{v}\in \Gamma_{large}(v)
2.517 - \ \cap\ M_{large}(s):\\(Pair(M(s),\tilde{v}),u)\in E_{small})\wedge
2.518 - (\forall \tilde{u}\in \Gamma_{small}(u)
2.519 - \ \cap\ M_{small}(s):(v,Pair(M(s),\tilde{u}))\in E_{large})$ is a
2.520 +$Cons_{IND}((u,v),\mathfrak{m}):=\mathcal{L}(u)\!\!=\!\!\mathcal{L}(v)\wedge(\forall \tilde{v}\in \Gamma_{2}(v)\cap\mathfrak{R}(\mathfrak{m}):(u,\mathfrak{m}^{-1}(\tilde{v}))\in E_{1})\wedge
2.521 + (\forall \tilde{u}\in \Gamma_{1}(u)
2.522 + \cap \mathfrak{D}(\mathfrak{m}):(v,\mathfrak{m}(\tilde{u}))\in E_{2})$ is a
2.523 consistency function in the case of $IND$.
2.524 \end{claim}
2.525
2.526 \subsection{Cutting rules}
2.527 -$Cut_{PT}(p,M(s))$ is defined by a collection of efficiently
2.528 -verifiable conditions. The requirement is that $Cut_{PT}(p,M(s))$ can
2.529 -be true only if it is impossible to extended $M(s)\cup \{p\}$ to a
2.530 +$Cut_{PT}(p,\mathfrak{m})$ is defined by a collection of efficiently
2.531 +verifiable conditions. The requirement is that $Cut_{PT}(p,\mathfrak{m})$ can
2.532 +be true only if it is impossible to extend $extend(\mathfrak{m},p)$ to a
2.533 whole mapping.
2.534
2.535 As an example, the cutting function of induced subgraph isomorphism is presented.
2.536 \begin{notation}
2.537 -Let $\mathbf{\tilde{T}_{small}}(s):=(V_{small}\backslash
2.538 -M_{small}(s))\backslash T_{small}(s)$, and
2.539 -\\ $\mathbf{\tilde{T}_{large}}(s):=(V_{large}\backslash
2.540 -M_{large}(s))\backslash T_{large}(s)$.
2.541 +Let $\mathbf{\tilde{T}_{1}}(\mathfrak{m}):=(V_{1}\backslash
2.542 +\mathfrak{D}(\mathfrak{m}))\backslash T_{1}(\mathfrak{m})$, and
2.543 +\\ $\mathbf{\tilde{T}_{2}}(\mathfrak{m}):=(V_{2}\backslash
2.544 +\mathfrak{R}(\mathfrak{m}))\backslash T_{2}(\mathfrak{m})$.
2.545 \end{notation}
2.546
2.547 \begin{claim}
2.548 -$Cut_{IND}((u,v),M(s)):= |\Gamma_{large} (v)\ \cap\ T_{large}(s)| <
2.549 - |\Gamma_{small} (u)\ \cap\ T_{small}(s)| \vee |\Gamma_{large}(v)\cap
2.550 - \tilde{T}_{large}(s)| < |\Gamma_{small}(u)\cap
2.551 - \tilde{T}_{small}(s)|$ is a cutting function by $IND$.
2.552 +$Cut_{IND}((u,v),\mathfrak{m}):= |\Gamma_{2} (v)\ \cap\ T_{2}(\mathfrak{m})| <
2.553 + |\Gamma_{1} (u)\ \cap\ T_{1}(\mathfrak{m})| \vee |\Gamma_{2}(v)\cap
2.554 + \tilde{T}_{2}(\mathfrak{m})| < |\Gamma_{1}(u)\cap
2.555 + \tilde{T}_{1}(\mathfrak{m})|$ is a cutting function by $IND$.
2.556 \end{claim}
2.557
2.558 -
2.559 \section{The VF2++ Algorithm}
2.560 Although any total ordering relation makes the search space of VF2 a
2.561 tree, its choice turns out to dramatically influence the number of
2.562 @@ -625,25 +573,116 @@
2.563 candidate set calculating were described in \cite{VF2Plus}, too.
2.564
2.565 It should be noted that all the methods described in this section are
2.566 -extendable to handle directed graphs and edge labels as well.
2.567 +extendable to handle directed graphs and edge labels as well.\newline
2.568
2.569 The basic ideas and the detailed description of VF2++ are provided in
2.570 the following.
2.571
2.572 +The goal is to find a matching order in which the algorithm is able to
2.573 +recognize inconsistency or prune the infeasible branches on the
2.574 +highest levels and goes deep only if it is needed.
2.575 +
2.576 +\begin{notation}
2.577 +Let $\mathbf{Conn_{H}(u)}:=|\Gamma_{1}(u)\cap H\}|$, that is the
2.578 +number of neighbours of u which are in H, where $u\in V_{1} $ and
2.579 +$H\subseteq V_{1}$.
2.580 +\end{notation}
2.581 +
2.582 +The principal question is the following. Suppose a mapping $\mathfrak{m}$ is
2.583 +given. For which node of $T_{1}(\mathfrak{m})$ is the hardest to find a
2.584 +consistent pair in $G_{2}$? The more covered neighbours a node in
2.585 +$T_{1}(\mathfrak{m})$ has --- i.e. the largest $Conn_{\mathfrak{D}(\mathfrak{m})}$ it has
2.586 +---, the more rarely satisfiable consistency constraints for its pair
2.587 +are given.
2.588 +
2.589 +In biology, most of the graphs are sparse, thus several nodes in
2.590 +$T_{1}(\mathfrak{m})$ may have the same $Conn_{\mathfrak{D}(\mathfrak{m})}$, which makes
2.591 +reasonable to define a secondary and a tertiary order between them.
2.592 +The observation above proves itself to be as determining, that the
2.593 +secondary ordering prefers nodes with the most uncovered neighbours
2.594 +among which have the same $Conn_{\mathfrak{D}(\mathfrak{m})}$ to increase
2.595 +$Conn_{\mathfrak{D}(\mathfrak{m})}$ of uncovered nodes so much, as possible. The
2.596 +tertiary ordering prefers nodes having the rarest uncovered labels.
2.597 +
2.598 +Note that the secondary ordering is the same as the ordering by $deg$,
2.599 +which is a static data in front of the above used.
2.600 +
2.601 +These rules can easily result in a matching order which contains the
2.602 +nodes of a long path successively, whose nodes may have low $Conn$ and
2.603 +is easily matchable into $G_{2}$. To avoid that, a BFS order is
2.604 +used, which provides the shortest possible paths.
2.605 +\newline
2.606 +
2.607 +In the following, some examples on which the VF2 may be slow are
2.608 +described, although they are easily solvable by using a proper
2.609 +matching order.
2.610 +
2.611 +\begin{example}
2.612 +Suppose $G_{1}$ can be mapped into $G_{2}$ in many ways
2.613 +without node labels. Let $u\in V_{1}$ and $v\in V_{2}$.
2.614 +\newline
2.615 +$\mathcal{L}(u):=black$
2.616 +\newline
2.617 +$\mathcal{L}(v):=black$
2.618 +\newline
2.619 +$\mathcal{L}(\tilde{u}):=red \ \forall \tilde{u}\in (V_{1}\backslash
2.620 +\{u\})$
2.621 +\newline
2.622 +$\mathcal{L}(\tilde{v}):=red \ \forall \tilde{v}\in (V_{2}\backslash
2.623 +\{v\})$
2.624 +\newline
2.625 +
2.626 +Now, any mapping by $\mathcal{L}$ must contain $(u,v)$, since
2.627 +$u$ is black and no node in $V_{2}$ has a black label except
2.628 +$v$. If unfortunately $u$ were the last node which will get covered,
2.629 +VF2 would check only in the last steps, whether $u$ can be matched to
2.630 +$v$.
2.631 +\newline
2.632 +However, had $u$ been the first matched node, u would have been
2.633 +matched immediately to v, so all the mappings would have been
2.634 +precluded in which node labels can not correspond.
2.635 +\end{example}
2.636 +
2.637 +\begin{example}
2.638 +Suppose there is no node label given, $G_{1}$ is a small graph and
2.639 +can not be mapped into $G_{2}$ and $u\in V_{1}$.
2.640 +\newline
2.641 +Let $G'_{1}:=(V_{1}\cup
2.642 +\{u'_{1},u'_{2},..,u'_{k}\},E_{1}\cup
2.643 +\{(u,u'_{1}),(u'_{1},u'_{2}),..,(u'_{k-1},u'_{k})\})$, that is,
2.644 +$G'_{1}$ is $G_{1}\cup \{ a\ k$ long path, which is disjoint
2.645 +from $G_{1}$ and one of its starting points is connected to $u\in
2.646 +V_{1}\}$.
2.647 +\newline
2.648 +Is there a subgraph of $G_{2}$, which is isomorph with
2.649 +$G'_{1}$?
2.650 +\newline
2.651 +If unfortunately the nodes of the path were the first $k$ nodes in the
2.652 +matching order, the algorithm would iterate through all the possible k
2.653 +long paths in $G_{2}$, and it would recognize that no path can be
2.654 +extended to $G'_{1}$.
2.655 +\newline
2.656 +However, had it started by the matching of $G_{1}$, it would not
2.657 +have matched any nodes of the path.
2.658 +\end{example}
2.659 +
2.660 +These examples may look artificial, but the same problems also appear
2.661 +in real-world instances, even though in a less obvious way.
2.662 +
2.663 \subsection{Preparations}
2.664 \begin{claim}
2.665 \label{claim:claimCoverFromLeft}
2.666 The total ordering relation uniquely determines a node order, in which
2.667 -the nodes of $V_{small}$ will be covered by VF2. From the point of
2.668 +the nodes of $V_{1}$ will be covered by VF2. From the point of
2.669 view of the matching procedure, this means, that always the same node
2.670 -of $G_{small}$ will be covered on the d-th level.
2.671 +of $G_{1}$ will be covered on the d-th level.
2.672 \end{claim}
2.673
2.674 \begin{definition}
2.675 -An order $(u_{\sigma(1)},u_{\sigma(2)},..,u_{\sigma(|V_{small}|)})$ of
2.676 -$V_{small}$ is \textbf{matching order} if exists $\prec$ total
2.677 +An order $(u_{\sigma(1)},u_{\sigma(2)},..,u_{\sigma(|V_{1}|)})$ of
2.678 +$V_{1}$ is \textbf{matching order} if exists $\prec$ total
2.679 ordering relation, s.t. the VF2 with $\prec$ on the d-th level finds
2.680 -pair for $u_{\sigma(d)}$ for all $d\in\{1,..,|V_{small}|\}$.
2.681 +pair for $u_{\sigma(d)}$ for all $d\in\{1,..,|V_{1}|\}$.
2.682 \end{definition}
2.683
2.684 \begin{claim}\label{claim:MOclaim}
2.685 @@ -656,105 +695,13 @@
2.686 order, and every matching order can be determined by a total ordering,
2.687 however, more than one different total orderings may determine the
2.688 same matching order.
2.689 -\subsection{Idea behind the algorithm}
2.690 -The goal is to find a matching order in which the algorithm is able to
2.691 -recognize inconsistency or prune the infeasible branches on the
2.692 -highest levels and goes deep only if it is needed.
2.693 -
2.694 -\begin{notation}
2.695 -Let $\mathbf{Conn_{H}(u)}:=|\Gamma_{small}(u)\cap H\}|$, that is the
2.696 -number of neighbours of u which are in H, where $u\in V_{small} $ and
2.697 -$H\subseteq V_{small}$.
2.698 -\end{notation}
2.699 -
2.700 -The principal question is the following. Suppose a state $s$ is
2.701 -given. For which node of $T_{small}(s)$ is the hardest to find a
2.702 -consistent pair in $G_{large}$? The more covered neighbours a node in
2.703 -$T_{small}(s)$ has --- i.e. the largest $Conn_{M_{small}(s)}$ it has
2.704 ----, the more rarely satisfiable consistency constraints for its pair
2.705 -are given.
2.706 -
2.707 -In biology, most of the graphs are sparse, thus several nodes in
2.708 -$T_{small}(s)$ may have the same $Conn_{M_{small}(s)}$, which makes
2.709 -reasonable to define a secondary and a tertiary order between them.
2.710 -The observation above proves itself to be as determining, that the
2.711 -secondary ordering prefers nodes with the most uncovered neighbours
2.712 -among which have the same $Conn_{M_{small}(s)}$ to increase
2.713 -$Conn_{M_{small}(s)}$ of uncovered nodes so much, as possible. The
2.714 -tertiary ordering prefers nodes having the rarest uncovered labels.
2.715 -
2.716 -Note that the secondary ordering is the same as the ordering by $deg$,
2.717 -which is a static data in front of the above used.
2.718 -
2.719 -These rules can easily result in a matching order which contains the
2.720 -nodes of a long path successively, whose nodes may have low $Conn$ and
2.721 -is easily matchable into $G_{large}$. To avoid that, a BFS order is
2.722 -used, which provides the shortest possible paths.
2.723 -\newline
2.724 -
2.725 -In the following, some examples on which the VF2 may be slow are
2.726 -described, although they are easily solvable by using a proper
2.727 -matching order.
2.728 -
2.729 -\begin{example}
2.730 -Suppose $G_{small}$ can be mapped into $G_{large}$ in many ways
2.731 -without node labels. Let $u\in V_{small}$ and $v\in V_{large}$.
2.732 -\newline
2.733 -$lab(u):=black$
2.734 -\newline
2.735 -$lab(v):=black$
2.736 -\newline
2.737 -$lab(\tilde{u}):=red \ \forall \tilde{u}\in (V_{small}\backslash
2.738 -\{u\})$
2.739 -\newline
2.740 -$lab(\tilde{v}):=red \ \forall \tilde{v}\in (V_{large}\backslash
2.741 -\{v\})$
2.742 -\newline
2.743 -
2.744 -Now, any mapping by the node label $lab$ must contain $(u,v)$, since
2.745 -$u$ is black and no node in $V_{large}$ has a black label except
2.746 -$v$. If unfortunately $u$ were the last node which will get covered,
2.747 -VF2 would check only in the last steps, whether $u$ can be matched to
2.748 -$v$.
2.749 -\newline
2.750 -However, had $u$ been the first matched node, u would have been
2.751 -matched immediately to v, so all the mappings would have been
2.752 -precluded in which node labels can not correspond.
2.753 -\end{example}
2.754 -
2.755 -\begin{example}
2.756 -Suppose there is no node label given, $G_{small}$ is a small graph and
2.757 -can not be mapped into $G_{large}$ and $u\in V_{small}$.
2.758 -\newline
2.759 -Let $G'_{small}:=(V_{small}\cup
2.760 -\{u'_{1},u'_{2},..,u'_{k}\},E_{small}\cup
2.761 -\{(u,u'_{1}),(u'_{1},u'_{2}),..,(u'_{k-1},u'_{k})\})$, that is,
2.762 -$G'_{small}$ is $G_{small}\cup \{ a\ k$ long path, which is disjoint
2.763 -from $G_{small}$ and one of its starting points is connected to $u\in
2.764 -V_{small}\}$.
2.765 -\newline
2.766 -Is there a subgraph of $G_{large}$, which is isomorph with
2.767 -$G'_{small}$?
2.768 -\newline
2.769 -If unfortunately the nodes of the path were the first $k$ nodes in the
2.770 -matching order, the algorithm would iterate through all the possible k
2.771 -long paths in $G_{large}$, and it would recognize that no path can be
2.772 -extended to $G'_{small}$.
2.773 -\newline
2.774 -However, had it started by the matching of $G_{small}$, it would not
2.775 -have matched any nodes of the path.
2.776 -\end{example}
2.777 -
2.778 -These examples may look artificial, but the same problems also appear
2.779 -in real-world instances, even though in a less obvious way.
2.780
2.781 \subsection{Total ordering}
2.782 -Instead of the total ordering relation, the matching order will be
2.783 -searched directly.
2.784 +The matching order will be searched directly.
2.785 \begin{notation}
2.786 -Let \textbf{F$_\mathcal{M}$(l)}$:=|\{v\in V_{large} :
2.787 -l=lab(v)\}|-|\{u\in V_{small}\backslash \mathcal{M} : l=lab(u)\}|$ ,
2.788 -where $l$ is a label and $\mathcal{M}\subseteq V_{small}$.
2.789 +Let \textbf{F$_\mathcal{M}$(l)}$:=|\{v\in V_{2} :
2.790 +l=\mathcal{L}(v)\}|-|\{u\in V_{1}\backslash \mathcal{M} : l=\mathcal{L}(u)\}|$ ,
2.791 +where $l$ is a label and $\mathcal{M}\subseteq V_{1}$.
2.792 \end{notation}
2.793
2.794 \begin{definition}Let $\mathbf{arg\ max}_{f}(S) :=\{u\in S : f(u)=max_{v\in S}\{f(v)\}\}$ and $\mathbf{arg\ min}_{f}(S) := arg\ max_{-f}(S)$, where $S$ is a finite set and $f:S\longrightarrow \mathbb{R}$.
2.795 @@ -768,9 +715,9 @@
2.796 \caption{\hspace{0.5cm}$The\ method\ of\ VF2++\ for\ determining\ the\ node\ order$}\label{alg:VF2PPPseu}
2.797 \begin{algorithmic}[1]
2.798 \Procedure{VF2++order}{} \State $\mathcal{M}$ := $\emptyset$
2.799 -\Comment{matching order} \While{$V_{small}\backslash \mathcal{M}
2.800 +\Comment{matching order} \While{$V_{1}\backslash \mathcal{M}
2.801 \neq\emptyset$} \State $r\in$ arg max$_{deg}$ (arg
2.802 -min$_{F_\mathcal{M}\circ lab}(V_{small}\backslash
2.803 +min$_{F_\mathcal{M}\circ \mathcal{L}}(V_{1}\backslash
2.804 \mathcal{M})$)\label{alg:findMin} \State Compute $T$, a BFS tree with
2.805 root node $r$. \For{$d=0,1,...,depth(T)$} \State $V_d$:=nodes of the
2.806 $d$-th level \State Process $V_d$ \Comment{See Algorithm
2.807 @@ -786,7 +733,7 @@
2.808 \caption{\hspace{.5cm}$The\ method\ for\ processing\ a\ level\ of\ the\ BFS\ tree$}\label{alg:VF2PPProcess1}
2.809 \begin{algorithmic}[1]
2.810 \Procedure{VF2++ProcessLevel}{$V_{d}$} \While{$V_d\neq\emptyset$}
2.811 -\State $m\in$ arg min$_{F_\mathcal{M}\circ\ lab}($ arg max$_{deg}($arg
2.812 +\State $m\in$ arg min$_{F_\mathcal{M}\circ\ \mathcal{L}}($ arg max$_{deg}($arg
2.813 max$_{Conn_{\mathcal{M}}}(V_{d})))$ \State $V_d:=V_d\backslash m$
2.814 \State Append node $m$ to the end of $\mathcal{M}$ \State Refresh
2.815 $F_\mathcal{M}$ \EndWhile \EndProcedure
2.816 @@ -795,7 +742,7 @@
2.817
2.818 Algorithm~\ref{alg:VF2PPPseu} is a high level description of the
2.819 matching order procedure of VF2++. It computes a BFS tree for each
2.820 -component in ascending order of their rarest $lab$ and largest $deg$,
2.821 +component in ascending order of their rarest node labels and largest $deg$,
2.822 whose root vertex is the component's minimal
2.823 node. Algorithm~\ref{alg:VF2PPProcess1} is a method to process a level of the BFS tree, which appends the nodes of the current level in descending
2.824 lexicographic order by $(Conn_{\mathcal{M}},deg,-F_\mathcal{M})$ separately
2.825 @@ -807,43 +754,53 @@
2.826
2.827 \subsection{Cutting rules}
2.828 \label{VF2PPCuttingRules}
2.829 -This section presents the cutting rule of VF2++ in the case of IND, which is improved
2.830 -by using extra information coming from the node labels. For other problem types, the rules can be formulated similarly.
2.831 +This section presents the cutting rules of VF2++, which are improved by using extra information coming from the node labels.
2.832 \begin{notation}
2.833 -Let $\mathbf{\Gamma_{small}^{l}(u)}:=\{\tilde{u} : lab(\tilde{u})=l
2.834 -\wedge \tilde{u}\in \Gamma_{small} (u)\}$ and
2.835 -$\mathbf{\Gamma_{large}^{l}(v)}:=\{\tilde{v} : lab(\tilde{v})=l \wedge
2.836 -\tilde{v}\in \Gamma_{large} (v)\}$, where $u\in V_{small}$, $v\in
2.837 -V_{large}$ and $l$ is a label.
2.838 +Let $\mathbf{\Gamma_{1}^{l}(u)}:=\{\tilde{u} : \mathcal{L}(\tilde{u})=l
2.839 +\wedge \tilde{u}\in \Gamma_{1} (u)\}$ and
2.840 +$\mathbf{\Gamma_{2}^{l}(v)}:=\{\tilde{v} : \mathcal{L}(\tilde{v})=l \wedge
2.841 +\tilde{v}\in \Gamma_{2} (v)\}$, where $u\in V_{1}$, $v\in
2.842 +V_{2}$ and $l$ is a label.
2.843 \end{notation}
2.844
2.845 +\subsubsection{Induced subgraph isomorphism}
2.846 \begin{claim}
2.847 -\[LabCut_{IND}((u,v),M(s))\!:=\!\!\!\!\!\bigvee_{l\ is\ label}\!\!\!\!\!\!\!|\Gamma_{large}^{l} (v) \cap T_{large}(s)|\!<\!|\Gamma_{small}^{l}(u)\cap T_{small}(s)|\ \vee\]\[\bigvee_{l\ is\ label} \newline |\Gamma_{large}^{l}(v)\cap \tilde{T}_{large}(s)| < |\Gamma_{small}^{l}(u)\cap \tilde{T}_{small}(s)|\] is a cutting function by IND.
2.848 +\[LabCut_{IND}((u,v),\mathfrak{m}):=\bigvee_{l\ is\ label}|\Gamma_{2}^{l} (v) \cap T_{2}(\mathfrak{m})|\!<\!|\Gamma_{1}^{l}(u)\cap T_{1}(\mathfrak{m})|\ \vee\]\[\bigvee_{l\ is\ label} \newline |\Gamma_{2}^{l}(v)\cap \tilde{T}_{2}(\mathfrak{m})| < |\Gamma_{1}^{l}(u)\cap \tilde{T}_{1}(\mathfrak{m})|\] is a cutting function by IND.
2.849 +\end{claim}
2.850 +\subsubsection{Graph isomorphism}
2.851 +\begin{claim}
2.852 +\[LabCut_{ISO}((u,v),\mathfrak{m}):=\bigvee_{l\ is\ label}|\Gamma_{2}^{l} (v) \cap T_{2}(\mathfrak{m})|\!\neq\!|\Gamma_{1}^{l}(u)\cap T_{1}(\mathfrak{m})|\ \vee\]\[\bigvee_{l\ is\ label} \newline |\Gamma_{2}^{l}(v)\cap \tilde{T}_{2}(\mathfrak{m})| \neq |\Gamma_{1}^{l}(u)\cap \tilde{T}_{1}(\mathfrak{m})|\] is a cutting function by ISO.
2.853 \end{claim}
2.854
2.855 +\subsubsection{Subgraph isomorphism}
2.856 +\begin{claim}
2.857 +\[LabCut_{SU\!B}((u,v),\mathfrak{m}):=\bigvee_{l\ is\ label}|\Gamma_{2}^{l} (v) \cap T_{2}(\mathfrak{m})|\!<\!|\Gamma_{1}^{l}(u)\cap T_{1}(\mathfrak{m})|\] is a cutting function by SUB.
2.858 +\end{claim}
2.859
2.860 -\subsection{Implementation details}
2.861 +
2.862 +
2.863 +\section{Implementation details}
2.864 This section provides a detailed summary of an efficient
2.865 implementation of VF2++.
2.866 \subsubsection{Storing a mapping}
2.867 After fixing an arbitrary node order ($u_0, u_1, ..,
2.868 -u_{|G_{small}|-1}$) of $G_{small}$, an array $M$ is usable to store
2.869 +u_{|G_{1}|-1}$) of $G_{1}$, an array $M$ is usable to store
2.870 the current mapping in the following way.
2.871 \[
2.872 M[i] =
2.873 \begin{cases}
2.874 - v & if\ (u_i,v)\ is\ in\ the\ mapping\\ INVALID &
2.875 + v & if\ (u_i,v)\ is\ in\ the\ mapping\\ INV\!ALI\!D &
2.876 if\ no\ node\ has\ been\ mapped\ to\ u_i,
2.877 \end{cases}
2.878 \]
2.879 -where $i\in\{0,1, ..,|G_{small}|-1\}$, $v\in V_{large}$ and $INVALID$
2.880 +where $i\in\{0,1, ..,|G_{1}|-1\}$, $v\in V_{2}$ and $INV\!ALI\!D$
2.881 means "no node".
2.882 \subsubsection{Avoiding the recurrence}
2.883 The recursion of Algorithm~\ref{alg:VF2Pseu} can be realized
2.884 as a \textit{while loop}, which has a loop counter $depth$ denoting the
2.885 all-time depth of the recursion. Fixing a matching order, let $M$
2.886 denote the array storing the all-time mapping. Based on Claim~\ref{claim:claimCoverFromLeft},
2.887 -$M$ is $INVALID$ from index $depth$+1 and not $INVALID$ before
2.888 +$M$ is $INV\!ALI\!D$ from index $depth$+1 and not $INV\!ALI\!D$ before
2.889 $depth$. $M[depth]$ changes
2.890 while the state is being processed, but the property is held before
2.891 both stepping back to a predecessor state and exploring a successor
2.892 @@ -858,12 +815,12 @@
2.893 \subsubsection{Calculating the candidates for a node}
2.894 Being aware of Claim~\ref{claim:claimCoverFromLeft}, the
2.895 task is not to maintain the candidate set, but to generate the
2.896 -candidate nodes in $G_{large}$ for a given node $u\in V_{small}$. In
2.897 +candidate nodes in $G_{2}$ for a given node $u\in V_{1}$. In
2.898 case of any of the three problem types and a mapping $M$ if a node $v\in
2.899 -V_{large}$ is a potential pair of $u\in V_{small}$, then $\forall
2.900 -u'\in V_{small} : (u,u')\in
2.901 -E_{small}\ and\ u'\ is\ covered\ by\ M\ \Rightarrow (v,Pair(M,u'))\in
2.902 -E_{large}$. That is, each covered neighbour of $u$ has to be mapped to
2.903 +V_{2}$ is a potential pair of $u\in V_{1}$, then $\forall
2.904 +u'\in V_{1} : (u,u')\in
2.905 +E_{1}\ and\ u'\ is\ covered\ by\ M\ \Rightarrow (v,Pair(M,u'))\in
2.906 +E_{2}$. That is, each covered neighbour of $u$ has to be mapped to
2.907 a covered neighbour of $v$.
2.908
2.909 Having said that, an algorithm running in $\Theta(deg)$ time is
2.910 @@ -879,16 +836,16 @@
2.911 numbers $\{0,1,..,|K|-1\}$, where $K$ is the set of the labels. It
2.912 enables $F_\mathcal{M}$ to be stored in an array. At first, the node order
2.913 $\mathcal{M}=\emptyset$, so $F_\mathcal{M}[i]$ is the number of nodes
2.914 -in $V_{small}$ having label i, which is easy to compute in
2.915 -$\Theta(|V_{small}|)$ steps.
2.916 +in $V_{1}$ having label $i$, which is easy to compute in
2.917 +$\Theta(|V_{1}|)$ steps.
2.918
2.919 -Representing $\mathcal{M}\subseteq V_{small}$ as an array of
2.920 -size $|V_{small}|$, both the computation of the BFS tree, and processing its levels by Algorithm~\ref{alg:VF2PPProcess1} can be done inplace by swapping nodes.
2.921 +Representing $\mathcal{M}\subseteq V_{1}$ as an array of
2.922 +size $|V_{1}|$, both the computation of the BFS tree, and processing its levels by Algorithm~\ref{alg:VF2PPProcess1} can be done inplace by swapping nodes.
2.923
2.924 \subsubsection{Cutting rules}
2.925 In Section~\ref{VF2PPCuttingRules}, the cutting rules were
2.926 -described using the sets $T_{small}$, $T_{large}$, $\tilde T_{small}$
2.927 -and $\tilde T_{large}$, which are dependent on the all-time mapping
2.928 +described using the sets $T_{1}$, $T_{2}$, $\tilde T_{1}$
2.929 +and $\tilde T_{2}$, which are dependent on the all-time mapping
2.930 (i.e. on the all-time state). The aim is to check the labeled cutting
2.931 rules of VF2++ in $\Theta(deg)$ time.
2.932
2.933 @@ -896,93 +853,37 @@
2.934 checking whether a node is in a certain set takes constant time,
2.935 e.g. they are given by their 0-1 characteristic vectors. Let $L$ be an
2.936 initially zero integer lookup table of size $|K|$. After incrementing
2.937 -$L[lab(u')]$ for all $u'\in \Gamma_{small}(u) \cap T_{small}(s)$ and
2.938 -decrementing $L[lab(v')]$ for all $v'\in\Gamma_{large} (v) \cap
2.939 -T_{large}(s)$, the first part of the cutting rules is checkable in
2.940 +$L[\mathcal{L}(u')]$ for all $u'\in \Gamma_{1}(u) \cap T_{1}(\mathfrak{m})$ and
2.941 +decrementing $L[\mathcal{L}(v')]$ for all $v'\in\Gamma_{2} (v) \cap
2.942 +T_{2}(s)$, the first part of the cutting rules is checkable in
2.943 $\Theta(deg)$ time by considering the proper signs of $L$. Setting $L$
2.944 to zero takes $\Theta(deg)$ time again, which makes it possible to use
2.945 the same table through the whole algorithm. The second part of the
2.946 cutting rules can be verified using the same method with $\tilde
2.947 -T_{small}$ and $\tilde T_{large}$ instead of $T_{small}$ and
2.948 -$T_{large}$. Thus, the overall complexity is $\Theta(deg)$.
2.949 +T_{1}$ and $\tilde T_{2}$ instead of $T_{1}$ and
2.950 +$T_{2}$. Thus, the overall complexity is $\Theta(deg)$.
2.951
2.952 -An other integer lookup table storing the number of covered neighbours
2.953 -of each node in $G_{large}$ gives all the information about the sets
2.954 -$T_{large}$ and $\tilde T_{large}$, which is maintainable in
2.955 +Another integer lookup table storing the number of covered neighbours
2.956 +of each node in $G_{2}$ gives all the information about the sets
2.957 +$T_{2}$ and $\tilde T_{2}$, which is maintainable in
2.958 $\Theta(deg)$ time when a pair is added or substracted by incrementing
2.959 or decrementing the proper indices. A further improvement is that the
2.960 -values of $L[lab(u')]$ in case of checking $u$ is dependent only on
2.961 -$u$, i.e. on the size of the mapping, so for each $u\in V_{small}$ an
2.962 +values of $L[\mathcal{L}(u')]$ in case of checking $u$ are dependent only on
2.963 +$u$, i.e. on the size of the mapping, so for each $u\in V_{1}$ an
2.964 array of pairs (label, number of such labels) can be stored to skip
2.965 the maintaining operations. Note that these arrays are at most of size
2.966 -$deg$. Skipping this trick, the number of covered neighbours has to be
2.967 -stored for each node of $G_{small}$ as well to get the sets
2.968 -$T_{small}$ and $\tilde T_{small}$.
2.969 +$deg$.
2.970
2.971 -Using similar tricks, the consistency function can be evaluated in
2.972 +Using similar techniques, the consistency function can be evaluated in
2.973 $\Theta(deg)$ steps, as well.
2.974
2.975 -\section{The VF2 Plus Algorithm}
2.976 -The VF2 Plus algorithm is a recently improved version of VF2. It was
2.977 -compared with the state of the art algorithms in \cite{VF2Plus} and
2.978 -has proved itself to be competitive with RI, the best algorithm on
2.979 -biological graphs. \\ A short summary of VF2 Plus follows, which uses
2.980 -the notation and the conventions of the original paper.
2.981 +\section{Experimental results}
2.982 +This section compares the performance of VF2++ and VF2 Plus. According to
2.983 +our experience, both algorithms run faster than VF2 with orders of
2.984 +magnitude, thus its inclusion was not reasonable.
2.985
2.986 -\subsection{Ordering procedure}
2.987 -VF2 Plus uses a sorting procedure that prefers nodes in $V_{small}$
2.988 -with the lowest probability to find a pair in $V_{small}$ and the
2.989 -highest number of connections with the nodes already sorted by the
2.990 -algorithm.
2.991 -
2.992 -\begin{definition}
2.993 -$(u,v)$ is a \textbf{feasible pair} if $lab(u)=lab(v)$ and
2.994 - $deg(u)\leq deg(v)$, where $u\in{V_{small}}$ and $ v\in{V_{large}}$.
2.995 -\end{definition}
2.996 -$P_{lab}(L):=$ a priori probability to find a node with label $L$ in
2.997 -$V_{large}$
2.998 -\newline
2.999 -$P_{deg}(d):=$ a priori probability to find a node with degree $d$ in
2.1000 -$V_{large}$
2.1001 -\newline
2.1002 -$P(u):=P_{lab}(L)*\bigcup_{d'>d}P_{deg}(d')$\\ $M$ is the set of
2.1003 -already sorted nodes, $T$ is the set of nodes candidate to be
2.1004 -selected, and $degreeM$ of a node is the number of its neighbours in
2.1005 -$M$.
2.1006 -\begin{algorithm}
2.1007 -\algtext*{EndIf}%ne nyomtasson end if-et
2.1008 -\algtext*{EndFor}%nenyomtasson ..
2.1009 -\algtext*{EndProcedure}%ne nyomtasson ..
2.1010 -\algtext*{EndWhile}
2.1011 -\caption{}\label{alg:VF2PlusPseu}
2.1012 -\begin{algorithmic}[1]
2.1013 -\Procedure{VF2 Plus order}{} \State Select the node with the lowest
2.1014 -$P$. \If {more nodes share the same $P$} \State select the one with
2.1015 -maximum degree \EndIf \If {more nodes share the same $P$ and have the
2.1016 - max degree} \State select the first \EndIf \State Put the selected
2.1017 -node in the set $M$. \label{alg:putIn} \State Put all its unsorted
2.1018 -neighbours in the set $T$. \If {$M\neq V_{small}$} \State From set
2.1019 -$T$ select the node with maximum $degreeM$. \If {more nodes have
2.1020 - maximum $degreeM$} \State Select the one with the lowest $P$ \EndIf
2.1021 -\If {more nodes have maximum $degreeM$ and $P$} \State Select the
2.1022 -first. \EndIf \State \textbf{goto \ref{alg:putIn}.} \EndIf
2.1023 -\EndProcedure
2.1024 -\end{algorithmic}
2.1025 -\end{algorithm}
2.1026 -
2.1027 -Using these notations, Algorithm~\ref{alg:VF2PlusPseu}
2.1028 -provides the description of the sorting procedure.
2.1029 -
2.1030 -Note that $P(u)$ is not the exact probability of finding a consistent
2.1031 -pair for $u$ by choosing a node of $V_{large}$ randomly, since
2.1032 -$P_{lab}$ and $P_{deg}$ are not independent, though calculating the
2.1033 -real probability would take quadratic time, which may be reduced by
2.1034 -using fittingly lookup tables.
2.1035 -
2.1036 -\section{Experimental results}
2.1037 -This section compares the performance of VF2++ and VF2 Plus. Both
2.1038 -algorithms have run faster with orders of magnitude than VF2, thus its
2.1039 -inclusion was not reasonable.
2.1040 +The algorithms were implemented in C++ using the open source
2.1041 +LEMON graph and network optimization library\cite{LEMON}. The test were carried out on a linux based system with an Intel i7 X980 3.33 GHz CPU and 6 GB of RAM.
2.1042 \subsection{Biological graphs}
2.1043 The tests have been executed on a recent biological dataset created
2.1044 for the International Contest on Pattern Search in Biological
2.1045 @@ -996,7 +897,7 @@
2.1046 contact map dataset contains graphs with 150-800 nodes and an average
2.1047 degree of 20. \\
2.1048
2.1049 -In the following, the induced subgraph isomorphism and the graph
2.1050 +In the following, both the induced subgraph isomorphism and the graph
2.1051 isomorphism will be examined.
2.1052
2.1053 This dataset provides graph pairs, between which all the induced subgraph isomorphisms have to be found. For runtime results, please see Figure~\ref{fig:bioIND}.
2.1054 @@ -1258,8 +1159,8 @@
2.1055 small graphs of various size.
2.1056
2.1057 For each chart, a number $0<\rho< 1$ has been fixed, and the following
2.1058 -has been executed 150 times. Generating a large graph $G_{large}$ of an average degree of $\delta$,
2.1059 -choose 10 of its induced subgraphs having $\rho\ |V_{large}|$ nodes,
2.1060 +has been executed 150 times. Generating a large graph $G_{2}$ of an average degree of $\delta$,
2.1061 +choose 10 of its induced subgraphs having $\rho\ |V_{2}|$ nodes,
2.1062 and for all the 10 subgraphs find a mapping by using both the graph
2.1063 matching algorithms. The $\delta = 5, 10, 35$ and $\rho = 0.05, 0.1,
2.1064 0.3, 0.6, 0.8, 0.95$ cases have been examined, see
2.1065 @@ -1581,9 +1482,7 @@
2.1066
2.1067
2.1068 \section{Conclusion}
2.1069 -In this paper, after providing a short summary of the recent
2.1070 -algorithms, a new graph matching algorithm based on VF2, called VF2++,
2.1071 -has been presented and analyzed from a practical viewpoint.
2.1072 +This paper presented VF2++, a new graph matching algorithm based on VF2, called VF2++, and analyzed it from a practical viewpoint.
2.1073
2.1074 Recognizing the importance of the node order and determining an
2.1075 efficient one, VF2++ is able to match graphs of thousands of nodes in
2.1076 @@ -1593,8 +1492,8 @@
2.1077 most of the unfruitful branches without going astray.
2.1078
2.1079 In order to show the efficiency of the new method, it has been
2.1080 -compared to VF2 Plus, which is the best concurrent algorithm based on
2.1081 -\cite{VF2Plus}.
2.1082 +compared to VF2 Plus\cite{VF2Plus}, which is the best contemporary algorithm.
2.1083 +.
2.1084
2.1085 The experiments show that VF2++ consistently outperforms VF2 Plus on
2.1086 biological graphs. It seems to be asymptotically faster on protein and
2.1087 @@ -1603,8 +1502,8 @@
2.1088 asymptotic behaviour on protein graphs.
2.1089
2.1090 Regarding random sparse graphs, not only has VF2++ proved itself to be
2.1091 -faster than VF2 Plus, but it has a practically linear behaviour both
2.1092 -in the case of induced subgraph- and graph isomorphism, as well.
2.1093 +faster than VF2 Plus, but it also has a practically linear behaviour both
2.1094 +in the case of induced subgraph- and graph isomorphism.
2.1095
2.1096
2.1097