[Lemon-user] bug in ListDigraph::erase(Node &) ?

Thu Mar 31 17:24:44 CEST 2011

Hi Alpar,

On Thu, Mar 31, 2011 at 10:12:13AM +0200, Alpár Jüttner wrote:
> > Not always! In fact, most of the time Valgrind won't say anything,
> > because so long as array[-2] is still inside valid memory (which it
> > might tend to be, statistically speaking), Valgrind doesn't report
> > anything. (I've just observed exactly this behaviour in my code.)
> 
> I think you are not right. Valgrind uses a much more sophisticated
> technique to localize wrong memory usage.

OK, you caught me. ;) To be honest, I have no clue how Valgrind works
(even though I use it quite often!) and was just making a guess to
explain the behaviour I was seeing.

> For example I have just tried
> this simple code:
> 
> On my laptop it compiles and runs with no error

You're lucky! (Or maybe unlucky?) On my machine at work (debian x64
system, g++ 4.4, glibc 2.10) I get this:

$ cat blah.cpp
#include<lemon/list_graph.h>
using namespace lemon;
int main()
{
  ListDigraph g;
  ListDigraph::Node n = g.addNode();
  g.erase(n);
  g.erase(n);
}

$ g++ -o blah blah.cpp -I/work/lemon/include
$ ./blah [1]    17933 segmentation fault  ./blah

However, if I run it in Valgrind, I get the same invalid write
error you did.

Here's where it gets interesting, though. Using the simple g++
invocation above, I get:

-rwxr-xr-x 1 duraid duraid 51089 2011-04-01 00:10 blah

but look what happens if I build your code in my current project's
build environment (i.e. replace main() of my code with your file, but
use the same compiler flags, still link in the same libraries, etc.):

-rwxr-xr-x   1 duraid duraid 1877694 2011-04-01 00:11 blah

$ ./blah
$ valgrind ./blah
==18290== Memcheck, a memory error detector
==18290== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==18290== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==18290== Command: ./blah
==18290== 
==18290== 
==18290== HEAP SUMMARY:
==18290==     in use at exit: 0 bytes in 0 blocks
==18290==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==18290== 
==18290== All heap blocks were freed -- no leaks are possible
==18290== 
==18290== For counts of detected and suppressed errors, rerun with: -v
==18290== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4)

Oh dear.

I stepped through the program and it is indeed making two calls to
erase(). After the first erase(), nodes[] looks like this:

(gdb) print g.nodes[-2]
$1 = (lemon::ListDigraphBase::NodeT &) @0x7ffff600e020: {first_in = 0, first_out = -16, prev = -1, next = -1}
(gdb) print g.nodes[-1]
$2 = (lemon::ListDigraphBase::NodeT &) @0x7ffff600e030: {first_in = -1, first_out = 268435455, prev = 14, next = 0}
(gdb) print g.nodes[0]
$3 = (lemon::ListDigraphBase::NodeT &) @0x7ffff600e040: {first_in = -1, first_out = -1, prev = -2, next = -1}
(gdb) print g.nodes[1]
$4 = (lemon::ListDigraphBase::NodeT &) @0x7ffff600e050: {first_in = 0, first_out = 0, prev = 0, next = 0}
(gdb) print g.nodes[2]
$5 = (lemon::ListDigraphBase::NodeT &) @0x7ffff600e060: {first_in = 0, first_out = 0, prev = 0, next = 0}

For comparison, after the first erase() in the first program, I get:

(gdb) print g.nodes[-2]
$1 = (lemon::ListDigraphBase::NodeT &) @0x605ff0: {first_in = 0, first_out = 0, prev = 0, next = 0}
(gdb) print g.nodes[-1]
$2 = (lemon::ListDigraphBase::NodeT &) @0x606000: {first_in = 0, first_out = 0, prev = 33, next = 0}
(gdb) print g.nodes[0]
$3 = (lemon::ListDigraphBase::NodeT &) @0x606010: {first_in = -1, first_out = -1, prev = -2, next = -1}
(gdb) print g.nodes[1]
$4 = (lemon::ListDigraphBase::NodeT &) @0x606020: {first_in = 0, first_out = 0, prev = 135137, next = 0}
(gdb) print g.nodes[2]
$5 = (lemon::ListDigraphBase::NodeT &) @0x606030: {first_in = 0, first_out = 0, prev = 0, next = 0}

Now I think I see the problem. I'm not very familiar with the AMD64
ABI, but those 0x7fff.... pointers look like stack addresses, where I
guess Valgrind becomes pretty useless. (Those smaller pointers seem to
be where g++ has stuck its initialization code.) The irony is that
Valgrind now "fails" on my program as a side-effect of some performance
work I did *specifically* so I could get an edit-compile-valgrind loop
going at bearable speed. Believe me, I love Valgrind just as much as
the next guy, but it is no panacea!

To try and bring this discussion back on-topic, I am starting to wonder
if LEMON should support an (additional? optional?) erase() semantics
of:

    "g.erase(x) establishes that g does not contain x"
       (and does not format your hard disk. [*])

Do you think this is worth supporting?

My opinion is that even though this kind of semantics involves _some_
kind of overhead, it is not only useful for some user applications
(IMHO), but it would also naturally provide the DEBUG mode you
suggested, which I really think is a great idea, partly because it
would avoid the massive speed hit one can get running things under
Valgrind, and partly because I think it would be more robust.

It's fine for users to implement their own 'does this graph contain
this node' query systems (and maybe they need to, if they have some
strange requirements (performance?) particular to their application),
but I think it would be nice if LEMON provided one to cover the
majority of cases where an application's complexity means it isn't
trivial to know which graphs contain which nodes at any given moment.

What do you think? Maybe there is something obvious I am missing about
LEMON's current erase() semantics? In my code, I have gone and
implemented the necessary query functionality to ensure that
double-erases don't occur, but maybe there is some other way of doing
this in LEMON I don't know about?

	Duraid

[*] - This is just the usual joke that some kind of passive-aggressive
compiler could legitimately claim 100% compliance with the C++ standard
and actually wipe out your home directory if you dare erase the same
node twice!