2008-11-26

Resolved a subtle teal::vout bug

I've confirmed a subtle teal::vout bug for multi-threaded truss verification environment. In my Linux box (both i686 and x86_64), the ncsim got crashed randomly. Although the crash rate is bellow 1%, it makes me not able to run regression test with teal/truss.

Here's the symptom:
./teal_vout.cpp: (set_file_and_line:631) test_component_0 thread(0x44606960) vout(0x2109718) [FILE: truss/trunk/cpp/inc/truss_test_component.h][line: 70] begin
... ... ...
./teal_vout.cpp: (set_file_and_line:631) test_component_0 thread(0x40A00960) vout(0x2109718) [FILE: ./test_component.cpp][line: 92] begin
... ... ...
=> crashed

Here's the reason:
  • Thread test_component_0 is with ID(0x44606960)
  • Thread verification_top is with ID(0x40A00960)
  • vout(0x2109718) is constructed by thread(0x40A00960) with test_component_0 as its function_area_ name
truss::verification_top calls "test_0->start()", which calls "lls::test_component_[0]->start()", which calls "truss::thread.start()" to create a new thread(0x44606960), which calls "truss::test_component.start_()" to start a new teal::vout message to vout(0x2109718).

However, before it was finished with teal::vout::end_message_() at thread(0x44606960), a new message to the same vout(0x2109718) was issued. It was called from "truss::verification_top::test_0->wait_for_completion ()", which calls "lls::test_component_[0]->wait_for_completion ()", which calls "truss::test_component.wait_for_completion()", which calls "lls::test_component::wait_for_completion_()" at thread (0x40A00960).

As a result, the message_list_ and the other maps of teal::vout got polluted. Sometime, it will end up with a "glibc detected double free or corruption error", which crashed the simulation.

And, here's my solution:
  1. remove internal_mutex_sentry() from teal::vout::put_message()
  2. put a pthread_mutex_lock() at teal::vout::set_file_and_line()
  3. put a pthread_mutex_unlock() at teal::vout::clear_message_()
Although this decreased the simulation crash rate dramatically (from 1% to bellow 0.1%). There's still a simulation error to be resolved: "Cannot place a value with no or zero delay at read only sync".