I've confirmed a subtle teal::vout bug for multi-threaded truss verification environment. In my Linux box (both i686 and x86_64), the ncsim got crashed randomly. Although the crash rate is bellow 1%, it makes me not able to run regression test with teal/truss.
Here's the symptom:
./teal_vout.cpp: (
set_file_and_line:631) test_component_0 thread(
0x44606960) vout(
0x2109718) [FILE: truss/trunk/cpp/inc/truss_test_component.h][line: 70] begin
... ... ...
./teal_vout.cpp: (
set_file_and_line:631) test_component_0 thread(
0x40A00960) vout(
0x2109718) [FILE: ./test_component.cpp][line: 92] begin
... ... ...
=> crashed
Here's the reason:
- Thread test_component_0 is with ID(0x44606960)
- Thread verification_top is with ID(0x40A00960)
- vout(0x2109718) is constructed by thread(0x40A00960) with test_component_0 as its function_area_ name
truss::verification_top calls "test_0->start()", which calls "lls::test_component_[0]->start()", which calls "truss::thread.start()" to create a new thread(
0x44606960), which calls "truss::test_component.start_()" to start a new teal::vout message to vout(
0x2109718).
However, before it was finished with teal::vout::end_message_() at thread(
0x44606960), a new message to the same vout(
0x2109718) was issued. It was called from "truss::verification_top::test_0->wait_for_completion ()", which calls "lls::test_component_[0]->wait_for_completion ()", which calls "truss::test_component.wait_for_completion()", which calls "lls::test_component::wait_for_completion_()" at thread (
0x40A00960).
As a result, the message_list_ and the other maps of teal::vout got polluted. Sometime, it will end up with a "glibc detected double free or corruption error", which crashed the simulation.
And, here's my solution:
- remove internal_mutex_sentry() from teal::vout::put_message()
- put a pthread_mutex_lock() at teal::vout::set_file_and_line()
- put a pthread_mutex_unlock() at teal::vout::clear_message_()
Although this decreased the simulation crash rate dramatically (from 1% to bellow 0.1%). There's still a simulation error to be resolved: "
Cannot place a value with no or zero delay at read only sync".