多线程程序的GDB调试输出。

时间:2022-07-05 20:41:57

All,

所有人,

I am debuging a 24-thread program with GDB, now I have find which line in the code the error occurs, but I cannot tell what the error is from the output of GDB. The followsing line of code leads to the error, it's just a normal insertion to a map structure.

我正在用GDB调试一个24线程的程序,现在我已经找到了错误发生的代码中的哪一行,但是我无法分辨出GDB输出的错误。下面的代码行导致错误,它只是一个普通的插入到映射结构。

current_node->children.insert(std::pair<string, ComponentTrieNode*>(comps[j], temp_node));

I used GDB to find out in which thread the error happens and switched to that thread, the backtrace command shows the function calls in the stack. (The last several lines try to print the value of some variables in a function, but failed.)

我使用GDB查找错误发生的线程,并切换到该线程,backtrace命令显示堆栈中的函数调用。(最后几行尝试在函数中打印一些变量的值,但失败了。)

What should I do to clear know what error is happening?

我该怎么做才能弄清楚发生了什么错误?

[root@localhost nameComponentEncoding]# gdb NCE_david
GNU gdb (GDB) Fedora (7.2.90.20110429-36.fc15)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /mnt/disk2/experiments_BLOODMOON/two_stage_bloom_filter/programs/nameComponentEncoding/NCE_david...done.
(gdb) r /mnt/disk2/FIB_with_port/10_1.txt /mnt/disk2/trace/a_10_1.trace /mnt/disk2/FIB_with_port/10_2.txt
Starting program: /mnt/disk2/experiments_BLOODMOON/two_stage_bloom_filter/programs/nameComponentEncoding/NCE_david /mnt/disk2/FIB_with_port/10_1.txt /mnt/disk2/trace/a_10_1.trace /mnt/disk2/FIB_with_port/10_2.txt
[Thread debugging using libthread_db enabled]
[New Thread 0x7fffd2bf5700 (LWP 13129)]
[New Thread 0x7fffd23f4700 (LWP 13130)]
[New Thread 0x7fffd1bf3700 (LWP 13131)]
[New Thread 0x7fffd13f2700 (LWP 13132)]
[New Thread 0x7fffd0bf1700 (LWP 13133)]
[New Thread 0x7fffd03f0700 (LWP 13134)]
[New Thread 0x7fffcfbef700 (LWP 13135)]
[New Thread 0x7fffcf3ee700 (LWP 13136)]
[New Thread 0x7fffcebed700 (LWP 13137)]
[New Thread 0x7fffce3ec700 (LWP 13138)]
[New Thread 0x7fffcdbeb700 (LWP 13139)]
[New Thread 0x7fffcd3ea700 (LWP 13140)]
[New Thread 0x7fffccbe9700 (LWP 13141)]
[New Thread 0x7fffcc3e8700 (LWP 13142)]
[New Thread 0x7fffcbbe7700 (LWP 13143)]
[New Thread 0x7fffcb3e6700 (LWP 13144)]
[New Thread 0x7fffcabe5700 (LWP 13145)]
[New Thread 0x7fffca3e4700 (LWP 13146)]
[New Thread 0x7fffc9be3700 (LWP 13147)]
[New Thread 0x7fffc93e2700 (LWP 13148)]
[New Thread 0x7fffc8be1700 (LWP 13149)]
[New Thread 0x7fffc83e0700 (LWP 13150)]
[New Thread 0x7fffc7bdf700 (LWP 13151)]
this is thread 1
this is thread 7
this is thread 14
this is thread 18
this is thread 2
this is thread 19
this is thread 6
this is thread 8
this is thread 24
base: 64312646
this is thread 11
this is thread 5
this is thread 12
this is thread 13
this is thread 3
this is thread 15
this is thread 16
this is thread 17
this is thread 4
this is thread 20
this is thread 21
this is thread 22
this is thread 23
this is thread 9
this is thread 10

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc8be1700 (LWP 13149)]
std::local_Rb_tree_rotate_left (__x=0xa057c90, __root=@0x608118) at ../../../../libstdc++-v3/src/tree.cc:126
126         __x->_M_right = __y->_M_left;
(gdb) info threads
  Id   Target Id         Frame
  24   Thread 0x7fffc7bdf700 (LWP 13151) "NCE_david" compare (__n=<optimized out>, __s2=<optimized out>, __s1=<optimized out>)
    at /usr/lib/gcc/x86_64-redhat-linux/4.6.0/../../../../include/c++/4.6.0/bits/char_traits.h:257
  (... other 22 threads not listed)
  2    Thread 0x7fffd2bf5700 (LWP 13129) "NCE_david" compare (__n=<optimized out>, __s2=<optimized out>, __s1=<optimized out>)
    at /usr/lib/gcc/x86_64-redhat-linux/4.6.0/../../../../include/c++/4.6.0/bits/char_traits.h:257
  1    Thread 0x7ffff7fe57a0 (LWP 13126) "NCE_david" strtok () at ../sysdeps/x86_64/strtok.S:76
(gdb) thread 22
[Switching to thread 22 (Thread 0x7fffc8be1700 (LWP 13149))]
#0  std::local_Rb_tree_rotate_left (__x=0xa057c90, __root=@0x608118) at ../../../../libstdc++-v3/src/tree.cc:126
126         __x->_M_right = __y->_M_left;

(gdb) bt
#0  std::local_Rb_tree_rotate_left (__x=0xa057c90, __root=@0x608118) at ../../../../libstdc++-v3/src/tree.cc:126
#1  0x0000003cdd26e848 in std::_Rb_tree_insert_and_rebalance (__insert_left=<optimized out>, __x=0x7fffc0005ba0, __p=<optimized out>, __header=...)
    at ../../../../libstdc++-v3/src/tree.cc:266
#2  0x00000000004029ca in std::_Rb_tree<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*>, std::_Select1st<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*> >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*> > >::_M_insert_ (this=0x608108, __x=<optimized out>, __p=0x16cd3e30, __v=...)
    at /usr/lib/gcc/x86_64-redhat-linux/4.6.0/../../../../include/c++/4.6.0/bits/stl_pair.h:87
#3  0x0000000000402b7d in std::_Rb_tree<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*>, std::_Select1st<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*> >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*> > >::_M_insert_unique (this=0x608108, __v=...)
    at /usr/lib/gcc/x86_64-redhat-linux/4.6.0/../../../../include/c++/4.6.0/bits/stl_tree.h:1281
#4  0x000000000040444c in insert (__x=..., this=0x608108) at /usr/lib/gcc/x86_64-redhat-linux/4.6.0/../../../../include/c++/4.6.0/bits/stl_map.h:518
#5  ComponentTrie::add_prefix (this=0x7fffffffe2e0, prefix_input=<optimized out>, port=10) at ComponentTrie_david.cpp:112
#6  0x0000000000401c3b in main._omp_fn.0 () at NameComponentEncoding_david.cpp:277
#7  0x0000003cd2607fea in gomp_thread_start (xdata=<optimized out>) at ../../../libgomp/team.c:115
#8  0x0000003cd0607cd1 in start_thread (arg=0x7fffc8be1700) at pthread_create.c:305
#9  0x0000003cd02dfd3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

(gdb) p 'ComponentTrie::add_prefix(char*, int)'::comps[j]
No symbol "comps" in specified context.
(gdb) p 'ComponentTrie::add_prefix(char*, int)'::prefix
No symbol "prefix" in specified context.

Edit: I have run the code with valgrind --tool=memcheck, the following is the result.

编辑:我已经用valgrind——工具=memcheck来运行代码,下面是结果。

[root@localhost nameComponentEncoding]# valgrind --tool=memcheck ./NCE_david /mnt/disk2/FIB_with_port/10_1.txt /mnt/disk2/trace/a_10_1.trace /mnt/disk2/FIB_with_port/10_2.txt
(... many lines omitted)
==13261==
==13261== Thread 11:
==13261== Invalid read of size 1
==13261==    at 0x3CD02849BC: strtok (strtok.S:141)
==13261==    by 0x40426A: ComponentTrie::add_prefix(char*, int) (ComponentTrie_david.cpp:99)
==13261==    by 0x40242C: main._omp_fn.0 (NameComponentEncoding_david.cpp:531)
==13261==    by 0x3CD2607FE9: gomp_thread_start (team.c:115)
==13261==    by 0x3CD0607CD0: start_thread (pthread_create.c:305)
==13261==    by 0x3CD02DFD3C: clone (clone.S:115)
==13261==  Address 0x234422c02 is not stack'd, malloc'd or (recently) free'd
==13261==
==13261== Invalid read of size 1
==13261==    at 0x3CD02849EC: strtok (strtok.S:167)
==13261==    by 0x40426A: ComponentTrie::add_prefix(char*, int) (ComponentTrie_david.cpp:99)
==13261==    by 0x40242C: main._omp_fn.0 (NameComponentEncoding_david.cpp:531)
==13261==    by 0x3CD2607FE9: gomp_thread_start (team.c:115)
==13261==    by 0x3CD0607CD0: start_thread (pthread_create.c:305)
==13261==    by 0x3CD02DFD3C: clone (clone.S:115)
==13261==  Address 0x234422c02 is not stack'd, malloc'd or (recently) free'd
==13261==
Insertion and lookup cost time(us): 994669532   67108864        14.821731       0.067469
component number:4849478, state number: 2545847
Parallel threads:24
==13261==
==13261== HEAP SUMMARY:
==13261==     in use at exit: 4,239,081,584 bytes in 76,746,193 blocks
==13261==   total heap usage: 80,050,114 allocs, 3,303,921 frees, 4,323,622,103 bytes allocated
==13261==
==13261== LEAK SUMMARY:
==13261==    definitely lost: 0 bytes in 0 blocks
==13261==    indirectly lost: 0 bytes in 0 blocks
==13261==      possibly lost: 4,111,951,106 bytes in 74,746,429 blocks
==13261==    still reachable: 127,130,478 bytes in 1,999,764 blocks
==13261==         suppressed: 0 bytes in 0 blocks
==13261== Rerun with --leak-check=full to see details of leaked memory
==13261==
==13261== For counts of detected and suppressed errors, rerun with: -v
==13261== Use --track-origins=yes to see where uninitialised values come from
==13261== ERROR SUMMARY: 45 errors from 30 contexts (suppressed: 6 from 6)

1 个解决方案

#1


3  

We know that the program is segfaulting on this line:

我们知道这个程序在这条线上是分段的:

current_node->children.insert(std::pair<string, ComponentTrieNode*>(comps[j], temp_node));

From the stack trace, we know that the segfault happens deep in the red black tree implementation of std::map:

从堆栈跟踪中,我们知道segfault发生在std的红黑树实现中::map:

#0  std::local_Rb_tree_rotate_left (__x=0xa057c90, __root=@0x608118) at ../../../../libstdc++-v3/src/tree.cc:126
126         __x->_M_right = __y->_M_left;

This implies that:

这意味着:

  1. The segfault could be caused by:
    1. evaluating __x->_M_right
    2. 评估__x - > _M_right
    3. evaluating __y->_M_left
    4. 评估__y - > _M_left
    5. storing the right hand side to the left hand side of __x->_M_right = __y->_M_left
    6. 将右手边存储在__x->_M_right = __y->_M_left的左侧。
  2. 这个segfault可能是由:评估__x->_M_right评估__y->_M_left将右手边保存在__x->_M_right = __y->_M_left的左侧。
  3. std::map::insert() being called implies that the segfault was NOT caused while building the arguments to the call. In particular comps[j] is not out of bounds.
  4. std::map::insert()被调用,这意味着在为调用构建参数时,segfault不是由它引起的。特别的是,[j]没有出界。

This leads me to think that your heap was already corrupted by previous memory operation errors by this time and that the crash in std::map::insert() is a symptom and not a cause.

这导致我认为您的堆已经被以前的内存操作错误损坏,std中的崩溃:::insert()是一个症状,而不是原因。

Run your program under the Valgrind memcheck tool:

在Valgrind memcheck工具下运行您的程序:

$ valgrind --tool=memcheck /mnt/disk2/experiments_BLOODMOON/two_stage_bloom_filter/programs/nameComponentEncoding/NCE_david /mnt/disk2/FIB_with_port/10_1.txt /mnt/disk2/trace/a_10_1.trace /mnt/disk2/FIB_with_port/10_2.txt

and carefully read Valgrind's output afterwards to find the first memory error in your program.

然后仔细阅读Valgrind的输出,找到程序中的第一个内存错误。

Valgrind is implemented as a virtual CPU, so your program would slow down by a factor of ~30. This is time consuming but should allow you to make progress in troubleshooting the problem.

Valgrind是作为一个虚拟CPU实现的,所以您的程序将会慢下来一个~30的因素。这是非常耗时的,但是应该允许您在解决问题的过程中取得进展。

In addition to Valgrind, you might also want to try enabling debug mode for the libstdc++ containers:

除了Valgrind,您还可以尝试启用libstdc++容器的调试模式:

To use the libstdc++ debug mode, compile your application with the compiler flag -D_GLIBCXX_DEBUG. Note that this flag changes the sizes and behavior of standard class templates such as std::vector, and therefore you can only link code compiled with debug mode and code compiled without debug mode if no instantiation of a container is passed between the two translation units.

要使用libstdc++调试模式,请使用编译器标志-D_GLIBCXX_DEBUG编译您的应用程序。请注意,此标志更改了标准类模板(如std::vector)的大小和行为,因此,如果没有在两个翻译单元之间传递容器的实例化,则只能将编译的代码与调试模式和没有调试模式编译的代码链接起来。

If your program uses no external libraries then rebuilding the whole thing with -D_GLIBCXX_DEBUG added to CXXFLAGS in the Makefile should work. Otherwise you'd need to know whether C++ containers are passed between components compiled with and without the debug flag.

如果您的程序没有使用外部库,那么在Makefile中添加到CXXFLAGS的-D_GLIBCXX_DEBUG来重新构建整个事件应该是有效的。否则,您需要知道在编译的组件与没有调试标志的组件之间是否传递了c++容器。

Valgrind Log Review

I'm surprised that you're using strtok() in a multi-threaded program. Is ComponentTrie::add_prefix() never called from two threads concurrently? While fixing the invalid read by inspecting how strtok() is used on ComponentTrie_david.cpp:99, you might want to replace strtok() with strtok_r() as well.

我很惊讶您在多线程程序中使用strtok()。ComponentTrie::add_prefix()从不同时从两个线程调用吗?通过检查如何在ComponentTrie_david上使用strtok()来修复无效的读取。您可能想用strtok_r()替换strtok()。

Concurrent Access to STL Containers

The standard C++ containers are explicitly documented to not do thread synchronization:

标准c++容器被显式地记录为不执行线程同步:

The user code must guard against concurrent function calls which access any particular library object's state when one or more of those accesses modifies the state. An object will be modified by invoking a non-const member function on it or passing it as a non-const argument to a library function. An object will not be modified by invoking a const member function on it or passing it to a function as a pointer- or reference-to-const. Typically, the application programmer may infer what object locks must be held based on the objects referenced in a function call and whether the objects are accessed as const or non-const.

当一个或多个访问修改状态时,用户代码必须防止并发的函数调用访问任何特定库对象的状态。对象将通过调用非const成员函数或将其作为非const参数传递给库函数来修改。对象不会通过调用const成员函数或将其传递给函数作为指针或引用到const来修改。通常,应用程序程序员可以根据函数调用中引用的对象和对象是否被访问为const或非const来推断对象锁必须持有的对象。

(That's from the GNU libstdc++ documentation but the C++11 standard essentially specifies the same behavior) Concurrent modifications of std::map and other containers is a serious error and likely the culprit that caused the crash. Guard each container with their own pthread_mutex_t or use the OpenMP synchronization mechanisms.

(这来自GNU libstdc++文档,但是c++ 11标准实际上指定了相同的行为)std的并发修改::map和其他容器是一个严重的错误,可能是导致崩溃的罪魁祸首。使用自己的pthread_mutex_t保护每个容器或使用OpenMP同步机制。

#1


3  

We know that the program is segfaulting on this line:

我们知道这个程序在这条线上是分段的:

current_node->children.insert(std::pair<string, ComponentTrieNode*>(comps[j], temp_node));

From the stack trace, we know that the segfault happens deep in the red black tree implementation of std::map:

从堆栈跟踪中,我们知道segfault发生在std的红黑树实现中::map:

#0  std::local_Rb_tree_rotate_left (__x=0xa057c90, __root=@0x608118) at ../../../../libstdc++-v3/src/tree.cc:126
126         __x->_M_right = __y->_M_left;

This implies that:

这意味着:

  1. The segfault could be caused by:
    1. evaluating __x->_M_right
    2. 评估__x - > _M_right
    3. evaluating __y->_M_left
    4. 评估__y - > _M_left
    5. storing the right hand side to the left hand side of __x->_M_right = __y->_M_left
    6. 将右手边存储在__x->_M_right = __y->_M_left的左侧。
  2. 这个segfault可能是由:评估__x->_M_right评估__y->_M_left将右手边保存在__x->_M_right = __y->_M_left的左侧。
  3. std::map::insert() being called implies that the segfault was NOT caused while building the arguments to the call. In particular comps[j] is not out of bounds.
  4. std::map::insert()被调用,这意味着在为调用构建参数时,segfault不是由它引起的。特别的是,[j]没有出界。

This leads me to think that your heap was already corrupted by previous memory operation errors by this time and that the crash in std::map::insert() is a symptom and not a cause.

这导致我认为您的堆已经被以前的内存操作错误损坏,std中的崩溃:::insert()是一个症状,而不是原因。

Run your program under the Valgrind memcheck tool:

在Valgrind memcheck工具下运行您的程序:

$ valgrind --tool=memcheck /mnt/disk2/experiments_BLOODMOON/two_stage_bloom_filter/programs/nameComponentEncoding/NCE_david /mnt/disk2/FIB_with_port/10_1.txt /mnt/disk2/trace/a_10_1.trace /mnt/disk2/FIB_with_port/10_2.txt

and carefully read Valgrind's output afterwards to find the first memory error in your program.

然后仔细阅读Valgrind的输出,找到程序中的第一个内存错误。

Valgrind is implemented as a virtual CPU, so your program would slow down by a factor of ~30. This is time consuming but should allow you to make progress in troubleshooting the problem.

Valgrind是作为一个虚拟CPU实现的,所以您的程序将会慢下来一个~30的因素。这是非常耗时的,但是应该允许您在解决问题的过程中取得进展。

In addition to Valgrind, you might also want to try enabling debug mode for the libstdc++ containers:

除了Valgrind,您还可以尝试启用libstdc++容器的调试模式:

To use the libstdc++ debug mode, compile your application with the compiler flag -D_GLIBCXX_DEBUG. Note that this flag changes the sizes and behavior of standard class templates such as std::vector, and therefore you can only link code compiled with debug mode and code compiled without debug mode if no instantiation of a container is passed between the two translation units.

要使用libstdc++调试模式,请使用编译器标志-D_GLIBCXX_DEBUG编译您的应用程序。请注意,此标志更改了标准类模板(如std::vector)的大小和行为,因此,如果没有在两个翻译单元之间传递容器的实例化,则只能将编译的代码与调试模式和没有调试模式编译的代码链接起来。

If your program uses no external libraries then rebuilding the whole thing with -D_GLIBCXX_DEBUG added to CXXFLAGS in the Makefile should work. Otherwise you'd need to know whether C++ containers are passed between components compiled with and without the debug flag.

如果您的程序没有使用外部库,那么在Makefile中添加到CXXFLAGS的-D_GLIBCXX_DEBUG来重新构建整个事件应该是有效的。否则,您需要知道在编译的组件与没有调试标志的组件之间是否传递了c++容器。

Valgrind Log Review

I'm surprised that you're using strtok() in a multi-threaded program. Is ComponentTrie::add_prefix() never called from two threads concurrently? While fixing the invalid read by inspecting how strtok() is used on ComponentTrie_david.cpp:99, you might want to replace strtok() with strtok_r() as well.

我很惊讶您在多线程程序中使用strtok()。ComponentTrie::add_prefix()从不同时从两个线程调用吗?通过检查如何在ComponentTrie_david上使用strtok()来修复无效的读取。您可能想用strtok_r()替换strtok()。

Concurrent Access to STL Containers

The standard C++ containers are explicitly documented to not do thread synchronization:

标准c++容器被显式地记录为不执行线程同步:

The user code must guard against concurrent function calls which access any particular library object's state when one or more of those accesses modifies the state. An object will be modified by invoking a non-const member function on it or passing it as a non-const argument to a library function. An object will not be modified by invoking a const member function on it or passing it to a function as a pointer- or reference-to-const. Typically, the application programmer may infer what object locks must be held based on the objects referenced in a function call and whether the objects are accessed as const or non-const.

当一个或多个访问修改状态时,用户代码必须防止并发的函数调用访问任何特定库对象的状态。对象将通过调用非const成员函数或将其作为非const参数传递给库函数来修改。对象不会通过调用const成员函数或将其传递给函数作为指针或引用到const来修改。通常,应用程序程序员可以根据函数调用中引用的对象和对象是否被访问为const或非const来推断对象锁必须持有的对象。

(That's from the GNU libstdc++ documentation but the C++11 standard essentially specifies the same behavior) Concurrent modifications of std::map and other containers is a serious error and likely the culprit that caused the crash. Guard each container with their own pthread_mutex_t or use the OpenMP synchronization mechanisms.

(这来自GNU libstdc++文档,但是c++ 11标准实际上指定了相同的行为)std的并发修改::map和其他容器是一个严重的错误,可能是导致崩溃的罪魁祸首。使用自己的pthread_mutex_t保护每个容器或使用OpenMP同步机制。