kprobe分析内核kworker占用CPU100%问题总结

kprobe分析内核kworker占⽤CPU100%问题总结
kprobe分析内核kworker占⽤CPU 100%问题总结
Create by Billow.Jen,2020.3.8
前⾔
利⽤linux kernel 动态追踪技术,排查问题本⾝就可能会变成⼀个⾮常有趣的过程,让我们遇到线上的诡异问题就感到兴奋,就仿佛好不容易⼜逮着机会,可以去解⼀道迷⼈的谜题。
⼀、系统问题说明
背景说明
负载均衡模块独⽴主机部署,运⾏BIRD、quagga、nftbl、contiv、ovs等进程,实现⽹络NAT功能。
内核版本:centos7-4.19.8
问题描述
在调⽤nftlb写⼊nft规则的过程中,发现部分主机性能低(因操作系统为centos且内核进⾏了升级,红帽服务不⽀持。)
过程如下:
1、执⾏命令:
time ip netns exec vpc067871207-lbalance curl -H "Key: spidernet" -X POST 127.0.0.1:06787/farms --data '{"farms":[{"name":"b1","family":"ipv4","virt ual-addr":"XX.XX.XX.XX","virtual-ports":"1001","mode":"snat","protocol":"tcp","scheduler":"weight","state":"up","backends":[{"name":"bck01","ip-addr":"192.1 68.0.1","port":"80","weight":"5","mark":"0x00000001","priority":"1","state":"up"}]}]}'
该命令正常应该在0.2ms内完成,但在部分主机上耗时到2、3s。
2、利⽤strace命令跟踪:
抽纸盒strace -tt -T -v -f  -s 1024 -p 2727846
发现耗时在sendmsg(AF_NETLINK…)函数上,有长有短,短的在⼏⼗us,但有四个长在0.5s左右的sendmsg,如:
[外链图⽚转存失败,源站可能有防盗链机制,建议将图⽚保存下来直接上传(img-eAfNbx6U-1584208688223)(en-
resource://database/1982:1)]
3、继续分析主机进程,发现⼀个进程kworker⼀直100%
[外链图⽚转存失败,源站可能有防盗链机制,建议将图⽚保存下来直接上传(img-QxeEl9NA-1584208688224)(en-
resource://database/1984:1)]
4、观察主机性能,nmon如下:[外链图⽚转存失败,源站可能有防盗链机制,建议将图⽚保存下来直接上传(img-ewVOFHNl-1584208688225)(en-resource://database/1990:1)]
此时主机内存、cpu、⽹络、存储都很空闲,但中断⽐正常主机⾼1~2倍。
5、⽤perf⼯具查看kworker耗时
[外链图⽚转存失败,源站可能有防盗链机制,建议将图⽚保存下来直接上传(img-CEyDfmjJ-1584208688226)(en-
resource://database/1988:1)]
6、⽹卡中断,经确认(/proc/interrupts)⽹卡中断分布在每个cpu上,⽹络中断均衡
7、继续ftrace,分析事件:
cd /sys/kernel/debug/tracing
echo wakeup_rt > current_tracer
echo 0 > options/function-trace
echo 1 > events/enable
echo 1 > tracing_on
echo 0 > tracing_max_latency
chrt -f 5 sleep 1
echo 0 > tracing_on
cat trace
结果没有针对性,不具备进⼀步分析条件。初步问题分析完了,下步进⾏内核态深度分析。
⼩结
内核该kworker进程的性能影响了work处理效率,导致内核较慢,响应sendmsg的work延迟,导致curl耗时长。
⼆、内核性能分析、跟踪⼯具简述
ftrace、kprobe、perf、operf/oprofile、systemtap都是跟踪内核的好⼯具,但有所区别。
Linux内核提供的基础设施:
1. tarcepoints => 静态探测点
2. kprobe => 内核态动态探测点(kernel/kprobe.c, example:sample/kprobe)
3. uprobe => ⽤户态动态探测点(kernel/events/uprobe.c)
ftrace是后⾯⼯具的基础,但使⽤⽐较繁琐
perf可图形化展⽰函数耗时,但不能⾃定义跟踪调试。
operf/oprofile依赖vmlinux,并且得带有debug信息。
systemtap依赖太多,kernel-debuginfo、内核编译参数等等,⽣产环境基本不可⽤。
kprobe作为轻量级内核调试⼯具,在诊断内核bug时有着先天独厚的优势,相关其他⼯具,kprobe有如下优点:
1、不⽤更新内核
2、可以以模块的形式加载进内核,⽤完后直接卸载即可,不会对内核造成污染
3、动态跟踪,⾃由构造,灵巧轻便。
经过⼤量实验探索,依赖少并可深⼊⾃定义的,kprobe胜出。
三、利⽤kprobe分析kworker⾏为
调试源码见附录,kprobe使⽤及原理⾃⾏学习,本⽂不做基础介绍,重点在思路、步骤,主要展⽰⽅法论的形成过程。
准备⼯作,说明下kprobe的参数⽰例:
struct file *file = (struct file *)regs->di;
//因为x86的参数传递规则是di,si,dx,cx,r8,r9,所以di就是vfs_write的第⼀个参数。arm默认是r0,r1,r2,r3,相应的取r0
看看kworker在做什么:
[root@04b280305 kprobe]# cat /proc/352560/stack
[<0>] insert_work+0x6e/0xa0
[<0>] __queue_work+0x131/0x3b0
[<0>] queue_work_on+0x28/0x40
[<0>] rht_deferred_worker+0x8b/0x3e0
[<0>] process_one_work+0x179/0x390
[<0>] worker_thread+0x4f/0x3e0
[<0>] kthread+0x105/0x140
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
可知,rht_deferred_worker是特定任务,需要重点分析。
写kprobe代码对rht_deferred_worker函数加探针,分析rht_deferred_worker事件(/var/log/messages,当时dmesg没记录被覆盖了,所以从messages中直接看):
[root@04b280305 kprobe]# grep rht_deferred_worker /var/log/messages |tail
Mar  9 21:35:14 04b280305 kernel: <rht_deferred_worker> pre_handler: 813f99c0
Mar  9 21:35:14 04b280305 kernel: <rht_deferred_worker> post_handler: 813f99c0
Mar  9 21:35:14 04b280305 kernel: <rht_deferred_worker> pre_handler: 813f99c0
Mar  9 21:35:14 04b280305 kernel: <rht_deferred_worker> post_handler: 813f99c0
Mar  9 21:35:14 04b280305 kernel: <rht_deferred_worker> pre_handler: 813f99c0
看看813f99c0是什么work:
[root@04b280305 kprobe]# cat /proc/kallsyms |grep 813f99c0
ffffffff813f99c0 t rht_deferred_worker
分析rht_deferred_worker源码,得知在重新分配或者收缩失败时,会触发插⼊work:rht_deferred_worker,此处是关键。
static void rht_deferred_worker(struct work_struct *work)
{
struct rhashtable *ht;
struct bucket_table *tbl;
int err = 0;
ht = container_of(work, struct rhashtable, run_work);
mutex_lock(&ht->mutex);
tbl = rht_dereference(ht->tbl, ht);
tbl = rhashtable_last_table(ht, tbl);
if (rht_grow_above_75(ht, tbl))
err = rhashtable_rehash_alloc(ht, tbl, tbl->size * 2);
else if (ht->p.automatic_shrinking && rht_shrink_below_30(ht, tbl))
err = rhashtable_shrink(ht);
else if (tbl->nest)
err = rhashtable_rehash_alloc(ht, tbl, tbl->size);
if (!err)//出错时,返回⾮0,执⾏不到
err = rhashtable_rehash_table(ht);
mutex_unlock(&ht->mutex);
if (err)
schedule_work(&ht->run_work);
/*上⾏是关键,造成了死循环*/
}
确认下是alloc失败还是shrink失败:
写kprobe代码对 rhashtable_rehash_alloc函数加探针,查看结果:
Mar  9 22:50:03 04b280305 kernel: Planted kprobe at rhashtable_rehash_alloc+0x0/0x50
Mar  9 22:50:07 04b280305 kernel: kprobe at rhashtable_rehash_alloc+0x0/0x50 unregistered
只执⾏了⼀次,说明不是alloc的问题,继续分析shrink,rhashtable_shrink不是系统导出符号,不能kprobe,看下源码:
static int rhashtable_shrink(struct rhashtable *ht)
{
struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
unsigned int nelems = atomic_read(&ht->nelems);
unsigned int size = 0;
if (nelems)
size = roundup_pow_of_two(nelems * 3 / 2);
if (size < ht->p.min_size)
size = ht->p.min_size;
if (old_tbl->size <= size)
return 0;
if (rht_dereference(old_tbl->future_tbl, ht))
return -EEXIST;
return rhashtable_rehash_alloc(ht, old_tbl, size);
}
写了个test_shirnk的探测函数,⼀安装服务器崩了…
去看看⾼版本内核,从5.1版本解决了此问题:
static void rht_deferred_worker(struct work_struct *work)
{
struct rhashtable *ht;
struct bucket_table *tbl;
int err = 0;
ht = container_of(work, struct rhashtable, run_work);
mutex_lock(&ht->mutex);
tbl = rht_dereference(ht->tbl, ht);
tbl = rhashtable_last_table(ht, tbl);
if (rht_grow_above_75(ht, tbl))
err = rhashtable_rehash_alloc(ht, tbl, tbl->size * 2);
else if (ht->p.automatic_shrinking && rht_shrink_below_30(ht, tbl))
err = rhashtable_shrink(ht);
else if (tbl->nest)
err = rhashtable_rehash_alloc(ht, tbl, tbl->size);
/*此块代码,修复了BUG*/
if (!err || err == -EEXIST) {
int nerr;
nerr = rhashtable_rehash_table(ht);
err = err ?: nerr;
}
mutex_unlock(&ht->mutex);
if (err)
schedule_work(&ht->run_work);
}
BUG说明:
rhashtable: Still do rehash when we get EEXIST
As it stands if a shrink is delayed because of an outstanding
rehash, we will go into a rescheduling loop without ever doing
the rehash.
This patch fixes this by still carrying out the rehash and then
rescheduling so that we can shrink after the completion of the
rehash should it still be necessary.
The return value of EEXIST captures this case and other cases
(e.g., another thread expanded/rehashed the table at the same
time) where we should still proceed with the rehash.
Fixes: da20420 (“rhashtable: Add nested tables”)
英⽂不知所云,我⾃⼰的理解是:
哈希表收缩时(rhashtable_shrink),会去判断该表是否被间接引⽤,如果存在则不缩容,返回-EEXIST。拉紧带
v5.1之前的版本,该返回值会触发重新⽣成⼀个rht_deferred_worker的work,这样就形成了递归,如果间接引⽤⼀直存在则形成死循环,导致CPU 100%。
疲劳值v5.1之后的版本,该返回值会触发重新哈希,若此期间间接引⽤还存在则返回0,不再会触发产⽣新rht_deferred_worker的work,以此来解决BUG。虚拟传真
探测完注意尽快卸载驱动,影响性能,会打印⼤量⽇志,并且可能会把服务器搞崩
细分驱动器
四、打patch,修复BUG
$ diff -up linux-4.19.8-1/lib/rhashtable.c linux-4.19.8-2/lib/rhashtable.c > rht_patch
$ patch -p1 < rht_patch
五、总结回顾
遇到操作系统性能或功能问题,可采⽤如下步骤定位:
1. 通过top⼯具,确定问题进程
2. 使⽤strace分析系统调⽤耗时或阻塞情况
3. 使⽤nmon⼯具,查看系统中断、负载、cpu、内存、IO、net等情况,分析是否正常,初步判断可能存在的问题
4. 使⽤perf⼯具,分析函数栈性能
5. 查询资料、阅读相关源码,初步判断位置
6. 若问题可在测试环境复现,使⽤systemtap⼯具进⾏跟踪分析
7. 若问题在⽣产环境上可遇不可求,使⽤kprobe⼯具进⾏动态分析
8. 定位问题后,查看⾼版本是否修复
9. 若修复则通过kernel官⽹patch升级;未修复时,若⾃⼰可修,则⾃改⾃测,否则git上提交bug
内核调试博⼤精深,本⽂⾮常有局限性,具体问题还得具体分析
附kprobe探测代码
Talk is cheap,show you the code!
kprobe_kworker.c
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
#define MAX_SYMBOL_LEN    64
static char symbol[MAX_SYMBOL_LEN] = "rht_deferred_worker";
//rht_deferred_worker可根据需要改成要探测的函数
module_param_string(symbol, symbol, sizeof(symbol), 0644);
/* For each probe you need to allocate a kprobe structure */
static struct kprobe kp = {
.
symbol_name    = symbol,
};
/* kprobe pre_handler: called just before the probed instruction is executed */
static int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
#ifdef CONFIG_X86
pr_info("<%s>,pid = %ld\n",current->comm, current->pid);
struct work_struct *work = (struct work_struct *)regs->di;
pr_info("<%s> pre_handler: %x\n",p->symbol_name,work->func);
#endif
#ifdef CONFIG_ARM64
pr_info("<%s> pre_handler: p->addr = %pF, pc = 0x%lx,"
" pstate = 0x%lx\n",
p->symbol_name, p->addr, (long)regs->pc, (long)regs->pstate);
#endif
/* A dump_stack() here will give a stack backtrace */
return 0;
}
/* kprobe post_handler: called after the probed instruction is executed */
static void handler_post(struct kprobe *p, struct pt_regs *regs,
unsigned long flags)
{
#ifdef CONFIG_X86
struct work_struct *work = (struct work_struct *)regs->di;
pr_info("<%s> post_handler: %x\n",p->symbol_name,work->func);防盗追踪

本文发布于:2024-09-24 09:24:31,感谢您对本站的认可!

本文链接:https://www.17tex.com/tex/4/343260.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:内核   问题   分析
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议