通过Android trace文件分析死锁ANR实例过程

时间:2022-03-30 07:23:43

对于从事android开发的人来说,遇到anr(application not responding)是比较常见的问题。一般情况下,如果有anr发生,系统都会在/data/anr/目录下生成trace文件,通过分析trace文件,可以定位产生anr的原因。产生anr的原因有很多,比如cpu使用过高、事件没有得到及时的响应、死锁等,下面将通过一次因为死锁导致的anr问题,来说明如何通过trace文件分析anr问题。

对应的部分trace文件内容如下:
"powermanagerservice" prio=5 tid=24 monitor
| group="main" scount=1 dscount=0 obj=0x41dd0eb0 self=0x5241b218
| systid=567 nice=0 sched=0/0 cgrp=apps handle=1380038664
| state=s schedstat=( 6682116007 11324451214 33313 ) utm=450 stm=219 core=1
at com.android.server.am.activitymanagerservice.broadcastintent(activitymanagerservice.java:~13045)
- waiting to lock <0x41a874a0> (a com.android.server.am.activitymanagerservice) held by tid=12 (android.server.serverthread)
at android.app.contextimpl.sendbroadcast(contextimpl.java:1144)
at com.android.server.power.powermanagerservice$displayblankerimpl.unblankalldisplays(powermanagerservice.java:3442)
at com.android.server.power.displaypowerstate$photonicmodulator$1.run(displaypowerstate.java:456)
at android.os.handler.handlecallback(handler.java:800)
at android.os.handler.dispatchmessage(handler.java:100)
at android.os.looper.loop(looper.java:194)
at android.os.handlerthread.run(handlerthread.java:60)
"binder_b" prio=5 tid=85 monitor
| group="main" scount=1 dscount=0 obj=0x42744770 self=0x58329e88
| systid=3700 nice=-20 sched=0/0 cgrp=apps handle=1471424616
| state=s schedstat=( 1663727513 2044643318 6806 ) utm=132 stm=34 core=1
at com.android.server.power.powermanagerservice$displayblankerimpl.tostring(powermanagerservice.java:~3449)
- waiting to lock <0x41a7e420> (a com.android.server.power.powermanagerservice$displayblankerimpl) held by tid=24 (powermanagerservice)
at java.lang.stringbuilder.append(stringbuilder.java:202)
at com.android.server.power.powermanagerservice.dump(powermanagerservice.java:3052)
at android.os.binder.dump(binder.java:264)
at android.os.binder.ontransact(binder.java:236)
at android.os.ipowermanager$stub.ontransact(ipowermanager.java:373)
at android.os.binder.exectransact(binder.java:351)
at dalvik.system.nativestart.run(native method)
"android.server.serverthread" prio=5 tid=12 monitor
| group="main" scount=1 dscount=0 obj=0x41a76178 self=0x507837a8
| systid=545 nice=-2 sched=0/0 cgrp=apps handle=1349936616
| state=s schedstat=( 15368096286 21707846934 69485 ) utm=1226 stm=310 core=0
at com.android.server.power.powermanagerservice.isscreenoninternal(powermanagerservice.java:~2529)
- waiting to lock <0x41a7e2e8> (a java.lang.object) held by tid=85 (binder_b)
at com.android.server.power.powermanagerservice.isscreenon(powermanagerservice.java:2522)
at com.android.server.wm.windowmanagerservice.sendscreenstatustoclientslocked(windowmanagerservice.java:7749)
at com.android.server.wm.windowmanagerservice.seteventdispatching(windowmanagerservice.java:7628)
at com.android.server.am.activitymanagerservice.updateeventdispatchinglocked(activitymanagerservice.java:8083)
at com.android.server.am.activitymanagerservice.wakingup(activitymanagerservice.java:8077)
at com.android.server.power.notifier.sendwakeupbroadcast(notifier.java:474)
at com.android.server.power.notifier.sendnextbroadcast(notifier.java:455)
at com.android.server.power.notifier.access$700(notifier.java:62)
at com.android.server.power.notifier$notifierhandler.handlemessage(notifier.java:600)
at android.os.handler.dispatchmessage(handler.java:107)
at android.os.looper.loop(looper.java:194)
at com.android.server.serverthread.run(systemserver.java:1328)
从trace文件看,是因为tid为24的线程等待一个tid为12的线程持有的锁,tid为12的线程等待一个tid为85的线程持有的锁,而tid为85的线程确等待一个tid为24的线程持有的锁,导致了循环等待的现象,对应的trace文件的语句如下:
tid 24:- waiting to lock <0x41a874a0> (a com.android.server.am.activitymanagerservice) held by tid=12 (android.server.serverthread)
tid 12: - waiting to lock <0x41a7e2e8> (a java.lang.object) held by tid=85 (binder_b)
tid 85:- waiting to lock <0x41a7e420> (a com.android.server.power.powermanagerservice$displayblankerimpl) held by tid=24 (powermanagerservice)
既然是死锁,那么先看各线程都有那些锁。
先看tid=24的线程的栈顶,activitymanagerservice的broadcastintent函数代码如下:

复制代码 代码如下:


public final int broadcastintent(iapplicationthread caller,
intent intent, string resolvedtype, iintentreceiver resultto,
int resultcode, string resultdata, bundle map,
string requiredpermission, boolean serialized, boolean sticky, int userid) {
enforcenotisolatedcaller("broadcastintent");
synchronized(this) {
intent = verifybroadcastlocked(intent);
final processrecord callerapp = getrecordforapplocked(caller);
final int callingpid = binder.getcallingpid();
final int callinguid = binder.getcallinguid();
final long origid = binder.clearcallingidentity();
int res = broadcastintentlocked(callerapp,
callerapp != null ? callerapp.info.packagename : null,
intent, resolvedtype, resultto,
resultcode, resultdata, map, requiredpermission, serialized, sticky,
callingpid, callinguid, userid);
binder.restorecallingidentity(origid);
return res;
}


可以看到tid=24需要activitymanagerservice这个锁。再看tid=12线程的栈顶,powermanagerservice的isscreenoninternal函数代码如下:

复制代码 代码如下:


private boolean isscreenoninternal() {
synchronized (mlock) {
return !msystemready
|| mdisplaypowerrequest.screenstate != displaypowerrequest.screen_state_off;
}
}


可以看到需要powermanagerservice的mlock这个锁。最后看tid=85线程的栈顶,同样在powermanagerservice里面,内部类displayblankerimpl的tostring函数:

复制代码 代码如下:


public string tostring() {
synchronized (this) {
return "blanked=" + mblanked;
}
}


这是在内部类displayblankerimpl里面实现的,所以需要displayblankerimpl这个锁。
对应的表格如下:
通过Android trace文件分析死锁ANR实例过程 
表一 各线程等待的锁情况
从表一来看,没有出现死锁现象,似乎并不是我们所想的那样。难道不是死锁?开始有点小怀疑自己了,难道别的原因导致的。也许只看调用堆栈的顶端可能不行,栈顶只能看出各线程需要的锁,不能仅看自己要什么吧!一味索取可不好!人不是这样做的!看一下整个的堆栈调用流程,看看自己拥有了那些锁。
跟踪tid=24线程的堆栈,在powermanagerservice内部类displayblankerimpl的unblankalldisplays函数中持有锁:

复制代码 代码如下:


public void unblankalldisplays() {
synchronized (this) {
nativesetautosuspend(false);
nativesetinteractive(true);
mdisplaymanagerservice.unblankalldisplaysfrompowermanager();
mblanked = false;
///m: add for tvout and hdmi
mtvout.tvoutpowerenable(true);
mhdmi.hdmipowerenable(true);
///@}
if (debug) {
slog.d(tag_p, "unblankalldisplays out ...");
}
if (mbootcompleted) {
intent intent = new intent(action_lock_screen_show);
mcontext.sendbroadcast(intent);
}
}
}


最后发送广播的代码,是我们自己添加的。根据unblankalldisplays函数和broadcastintent函数,可以看到tid=24的线程此时持有了displayblankerimpl锁(unblankalldisplays),等待activitymanagerservice锁(broadcastintent)释放。
同样,跟踪tid=12线程的堆栈,在activitymanagerservice的wake_up函数中持有锁:

复制代码 代码如下:


public void wakingup() {
if (checkcallingpermission(android.manifest.permission.device_power)
!= packagemanager.permission_granted) {
throw new securityexception("requires permission "
+ android.manifest.permission.device_power);
}
synchronized(this) {
slog.i(tag, "wakingup");
mwenttosleep = false;
updateeventdispatchinglocked();
comeoutofsleepifneededlocked();
}
}


根据wakingup函数和isscreenoninternal函数,可以看到tid=12的线程持有activitymanagerservice锁(wakingup),等待powermanagerservice.mlock锁(isscreenoninternal)。到这,似乎看到了希望,迷雾要拨开了,有点小自信是死锁导致的,但还不能最终下结论。
一鼓作气,跟踪tid=85线程的堆栈,在powermanagerservice的dump有持有锁的操作:
protected void dump(filedescriptor fd, printwriter pw, string[] args) {
....
synchronized (mlock) {
...
}
根据tostring函数和dump函数,可以看到tid=85线程此时持有powermanagerservice.mlock锁(dump),需要displayblankerimpl(tostring)。
似乎谜底已经揭晓了,如果你还没有看出来(其实我也没看出来),来个表看看吧!
通过Android trace文件分析死锁ANR实例过程 
表二 各线程锁的情况
清楚了吗?多么清晰的循环等待呀!死锁都死的这么完美,还是图表效果好,看来有时候在纸上画画还是有用的!