TuxOnIce + BFS + ath9k == :(

Discussion:

Oleksandr Natalenko

2014-06-11 19:34:33 UTC

Hello, Nigel and Con!

As both of you know, some users have encountered serious errors while using
TuxOnIce and/or BFS, and all of them you've fixed successfully. I really hope
you could help one more time by joining your efforts to make your users
happier.

I'll try to describe what I've faced. Affected kernels are 3.14 and 3.15 too,
and here are the results of my boot-hang-boot_again-change_smth-hang
experiment.

First case. TOI enabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. After resume the system may hang completely, even mouse cursor
wouldn't move. Also system may stay very unresponsive with ksoftirqd stucking
at 100–110% of CPU, increasing loadavg (up to 27 and even more) and lots of
processes with D state. I've also noticed that system hangs immediately if I
run "sudo perf top -U" before hibernation.

Second case. TOI disabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.

Third case. TOI enabled, BFS disabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.

Fourth case. TOI enabled, BFS enabled, ath9k module *unloaded*. Boots OK.
Hibernates OK. Resumes OK. After resume everything works OK until I modprobe
ath9k module again. System goes to the state pointed in first case except I
could notice "sudo perf top -U" output (please see picture [1]).

If I could provide more info, please, let me know. I really hope this bug is
fixable.

[1]
Loading Image...

--
Oleksandr post-factum Natalenko, MSc
pf-kernel community
https://natalenko.name/

Oleksandr Natalenko

2014-06-14 21:01:55 UTC

Permalink

I'd like to add more info.

I've recompiled kernel with debug info enabled, preemption debug enabled and
RCU stalls detection enabled.

Following fourth case (as mentioned in my previous letter). Modprobing ath9k
right after resuming causes same behavior except "sudo perf top -U" output has
changed [1]. As you can see, now check_preemption_disabled appeared and it
consumes most of CPU time.

Secondly, dmesg reports RCU stalls [2].

Finally, removing ath9k module (with modprobe -r) while observing mentioned
stalls immediately hangs the machine completely.

Hope this helps.

[1]
Loading Image...

[2]
Loading Image...

Post by Oleksandr Natalenko
Hello, Nigel and Con!
As both of you know, some users have encountered serious errors while using
TuxOnIce and/or BFS, and all of them you've fixed successfully. I really
hope you could help one more time by joining your efforts to make your
users happier.
I'll try to describe what I've faced. Affected kernels are 3.14 and 3.15
too, and here are the results of my boot-hang-boot_again-change_smth-hang
experiment.
First case. TOI enabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. After resume the system may hang completely, even mouse
cursor wouldn't move. Also system may stay very unresponsive with ksoftirqd
stucking at 100–110% of CPU, increasing loadavg (up to 27 and even more)
and lots of processes with D state. I've also noticed that system hangs
immediately if I run "sudo perf top -U" before hibernation.
Second case. TOI disabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Third case. TOI enabled, BFS disabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Fourth case. TOI enabled, BFS enabled, ath9k module *unloaded*. Boots OK.
Hibernates OK. Resumes OK. After resume everything works OK until I modprobe
ath9k module again. System goes to the state pointed in first case except I
could notice "sudo perf top -U" output (please see picture [1]).
If I could provide more info, please, let me know. I really hope this bug is
fixable.
[1]
http://habrastorage.org/files/26b/da9/4b1/26bda94b113a4e6b979f1d6c0e52b1f8.j
pg

--
Oleksandr post-factum Natalenko, MSc
pf-kernel community
https://natalenko.name/

Oleksandr Natalenko

2014-06-15 10:30:24 UTC

Permalink

Another update from me.

I've enabled locking debug, and got the following deadlock message ([1] -
[4]). With locking debug enabled RCU stalls messages disappeared.

[1]
Loading Image...

[2]
Loading Image...

[3]
Loading Image...

[4]
Loading Image...

Post by Oleksandr Natalenko
I'd like to add more info.
I've recompiled kernel with debug info enabled, preemption debug enabled and
RCU stalls detection enabled.
Following fourth case (as mentioned in my previous letter). Modprobing ath9k
right after resuming causes same behavior except "sudo perf top -U" output
has changed [1]. As you can see, now check_preemption_disabled appeared and
it consumes most of CPU time.
Secondly, dmesg reports RCU stalls [2].
Finally, removing ath9k module (with modprobe -r) while observing mentioned
stalls immediately hangs the machine completely.
Hope this helps.
[1]
http://habrastorage.org/files/bff/5ce/c90/bff5cec9020f4ec7a13312b308b4e571.j
pg
[2]
http://habrastorage.org/files/356/8ff/627/3568ff6278744a949e68b39fad4a44df.j
pg

Post by Oleksandr Natalenko
Hello, Nigel and Con!
As both of you know, some users have encountered serious errors while using
TuxOnIce and/or BFS, and all of them you've fixed successfully. I really
hope you could help one more time by joining your efforts to make your
users happier.
I'll try to describe what I've faced. Affected kernels are 3.14 and 3.15
too, and here are the results of my boot-hang-boot_again-change_smth-hang
experiment.
First case. TOI enabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. After resume the system may hang completely, even mouse
cursor wouldn't move. Also system may stay very unresponsive with ksoftirqd
stucking at 100–110% of CPU, increasing loadavg (up to 27 and even more)
and lots of processes with D state. I've also noticed that system hangs
immediately if I run "sudo perf top -U" before hibernation.
Second case. TOI disabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Third case. TOI enabled, BFS disabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Fourth case. TOI enabled, BFS enabled, ath9k module *unloaded*. Boots OK.
Hibernates OK. Resumes OK. After resume everything works OK until I
modprobe ath9k module again. System goes to the state pointed in first
case except I could notice "sudo perf top -U" output (please see picture
[1]).
If I could provide more info, please, let me know. I really hope this bug
is fixable.
[1]
http://habrastorage.org/files/26b/da9/4b1/26bda94b113a4e6b979f1d6c0e52b1f8
.j pg

--
Oleksandr post-factum Natalenko, MSc
pf-kernel community
https://natalenko.name/

Oleksandr Natalenko

2014-07-12 13:21:51 UTC

Permalink

More info from me.

I've discovered that the issue appeared somewhere between 3.13 and 3.14. So
I've reverted all ath9k changes to 3.13 state and that didn't help. So it's
obviously not ath9k issue.

Post by Oleksandr Natalenko
Another update from me.
I've enabled locking debug, and got the following deadlock message ([1] -
[4]). With locking debug enabled RCU stalls messages disappeared.
[1]
http://habrastorage.org/files/34d/841/976/34d84197673f4533b102d5982a33ffe6.j
pg [2]
http://habrastorage.org/files/ee9/67f/662/ee967f662f794b8284cdf0ebf85d46d4.j
pg [3]
http://habrastorage.org/files/274/666/842/2746668423b2482a9fef8dd2bf81b3e6.j
pg [4]
http://habrastorage.org/files/cef/0e1/32d/cef0e132d8aa42e69caeb5efe11b637c.j
pg

Post by Oleksandr Natalenko
I'd like to add more info.
I've recompiled kernel with debug info enabled, preemption debug enabled
and RCU stalls detection enabled.
Following fourth case (as mentioned in my previous letter). Modprobing
ath9k right after resuming causes same behavior except "sudo perf top -U"
output has changed [1]. As you can see, now check_preemption_disabled
appeared and it consumes most of CPU time.
Secondly, dmesg reports RCU stalls [2].
Finally, removing ath9k module (with modprobe -r) while observing mentioned
stalls immediately hangs the machine completely.
Hope this helps.
[1]
http://habrastorage.org/files/bff/5ce/c90/bff5cec9020f4ec7a13312b308b4e571
.j pg
[2]
http://habrastorage.org/files/356/8ff/627/3568ff6278744a949e68b39fad4a44df
.j pg

Post by Oleksandr Natalenko
Hello, Nigel and Con!
As both of you know, some users have encountered serious errors while using
TuxOnIce and/or BFS, and all of them you've fixed successfully. I really
hope you could help one more time by joining your efforts to make your
users happier.
I'll try to describe what I've faced. Affected kernels are 3.14 and 3.15
too, and here are the results of my
boot-hang-boot_again-change_smth-hang
experiment.
First case. TOI enabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. After resume the system may hang completely, even mouse
cursor wouldn't move. Also system may stay very unresponsive with ksoftirqd
stucking at 100–110% of CPU, increasing loadavg (up to 27 and even more)
and lots of processes with D state. I've also noticed that system hangs
immediately if I run "sudo perf top -U" before hibernation.
Second case. TOI disabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Third case. TOI enabled, BFS disabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Fourth case. TOI enabled, BFS enabled, ath9k module *unloaded*. Boots OK.
Hibernates OK. Resumes OK. After resume everything works OK until I
modprobe ath9k module again. System goes to the state pointed in first
case except I could notice "sudo perf top -U" output (please see picture
[1]).
If I could provide more info, please, let me know. I really hope this bug
is fixable.
[1]
http://habrastorage.org/files/26b/da9/4b1/26bda94b113a4e6b979f1d6c0e52b1
f8
.j pg

--
Oleksandr post-factum Natalenko, MSc
pf-kernel community
https://natalenko.name/

Oleksandr Natalenko

2014-07-24 17:58:55 UTC

Permalink

It seems that this issue could be fixed by turning on ath9k powersaving feature
(which is off by default):

modprobe ath9k ps_enable=1
iw dev wlp1s0 set power_save on

My machine has survived at least 2 cycles, and I'm going to test it more.

Nigel, could this be related to some unhandled resume step in TOI regarding
device drivers (or wireless device drivers)?

Post by Oleksandr Natalenko
More info from me.
I've discovered that the issue appeared somewhere between 3.13 and 3.14. So
I've reverted all ath9k changes to 3.13 state and that didn't help. So it's
obviously not ath9k issue.

Post by Oleksandr Natalenko
Another update from me.
I've enabled locking debug, and got the following deadlock message ([1] -
[4]). With locking debug enabled RCU stalls messages disappeared.
[1]
http://habrastorage.org/files/34d/841/976/34d84197673f4533b102d5982a33ffe6
.j pg [2]
http://habrastorage.org/files/ee9/67f/662/ee967f662f794b8284cdf0ebf85d46d4
.j pg [3]
http://habrastorage.org/files/274/666/842/2746668423b2482a9fef8dd2bf81b3e6
.j pg [4]
http://habrastorage.org/files/cef/0e1/32d/cef0e132d8aa42e69caeb5efe11b637c
.j pg

Post by Oleksandr Natalenko
I'd like to add more info.
I've recompiled kernel with debug info enabled, preemption debug enabled
and RCU stalls detection enabled.
Following fourth case (as mentioned in my previous letter). Modprobing
ath9k right after resuming causes same behavior except "sudo perf top -U"
output has changed [1]. As you can see, now check_preemption_disabled
appeared and it consumes most of CPU time.
Secondly, dmesg reports RCU stalls [2].
Finally, removing ath9k module (with modprobe -r) while observing mentioned
stalls immediately hangs the machine completely.
Hope this helps.
[1]
http://habrastorage.org/files/bff/5ce/c90/bff5cec9020f4ec7a13312b308b4e5
71
.j pg
[2]
http://habrastorage.org/files/356/8ff/627/3568ff6278744a949e68b39fad4a44
df
.j pg

Post by Oleksandr Natalenko
Hello, Nigel and Con!
As both of you know, some users have encountered serious errors while using
TuxOnIce and/or BFS, and all of them you've fixed successfully. I really
hope you could help one more time by joining your efforts to make your
users happier.
I'll try to describe what I've faced. Affected kernels are 3.14 and 3.15
too, and here are the results of my
boot-hang-boot_again-change_smth-hang
experiment.
First case. TOI enabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. After resume the system may hang completely, even mouse
cursor wouldn't move. Also system may stay very unresponsive with ksoftirqd
stucking at 100–110% of CPU, increasing loadavg (up to 27 and even more)
and lots of processes with D state. I've also noticed that system hangs
immediately if I run "sudo perf top -U" before hibernation.
Second case. TOI disabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Third case. TOI enabled, BFS disabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Fourth case. TOI enabled, BFS enabled, ath9k module *unloaded*. Boots OK.
Hibernates OK. Resumes OK. After resume everything works OK until I
modprobe ath9k module again. System goes to the state pointed in first
case except I could notice "sudo perf top -U" output (please see picture
[1]).
If I could provide more info, please, let me know. I really hope this bug
is fixable.
[1]
http://habrastorage.org/files/26b/da9/4b1/26bda94b113a4e6b979f1d6c0e52
b1
f8
.j pg

--
Oleksandr post-factum Natalenko, MSc
pf-kernel community
https://natalenko.name/

Oleksandr Natalenko

2014-07-30 12:18:32 UTC

Permalink

That could be the reason, kernfs was introduced between 3.13 and 3.14, and
this bug happened then too.

Hi Oleksandr
All those screenshots you provided look remarkably similar to this bug just
http://marc.info/?l=linux-acpi&m=140670139027145
So this may be an upstream bug afterall. We'll see if they come up with a
solution for this and if it helps your test case.
Regards,
Con

Post by Oleksandr Natalenko
It seems that this issue could be fixed by turning on ath9k powersaving
modprobe ath9k ps_enable=1
iw dev wlp1s0 set power_save on
My machine has survived at least 2 cycles, and I'm going to test it more.
Nigel, could this be related to some unhandled resume step in TOI regarding
device drivers (or wireless device drivers)?

Post by Oleksandr Natalenko
Another update from me.
I've enabled locking debug, and got the following deadlock message
([1]
-
[4]). With locking debug enabled RCU stalls messages disappeared.
[1]
http://habrastorage.org/files/34d/841/976/34d84197673f4533b102d5982a33
ff
e6
.j pg [2]
http://habrastorage.org/files/ee9/67f/662/ee967f662f794b8284cdf0ebf85d
46
d4
.j pg [3]
http://habrastorage.org/files/274/666/842/2746668423b2482a9fef8dd2bf81
b3
e6
.j pg [4]
http://habrastorage.org/files/cef/0e1/32d/cef0e132d8aa42e69caeb5efe11b
63
7c
.j pg

Post by Oleksandr Natalenko
I'd like to add more info.
I've recompiled kernel with debug info enabled, preemption debug enabled
and RCU stalls detection enabled.
Following fourth case (as mentioned in my previous letter). Modprobing
ath9k right after resuming causes same behavior except "sudo perf
top
-U"
output has changed [1]. As you can see, now
check_preemption_disabled
appeared and it consumes most of CPU time.
Secondly, dmesg reports RCU stalls [2].
Finally, removing ath9k module (with modprobe -r) while observing mentioned
stalls immediately hangs the machine completely.
Hope this helps.
[1]
http://habrastorage.org/files/bff/5ce/c90/bff5cec9020f4ec7a13312b308
b4
e5
71
.j pg
[2]
http://habrastorage.org/files/356/8ff/627/3568ff6278744a949e68b39fad
4a
44
df
.j pg

Post by Oleksandr Natalenko
Hello, Nigel and Con!
As both of you know, some users have encountered serious errors
while
using
TuxOnIce and/or BFS, and all of them you've fixed successfully. I
really
hope you could help one more time by joining your efforts to make your
users happier.
I'll try to describe what I've faced. Affected kernels are 3.14
and
3.15
too, and here are the results of my
boot-hang-boot_again-change_smth-hang
experiment.
First case. TOI enabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. After resume the system may hang completely, even mouse
cursor wouldn't move. Also system may stay very unresponsive with
ksoftirqd
stucking at 100–110% of CPU, increasing loadavg (up to 27 and even
more)
and lots of processes with D state. I've also noticed that system hangs
immediately if I run "sudo perf top -U" before hibernation.
Second case. TOI disabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Third case. TOI enabled, BFS disabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Fourth case. TOI enabled, BFS enabled, ath9k module *unloaded*.
Boots
OK.
Hibernates OK. Resumes OK. After resume everything works OK until I
modprobe ath9k module again. System goes to the state pointed in first
case except I could notice "sudo perf top -U" output (please see
picture
[1]).
If I could provide more info, please, let me know. I really hope
this
bug
is fixable.
[1]
http://habrastorage.org/files/26b/da9/4b1/26bda94b113a4e6b979f1d6c
0e
52
b1
f8
.j pg

--
Oleksandr post-factum Natalenko, MSc
pf-kernel community
https://natalenko.name/

Con Kolivas

2014-07-30 11:27:16 UTC

Permalink

Hi Oleksandr

All those screenshots you provided look remarkably similar to this bug just
reported with mainline:

http://marc.info/?l=linux-acpi&m=140670139027145

So this may be an upstream bug afterall. We'll see if they come up with a
solution for this and if it helps your test case.

Regards,
Con

Post by Oleksandr Natalenko
Another update from me.
I've enabled locking debug, and got the following deadlock message ([1] -
[4]). With locking debug enabled RCU stalls messages disappeared.
[1]
http://habrastorage.org/files/34d/841/976/34d84197673f4533b102d5982a33ff
e6
.j pg [2]
http://habrastorage.org/files/ee9/67f/662/ee967f662f794b8284cdf0ebf85d46
d4
.j pg [3]
http://habrastorage.org/files/274/666/842/2746668423b2482a9fef8dd2bf81b3
e6
.j pg [4]
http://habrastorage.org/files/cef/0e1/32d/cef0e132d8aa42e69caeb5efe11b63
7c
.j pg

Post by Oleksandr Natalenko
I'd like to add more info.
I've recompiled kernel with debug info enabled, preemption debug enabled
and RCU stalls detection enabled.
Following fourth case (as mentioned in my previous letter). Modprobing
ath9k right after resuming causes same behavior except "sudo perf top -U"
output has changed [1]. As you can see, now check_preemption_disabled
appeared and it consumes most of CPU time.
Secondly, dmesg reports RCU stalls [2].
Finally, removing ath9k module (with modprobe -r) while observing mentioned
stalls immediately hangs the machine completely.
Hope this helps.
[1]
http://habrastorage.org/files/bff/5ce/c90/bff5cec9020f4ec7a13312b308b4
e5
71
.j pg
[2]
http://habrastorage.org/files/356/8ff/627/3568ff6278744a949e68b39fad4a
44
df
.j pg

Post by Oleksandr Natalenko
Hello, Nigel and Con!
As both of you know, some users have encountered serious errors
while
using
TuxOnIce and/or BFS, and all of them you've fixed successfully. I really
hope you could help one more time by joining your efforts to make your
users happier.
I'll try to describe what I've faced. Affected kernels are 3.14 and 3.15
too, and here are the results of my
boot-hang-boot_again-change_smth-hang
experiment.
First case. TOI enabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. After resume the system may hang completely, even mouse
cursor wouldn't move. Also system may stay very unresponsive with ksoftirqd
stucking at 100–110% of CPU, increasing loadavg (up to 27 and even more)
and lots of processes with D state. I've also noticed that system hangs
immediately if I run "sudo perf top -U" before hibernation.
Second case. TOI disabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Third case. TOI enabled, BFS disabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Fourth case. TOI enabled, BFS enabled, ath9k module *unloaded*.
Boots
OK.
Hibernates OK. Resumes OK. After resume everything works OK until I
modprobe ath9k module again. System goes to the state pointed in first
case except I could notice "sudo perf top -U" output (please see picture
[1]).
If I could provide more info, please, let me know. I really hope
this
bug
is fixable.
[1]
http://habrastorage.org/files/26b/da9/4b1/26bda94b113a4e6b979f1d6c0e
52
b1
f8
.j pg

--
-ck

Oleksandr Natalenko

2014-08-06 14:39:31 UTC

Permalink

Well, now at least I may say, it's not TOI issue, as I've caught this bug with
TOI disabled and BFS enabled.

Now I've disabled BFS. Will test more.

Post by Oleksandr Natalenko
Another update from me.
I've enabled locking debug, and got the following deadlock message
([1]
-
[4]). With locking debug enabled RCU stalls messages disappeared.
[1]
http://habrastorage.org/files/34d/841/976/34d84197673f4533b102d5982a33
ff
e6
.j pg [2]
http://habrastorage.org/files/ee9/67f/662/ee967f662f794b8284cdf0ebf85d
46
d4
.j pg [3]
http://habrastorage.org/files/274/666/842/2746668423b2482a9fef8dd2bf81
b3
e6
.j pg [4]
http://habrastorage.org/files/cef/0e1/32d/cef0e132d8aa42e69caeb5efe11b
63
7c
.j pg

Post by Oleksandr Natalenko
I'd like to add more info.
I've recompiled kernel with debug info enabled, preemption debug enabled
and RCU stalls detection enabled.
Following fourth case (as mentioned in my previous letter). Modprobing
ath9k right after resuming causes same behavior except "sudo perf
top
-U"
output has changed [1]. As you can see, now
check_preemption_disabled
appeared and it consumes most of CPU time.
Secondly, dmesg reports RCU stalls [2].
Finally, removing ath9k module (with modprobe -r) while observing mentioned
stalls immediately hangs the machine completely.
Hope this helps.
[1]
http://habrastorage.org/files/bff/5ce/c90/bff5cec9020f4ec7a13312b308
b4
e5
71
.j pg
[2]
http://habrastorage.org/files/356/8ff/627/3568ff6278744a949e68b39fad
4a
44
df
.j pg

Post by Oleksandr Natalenko
Hello, Nigel and Con!
As both of you know, some users have encountered serious errors
while
using
TuxOnIce and/or BFS, and all of them you've fixed successfully. I
really
hope you could help one more time by joining your efforts to make your
users happier.
I'll try to describe what I've faced. Affected kernels are 3.14
and
3.15
too, and here are the results of my
boot-hang-boot_again-change_smth-hang
experiment.
First case. TOI enabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. After resume the system may hang completely, even mouse
cursor wouldn't move. Also system may stay very unresponsive with
ksoftirqd
stucking at 100–110% of CPU, increasing loadavg (up to 27 and even
more)
and lots of processes with D state. I've also noticed that system hangs
immediately if I run "sudo perf top -U" before hibernation.
Second case. TOI disabled, BFS enabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Third case. TOI enabled, BFS disabled, ath9k module loaded. Boots OK.
Hibernates OK. Resumes OK.
Fourth case. TOI enabled, BFS enabled, ath9k module *unloaded*.
Boots
OK.
Hibernates OK. Resumes OK. After resume everything works OK until I
modprobe ath9k module again. System goes to the state pointed in first
case except I could notice "sudo perf top -U" output (please see
picture
[1]).
If I could provide more info, please, let me know. I really hope
this
bug
is fixable.
[1]
http://habrastorage.org/files/26b/da9/4b1/26bda94b113a4e6b979f1d6c
0e
52
b1
f8
.j pg

--
Oleksandr post-factum Natalenko, MSc
pf-kernel community
https://natalenko.name/

Oleksandr Natalenko

2014-08-22 16:13:14 UTC

Permalink

OK, we've got this fixed, and this was BFS issue.

Details here: http://ck-hack.blogspot.com/2014/08/bfs-450-316-ck1.html

@Nigel: sorry for bothering you :).

Post by Oleksandr Natalenko
Well, now at least I may say, it's not TOI issue, as I've caught this bug
with TOI disabled and BFS enabled.
Now I've disabled BFS. Will test more.

Post by Oleksandr Natalenko
More info from me.
I've discovered that the issue appeared somewhere between 3.13 and
3.14.
So
I've reverted all ath9k changes to 3.13 state and that didn't help. So it's
obviously not ath9k issue.

Post by Oleksandr Natalenko
Another update from me.
I've enabled locking debug, and got the following deadlock message
([1]
-
[4]). With locking debug enabled RCU stalls messages disappeared.
[1]
http://habrastorage.org/files/34d/841/976/34d84197673f4533b102d5982a
33
ff
e6
.j pg [2]
http://habrastorage.org/files/ee9/67f/662/ee967f662f794b8284cdf0ebf8
5d
46
d4
.j pg [3]
http://habrastorage.org/files/274/666/842/2746668423b2482a9fef8dd2bf
81
b3
e6
.j pg [4]
http://habrastorage.org/files/cef/0e1/32d/cef0e132d8aa42e69caeb5efe1
1b
63
7c
.j pg

Post by Oleksandr Natalenko
I'd like to add more info.
I've recompiled kernel with debug info enabled, preemption debug
enabled
and RCU stalls detection enabled.
Following fourth case (as mentioned in my previous letter). Modprobing
ath9k right after resuming causes same behavior except "sudo perf
top
-U"
output has changed [1]. As you can see, now
check_preemption_disabled
appeared and it consumes most of CPU time.
Secondly, dmesg reports RCU stalls [2].
Finally, removing ath9k module (with modprobe -r) while observing
mentioned
stalls immediately hangs the machine completely.
Hope this helps.
[1]
http://habrastorage.org/files/bff/5ce/c90/bff5cec9020f4ec7a13312b3
08
b4
e5
71
.j pg
[2]
http://habrastorage.org/files/356/8ff/627/3568ff6278744a949e68b39f
ad
4a
44
df
.j pg

Post by Oleksandr Natalenko
Hello, Nigel and Con!
As both of you know, some users have encountered serious errors
while
using
TuxOnIce and/or BFS, and all of them you've fixed successfully. I
really
hope you could help one more time by joining your efforts to
make
your
users happier.
I'll try to describe what I've faced. Affected kernels are 3.14
and
3.15
too, and here are the results of my
boot-hang-boot_again-change_smth-hang
experiment.
First case. TOI enabled, BFS enabled, ath9k module loaded. Boots
OK.
Hibernates OK. After resume the system may hang completely, even
mouse
cursor wouldn't move. Also system may stay very unresponsive with
ksoftirqd
stucking at 100–110% of CPU, increasing loadavg (up to 27 and
even
more)
and lots of processes with D state. I've also noticed that
system
hangs
immediately if I run "sudo perf top -U" before hibernation.
Second case. TOI disabled, BFS enabled, ath9k module loaded.
Boots
OK.
Hibernates OK. Resumes OK.
Third case. TOI enabled, BFS disabled, ath9k module loaded.
Boots
OK.
Hibernates OK. Resumes OK.
Fourth case. TOI enabled, BFS enabled, ath9k module *unloaded*.
Boots
OK.
Hibernates OK. Resumes OK. After resume everything works OK
until
I
modprobe ath9k module again. System goes to the state pointed in
first
case except I could notice "sudo perf top -U" output (please see
picture
[1]).
If I could provide more info, please, let me know. I really hope
this
bug
is fixable.
[1]
http://habrastorage.org/files/26b/da9/4b1/26bda94b113a4e6b979f1d
6c
0e
52
b1
f8
.j pg

--
Oleksandr post-factum Natalenko, MSc
pf-kernel community
https://natalenko.name/

Nigel Cunningham

2014-08-23 23:06:01 UTC

Permalink

Hi.

Post by Oleksandr Natalenko
OK, we've got this fixed, and this was BFS issue.
Details here: http://ck-hack.blogspot.com/2014/08/bfs-450-316-ck1.html
@Nigel: sorry for bothering you :).

Not a problem at all!

Nigel

Oleksandr Natalenko

2014-09-06 11:38:29 UTC

Permalink

If TOI parts are compiled as modules:

===
CONFIG_TOI_CORE=y
CONFIG_TOI_FILE=m
CONFIG_TOI_SWAP=m
CONFIG_TOI_CRYPTO=m
CONFIG_TOI_USERUI=m
===

we get the following error:

===
ERROR: "resume_file" [kernel/power/tuxonice_file.ko] undefined!
ERROR: "resume_file" [kernel/power/tuxonice_bio.ko] undefined!
scripts/Makefile.modpost:90: recipe for target '__modpost' failed
make[1]: *** [__modpost] Error 1
===

So just export resume_file variable from hibernate.c to fix the issue.

This patch also can be fetched via gists [1] or from my pf-kernel tree [2].

[1] https://gist.github.com/pfactum/84ef6292e088a11b1823
[2] https://github.com/pfactum/pf-kernel/commit/899119c90268a6043862a0781134f68444681de1

--
Oleksandr post-factum Natalenko, MSc
pf-kernel community
https://natalenko.name/