[Haifux] Login console freezes: Eli's weekly riddle
Eli Billauer
eli at billauer.co.il
Sun Nov 7 01:44:12 MSK 2010
Hello Shachar.
I suppose great minds think alike. ;) This is more or less what I've
already done, and with some correspondence on LKML, I've managed to nail
the exact place where the kernel goes to sleep for 30 seconds taking the
lock with it. Sorry for not updating you, but my thought was to
summarize the issue when the patch is out.
So a kernel bug it is indeed. It wasn't the simple thing I thought,
though. It turns out, that under some conditions, a close() operation on
a serial port may require waiting until all written data is flushed. In
these cases, it's normal behavior that returning from the close() call
takes up to 30 seconds (the timeout). In particular, when it's
modem-manager trying to probe serial ports that don't really exist.
Now, work was done on the TTYs' locking schema between kernels 2.6.35
and 2.6.36, and the maintainer overlooked this possibility, and hence
had no problem with the lock being held at that crucial point when the
data is being drained. As a result, all system calls related to TTYs and
PTYs (that is, serial ports and *cough* virtual terminals, and I suppose
other keyboard input) were frozen waiting for the big TTY mutex every
time modem-manager chose to close a port. Which it does a few times
after a boot.
The LKML thread is at http://lkml.org/lkml/2010/11/2/314 (for some
reason, my postings break the threading every time, even though I
reply-to-all).
This is not a trivial thing to solve, since it seems like nobody really
knows what assumptions have been made on the two muteces involved, so
it's not so clear if one can release them just before the possible long
sleep. I suppose the guy who manipulated the locks will come up with a
fix sooner or later. I've reverted to 2.6.35 anyhow.
Ah, and by the way, what started this thing was an effort to stop using
the big kernel lock. As a matter of fact, from 2.6.36, the big kernel
lock is no longer used in core kernel code. (Hurray...?)
So that's the way things stand. I can't say I was very encouraged by
this little trip to kernel-land, and I can only hope that those who
maintain the software controlling my car's airbag are doing so with a
deeper understanding of what each software component stands. Don't tell
me. They probably don't. Only they don't discuss their issues over a
public mailing list.
Eli
Shachar Shemesh wrote:
> On 29/10/10 17:04, Eli Billauer wrote:
>>
>> /* find a device that is not in use. */
>> printk(KERN_ALERT "34: pty_open to lock\n");
>> tty_lock();
>> printk(KERN_ALERT "35: pty_open locked\n");
>> <snip>
>>
> Set a global variable right before the tty_lock call, and clear it
> immediately after. Inside tty_lock (and probably tty_unlock too), set
> up many printks conditional on this global variable being set. Print
> any relevant identifier you can find (such as the device ID). This
> should help you find out WHY the device takes so long to lock. and
> hopefully, who the contention is with.
>
> Also, in tty_lock, save to a global variable who is holding the lock,
> and print that variable from the code above.
>
> Shachar
> --
> Shachar Shemesh
> Lingnu Open Source Consulting Ltd.
> http://www.lingnu.com
>
--
Web: http://www.billauer.co.il
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://haifux.org/pipermail/haifux/attachments/20101107/176f7d3c/attachment.html
More information about the Haifux
mailing list