[Haifux] Login console freezes: Eli's weekly riddle

Eli Billauer eli at billauer.co.il
Sun Nov 7 01:44:12 MSK 2010


Hello Shachar.


I suppose great minds think alike. ;) This is more or less what I've 
already done, and with some correspondence on LKML, I've managed to nail 
the exact place where the kernel goes to sleep for 30 seconds taking the 
lock with it. Sorry for not updating you, but my thought was to 
summarize the issue when the patch is out.


So a kernel bug it is indeed. It wasn't the simple thing I thought, 
though. It turns out, that under some conditions, a close() operation on 
a serial port may require waiting until all written data is flushed. In 
these cases, it's normal behavior that returning from the close() call 
takes up to 30 seconds (the timeout). In particular, when it's 
modem-manager trying to probe serial ports that don't really exist.


Now, work was done on the TTYs' locking schema between kernels 2.6.35 
and 2.6.36, and the maintainer overlooked this possibility, and hence 
had no problem with the lock being held at that crucial point when the 
data is being drained. As a result, all system calls related to TTYs and 
PTYs (that is, serial ports and *cough* virtual terminals, and I suppose 
other keyboard input) were frozen waiting for the big TTY mutex every 
time modem-manager chose to close a port. Which it does a few times 
after a boot.


The LKML thread is at http://lkml.org/lkml/2010/11/2/314 (for some 
reason, my postings break the threading every time, even though I 
reply-to-all).


This is not a trivial thing to solve, since it seems like nobody really 
knows what assumptions have been made on the two muteces involved, so 
it's not so clear if one can release them just before the possible long 
sleep. I suppose the guy who manipulated the locks will come up with a 
fix sooner or later. I've reverted to 2.6.35 anyhow.


Ah, and by the way, what started this thing was an effort to stop using 
the big kernel lock. As a matter of fact, from 2.6.36, the big kernel 
lock is no longer used in core kernel code. (Hurray...?)


So that's the way things stand. I can't say I was very encouraged by 
this little trip to kernel-land, and I can only hope that those who 
maintain the software controlling my car's airbag are doing so with a 
deeper understanding of what each software component stands. Don't tell 
me. They probably don't. Only they don't discuss their issues over a 
public mailing list.

   Eli



Shachar Shemesh wrote:

> On 29/10/10 17:04, Eli Billauer wrote:
>>
>>     /* find a device that is not in use. */
>>     printk(KERN_ALERT  "34: pty_open to lock\n");
>>     tty_lock();
>>     printk(KERN_ALERT  "35: pty_open locked\n");
>> <snip>
>>
> Set a global variable right before the tty_lock call, and clear it 
> immediately after. Inside tty_lock (and probably tty_unlock too), set 
> up many printks conditional on this global variable being set. Print 
> any relevant identifier you can find (such as the device ID). This 
> should help you find out WHY the device takes so long to lock. and 
> hopefully, who the contention is with.
>
> Also, in tty_lock, save to a global variable who is holding the lock, 
> and print that variable from the code above.
>
> Shachar
> -- 
> Shachar Shemesh
> Lingnu Open Source Consulting Ltd.
> http://www.lingnu.com
>   


-- 
Web: http://www.billauer.co.il

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://haifux.org/pipermail/haifux/attachments/20101107/176f7d3c/attachment.html 


More information about the Haifux mailing list