"As some of the latency junkies on lkml already know, commit 8e3e076 in v2.6.26-rc2 removed the preemptible BKL feature and made the Big Kernel Lock a spinlock and thus turned it into non-preemptible code again. This commit returned the BKL code to the 2.6.7 state of affairs in essence," began Ingo Molnar. He noted that this had a very negative effect on the real time kernel efforts, adding that Linux creator Linus Torvalds indicated the only acceptable way forward was to completely remove the BKL. Ingo explained:
"This task is not easy at all. 12 years after Linux has been converted to an SMP OS we still have 1300+ legacy BKL using sites. There are 400+ lock_kernel() critical sections and 800+ ioctls. They are spread out across rather difficult areas of often legacy code that few people understand and few people dare to touch. It takes top people like Alan Cox to map the semantics and to remove BKL code, and even for Alan (who is doing this for the TTY code) it is a long and difficult task."
Ingo went on to describe how the BKL works, how it differs from other locking mechanisms, and why this complicates removing it permanently from the kernel. He noted that the various dependencies of the lock are lost in the haze of 15 years of code changes, "all this has built up to a kind of Fear, Uncertainty and Doubt about the BKL: nobody really knows it, nobody really dares to touch it and code can break silently and subtly if BKL locking is wrong." He then suggested "changing the rules of the game", creating a "kill-the-BKL" branch which "turns the BKL into an ordinary albeit somewhat big mutex, with a quirky lock/unlock interface called 'lock_kernel()' and 'unlock_kernel()'."
"Any problems beyond that point are ones you need to take up with Nvidia as they have all the source code but we don't have theirs."
"I'm so glad you have nothing better to do than troll, if you actually wrote code I'd be worried it might get into something people used."
"Repeatedly posting crud does not make it right."
"Thats a very arrogant viewpoint. I don't have to be a TV engineer to use my television. Distributions should be providing sensible defaults out of the box. The kernel already provides them the mechanisms."
"I have difficulty constructing many scenarios where its useful but it appears valid providing you can tightly control file renaming, which is very very questionable."
"There is a ton of evidence both in computing and outside of it which shows that poor security can be very much worse than no security at all. In particular stuff which makes users think they are secure but is worthless is very dangerous indeed."
"Frankly I don't care about apparmor, I don't see it as a serious project. Smack is kind of neat but looks like a nicer way to specify selinux rules."
"Incidentally i was thinking about using KVM for automated testing. Important pieces of hardware should get an in-KVM simulator/emulator, that way developers who do not own that hardware can do functionality testing too," Ingo Molnar suggested during a thread discussing a SCSI driver bug fix. Linus Torvalds was originally unimpressed by the idea:
"Using emulators to test device drivers is almost certain to be pointless. The problem with device drivers tends to be timing issues, odd hardware interactions, and lots of strange (and sometimes undocumented) behaviour and dependencies (eg things like 'you have to wait 50us after setting the reset bit until the hardware has actually reset'). These are all things that you'd generally not catch in emulation - because the emulation by necessity is only going to be a very weak picture of the real thing."
Alan Cox countered, "for some things. I do it a bit because you can use it to fake failures that are tricky to do in the real world. It won't tell you the driver works but its surprisingly good for testing for races (forcing IRQ delivery at specific points), buggy hardware you don't posses, and things like media failures and timeouts your real hardware refuses to do." Linus acquiesced conditionally, "I do agree that you likely find bugs, even if quite often it's exactly because the behaviour is something that will never happen on real hardware," then acknowledged previous debugging efforts by Alan, "but failure testing is very useful - I forget who it was who debugged some driver by taking a CD and just scratching it mercilessly to induce read errors ;)" Ingo added, "something like that wont enable 100% coverage (or even reasonable coverage for most hardware), so it's no replacement for actual hard testing, but it could push out the domain of minimally tested code quite a bit and increase the quality of the kernel."
"15 partitions (at least for sd_mod devices) are too few," Jan Engelhardt suggested along with a patch to try and make the mounting of an unlimited number of partitions possible. H. Peter Anvin proposed as an alternative, "now when we have 20-bit minors, can't we simply recycle some of the higher bits for additional partitions, across the board? 63 partitions seem to have been sufficient; at least I haven't heard anyone complain about that for 15 years."
Alan Cox explained, "this was proposed ages ago. Al Viro vetoed sparse minors and it has been stuck this way ever since. If you have > 15 partitions use device mapper for it. I'd prefer it fixed but it's arguable that device mapper is the right way to punt all our partitioning to userspace".
A bug report filed by Ingo Molnar regarding a procfs crash in the recently released 2.6.23-rc9 kernel was quickly tracked down by Linus Torvalds as a compiler bug. The bug was ultimately determined to be from a compiler optimization generated with an older version of GCC. Ingo was skeptical at first, "it's 4.0.2. Not the latest & greatest but I've been using it for 2 years and this would be the first time it miscompiles a 32-bit kernel out of tens of thousands of successful kernel bootups." Linus replied, "I am 100% sure. I can look at the disassembly, and point to the fact that your Oops happens on code that is simply totally bogus." He continued on to offer an interesting review of the crash, explaining line by line what should have been generated versus what actually was, causing the crash. In the end, Ingo switched to a distribution compiled GCC 4.1.2 and confirmed that the crash went away, "so you are completely right, it's a compiler bug in 4.0.2."
During the thread, Linus suggested that the optimization made by the compiler wasn't "legal", to which Alan Cox retorted, "pedant: valid. Almost all optimizations are legal, nobody has yet written laws about compilers. Sorry but I'm forever fixing misuse of the word 'illegal' in printks, docs and the like and it gets annoying after a bit." Linus playfully responded, "heh. When I'm ruler of the universe, it *will* be illegal. I'm just getting a bit ahead of myself." When asked how long until he expected to be ruler, Linus added, "I'm working on it, I'm working on it. I'm just as frustrated as you are. It turns out to be a non-trivial problem."
"If you have the ability to use chroot() you are root. If you are root you can walk happily out of any chroot by a thousand other means," Alan Cox explained during a thread that suggested chroot was broken in Linux. It was further pointed out that this was true per the POSIX specification, and per other OS's implementations. Al Viro suggested this should be added to the lkml FAQ, explaining:
"If you are within chroot jail and capable of chroot(), you can chdir to its root, then chroot() to subdirectory and you've got cwd outside of your new root. After that you can chdir all way out to original root. Again, this is standard behaviour. Changing it will not yield any security improvements, so kindly give that a rest."
When it was suggested that chroot is frequently used as a security tool, Adrian Bunk retorted, "incompetent people implementing security solutions are a real problem." Alan added, "chroot is not and never has been a security tool. People have built things based upon the properties of chroot but extended (BSD jails, Linux vserver) but they are quite different."
Ulrich Drepper noted a difference between the Linux connect(2) man page and the POSIX specification. The former states, "connectionless sockets may dissolve the association by connecting to an address with the sa_family member of sockaddr set to AF_UNSPEC." The latter reads, "if address is a null address for the protocol, the socket's peer address shall be reset." Ulrich explained that he preferred the description in the Linux man page, but the Linux kernel seems to actually follow the POSIX specification, "is this functionality which got lost over time? Or is the man page wrong and this never was the case? Is this a worthwhile change?"
Alan Cox noted, "we got it from the 1003.4g draft socket specification if I remember rightly." David Miller suggested, "the whole AF_UNSPEC thing I'm almost certain comes from BSD, which has behaved that way for centuries." Alan concurred, "its entirely plausible that [the 1003.4g draft socket specification] got it from 4BSE." Ulrich concluded, "I guess I'll just go ahead and file a problem report with the spec. Maybe the Unix vendors will test their implementations and provide feedback."
"An ongoing study on datasets of several Petabytes have shown that there can be 'silent data corruption' at rates much larger than one might naively expect from the expected error rates in RAID arrays and the expected probability of single bit uncorrected errors in hard disks," began a recent query on the Linux kernel mailing list asking where the errors might be introduced. Alan Cox replied, "its almost entirely device specific at every level." He then continued on with some general information, tracing the path of the data from the drive, through the cable and bus, into main memory and the CPU cache, as well as over the network, "once its crossing the PCI bus and main memory and CPU cache its entirely down to the system you are running what is protected and how much. Note that a lot of systems won't report ECC errors unless you ask." Alan continued:
"The next usual mess is network transfers. The TCP checksum strength is questionable for such workloads but the ethernet one is pretty good. Unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness."
Regarding the specific study in question, Alan noted, "for drivers/ide there are *lots* of problems with error handling so that might be implicated (would want to do old [versus] new ide tests on the same h/w which would be very intriguing)."