"In the kerneloops.org stats, a new oops is rapidly climbing the charts, began Arjan van de Ven, referring to his website where he automatically collects kernel oops and warning reports from mailing lists, bugzillas, and a special client. Regarding the latest oops, he noted, "the oops is a page fault in the ext3 'do_slit' function, and the first report of it was with 2.6.26-rc6-git3." Linux creator Linus Torvalds took a quick interest in the issue, observing that all the oopses seemed to be on the i686 architecture, suggesting, "could this perhaps be an indication that it is specific to i686 some way (eg a compiler issue?)"
Shortly before Linus sent out his emails, Dave Airlie confirmed that this was indeed a known compiler bug affecting GCC 4.3.1. The bug report notes, "any ext* filesystem which enables the dir_index feature is likely susceptible". Linus caught up on his email and retorted, "gaah. I should read all my email instead of wasting my time trying to match up the code with what I can reproduce.." The reason the Red Hat bug report wasn't automatically picked up by the kerneloops website was because the oops was reported in a jpeg image, leading Arjan to quip, "maybe one day if I'm really bored I'll implement OCR into [kerneloops.org] ;)".
From: Arjan van de Ven <arjan@...> Subject: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 1:34 am 2008 In the kerneloops.org stats, a new oops is rapidly climbing the charts. The oops is a page fault in the ext3 "do_slit" function, and the first report of it was with 2.6.26-rc6-git3. It happens with various applications; the backtraces are at: http://www.kerneloops.org/search.php?search=do_split but are generally of this pattern: *do_split ext3_add_entry ext3_rename vfs_rename ... <various paths into vfs_rename> ... or *do_split ? add_dirent_to_buf ext3_add_entry ext3_new_inode ext3_add_nondir ext3_create vfs_create .... did we change anything in ext3 this cycle? --
From: Linus Torvalds <torvalds@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 2:01 am 2008 On Wed, 18 Jun 2008, Arjan van de Ven wrote: > > In the kerneloops.org stats, a new oops is rapidly climbing the charts. > The oops is a page fault in the ext3 "do_split" function, and the first > report of it was with 2.6.26-rc6-git3. Interesting. > It happens with various applications; the backtraces are at: > > http://www.kerneloops.org/search.php?search=do_split > > but are generally of this pattern: > > *do_split > ext3_add_entry > ext3_rename > vfs_rename > ... <various paths into vfs_rename> ... > > or > > *do_split > ? add_dirent_to_buf > ext3_add_entry > ext3_new_inode > ext3_add_nondir > ext3_create > vfs_create > .... > > did we change anything in ext3 this cycle? I'm not seeing anything relevant, but I'm adding Al to the cc in, since the r/o bind mounts did change fs/namei.c and vfs_create/mkdir in particular. Not that I see why that would trigger either, but the changes to fs/ext3/namei.c seem to be even _less_ interesting than that. One thing I note is that all the oopses seem to be i686 - are there that few x86-64 fc10 users (I'd have assumed that 64-bit is starting to be the norm for people who live on the edge, but perhaps I'm just out of touch)? Or could this perhaps be an indication that it is specific to i686 some way (eg a compiler issue?) Linus --
From: Linus Torvalds <torvalds@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 2:14 am 2008 On Wed, 18 Jun 2008, Linus Torvalds wrote: > > One thing I note is that all the oopses seem to be i686 - are there that > few x86-64 fc10 users (I'd have assumed that 64-bit is starting to be the > norm for people who live on the edge, but perhaps I'm just out of touch)? > > Or could this perhaps be an indication that it is specific to i686 some > way (eg a compiler issue?) The oops code is odd: 27: 8d 4c 18 fe lea 0xfffffffe(%eax,%ebx,1),%ecx 2b:* 8b 19 mov (%ecx),%ebx <-- trapping instruction 2d: 83 e9 08 sub [mid=2167864,2168034,2168094,2168204,2168044]x8,%ecx 30: 89 d8 mov %ebx,%eax 32: 66 d1 e8 shr %ax 35: 0f b7 c0 movzwl %ax,%eax and that "lea" is doing an address computation of "eax+2*ebx-2". Which does *not* look like an address to a 32-bit entity, but to a 16-bit one. Yeah, it's not conclusive, but it is suggestive. And the 16-bit "shr+movzwl" further strengthens the case that it is actually working on a 16-bit entity. The trapping instruction _should_ possibly have been a "movzwl (%ecx),%ebx" to begin with. But it did a 32-bit load, and in this case it looks as if the 16-bit load would have been correct! The value of ECX in this example was ECX: dc384ffe ie it was indeed a two-byte aligned thing at the end of the page, and if the load had been a 16-bit load (like the data seems to be), it would never have oopsed! The page fault seems to be due to DEBUG_PAGEALLOC and the next page being unmapped because it's not allocated. I only looked closer at one particular oops (25906, in case anybody cares), but at least judging from that particular one I would indeed suspect a compiler bug. Of course, the main reason I say that is that none of the ext3 or VFS changes look even _remotely_ relevant to any of this. They really don't look like they could possibly matter for "do_split()" unless there is something really odd going on. Linus --
From: Linus Torvalds <torvalds@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 2:40 am 2008 On Wed, 18 Jun 2008, Linus Torvalds wrote: > > and that "lea" is doing an address computation of "eax+2*ebx-2". Which > does *not* look like an address to a 32-bit entity, but to a 16-bit one. > Yeah, it's not conclusive, but it is suggestive. I'm wrong, that's just "eax+ebx-2". The *2 was just a brainfart on my part. But I think I have pinpointed where it comes from: it's the struct dx_map_entry *map; which is a structure like this: struct dx_map_entry { u32 hash; u16 offs; u16 size; }; and it does look like it's the if (size + map[i].size/2 > blocksize/2) calculation, where "i" counts backwards from "count-1" to 0. In particular, the code 27: 8d 4c 18 fe lea 0xfffffffe(%eax,%ebx,1),%ecx 2b:* 8b 19 mov (%ecx),%ebx <-- trapping instruction 2d: 83 e9 08 sub [mid=2167864,2168034,2168094,2168204,2168044]x8,%ecx 30: 89 d8 mov %ebx,%eax 32: 66 d1 e8 shr %ax 38: 8d 04 02 lea (%edx,%eax,1),%eax seems to be that "size + map[i].size/2" calculation, but I have a hard time trying to line it up with wat _my_ compiler gives me. But the nearest match I have is: movw 6(%ecx), %bx # <variable>.size, D.21305 subl , %ecx #, ivtmp.921 movl -104(%ebp), %edx # blocksize, tmp179 movl %ebx, %eax # D.21305, tmp176 shrw %ax # tmp176 movzwl %ax, %eax # tmp176, tmp177 leal (%esi,%eax), %eax #, tmp178 which seems to be largely the same thing (except I have a "movw" to load the size, and %ecx is offset by one 'map' entry - so the offset is 6 (in the memop) instead of that "-2" (from the lea). I think I'll give up, but that's the closest match I can find. No guarantees, but it seems to support the notion of "wrong 32-bit load where it should have used a 16-bit one". Linus --
From: Arjan van de Ven <arjan@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 2:09 am 2008 Linus Torvalds wrote: > Or could this perhaps be an indication that it is specific to i686 some > way (eg a compiler issue?) > Dave Airlie just confirmed this is a compiler bug indeed in gcc 4.3.1 and pointed at https://bugzilla.redhat.com/show_bug.cgi?id=451068 --
From: Dave Airlie <airlied@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 1:42 am 2008 On Thu, Jun 19, 2008 at 3:36 PM, Arjan van de Ven <arjan@linux.intel.com> wrote: > In the kerneloops.org stats, a new oops is rapidly climbing the charts. > The oops is a page fault in the ext3 "do_slit" function, and the first > report of it was with 2.6.26-rc6-git3. > > It happens with various applications; the backtraces are at: > > http://www.kerneloops.org/search.php?search=do_split > This is a bug in rawhide in gcc miscompiling something... https://bugzilla.redhat.com/show_bug.cgi?id=451068 Dave. --
From: Arjan van de Ven <arjan@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 1:48 am 2008 Dave Airlie wrote: > On Thu, Jun 19, 2008 at 3:36 PM, Arjan van de Ven <arjan@linux.intel.com> wrote: >> In the kerneloops.org stats, a new oops is rapidly climbing the charts. >> The oops is a page fault in the ext3 "do_slit" function, and the first >> report of it was with 2.6.26-rc6-git3. >> >> It happens with various applications; the backtraces are at: >> >> http://www.kerneloops.org/search.php?search=do_split >> > > This is a bug in rawhide in gcc miscompiling something... > > https://bugzilla.redhat.com/show_bug.cgi?id=451068 > thanks for letting us know so fast! I've marked this one in the database as a fedora gcc bug --
From: Linus Torvalds <torvalds@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 2:42 am 2008 On Thu, 19 Jun 2008, Dave Airlie wrote: > > This is a bug in rawhide in gcc miscompiling something... > > https://bugzilla.redhat.com/show_bug.cgi?id=451068 Gaah. I should read all my email instead of wasting my time trying to match up the code with what I can reproduce.. Linus --
From: Arjan van de Ven <arjan@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 3:09 am 2008 Linus Torvalds wrote: > > On Thu, 19 Jun 2008, Dave Airlie wrote: >> This is a bug in rawhide in gcc miscompiling something... >> >> https://bugzilla.redhat.com/show_bug.cgi?id=451068 > > Gaah. I should read all my email instead of wasting my time trying to > match up the code with what I can reproduce.. > unfortunately, kerneloops.org didn't pick up the link to this bug (due to the fact that the oops in the bug was a jpeg....)... maybe one day if I'm really bored I'll implement OCR into it ;) sorry about wasting your time --
From: Adrian Bunk <bunk@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 4:11 am 2008 On Thu, Jun 19, 2008 at 03:42:34PM +1000, Dave Airlie wrote: > On Thu, Jun 19, 2008 at 3:36 PM, Arjan van de Ven <arjan@linux.intel.com> wrote: > > In the kerneloops.org stats, a new oops is rapidly climbing the charts. > > The oops is a page fault in the ext3 "do_slit" function, and the first > > report of it was with 2.6.26-rc6-git3. > > > > It happens with various applications; the backtraces are at: > > > > http://www.kerneloops.org/search.php?search=do_split > > This is a bug in rawhide in gcc miscompiling something... > > https://bugzilla.redhat.com/show_bug.cgi?id=451068 If I understand it correctly that's a bug in upstream gcc 4.3.1 (but not in gcc 4.3.0)? Expect a lot more of this to pop up in the future. Should we #error for gcc 4.3.1? > Dave. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed --
From: Arjan van de Ven <arjan@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 9:40 am 2008 Adrian Bunk wrote: > > Expect a lot more of this to pop up in the future. > Should we #error for gcc 4.3.1? > it/s better to find if the gcc guys made a testcase for this bug (they normally do) and test based on that. --
From: Adrian Bunk <bunk@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 11:10 am 2008 On Thu, Jun 19, 2008 at 06:40:05AM -0700, Arjan van de Ven wrote: > Adrian Bunk wrote: >> >> Expect a lot more of this to pop up in the future. >> Should we #error for gcc 4.3.1? > > it/s better to find if the gcc guys made a testcase for this bug (they normally do) and > test based on that. The gcc Bugzilla contains a testcase. But how do you plan to integrate it into a kernel build? cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed --
From: Arjan van de Ven <arjan@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 11:18 am 2008 Adrian Bunk wrote: > On Thu, Jun 19, 2008 at 06:40:05AM -0700, Arjan van de Ven wrote: >> Adrian Bunk wrote: >>> Expect a lot more of this to pop up in the future. >>> Should we #error for gcc 4.3.1? >> it/s better to find if the gcc guys made a testcase for this bug (they normally do) and >> test based on that. > > The gcc Bugzilla contains a testcase. > > But how do you plan to integrate it into a kernel build? we already have several of these. Just look at scripts/gcc-x86_64-has-stack-protector.sh for an example of such a beast. --
From: Mikael Pettersson <mikpe@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 4:32 am 2008 Adrian Bunk writes: > On Thu, Jun 19, 2008 at 03:42:34PM +1000, Dave Airlie wrote: > > On Thu, Jun 19, 2008 at 3:36 PM, Arjan van de Ven <arjan@linux.intel.com> wrote: > > > In the kerneloops.org stats, a new oops is rapidly climbing the charts. > > > The oops is a page fault in the ext3 "do_slit" function, and the first > > > report of it was with 2.6.26-rc6-git3. > > > > > > It happens with various applications; the backtraces are at: > > > > > > http://www.kerneloops.org/search.php?search=do_split > > > > This is a bug in rawhide in gcc miscompiling something... > > > > https://bugzilla.redhat.com/show_bug.cgi?id=451068 > > If I understand it correctly that's a bug in upstream gcc 4.3.1 > (but not in gcc 4.3.0)? > > Expect a lot more of this to pop up in the future. > Should we #error for gcc 4.3.1? There are other nasty bugs in gcc-4.3.0. I actually had to completely ban 4.3.0 in a user-space project I'm involved with (Erlang) due to gcc PR36339 (fixed in 4.3.1). What's the gcc bugzilla number for this new 4.3.1 bug? --
From: Adrian Bunk <bunk@...> Subject: Re: kerneloops.org: 2.6.26-rc possible regression in ext3 Date: Jun 19, 6:49 am 2008 On Thu, Jun 19, 2008 at 10:32:24AM +0200, Mikael Pettersson wrote: >... > What's the gcc bugzilla number for this new 4.3.1 bug? #36533 cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed --
| VPN's on NetBSD | 9 hours ago | NetBSD |
| Why does uClinux 2.6.18 bootup block SuperIO UART IRQs that BIOS configured | 11 hours ago | Linux kernel |
| USB statistics | 12 hours ago | Linux kernel |
| Block Sub System query | 16 hours ago | Linux kernel |
| kernel module to intercept socket creation | 17 hours ago | Linux kernel |
| Image size changing during each build | 18 hours ago | Linux kernel |
| Soft lock bug | 23 hours ago | Linux kernel |
| sysctl - dynamic registration problem | 1 day ago | Linux kernel |
| Question on swap as ramdisk partition | 1 day ago | Linux kernel |
| serial driver xmit problem | 1 day ago | Linux kernel |
