Add basic sanity checks to the syscall execution patch Several pieces of malware (rootkits etc) have the nasty habbit of putting their own pointers into the syscall table. For example, the recently "hot in the news" phalanx rootkit does this. The patch below, while obviously not perfect protection against malware, adds some cheap sanity checks to the syscall path to verify the system call is actually still in the kernel code region and not some external-to-this region such as a rootkit. The overhead is very minimal; measured at 2 cycles or less. (this is because the branches get predicted right and the rest of the code is almost perfectly parallelizable... and an indirect function call is a branch issue anyway) with eyes-on-the-code help from Peter the idea is from Ben Herrenschmidt Signed-off-by: Arjan van de Ven diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 109792b..f25c0a1 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -347,7 +347,12 @@ sysenter_past_esp: sysenter_do_call: cmpl $(nr_syscalls), %eax jae syscall_badsys - call *sys_call_table(,%eax,4) + mov sys_call_table(,%eax,4), %eax + cmp $_stext, %eax + jb syscall_badsys + cmp $_etext, %eax + jae syscall_badsys + call *%eax movl %eax,PT_EAX(%esp) LOCKDEP_SYS_EXIT DISABLE_INTERRUPTS(CLBR_ANY) @@ -426,7 +431,12 @@ ENTRY(system_call) cmpl $(nr_syscalls), %eax jae syscall_badsys syscall_call: - call *sys_call_table(,%eax,4) + mov sys_call_table(,%eax,4), %eax + cmp $_stext, %eax + jb syscall_badsys + cmp $_etext, %eax + jae syscall_badsys + call *%eax movl %eax,PT_EAX(%esp) # store the return value syscall_exit: LOCKDEP_SYS_EXIT diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 89434d4..be42486 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -360,8 +360,13 @@ ENTRY(system_call_after_swapgs) system_call_fastpath: cmpq $__NR_syscall_max,%rax ja badsys + mov sys_...
This just means that the root kits will switch to patch the first instruction of the entry points instead. So the protection will be zero to minimal, but the overhead will be there forever. Now that I said this I expect it to go in yesterday. -Andi --
On Thu, 04 Sep 2008 14:01:46 +0200 I'd have considered taking your email serious if you had left out the uncalled and unneeded sarcasm line at the end. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
consider how your whole patch is based on one big self-contradiction. you already assume that the attacker *can* modify arbitrary kernel memory (even the otherwise *read-only* syscall table at that), but at the very same time you're saying he *can't* use the same powers to patch out your 'protection' or do many other things to evade it. as it is, it's cargo cult security at its best, reminding one on the Vista kernel's similar 'protection' mechanism for the service descriptor tables... --
On Fri, 05 Sep 2008 11:43:31 +0200 so I'm not going to say that the patch is important or good; it's the result of ben mentioning the idea on irc and me thinking "sure lets see what it would take and cost". Nothing more than that -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Well, I see it a different way ... it will once for all screw up binary modules that try to add syscalls :-) Ben. --
and that'd be because at the same time they patch the syscall table (remember, they already have to go to length to get around the read-only pages), they can't also patch this 'protection'? sounds really plausible, right :). [fixed hpa's address, .org bounces.] --
Sure, they can :-) It's just an idea I had on irc but I tend to agree that it wouldn't have much effect in practice... regarding security, it will break some existing rootkits ... until updated ones show up. Cheers, Ben. --
at which point we are left with a change that has no relevance to updated rootkits (they circumvent it just fine), while the kernel syscall entry path is left with 2 cycles (or more) overhead, forever. Not a good deal. We introduced the read-only syscall table because it has debugging and robustness advantages, with near zero cost. This change is not zero cost - it's ~1% of our null syscall latency. (which is ~100 nsecs, the cost of this check is ~1 nsec) The other, more fundamental problem that nobody has mentioned so far is that the check returns -ENOSYS and thus makes rootkit attacks _more robust_ and hence more likely! The far better solution would be to insert uncertainty into the picture: some sort of low-frequency watchdog [runs once a second or so] that tries to hide itself from the general kernel scope as much as possible, perhaps as ELF-PIC code at some randomized location, triggered by some frequently used and opaque kernel facility that an attacker can not afford to block or fully filter, and which would just check integrity periodically and with little cost. When it finds a problem it immediately triggers a hard to block/filter vector of alert (which can be a silent alarm over the network or to the screen as well). that method does not prevent rootkits in general (nothing can), but sure makes their life more risky in practice - and a guaranteed livelihood and risk reduction is what typical criminals are interested in primarily, not whether they can break into a particular house. If we implement it then it should not be present in distro .config's, etc. - it should be as invisible as possible - perhaps only be part of the kernel image .init.data section in some unremarkably generic manner. [ It would be nice to have a 'randomize instruction scheduling' option for gcc, to make automated attacks that recognize specific instruction patterns less reliable. ] A good benchmark for such a silent alarm facility would be whether an...
Then they will simply proceed like this : - patch /boot/vmlinuz - sync - crash system => user says "oh crap" and presses the reset button. Patched kernel boots. Game over. Patching vmlinuz for known targetted distros is even easier because the attacker just has to embed binary changes for the most common distro kernels. Clearly all this is a waste of developer time, CPU cycles, memory, reliability and debugging time. All that time would be more efficiently spent auditing and debugging existing code to reduce the attack surface, and CPU cycles + memory would be better spent adding double checks to most sensible functions' entry points and user data processing. Regards, Willy --
a reboot often raises attention. But yes, in terms of end user boxes, probably not. Anyway, my points were about transparent rootkits installed on a running system without anyone noticing - obviously if the attacker can modify the kernel image and the user does not mind a reboot it's game over. Ingo --
Well, install a rootkit in /boot/vmlinuz, sync, then wait for user to reboot its system? Even well-kept servers are rebooted from time to time. I agree -- the only way to win is not to play this game. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Hi, can't then, in this scenario, the VFS keep tabs on /boot/vmlinuz and only allow modification when the process in question properly authenticates itself. As long as we're talking signed modules, why not lock certain files down as well? e.g. hand the kernel a signed list of files to watch write access to, and allow only after the process auths via a private key. -- Jeroen. n.b. I understand this would slow down things more, but if we're talking about taking extreme measures... --
there's that adage about history being repeated by those not knowing it ;) for details see the series based around bypassing Vista's PatchGuard at: http://uninformed.org/?v=3 http://uninformed.org/?v=6 i believe the above mentioned papers prove that it's not a good benchmark ;) --
i think Linux is fundamentally different here as we have the source and every box where it matters we could have a _per box_ randomized kernel image in essence, with non-essential symbols thrown away, and with a few checks inserted in random locations - inlined and in essence unrecognizable from the general entropy of randomization. Not that a randomizing compiler which inserts true, hard to eliminate entropy would be easy to implement. But once done, the cat and mouse game is over and the needle is hidden in the hay-stack. At least as long as transparent rootkits are involved. a successful attack that wants to disable the checks reliably would have to patch the IDT and would have to emulate full kernel execution and would have to detect the pattern of an alert on the hardware API level - as that would be the only reliably observable output of the system. Besides being impractical at best, at minimum a huge slow-down would occur. the only other option would be for a rootkit to transparently switch to another, new, non-checked kernel image on the fly, while keeping all user-space context safe. That's a feature Linux would like to have anyway ;-) [and this could be made really difficult as well if gcc inserted a modest amount of per kernel random noise in the layout of all data structures / field offsets.] Ingo --
how's that supposed to work for the binary distros, i.e., the majority of why do you assume that an attacker wants to do that? it's equally possible, and there's even academic research on this in addition to the underground cracking scene, that one simply hides the modifications from the checker. from marking your patched code as unreadable to executing it from a different place than what the checker checks, there're many ways to trick such checkers. as far as reality goes, it's never been game over ;). --
it takes less than 10 minutes to build a full kernel on recent hardware. yes, in this area debuggability is in straight conflict. Since we can assume that both attacker and owner has about the same level of access to the system, making the kernel less accessible to an attacker makes it well at least in the case of Linux we have a fairly good tally of what kernel code is supposed to be executable at some given moment after bootup, and can lock that list down permanently until the next reboot, and give the list to the checker to verify every now and then? Such a verification pass certainly wouldnt be cheap though: all kernel pagetables have to be scanned and verified, plus all known code (a few megabytes typically), and the key CPU data structures. Ingo --
provided the end user wants/needs to have the whole toolchain on his boxes it's not only installation time (if you meant 'installing the box' itself), in other words, it's a permanently unsolved problem ;). somehow i don't see Red Hat selling RHEL for production boxes with the tag 'we do not debug crashes so no module support? what about kprobes and/or whatever else that generates so good-bye to large page support for kernel code? else there's likely enough unused space left in the large pages for a rootkit to hide. what if the rootkit finds unused pieces of actual code and replaces that (bound to happen with those generic distro configs, especially if you have to go with a non-modular kernel)? last but not least, how would that 'lock that list down' work exactly? what would you verify on the code? it's obfuscated so you can't really analyze it (else you've just solved the attacker's problem), all you can do is probably compute hashes but then you'll have to take care of kernel self-patching and also protecting the hashes somehow. --
it's minimal and easy. It really works to operate on the source code - this 'open source' thing ;-) We just still tend to think in terms of binary software practices that have been established in the past few not a problem really, it is rather small compared to all the stuff that is in a typical disto install. I like the fundamental message as well: "If you want to be more secure, you've got to have the source code, and it's not an unsolvable problem. The debug info can be on a separate box, encrypted, etc. etc - depending on your level of paranoia. The need to debug kernel crashes is a relatively rare event - especially on a box why no module support? Once the system has booted up all necessary modules are loaded and the ability to load new ones is locked down as well. This also makes it harder to inject rootkits btw. (combined with you dont need that in general on a perimeter box. If you need it, you open that locked box with the debug info and make the system more patchable/debuggable - at the risk of exposing same information to are you now talking about the randomized kernel image? The whole point why i proposed it was to hide the checking functionality in it, not to make it harder for the attacker to place the rootkit. Once the identity of the checking code is randomized reasonably, we can assume it will run every now and then, and would expose any modifications of 'unused' kernel functions. (which the attacker would best would be hardware support for mark-read-only-permanently, but once the checker functionality is reasonably randomized, its data structure yes, hashes. The point would be to make the true characteristics of the checker a random, per system property. True, it has many disadvantages such as the inevitable slowdown from a randomized kernel image, the restrictions on debuggability, etc. - but it can serve its purpose if someone is willing to pay that price. best (and most practical) tactics would still be to allow the k...
the question wasn't whether it was minimal or easy but whether end users want to have the toolchain on their production boxes, especially on these the point is not the size of the toolchain, i don't think anyone cares about that in the days of TB disks. the more fundamental issue is that the toolchain doesn't normally belong to production boxes and if the sole reason to have it is this kernel image randomization feature, then it may not be as easy a sell as you think as there're better alternatives what does having the debug info available in whatever form help you in the debugging process that doesn't at the same time help an attacker? remember, the assumption is that the attacker is already on the box (and as root at that), trying to get his kernel rootkit to work, so you'll have to come up with a debugging procedure where he can't leverage that local acccess to pry the debug info out of your hands as you're trying to diagnose a problem. e.g., you can't just disconnect the box from the network if you need remote access yourself or reproducing the problem how are the security constraints of the box related to its kernel's and this also makes it impossible to load newer versions of modules, which will now require a full reboot. i'm sure management will like the so all an attacker needs to do is induce some kernel problems (due to the underlying assumption, he can easily do that), wait for you guys and was pointing out that you don't actually have such a good tally unless you're willing to give up large page support for kernel code, and even if you go for 4k pages you'll be in trouble because a generic kernel like those used in distros is bound to have unused regions of code. and i base this on the assumption that your randomization cannot fundamentally change function boundaries (i.e., randomizing code placement at the basic block level) without killing the branch predictor for good. the short of it is that your list of 'kernel code pages' is useless without ensuring t...
First such checkers already exist -- they are called root kit checkers. There are various around. Doing it in a hypervisor implicitely like Alan proposed would seem much The issue is that a lot of non key data structures all over the memory have function pointers (or pointers to function pointers) too. So if you protect syscall table they are just going to patch some dentry instead. Still if it's reasonable clean it might be still useful to raise the bar a bit, but I'm not sure a checker qualifies for that. -Andi --
how trivial do you think it is for *kernel* code to evade *userland* checking it? ;) otherwise agreed with rest. --
It depends on where the userland runs. e.g. if it's under a hypervisor and in a separate domain it should be reasonably safe. And then I don't think it is much difference between Ingo's kernel checker and a user land checker. Both can be disabled it you know about them. -Andi -- ak@linux.intel.com --
First as a minor pedantic correction (sorry!): the ro syscall table is not fully free. It means you cannot use 2MB pages anymore to map it, which costs One way to do that today is to feed gcc random data for profile feedback. Game copy protections have been playing similar games for decades. While I'm sure it was endless fun for both sides afaik the crackers tended to ultimatively win. And all of these things also make the kernel more fragile which is not good. Likely a case of "the only way to win is not to play" I liked Alan's proposal of using hypervisor support for truly ro pages, although even that is not fully hole proof because of indirect pointers. But at least it would make it generally harder to inject code. -Andi -- ak@linux.intel.com --
On Thu, 04 Sep 2008 14:01:46 +0200 Agreed entirely. This is a waste of time and a game not worth playing. The only place you can expect to make a difference here is in virtualised environments by teaching KVM how to provide 'irrevocably read only' pages to guests where the guest OS isn't permitted to change the rights back or the virtual mapping of that page. Alan --
Even that can be circumvented by patching indirect pointers (or pointer to objects with indirect pointers) in any writable object. Or in a couple of other ways. But yes it would still seem like a reasonable useful improvement. -Andi -- ak@linux.intel.com --
| Linus Torvalds | Linux 2.6.26-rc4 |
| Satyam Sharma | Re: 2.6.23-rc6-mm1 |
| Chuck Ebbert | Why do so many machines need "noapic"? |
| Jan Engelhardt | Re: LSM conversion to static interface |
| Theo de Raadt | That whole "Linux stealing our code" thing |
| Marco Peereboom | Re: Real men don't attack straw men |
| Marius ROMAN | 1440x900 resolution problem |
| GVG GVG | ssh_exchange_identification: Connection closed by remote host |
git: | |
| Martin Langhoff | Re: git versus CVS (versus bk) |
| Linus Torvalds | People unaware of the importance of "git gc"? |
| Martin Langhoff | Handling large files with GIT |
| Jeff King | Re: Git vs Monotone |
| David Miller | [GIT]: Networking |
| Matheos Worku | 2.6.24 BUG: soft lockup - CPU#X |
| Auke Kok | [PATCH] e1000e: test MSI interrupts |
| Wang Jian | drivers/net/phy/marvell.c: 88e1111 can't get out sleep mode |
