Hans Reiser formed Namesys and began the development of Reiserfs ten years ago. The first release of the filesystem, Reiser3, is part of the mainline 2.4 and 2.6 Linux kernels. The more recent Reiser4 is a complete redesign and reimplementation of Reiserfs, aiming to soon be merged into the mainline 2.6 Linux kernel.
In this interview, Hans discusses his background and how he came to create Namesys and Reiserfs. He looks back at Reiser3, describing the advantages it had over other filesystems when it was released and its current state. He then explores the many improvements currently in Reiser4, describing the plugin architecture and its exciting potential for future semantic enhancements.
Jeremy Andrews: Please share a little about yourself and your background.
Hans Reiser: I grew up in California, couldn't handle junior high school and the insistence on sitting in neat rows, so I dropped out after the eighth grade, ran away from home, took some extension classes at UC Berkeley when I was 14, and then against the advice of everyone applied to UC Berkeley and was accepted when I was 15. Berkeley was a lot better than junior high school, but it still involved homework, which deep down in my heart I could never believe in. Reading textbooks, yes, arguing with the professor in class, yes, but homework I could only possess a theoretical understanding of the social purpose of. Such a pity one cannot get a scholarship to go to the bookstore for 10 years, and at intervals prove by discussion of it that one learned something. I never got a PhD, and never will, because of this. Instead I wrote Reiser4, which was a lot more work, but something that I can care about so it is easier for me.
There were things I never really understood until I did Reiser4 about being a scientist.
For instance, those careful logs that seem so stupid and annoying in lab class, they are important in real life. You just can't remember benchmark configuration details 6 months or 3 years later, when you are wondering if there was maybe some other explanation of the data. It's even more important if you have a boss, because he needs to be able to make sense of the data, and summaries just don't do it when he is looking to provide an insight based on his experience.
I also learned to focus on the little things in the data that don't make sense. Often the guys I hire will disregard them, thinking there must be something wrong with the benchmark since it does not make sense. Being more experienced I know that the things that don't make sense are the most important data collected. Time and again, getting to the bottom of a minor performance anomaly that should not exist reveals a design flaw or failure in my understanding, and curing it leads to an advance in our performance that was well worth having. Time and again I have learned the importance of not letting any code go in without benchmarking it. Things that could not possibly affect performance, do, over and over again, and if you don't catch it immediately, you might never catch it. If you add 5% here and 5% there in performance, and catch all the things that subtract performance, and you do it for a few years, you will have a compelling advantage over the competition.
What you learn, when you read works like Novum Organum by Sir Francis Bacon, is that science is about being a blind man with a stick, and he who most persistently pokes blindly ahead of him, contributes the most to our understanding of the Universe, though only if he is willing to accept what the poking tells him that he does not want to be true. I am not as qualified or clever as our competition, and we aren't as well funded, but we are much more persistent and rigorous. That is not what I wanted to believe would be my contribution to the field when I was a boy, but so it is.
Well commented code, I could never have done it for a class, but in Reiser4 we are fanatical about it. Every new programmer gets dragged by me through the process of learning to write textbook clear code. The reason for this is that one person simply cannot scale to reading 10 people's code in addition to running the business unless the code is textbook clear, and the architect NEEDS to have read it all. But also, part of me views the code the way a painter would view a canvas: it should be well done in every detail, including clarity and commenting.
The filesystems business is a tough business: there is no market for being the second best filesystem, and if you aren't willing to make your work the best you know how to make it in every detail, someone else will and they will beat you. We get customers who choose our filesystem because they can work on it more easily, they have told me this.
Jeremy Andrews: What is your role at Namesys?
Hans Reiser: Architect, Owner, whatever I can't get another person to do.....
Jeremy Andrews: You mention how understanding minor performance anomalies can lead to finding design flaws. Do you have any example of design flaws that have been found and fixed this way?
Hans Reiser: Our flush code didn't follow the design spec in one aspect, perhaps because the significance of the design was not realized, and that was that when people had multiple streaming processes adding to multiple atoms, it would always start by trying to flush the oldest atom, even if that atom had been the one most recently flushed, and as a result we were getting some oddly tiny flushes and less performance than we desired. This was similar in some sense to why the old lowest block number first io scheduler worked so poorly in 2.2/2.4. The error in understanding was that if an atom was old it did not mean that the pages in it which were still unflushed and in RAM were old. Now, we let vm pick an old page, and we flush the slum associated with that page, like it was supposed to work.... The number of seeks was reduced.
Another thing that did not make sense was that in V3, performance for files randomly generated with a uniform distribution in the 0-10k size range was worse if tail packing was turned on. It "should" have been better. In V4 it IS better, for the reasons described at www.namesys.com at quite some length in the part about why BLOBs are a bad design idea. This actually has implications that go far beyond Reiser4 as BLOBs are the dominant paradigm in the database world.....
A willingness to believe that data indicates that one is wrong, and sometimes perhaps that everyone is wrong, is essential to a scientist. Boys think that being brilliant will make them a great scientist. Men know that, in the words of Sir Francis Bacon "men are imperfect mirrors of the creator". and that rigor, thoroughness, and a belief in data over consensus are what really matters. I am a blind man with a stick, and my contribution to society is that I ignorantly poke where none have poked before because I am more sure that I am such a fool I'd better check it than anyone else in my field. My only true insight into the field is knowing what a fool I am.
Jeremy Andrews: In regards to benchmarking and the importance of benchmarking all code changes, what tools do you use? Do they offer a close enough approximation of real life usage?
Hans Reiser: We use the mongo.pl benchmark from our website, and dd of a large file single process and then several large files at a time, and then Elena uses some others. mongo is carefully designed to use a representative mix of file sizes that is quite real life.
Jeremy Andrews: How important is the GPL to you? Do you have any interest or intention of working on a filesystem for any of the BSD operating systems, or for any closed source operating systems?
Hans Reiser: Doing GPL work is doing charity work in our current legal and economic framework. That should be and could be changed, but for now it is so. I have done my share of charity, and I would not have a problem doing proprietary work. I think people should keep their lives in balance, and that includes balancing charity work and better paid work. That said, I have no tempting offers at the moment, so I will probably keep on doing GPL work for now. It is not an easy life, I am $200k or more in debt and drive a 1989 CRX Si. I do want to finish my naming system project.
As for BSD licenses, I am not that generous. If other people want to charge for my code, they should give me some too. I do offer licenses in addition to the GPL for a fee.
Jeremy Andrews: Reiser3 is in the 2.4 and 2.6 kernels. Reiser4 is in Andrew Morton's -mm kernel, aiming for eventual inclusion into the 2.6 kernel. What happened to Reiser1 and Reiser2?
Hans Reiser: Just before journaling got added, one of the programmers put two versions up on our website, and bumped the major version number when he should have bumped the minor version number. I was not willing to go backwards in version numbers to fix it because one should never go backwards in version numbers. Oh well. In retrospect, probably I should have gone backwards. Not doing it now.....;-)
Jeremy Andrews: How did Reiser3 improve upon other filesystems that were available at the time it was written?
Hans Reiser: What mattered the most to real users was actually Chris Mason's journaling code. It was the first for Linux, and then after ext3 came out for a long while it was 2x as fast. We were also better at extremes than most other filesystems, like very large directories and small files. It was a robust filesystem.
In terms of a contribution to computer science, V3 was able to show that you could store small files in the filesystem as files. It showed that balanced tree algorithms were not unusably slow for filesystems to use for storing files in rather than just filenames.
What V4 did was figure out why V3 lost performance when it packed small files efficiently, and fix it. It is interesting to see that WinFS gave up on it rather than persisting in the effort to be efficient for small files without losing any large file performance. I guess a 10 year project is just beyond their horizon.
For us, what is exciting, is that it is all downhill from here technically. The hard stuff is done and working, and the fun stuff will be easy as well as fun. For semantics, it is the design of the semantics that is hard, and the design work is done. Actually, all the essential semantics were designed before any storage layer work started. So now, we just need to write a bunch of plugins, deal with some hairy compatibility issues that are hairy mostly because they are political, and there it will be.
Jeremy Andrews: How much effort does Namesys have to put into the support of Reiser3?
Hans Reiser: We didn't start V4 until V3 was stable. After we started V4 we hired one guy to do most of the V3 bug fixes, which were mostly in the newer journaling code, and then after a year the bug reports mostly stopped coming in. The bugs that do get reported now are always in the new features added by the SuSE guys. I am a big believer in the let there be a stable branch of the code with no new features model of software development. This makes me a bit of a heretic in the lkml community, but oh well.
Jeremy Andrews: What new features currently exist in V3 that you feel would have better been left out until V4?
Hans Reiser: acls. I'd be much happier if they had been implemented in V4, and had not added a whole set of bugs to V3. It would have cost less time to do them in V4, and they would have performed well.
Jeremy Andrews: Are there any features that have been added to V3 that you feel shouldn't have been added at all?
Hans Reiser: I think that xattrs differ from files pointlessly. If you have efficient small files, you don't need xattrs.
Jeremy Andrews: What are xattrs, and how are they related to small files?
Hans Reiser: For various reasons people have a need to assign to files attributes that are beyond those originally envisioned by the creators of Unix. One needs to have a namespace for selecting (naming) the attributes. The question is, should one use a completely separate namespace from that used to access filebodies, or just give them particular names and unify the namespaces.
This is like the question, should one have two road systems, one for trucks and one for cars.
For an operating system, its expressive power is proportional to the number of allowed combinations of its components, not the number of components. If you have two namespaces, then you have to write twice the tools, once for the data in attributes and once for the data in files, or at least modify all of them to handle both. Much fewer lines of OS code to just put them both in the same namespace.
So, our idea is that you should be able to do something like
/for i in *;do cat ../..../owner > $i/..../owner/
and have all the members of the current working directory acquire the same owner as another working directory.
No new "cat" command needs to be written..... but VFS/dcache work needs to be done....
Jeremy Andrews: In what ways does Reiser4 improve upon Reiser3, and other filesystems?
Hans Reiser: Reiser4 is very high performance. Much higher than its competitors for general purpose performance. Reiser4 saves a lot of space.
Reiser4 is based on plugins, which means that it is very easy to hack on. It takes more than a license to make code open, it takes an architectural design. People who have read our code tell us it is very well commented and modular. If you format for Reiser4 today, 5 years from now you'll still be able to add the latest features in the latest plugins easily.
One of the plugins that is in a race to be stable before 2.6.14 is the compression plugin. That will allow you to use half the space while increasing performance. Yes, increasing performance, because CPUs are powerful and compression algorithms are efficient these days, and if you halve your IO it is good for performance. One of the keys to making this true is that we only compress at the time we flush to disk, rather than with every write, and that means that if you have a hot set of files that fit into RAM, there is no performance loss because there is no compression going on.
Jeremy Andrews: There has been some discussion about Reiser4 plugins on the lkml, and suggestions that they would best be included in the VFS layer to benefit all filesystems. What is your feeling on this?
Hans Reiser: People think that VFS is well defined, but that is only because semantics have stagnated for 30 years. Now that I am basically happy with the storage layer, we are soon going to put half our staff on semantics enhancing plugins of various kinds. If we put 3-5 guys on the task of changing semantics to handle semi-structured data queries, then VFS will suddenly become either a remote corner or something whose definition changes monthly, depending where you put your labels.
I would be pleased to see a group of programmers create their own set of rival plugins, and call their bundle of plugins a distinct filesystem from reiser4. That would mark the full success of the plugin model, and someday it will happen. If that is what is meant by move it into the VFS, that sounds good to me. If we end up with half a dozen bundles of plugins with different brands on them and each sharing half their code with the other bundles, wow, that will be people enabled to innovate just where they have ideas and to reuse everywhere else.
I think that new ideas should be introduced to Linux by first one group of programmers proving that they work, and then the other groups can decide to use them. The model of some committee of guys deciding what semantics all filesystems will have is just not natural to an open and free society. What I think does make sense is for Linus to say "Ok, I am scheduling a meeting to discuss how to distill into common code all of the new semantics for all of the filesystems, and that meeting will happen 7 years from today. Innovate away until then, and be sure to prove that the users like your semantics before the meeting."
Standards aren't for innovation as it happens, they are for innovation that has gotten so old that everybody is ready to just conform to the accepted best practices and move on to figuring out something else..
Jeremy Andrews: You mention that you want to stabilize the compression plugin by 2.6.14. What is important about 2.6.14?
Hans Reiser: Umh, anytime I can double performance and halve space usage, it is a priority for me and I want it now! ;-)
Jeremy Andrews: What are some other plugins that currently exist?
Hans Reiser: We made a decision to implement the minimal set of plugins that we needed to have to ship. Next comes the interesting stuff, which we will be able to add without requiring you to mkfs again because of plugins.
Reiser4, WinFS and Spotlight:
Jeremy Andrews: How does Reiser4 compare to the upcoming WinFS?
Hans Reiser: Reiser4 is a much more mature design, representing a 10 year effort that started with V3, and one that did not give up on the hard problems but rewrote to solve them. It is easier to work on and I expect it will be higher performance. I look forward to benchmarks.
Jeremy Andrews: How does Reiser4 compare with Apple's Spotlight?
Hans Reiser: Spotlight is neat. It is very simple, but neat.
Reiser4 adds no semantic enhancements worth speaking of so far. It only lays the foundation for semantic enhancements.
Merging Reiser4 Into The 2.6 Kernel:
Jeremy Andrews: What needs to be done to get Reiser4 merged into the 2.6 kernel?
Hans Reiser: Everything that was last requested to be done has been done and will be sent off around Friday, if more things get invented who knows?
Jeremy Andrews: Is Reiser4 something that could be broken into smallish pieces for merging, or is it something that has to be merged all at once?
Hans Reiser: All at once, mostly. We are deferring sys_reiser4 and metafiles for another day though.
Jeremy Andrews: What are metafiles used for?
Hans Reiser: metafiles are files that are about other files. pseudo files are files that are implemented not by storing and retrieving the data in a regular file but by the plugin calculating what it should construct for read, or performing some operation other than just writing the data somewhere in response to a write. For example, someday cat /home/reiser/mp3s/..../childcat > /dev/dsp will concatenate every file that is a child of my mp3s directory and send it to the speakers.
Someday longer away, you'll be able to use queries in the FS, and send all the blues mp3s that your dad emailed you, or all the mp3s related to "britney spears" and "spoof" to your speakers. Using cat, or other dumb programs absent of querying intelligence. There will be a very very sophisticated naming system, and all the programs in the OS will not need any complexity of their own to tap into the power of sophisticated naming.
Before my son is grown, the "find" command will seem so primitive as to be unimaginable, and his eyes will roll when old programmers tell him about it and expect him to be interested.;-)
The traditional powerful advantage of Unix was sophisticated libraries, tools, and infrastructure that allowed complex things to be done simply in Unix compared to DOS. Now when we look at the libraries and tools, the amount of resources invested into improving them is pitiful. It makes a difference. With powerful tools, small companies can write powerful apps that are better than apps with 5x the programming budget in another OS. Naming is one of the more strategic tool sets, and Namesys is going to try to push things forward here. I hope others will contribute to optimizing and enriching C libraries and other neglected areas that sorely need investment. It would be nice if someone funded a systematic review of all the base utilities to see if they could be made better. For instance, the "cp" program is used so much, and it is not enormously optimal. If some effort were made to better calculate how large its buffers should be for best performance, it could be faster. It would be even faster if we used sendfile or some such to eliminate the copying to user space entirely. I mean, why in the world does copying to and from the FS cause bytes to go to user space anyway.... Oh, and everyone is always complaining about how there is no undelete in Linux. It's not like it is technically hard to fix that..... there just is not a person who is funded to do it is all.
As for reiserfs, we now have a very very rich collection of storage layer tools, so now innovation in reiserfs becomes cheaper than innovation in any other fs. I hope people outside Namesys sense that, and join us in a big coding dogpile attack on WinFS. Could be a lot of fun.
Perhaps now that the basic task of copying Unix is fairly complete, our community will naturally turn to trying to substantially exceed it in functionality and innovation. The key to that will be how receptive we are to kids in their 20s with bright ideas and not yet a lot of polish to them. That is what will determine whether Linux lasts a long time --- whether it is less hassle to contribute to Linux than to anything else.
If there are young programmers out there reading this, please consider that we try to make it socially easier to work with us and get patches accepted by me than it is with other filesystems. If I don't like your patch, I will tell you why, not ignore you. Most patches I get are pretty good though.
Future Reiser4 Efforts:
Jeremy Andrews: What weaknesses does Reiser4 have?
Hans Reiser: Our fsync performance is not optimized yet, and will be bad until it is optimized. Our performance for fully random modifications will be bad until we ship a repacker.
Jeremy Andrews: What types of applications are impacted by poor fsync performance?
Hans Reiser: Databases.
Jeremy Andrews: How much effort is involved in optimizing fsync performance?
Hans Reiser: 3 man months.
Jeremy Andrews: What types of applications require high performance for fully random modifications?
Hans Reiser: Obscure ones, but they do exist. Databases stored in the FS for which access patterns cannot be made less than fully random. Those are rare.
Almost always, fully random filenames are due to someone assuming the filesystem can't handle a large directory, and hashing the name. Then one just asks them to generate names that correlate somewhat with the usage pattern, and the performance goes up.
Oh, if the randomness is writes within the file rather than across files, that is more common and less fixable. Our repacker will address that usage pattern. Random reads are no problem at all.
Jeremy Andrews: What is a repacker?
Hans Reiser: The repacker goes through the fs, starting with the leftmost blocks in the tree and shoving them as far to the left on the partition as they will go, moving ~4MB at a time, and then when it reaches the rightmost block in the tree it changes direction and starts shoving the rightmost block in the tree as far to the right on the partition as it will go. After enough iterations, the FS is fully sorted.
As it goes, it squished all the items to the left or the right, meaning that intra-node space becomes tightly packed as well as inter-node space.
The same code handles resizing the fs online. All of this code uses the transaction manager code so that it is all performed online.
Jeremy Andrews: What plans do you have for future filesystems beyond Reiser4?
Hans Reiser: We will systematically implement a set of plugins that implement the semantics described at www.namesys.com/whitepaper.html. These semantics support semi-structured data --- data which has structure, but not necessarily table structure. Search engines (mostly) obliterate structure, database and hierarchical models impose structure, we have less ego and propose to match the structure inherent in the information rather than trying to reshape it to fit our model. That turns out to require a flexibility that is quite valuable for very large scale information systems.
One of my new hires was discussing art and literature with me, and made the mistake of saying that my filesystem is not subversive like the etchings and fiction I would like to have more time for writing. I started to explain that actually, it is the most significantly subversive thing I could imagine to write, because it will allow information to be what it is rather than submit to the shaping and molding of those who created and imposed a system upon it. I was laughing all the way through the explanation of this, and so was he, but we both knew I was quite serious, which simply made us laugh all the more.
Jeremy Andrews: What time frame do you estimate before these semantics are implemented?
Hans Reiser: 3-5 years, but thanks to plugins the features can dribble in one at a time as they become ready.
Jeremy Andrews: Is any of the "subversive fiction" you mentioned available online?
Hans Reiser: No, I stopped work on it for the FS after a first draft. I still want to finish it though. I just hope it does not become true before I publish it.;-)
Hans Reiser: I wrote about a world where government of the earth by a Muslim theocracy had just been overthrown in a revolution, the new rulers lacked military skills and aliens were about to attack, and one of the children of the overthrown ruler was militarily gifted and trying to use intrigue and the danger of aliens to regain power. This character allies with a lover who wants to obsolete humanity with AIs and genetic engineering. That character has a belief that AIs and genetically engineered creatures are our children, that it is nature's way that our children will obsolete us, and we should find the moral courage to embrace that. One of the things I need to finish is creating some political conflict surrounding that. In the novel, the genetically engineered AI assisted entities are better at combat, and the existence of humanity is at risk militarily.
Jeremy Andrews: Sounds interesting. How much of it is written?
Hans Reiser: I wrote a first draft, but I need to write a second draft because while the beginning makes me quite happy with it, the farther I get into it the more it needs a rewrite, with the rewrite needing to start with a careful reoutlining that will ensure the whole thing is tight with well-structured content. As soon as my characters get power the novel loses its pull, so I need to depower them, and make it a teetering power struggle the whole novel. That means I need to develop their enemies as interesting characters as much as I develop them, and then the whole writing process will naturally fall into a plausible development of a see-saw for power. I think that being away from it for 15 years will make it easier to do that, as I remember that when I stopped to focus on the filesystem I was having trouble generating a gripping struggle for power story, and now I don't think that will be so hard for me. I know a lot more about politics now, and now I should be able to include some plausible stories of what it is like to have people working for you. I am going to put in at least one story about someone seeking to protect the main character in a way that he does not want and does not need to be protected, and is embarassed by. And now I can put in some plausible bribery scenes to add tension too.....
I don't want to write a novel like Alan Dean Foster writes, and that is what I have now. I read a number of his novels before I figured out that he almost always has a great first chapter and then the book peters out. If I could buy a book consisting of all his first chapters, I would, because they are often simply excellent. I wish some editor would kick his butt, and make him write the novel he is capable of.
A novel I have often thought about writing is about a character who is a tourist in a police state, he has an interpreter, and he does not know if she is interpreting truthfully. He thinks there may be a great conspiracy against him, and since it is a police state it is not unreasonable to suspect his interpreter is an intelligence agent, but he cannot be quite sure of any of his suspicions, not even at the end. That would require a delicate touch to write.....
Jeremy Andrews: What other areas of the kernel are you involved with, beyond filesystems?
Hans Reiser: One filesystem is enough to keep me busy, really. ;-)
Jeremy Andrews: What advice would you offer those that are interested in learning more about how filesystems work?
Hans Reiser: During the next 5 years filesystems will change tremendously. For 30 years only the storage layer has been changing, and people have been afraid to change the important stuff. Now all of a sudden there are three teams attempting major semantic enhancements, Dominic Giampaolo's team at Apple, WinFS, and us. This is a fun time to dive in. The easiest thing to do is to write a reiser4 plugin. The code is free, the design makes a plugin something a student can attempt (assuming you keep it simple).
Jeremy Andrews: Is there a simple example plugin available as a model, or other documentation available to learn how to write a Reiser4 plugin?
Hans Reiser: I think for most people, just using the default unix regular file plugin or the default directory plugin as a starting point makes sense. Both are well commented.
Jeremy Andrews: It sounds like you've got some excellent plans that will keep you busy for many years to come. Thanks for spending so much time with me on this interview!