login
Header Space

 
 

Linux: DRBD, The Distributed Replicated Block Device

July 23, 2007 - 2:53pm
Submitted by Jeremy on July 23, 2007 - 2:53pm.
Linux news

Lars Ellenberg started an effort to get DRBD, the Distributed Replicated Block Device merged into the Linux kernel. When asked for clarification as to what it was, Lars explained, "think of it as RAID1 over TCP. Typically you have one Node in Primary, the other as Secondary, replication target only. But you can also have both Active, for use with a cluster file system." Earlier in the thread he described it as "a stacked block device driver".

Much of the initial review focused on the need to comply with kernel coding style guidelines. Kyle Moffett offered a much lengthier review, noting at one point in the code, "how about fixing this to actually use proper workqueues or something instead of this open-coded mess?" Lars replied, "unlikely to happen 'right now'. But it is on our todo list..." Jens Axboe added, "but stuff like that is definitely a merge show stopper, jfyi".


From:	Lars Ellenberg [email blocked]
Subject: [DRIVER SUBMISSION] DRBD wants to go mainline
Date:	Sat, 21 Jul 2007 22:38:19 +0200


	DRBD wants to go mainline.
	please have a look at the "for-linus" branch of
	git://git.drbd.org/home/git/linux-drbd.git.

Longer story:

Hi there,

for lkml readers who don't know about DRBD yet:
DRBD is the distributed replicated block device,
a stacked block device driver as developed mainly
by Lars Ellenberg and Philipp Reisner
both employed at Linbit.

We implement shared-disk semantics in a shared-nothing cluster.

yes, we even support cluster file systems (OCFS2, GFS2)
[support for this is somewhat clumsy, still,
mainly userland/setup/convenience issues.
but that will improve.]

to actually use it you will need some userland tools
(drbdadm, drbdmeta, drbdsetup).

most unfortunate current limitations:
we support only 2 nodes directly;
but we are working on that, too :)

I'm going to give a presentation/BoF about it
at linux-conf.eu in September.

Currently we are at "stable version 8.0.y" (y == 4 right now,
8.0.5 to be released within a few days.

we have a project home page http://www.drbd.org
we have some mention on the linux-HA project http://www.linux-ha.org/DRBD
(both are slightly out-dated, unfortunately...)
our subversion repository (yeah, sorry, there has been no git back then)
is at http://svn.drbd.org/drbd/branches/drbd-8.0

and, we have a user base...
[drbd-user readers who are in a position to do so:
please help us get this into mainline! ]

Now, finally, we have a git repository of the kernel module part, too:
 git://git.drbd.org/home/git/linux-drbd.git

it has two branches currently,
 master, which tracks ...torvalds/linux-2.6.git, and 
 for-linus, which tracks the cleaned up patch meant for inclusin in
mainline.
so if interessted please do
 git fetch git://git.drbd.org/home/git/linux-drbd.git +for-linus:drbd-8.0
to get a local branch named drbd-8.0 where you can inspect it.
see below for a diffstat of master..for-linus.

Jens, Andrew, anyone: please review,
and give me advice how to proceed from here.

technically, would you prefer
frequent rebase of the "for-linus" branch,
or rather merges back and forth?

Cheers,

	Lars




The only not drbd specific files touched
are listed explicitly here:

=====================
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index a4a3119..76f84bc 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -272,6 +272,8 @@ config BLK_DEV_CRYPTOLOOP
 	  instead, which can be configured to be on-disk compatible with the
 	  cryptoloop device.
 
+source "drivers/block/drbd/Kconfig"
+
 config BLK_DEV_NBD
 	tristate "Network block device support"
 	depends on NET
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 819c829..b27e16a 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_BLK_DEV_UB)	+= ub.o
 
 obj-$(CONFIG_XEN_BLKDEV_FRONTEND)	+= xen-blkfront.o
 obj-$(CONFIG_LGUEST_GUEST)	+= lguest_blk.o
+obj-$(CONFIG_BLK_DEV_DRBD)	+= drbd/
diff --git a/include/linux/major.h b/include/linux/major.h
index 0cb9805..2dbd7c1 100644
--- a/include/linux/major.h
+++ b/include/linux/major.h
@@ -145,6 +145,8 @@
 #define UNIX98_PTY_MAJOR_COUNT	8
 #define UNIX98_PTY_SLAVE_MAJOR	(UNIX98_PTY_MASTER_MAJOR+UNIX98_PTY_MAJOR_COUNT)
 
+#define DRBD_MAJOR		147
+
 #define RTF_MAJOR		150
 #define RAW_MAJOR		162
=====================

(yes, block major 147 is assigned by LANANA,
 as documented in Documentation/devices.txt already)

git-diff master..for-linus | diffstat
 drivers/block/Kconfig              |    2 
 drivers/block/Makefile             |    1 
 drivers/block/drbd/Kconfig         |   32 
 drivers/block/drbd/Makefile        |    7 
 drivers/block/drbd/drbd_actlog.c   | 1456 ++++++++++++++++
 drivers/block/drbd/drbd_bitmap.c   | 1184 +++++++++++++
 drivers/block/drbd/drbd_buildtag.c |    6 
 drivers/block/drbd/drbd_int.h      | 1894 ++++++++++++++++++++
 drivers/block/drbd/drbd_main.c     | 3249 +++++++++++++++++++++++++++++++++++
 drivers/block/drbd/drbd_nl.c       | 1847 ++++++++++++++++++++
 drivers/block/drbd/drbd_proc.c     |  267 ++
 drivers/block/drbd/drbd_receiver.c | 3356 +++++++++++++++++++++++++++++++++++++
 drivers/block/drbd/drbd_req.c      | 1169 ++++++++++++
 drivers/block/drbd/drbd_req.h      |  320 +++
 drivers/block/drbd/drbd_strings.c  |  106 +
 drivers/block/drbd/drbd_worker.c   | 1046 +++++++++++
 drivers/block/drbd/drbd_wrappers.h |  144 +
 drivers/block/drbd/lru_cache.c     |  370 ++++
 drivers/block/drbd/lru_cache.h     |  147 +
 include/linux/drbd.h               |  284 +++
 include/linux/drbd_config.h        |   55 
 include/linux/drbd_limits.h        |  124 +
 include/linux/drbd_nl.h            |  105 +
 include/linux/drbd_tag_magic.h     |   78 
 include/linux/major.h              |    2 
 25 files changed, 17251 insertions(+)

 

-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :


From: Jan Engelhardt [email blocked] To: Lars Ellenberg [email blocked] Subject: Re: [DRIVER SUBMISSION] DRBD wants to go mainline Date: Sat, 21 Jul 2007 23:17:43 +0200 (CEST) On Jul 21 2007 22:38, Lars Ellenberg wrote: > >We implement shared-disk semantics in a shared-nothing cluster. If nothing is shared, the disk is not shared, but got shared-disk semantics? A little confusing. Jan
From: Lars Ellenberg [email blocked] To: Jan Engelhardt [email blocked] Subject: Re: [DRIVER SUBMISSION] DRBD wants to go mainline Date: Sun, 22 Jul 2007 00:43:00 +0200 On Sat, Jul 21, 2007 at 11:17:43PM +0200, Jan Engelhardt wrote: > > On Jul 21 2007 22:38, Lars Ellenberg wrote: > > > >We implement shared-disk semantics in a shared-nothing cluster. > > If nothing is shared, the disk is not shared, but got shared-disk > semantics? A little confusing. Think of it as RAID1 over TCP. Typically you have one Node in Primary, the other as Secondary, replication target only. But you can also have both Active, for use with a cluster file system. So the semantics are like you have two (to come: N) nodes and a shared disk. only that there is not one shared disk, but two (N) replicated ones. btw, regarding linux kernel CodingStyle issues. scripts/checkpatch.pl reports 2000+ lines of complaints. I'm working on it now, it is mostly whitespace (lame excuse: phil grew up with emacs and gnu coding style, sorry for that). -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
From: Jan Engelhardt [email blocked] Subject: Re: [DRIVER SUBMISSION] DRBD wants to go mainline Date: Sun, 22 Jul 2007 11:06:59 +0200 (CEST) On Jul 22 2007 00:43, Lars Ellenberg wrote: >On Sat, Jul 21, 2007 at 11:17:43PM +0200, Jan Engelhardt wrote: >> On Jul 21 2007 22:38, Lars Ellenberg wrote: >> > >> >We implement shared-disk semantics in a shared-nothing cluster. >> >> If nothing is shared, the disk is not shared, but got shared-disk >> semantics? A little confusing. > >Think of it as RAID1 over TCP. And what does it do better than raid1-over-NBD? (Which is already N-disk, and, logically, seems to support cluster filesystems) >Typically you have one Node in Primary, the other as Secondary, >replication target only. >But you can also have both Active, for use with a cluster file system. > >So the semantics are like you have >two (to come: N) nodes and a shared disk. >only that there is not one shared disk, >but two (N) replicated ones. > >btw, regarding linux kernel CodingStyle issues. >scripts/checkpatch.pl reports 2000+ lines of complaints. >I'm working on it now, it is mostly whitespace (lame excuse: phil grew >up with emacs and gnu coding style, sorry for that). > Jan
From: Lars Ellenberg [email blocked] Subject: Re: [DRIVER SUBMISSION] DRBD wants to go mainline Date: Sun, 22 Jul 2007 16:03:57 +0200 On Sun, Jul 22, 2007 at 11:06:59AM +0200, Jan Engelhardt wrote: > > On Jul 22 2007 00:43, Lars Ellenberg wrote: > >On Sat, Jul 21, 2007 at 11:17:43PM +0200, Jan Engelhardt wrote: > >> On Jul 21 2007 22:38, Lars Ellenberg wrote: > >> > > >> >We implement shared-disk semantics in a shared-nothing cluster. > >> > >> If nothing is shared, the disk is not shared, but got shared-disk > >> semantics? A little confusing. > > > >Think of it as RAID1 over TCP. > > And what does it do better than raid1-over-NBD? (Which is already N-disk, > and, logically, seems to support cluster filesystems) DRBD has built-in logic to track which copy of the data is the most recent. DRBD has resource-level-fencing in place (though we can improve on that). DRBD deals with concurrent writes of the participating nodes correctly. "N-disk" is not the question, you can stack drbd on top of md, lvm, anything. N-nodes is the interessting thing. -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
From: Andi Kleen [email blocked] To: Lars Ellenberg [email blocked] Subject: Re: [DRIVER SUBMISSION] DRBD wants to go mainline Date: 22 Jul 2007 01:50:09 +0200 Lars Ellenberg [email blocked] writes: > > Jens, Andrew, anyone: please review, > and give me advice how to proceed from here. The standard procedure would be to post all the source code in logical pieces on the list for review. Then iterate until all comments are addressed. -Andi
From: Jens Axboe [email blocked] To: Andi Kleen [email blocked] Subject: Re: [DRIVER SUBMISSION] DRBD wants to go mainline Date: Sun, 22 Jul 2007 07:52:36 +0200 On Sun, Jul 22 2007, Andi Kleen wrote: > Lars Ellenberg [email blocked] writes: > > > > Jens, Andrew, anyone: please review, > > and give me advice how to proceed from here. > > The standard procedure would be to post all the source code in logical > pieces on the list for review. Then iterate until all comments are > addressed. Yep, cleanup the style issues (that make sense) from checkpatch and then psot as a series of patches that can be reviewed. Linking to a git tree wont get you very far. -- Jens Axboe
From: Lars Ellenberg [email blocked] Subject: Re: [DRIVER SUBMISSION] DRBD wants to go mainline Date: Sun, 22 Jul 2007 15:58:14 +0200 On Sun, Jul 22, 2007 at 07:52:36AM +0200, Jens Axboe wrote: > On Sun, Jul 22 2007, Andi Kleen wrote: > > Lars Ellenberg [email blocked] writes: > > > > > > Jens, Andrew, anyone: please review, > > > and give me advice how to proceed from here. > > > > The standard procedure would be to post all the source code in logical > > pieces on the list for review. Then iterate until all comments are > > addressed. > > Yep, cleanup the style issues (that make sense) from checkpatch and then > psot as a series of patches that can be reviewed. Linking to a git tree > wont get you very far. it got me far enough, for the first try, anyways :-) I did not spam the lkml with patches, and still got some very useful advice (no idea how I could overlook the checkpatch.pl complaints). If each patch of a series needs to compile and work, there will probably only one 17kB patch... it is difficult to split one module into a series of patches. Or am I missing something? may I bother you again to comment on this, please: I am now down to 5 CHECK: memory barrier without comment these are directly connected with our homegrown kernel thread stuff. will vanish as soon as we convert to kthread.h API. 4 CHECK: spinlock_t definition without comment 3 CHECK: Use #include <linux/uaccess.h> instead of <asm/uaccess.h> 3 CHECK: if this code is redundant consider removing it 2 CHECK: Use #include <linux/types.h> instead of <asm/types.h> need to check those, still. 72 ERROR: braces {} are not necessary for single statement blocks one branch needs them, the othe does not. what now? CodingStyle and checkpatch.pl disagree. 13 ERROR: no space between function name and open parenthesis '(' this is about our ERR_IF() macro... suggestions, anyone? do need I to explicitly write these out? 8 ERROR: Macros with multiple statements should be enclosed in a do - while loop 1 ERROR: Macros with complex values should be enclosed in parenthesis these are "netlink packet definition language" in drbd_nl.h, which is sort of clean, and preprocessor magic in drbd_nl.c. suggestions how to handle this cleanly, without making it more ugly? autogenerate code by other means? write it out by hand, and lose the nice and clean drbd_nl.h? 1 ERROR: Don't use kernel_thread(): see Documentation/feature-removal-schedule.txt yes. working on that. will take some days, though. 1 ERROR: Missing Signed-off-by: line(s) 94 WARNING: declaring multiple variables together should be avoided int snr, enr; does this really need to become two lines? 33 WARNING: line over 80 characters hmmm. get more ugly... probably need some helper functions and temp variables? 4 WARNING: multiple assignments should be avoided x = y = 0; is sometimes a convenient initialization. you don't like it? -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
From: Andi Kleen [email blocked] Subject: Re: [DRIVER SUBMISSION] DRBD wants to go mainline Date: Sun, 22 Jul 2007 16:56:36 +0200 On Sun, Jul 22, 2007 at 03:58:14PM +0200, Lars Ellenberg wrote: > On Sun, Jul 22, 2007 at 07:52:36AM +0200, Jens Axboe wrote: > > On Sun, Jul 22 2007, Andi Kleen wrote: > > > Lars Ellenberg [email blocked] writes: > > > > > > > > Jens, Andrew, anyone: please review, > > > > and give me advice how to proceed from here. > > > > > > The standard procedure would be to post all the source code in logical > > > pieces on the list for review. Then iterate until all comments are > > > addressed. > > > > Yep, cleanup the style issues (that make sense) from checkpatch and then > > psot as a series of patches that can be reviewed. Linking to a git tree > > wont get you very far. > > it got me far enough, for the first try, anyways :-) > I did not spam the lkml with patches, and still got some very useful > advice (no idea how I could overlook the checkpatch.pl complaints). > > If each patch of a series needs to compile and work, That's not needed for review. The final submission can then be in a single patch. It's just impossible to read a really large single patch. Ideally you space the posting also out over time.to not tax the review resources too much. Another standard trick for compileable patch series is to only add the Kconfig at the end. > 13 ERROR: no space between function name and open parenthesis '(' > this is about our ERR_IF() macro... > suggestions, anyone? > do need I to explicitly write these out? Preferable yes. > > 8 ERROR: Macros with multiple statements should be enclosed in a do - while loop > 1 ERROR: Macros with complex values should be enclosed in parenthesis > these are "netlink packet definition language" in drbd_nl.h, > which is sort of clean, and preprocessor magic in drbd_nl.c. > suggestions how to handle this cleanly, > without making it more ugly? > autogenerate code by other means? > write it out by hand, and lose the nice and clean drbd_nl.h? No, you can ignore it then. > 94 WARNING: declaring multiple variables together should be avoided > int snr, enr; > does this really need to become two lines? Probably not. > > 33 WARNING: line over 80 characters > hmmm. get more ugly... > probably need some helper functions and temp variables? Yes smaller functions are better in general. -Andi
From: Kyle Moffett [email blocked] To: Lars Ellenberg [email blocked] Subject: Re: [DRIVER SUBMISSION] DRBD wants to go mainline Date: Sun, 22 Jul 2007 21:32:02 -0400 Ok, I didn't have a chance to get through anywhere near all of it, but here's my comments so far. I didn't really go through things in any particular order but most of these comments are about your drbd_int.h header file. Hopefully a lot of the comments will be useful to fix similar/identical problems in your C files. First of all, if you could break this up into chunks (even if they aren't useful individually) just to make it easier to review. Divisions like "This code does on-disk bitmap management", and "This code does network protocol encoding/decoding" would be extremely helpful when digging through this stuff. I ended up only really doing a cursory review for low-level/style issues, since without the bigger picture (and without the fixed style issues and macro cleanups) it's very hard to really give a good high-level review. Cheers, Kyle Moffett +drbd-objs := drbd_buildtag.o drbd_bitmap.o drbd_proc.o \ + drbd_worker.o drbd_receiver.o drbd_req.o drbd_actlog.o \ + lru_cache.o drbd_main.o drbd_strings.o drbd_nl.o + +obj-$(CONFIG_BLK_DEV_DRBD) += drbd.o Don't use foo-objs, use foo-y instead. +#undef DEVICE_NAME +#define DEVICE_NAME "drbd" This is never actually defined/redefined anywhere else. Just hardcode the "drbd" in your printk strings and save yourself 5-6 characters per use. +/* I don't remember why XCPU ... + * This is used to wake the asender, + * and to interrupt sending the sending task + * on disconnect. + */ +#define DRBD_SIG SIGXCPU + +/* This is used to stop/restart our threads. + * Cannot use SIGTERM nor SIGKILL, since these + * are sent out by init on runlevel changes + * I choose SIGHUP for now. + * + * FIXME btw, we should register some reboot notifier. + */ +#define DRBD_SIGKILL SIGHUP Don't use signals between kernel threads, use proper primitives like notifiers and waitqueues, which means you should also probably switch away from kernel_thread() to the kthread_*() APIs. Also you should fix this FIXME or remove it if it no longer applies:-D. +#ifdef PARANOIA +# define PARANOIA_BUG_ON(x) BUG_ON(x) +#else +# define PARANOIA_BUG_ON(x) +#endif This is only ever used in one place for a simple != test: + PARANOIA_BUG_ON(w != &mdev->resync_work); Just delete PARANOIA_BUG_ON and convert the above to a straight BUG_ON() +#define STATIC +#define STATIC static These two lines are found in different files, but the symbol "STATIC" isn't used anywhere. Just get rid of it. +/* handy macro: DUMPP(somepointer) */ +#define DUMPP(A) ERR( #A " = %p in %s:%d\n", (A), __FILE__, __LINE__); [...] +/* Info: do not remove the spaces around the "," before ## + * Otherwise this is not portable from gcc-2.95 to gcc-3.3 */ +#define PRINTK(level, fmt, args...) \ + printk(level DEVICE_NAME "%d: " fmt, \ + mdev->minor , ##args) + +#define ALERT(fmt, args...) PRINTK(KERN_ALERT, fmt , ##args) [...] No more custom debugging macros please, we have plenty of standardized ones in the kernel already (and all togther too many nonstandardized ones). Please take a good look at dev_printk, etc to make some of your printk()s shorter, but don't add more icky macros. Also gcc < 3.1 is now unsupported, so please remove gcc-2.95 portability comments/cruft (although in this case the code itself doesn't need changing, just the comments). +/* see kernel/printk.c:printk_ratelimit + * macro, so it is easy do have independend rate limits at different locations + * "initializer element not constant ..." with kernel 2.4 :( + * so I initialize toks to something large + */ +#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \ Any particular reason you can't just use printk_ratelimit for this? Also you should remove any linux 2.4-related code/comments as they won't apply for code submitted to 2.6 mainline. +#ifdef DBG_ASSERTS +extern void drbd_assert_breakpoint(struct drbd_conf *, char *, char *, int ); +# define D_ASSERT(exp) if (!(exp)) \ + drbd_assert_breakpoint(mdev, #exp, __FILE__, __LINE__) +#else +# define D_ASSERT(exp) if (!(exp)) \ + ERR("ASSERT( " #exp " ) in %s:%d\n", __FILE__, __LINE__) +#endif +#define ERR_IF(exp) if (({ \ + int _b = (exp) != 0; \ + if (_b) ERR("%s: (" #exp ") in %s:%d\n", \ + __func__, __FILE__, __LINE__); \ + _b; \ + })) Yuck, more debugging macros. +/* Defines to control fault insertion */ +enum { + DRBD_FAULT_MD_WR = 0, /* meta data write */ + DRBD_FAULT_MD_RD, /* read */ + DRBD_FAULT_RS_WR, /* resync */ + DRBD_FAULT_RS_RD, + DRBD_FAULT_DT_WR, /* data */ + DRBD_FAULT_DT_RD, + DRBD_FAULT_DT_RA, /* data read ahead */ + + DRBD_FAULT_MAX, +}; We have some existing failure-injection code, any chance you could rip out your custom logic and just plug that? I haven't looked over it, though, so I can't really offer any useful suggestions about it. +#include <linux/stringify.h> Put the include at the top of the file, please? +/* integer division, round _UP_ to the next integer */ +#define div_ceil(A, B) ( (A)/(B) + ((A)%(B) ? 1 : 0) ) +/* usual integer division */ +#define div_floor(A, B) ( (A)/(B) ) Your div_ceil function looks OK but I think we already have a standard one somewhere. You might remove that div_floor function, though, as it isn't used anywhere. +/* + * Compatibility Section + *************************/ + +#define LOCK_SIGMASK(task, flags) spin_lock_irqsave(&task->sighand->siglock, flags) +#define UNLOCK_SIGMASK(task, flags) spin_unlock_irqrestore(&task->sighand->siglock, flags) +#define RECALC_SIGPENDING() recalc_sigpending(); These aren't used anywhere, remove please? +#if defined(DBG_SPINLOCKS) && defined(__SMP__) +# define MUST_HOLD(lock) if (!spin_is_locked(lock)) { ERR("Not holding lock! in %s\n", __FUNCTION__ ); } +#else +# define MUST_HOLD(lock) +#endif Why not just open-code "WARN_ON_SMP(!spin_is_locked(lock))" in the few places this is used? Alternatively if you need this to actually BUG(), you might either add a BUG_ON_SMP() to include/asm-generic/bug.h or add a helper to include/linux/spinlock.h: #if defined(CONFIG_SMP) && defined(CONFIG_DEBUG_SPINLOCK) # define spin_lock_musthold(lock) BUG_ON(!spin_is_locked(lock)) #else # define spin_lock_musthold(lock) do { } while(0) #endif +#define SET_MDEV_MAGIC(x) \ + ({ typecheck(struct drbd_conf*, x); \ + (x)->magic = (long)(x) ^ DRBD_MAGIC; }) +#define IS_VALID_MDEV(x) \ + ( typecheck(struct drbd_conf*, x) && \ + ((x) ? (((x)->magic ^ DRBD_MAGIC) == (long)(x)):0)) SET_MDEV_MAGIC is only used in one place, can't you open-code it there? Even if it were used lots of places a little inline function would handle the "typecheck" for you and be easier to read. The IS_VALID_MDEV() isn't used anywhere, so delete it. +enum Drbd_thread_state { + None, + Running, + Exiting, + Restarting +}; + +struct Drbd_thread { + spinlock_t t_lock; + struct task_struct *task; + struct completion startstop; + enum Drbd_thread_state t_state; + int (*function) (struct Drbd_thread *); + struct drbd_conf *mdev; +}; As somebody already mentioned, you need to switch from kernel_thread to kthread_* helpers. +/* + * Having this as the first member of a struct provides sort of "inheritance". + * "derived" structs can be "drbd_queue_work()"ed. + * The callback should know and cast back to the descendant struct. + * drbd_request and Tl_epoch_entry are descendants of drbd_work. + */ +struct drbd_work; +typedef int (*drbd_work_cb)(struct drbd_conf *, struct drbd_work *, int cancel); +struct drbd_work { + struct list_head list; + drbd_work_cb cb; +}; + +struct drbd_barrier; +struct drbd_request { + struct drbd_work w; Eww, please don't do opencoded casting of these. You should use the proper container_of() macro instead: > struct foo { > int a; > }; > struct bar { > int b; > struct foo thefoo; > }; > void mycallback(struct foo *myfoo) > { > struct bar *mybar = container_of(myfoo, struct bar, thefoo); > process_a_bar(mybar); > } It's not *much* better typechecking, but it's at least a little. It also lets you put a struct drbd_work at a non-zero offset inside of some other struct, as container_of() does internal subtraction of the offsetof(). +/* THINK maybe we actually want to use the default "event/%s" worker threads + * or similar in linux 2.6, which uses per cpu data and threads. + * + * To be general, this might need a spin_lock member. + * For now, please use the mdev->req_lock to protect list_head, + * see drbd_queue_work below. + */ +struct drbd_work_queue { + struct list_head q; + struct semaphore s; /* producers up it, worker down()s it */ + spinlock_t q_lock; /* to protect the list. */ +}; Umm, how about fixing this to actually use proper workqueues or something instead of this open-coded mess? +/* for sync_conf and other types... */ +#define PACKET(name, number, fields) struct name { fields }; +#define INTEGER(pn, pr, member) int member; +#define INT64(pn, pr, member) __u64 member; +#define BIT(pn, pr, member) unsigned member : 1; +#define STRING(pn, pr, member, len) \ + unsigned char member[len]; int member ## _len; +#include "linux/drbd_nl.h" The only light at the end of this tunnel is the greedily burning fires of Macro Hell(TM). Isn't there any better way to do this than all those horrid macros? I can sorta see what drbd_nl.h is trying to do, but the way it's included multiple times with all those insane macros defined makes it really hard to mentally parse. +#ifdef PARANOIA + long magic; +#endif Any particular reason you can't just unconditionally use a magic number? It's little more than an extra word per device and a couple simple integer tests. +/* returns 1 if it was successfull, + * returns 0 if there was no data socket. + * so wherever you are going to use the data.socket, e.g. do + * if (!drbd_get_data_sock(mdev)) + * return 0; + * CODE(); + * drbd_put_data_sock(mdev); + */ +static inline int drbd_get_data_sock(struct drbd_conf *mdev) +{ + down(&mdev->data.mutex); + /* drbd_disconnect() could have called drbd_free_sock() + * while we were waiting in down()... */ + if (unlikely(mdev->data.socket == NULL)) { + up(&mdev->data.mutex); + return 0; + } + return 1; +} Any reason this can't be open-coded? The extra unlock logic should just be moved into the cleanup handlers for the appropriate function. It looks like you are using the semaphore as a simple binary mutex, so please switch this to "struct mutex" as well. +#if BITS_PER_LONG == 32 +#define LN2_BPL 5 +#define cpu_to_lel(A) cpu_to_le32(A) +#define lel_to_cpu(A) le32_to_cpu(A) +#elif BITS_PER_LONG == 64 +#define LN2_BPL 6 +#define cpu_to_lel(A) cpu_to_le64(A) +#define lel_to_cpu(A) le64_to_cpu(A) +#else +#error "LN2 of BITS_PER_LONG unknown!" +#endif Hrm, any reason you can't just make the bitmap an array of __u64 and always use the cpu_to_le64()/le64_to_cpu() functions? +/* I want the packet to fit within one page + * THINK maybe use a special bitmap header, + * including offset and compression scheme and whatnot + * Do not use PAGE_SIZE here! Use a architecture agnostic constant! + */ +#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof(long)) Yuck. Definitely use PAGE_SIZE here, so at least if it's broken on an arch with multiple page sizes, somebody can grep for PAGE_SIZE to fix it. It also means that on archs/configs with 8k or 64k pages you won't waste a bunch of memory. +/* Dynamic tracing framework */ +#ifdef ENABLE_DYNAMIC_TRACE + +extern int trace_type; +extern int trace_devs; [...] AGH!! Not another dynamic tracing framework! Please use one of the existing solutions like kprobes or something. Maybe define some static tracepoints if that would be useful? +static inline void drbd_tcp_cork(struct socket *sock) +{ +#if 1 + mm_segment_t oldfs = get_fs(); + int val = 1; + + set_fs(KERNEL_DS); + tcp_setsockopt(sock->sk, SOL_TCP, TCP_CORK, (char *)&val, sizeof(val) ); + set_fs(oldfs); +#else + tcp_sk(sock->sk)->nonagle |= TCP_NAGLE_CORK; +#endif +} Yuck, why'd you do it this way? Probably because your tcp_sk(sock->sk) stuff doesn't have proper locking, I'll bet. You can avoid all the extra wrapper crap by just looking in "do_tcp_setsockopt" and taking the appropriate lock: static inline void drbd_tcp_cork(struct socket *sock) { struct sock *sk = sock->sk; lock_sock(sk); tcp_sk(sk)->nonagle |= TCP_NAGLE_CORK; release-sock(sk); } On the other hand, none of the currently in-tree network block devices or network filesystems seem to feel the need to try to TCP_CORK their output, why does drbd? +#define peer_mask role_mask +#define pdsk_mask disk_mask +#define susp_mask 1 +#define user_isp_mask 1 +#define aftr_isp_mask 1 + +#define NS(T, S) \ + ({ union drbd_state_t mask; mask.i = 0; mask.T = T##_mask; mask; }), \ + ({ union drbd_state_t val; val.i = 0; val.T = (S); val; }) +#define NS2(T1, S1, T2, S2) \ + ({ union drbd_state_t mask; mask.i = 0; mask.T1 = T1##_mask; \ + mask.T2 = T2##_mask; mask; }), \ + ({ union drbd_state_t val; val.i = 0; val.T1 = (S1); \ + val.T2 = (S2); val; }) +#define NS3(T1, S1, T2, S2, T3, S3) \ + ({ union drbd_state_t mask; mask.i = 0; mask.T1 = T1##_mask; \ + mask.T2 = T2##_mask; mask.T3 = T3##_mask; mask; }), \ + ({ union drbd_state_t val; val.i = 0; val.T1 = (S1); \ + val.T2 = (S2); val.T3 = (S3); val; }) + +#define _NS(D, T, S) \ + D, ({ union drbd_state_t ns; ns.i = D->state.i; ns.T = (S); ns; }) +#define _NS2(D, T1, S1, T2, S2) \ + D, ({ union drbd_state_t ns; ns.i = D->state.i; ns.T1 = (S1); \ + ns.T2 = (S2); ns; }) +#define _NS3(D, T1, S1, T2, S2, T3, S3) \ + D, ({ union drbd_state_t ns; ns.i = D->state.i; ns.T1 = (S1); \ + ns.T2 = (S2); ns.T3 = (S3); ns; }) Grumble. When I earlier said I thought I was in macro hell, well, I was wrong. *THIS* is macro hell. What the fsck is that supposed to do? And it doesn't even include a *SINGLE* comment!!! +static inline void drbd_state_lock(struct drbd_conf *mdev) +{ + wait_event(mdev->misc_wait, + !test_and_set_bit(CLUSTER_ST_CHANGE, &mdev->flags)); +} Umm, what's wrong with a mutex here? +static inline int semaphore_is_locked(struct semaphore *s) +{ + if (!down_trylock(s)) { + up(s); + return 0; + } + return 1; +} This would be better implemented in the mutex/semaphore generic code, instead of stuffed in a network blockdev. On the other hand, since the only user is this: + D_ASSERT(semaphore_is_locked(&mdev->md_io_mutex)); Can't you just either get rid of the check or open-code it? It would also give you a chance to substitute D_ASSERT(foo) for a proper BUG_ON(!foo) :-D +static inline void +_drbd_queue_work(struct drbd_work_queue *q, struct drbd_work *w) [...] Yeah, this stuff should really be fixed to use generic workqueues or something. +#define ERR_IF_CNT_IS_NEGATIVE(which) \ + if (atomic_read(&mdev->which) < 0) \ + ERR("in %s:%d: " #which " = %d < 0 !\n", \ + __func__ , __LINE__ , \ + atomic_read(&mdev->which)) Another macro with an implicit parameter (mdev). Please fix all of these +#define dec_unacked(mdev) do { \ + typecheck(struct drbd_conf *, mdev); \ + atomic_dec(&mdev->unacked_cnt); \ + ERR_IF_CNT_IS_NEGATIVE(unacked_cnt); } while (0) [...] More macros that should be inline functions. + if (!have_net_conf) dec_net(mdev); There's lots of coding-style violations like these in there. You should put the "dec_net(mdev);" indented on the line *AFTER* the if statememt. Please read linux/Documentation/CodingStyle for full details. + int mxb = 1000000; /* arbitrary limit on open requests */ Arbitrary limits are usually bad. Derive it from system properties, make it user-configurable, etc. +/* I'd like to use wait_event_lock_irq, + * but I'm not sure when it got introduced, + * and not sure when it has 3 or 4 arguments */ +static inline void inc_ap_bio(struct drbd_conf *mdev) [...] + spin_lock_irq(&mdev->req_lock); + while (!__inc_ap_bio_cond(mdev)) { + prepare_to_wait(&mdev->misc_wait, &wait, TASK_UNINTERRUPTIBLE); + spin_unlock_irq(&mdev->req_lock); + schedule(); + finish_wait(&mdev->misc_wait, &wait); + spin_lock_irq(&mdev->req_lock); + } + spin_unlock_irq(&mdev->req_lock); Since you're submitting for upstream inclusion you can just delete the extra backwards-compatible opencoded stuff and use the generic helper. +#define ARRY_SIZE(A) (sizeof(A)/sizeof(A[0])) There's a generic ARRAY_SIZE(foo) somewhere in the kernel sources. Use that one please. +/* this is developers aid only! */ +#define PARANOIA_ENTRY() BUG_ON(test_and_set_bit(__LC_PARANOIA, &lc->flags)) +#define PARANOIA_LEAVE() do { clear_bit(__LC_PARANOIA, &lc->flags); smp_mb__after_clear_bit(); } while (0) +#define RETURN(x...) do { PARANOIA_LEAVE(); return x ; } while (0) What is the PARANOIA flag intended to verify? This needs better commenting or removal. Don't use macros which take implicit arguments (IE: the lc value is used in the macro without it being passed as a parameter). Since it's debugging code you could also replace the open-coded bit trylock with a blocking spinlock. If something hangs, just turn on lockdep and it will report *MUCH* more useful information about the deadlock. Macros which do a "return" are discouraged, but since this one is called RETURN() it might be OK. On the other hand, you should probably rename it as PARANOIA_RETURN(). +int _drbd_md_sync_page_io(struct drbd_conf *mdev, + struct drbd_backing_dev *bdev, + struct page *page, sector_t sector, + int rw, int size) +{ [...] + ok = (bio_add_page(bio, page, size, 0) == size); + if (!ok) goto out; Ewwww, can somebody say "CodingStyle"? Don't put an if() and its result on a single line, and just use a straight if instead: > if (bio_add_page(bio, page, size, 0) != size) > goto out; + mask = ( hardsect / MD_HARDSECT ) - 1; + D_ASSERT( mask == 1 || mask == 3 || mask == 7 ); + D_ASSERT( hardsect == (mask+1) * MD_HARDSECT ); Problematic spaces and parentheses. Again, see Documentation/CodingStyle for the preferred forms. On the other hand, sometimes little code-style things like that are left to the maintainer if they really strongly prefer it one particular way.
From: Lars Ellenberg [email blocked] Subject: Re: [DRIVER SUBMISSION] DRBD wants to go mainline Date: Mon, 23 Jul 2007 15:32:02 +0200 On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: > Ok, I didn't have a chance to get through anywhere near all of it, but > here's my comments so far. I didn't really go through things in any > particular order but most of these comments are about your drbd_int.h > header file. Hopefully a lot of the comments will be useful to fix > similar/identical problems in your C files. Thank you for your time. I want to give some lame excuses, still: > +#undef DEVICE_NAME > +#define DEVICE_NAME "drbd" a left-over. first prototypes of drbd have been around in linux 2.0 time... "in the old days", as a block device, you had to define this, and then include some header after that. will clean that up. > +/* I don't remember why XCPU ... > + * This is used to wake the asender, > + * and to interrupt sending the sending task > + * on disconnect. > + */ > +#define DRBD_SIG SIGXCPU > Don't use signals between kernel threads, use proper primitives like > notifiers and waitqueues, which means you should also probably switch away > from kernel_thread() to the kthread_*() APIs. Also you should fix this > FIXME or remove it if it no longer applies:-D. right. but how to I tell a network thread in tcp_recvmsg to stop early, without using signals? > +/* see kernel/printk.c:printk_ratelimit > + * macro, so it is easy do have independend rate limits at different locations > + * "initializer element not constant ..." with kernel 2.4 :( > + * so I initialize toks to something large > + */ > +#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \ > > Any particular reason you can't just use printk_ratelimit for this? I want to be able to do a rate-limit per specific message/code fragment, without affecting other messages or execution paths. > +/* > + * Compatibility Section > + *************************/ > + > +#define LOCK_SIGMASK(task, flags) spin_lock_irqsave(&task->sighand->siglock, flags) > +#define UNLOCK_SIGMASK(task, flags) spin_unlock_irqrestore(&task->sighand->siglock, flags) > +#define RECALC_SIGPENDING() recalc_sigpending(); > > These aren't used anywhere, I just recently cleaned up the use of them. when the first drbd prototypes have been developed it was linux 2.0 times... along the way many things have changed in incompatible ways, so we had to abstract it in macros. > remove please? yes. > +/* THINK maybe we actually want to use the default "event/%s" worker threads > + * or similar in linux 2.6, which uses per cpu data and threads. > + * > + * To be general, this might need a spin_lock member. > + * For now, please use the mdev->req_lock to protect list_head, > + * see drbd_queue_work below. > + */ > +struct drbd_work_queue { > + struct list_head q; > + struct semaphore s; /* producers up it, worker down()s it */ > + spinlock_t q_lock; /* to protect the list. */ > +}; > > Umm, how about fixing this to actually use proper workqueues or something > instead of this open-coded mess? unlikely to happen "right now". but it is on our todo list... > +/* I want the packet to fit within one page > + * THINK maybe use a special bitmap header, > + * including offset and compression scheme and whatnot > + * Do not use PAGE_SIZE here! Use a architecture agnostic constant! > + */ > +#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof(long)) > > Yuck. Definitely use PAGE_SIZE here, so at least if it's broken on an arch > with multiple page sizes, somebody can grep for PAGE_SIZE to fix it. It also > means that on archs/configs with 8k or 64k pages you won't waste a bunch of > memory. No. This is not to allocate anything, but defines the chunk size with which we transmit the bitmap, when we have to. We need to be able to talk from one arch (say i586) to some other (say s390, or sparc, or whatever). The receiving side has a one-page buffer, from which it may or may not to endian-conversion. The hardcoded 4096 is the minimal common denominator here. > +/* Dynamic tracing framework */ guess we have to explain first what this is for. it probably is not exactly what you think it is... but maybe I am just too ignorant about what you suggest here. we'll have to defer discussion about this for when I'm done doing the trivial fix-ups, and provided an overall design overview. > +static inline void drbd_tcp_cork(struct socket *sock) ... > On the other hand, none of the currently in-tree network block devices or > network filesystems seem to feel the need to try to TCP_CORK their output, > why does drbd? it had performance improvements somewhen. I doubt it has still. maybe we just remove this. > +#define peer_mask role_mask > +#define pdsk_mask disk_mask > +#define susp_mask 1 > +#define user_isp_mask 1 > +#define aftr_isp_mask 1 > +#define NS(T, S) \ > +#define NS2(T1, S1, T2, S2) \ > +#define NS3(T1, S1, T2, S2, T3, S3) \ > +#define _NS(D, T, S) \ > +#define _NS2(D, T1, S1, T2, S2) \ > +#define _NS3(D, T1, S1, T2, S2, T3, S3) \ > > Grumble. When I earlier said I thought I was in macro hell, well, I was > wrong. *THIS* is macro hell. What the fsck is that supposed to do? And it > doesn't even include a *SINGLE* comment!!! uhm. basically you are right. Phil will take over to explain why it is done that way... > + int mxb = 1000000; /* arbitrary limit on open requests */ > > Arbitrary limits are usually bad. Derive it from system properties, make it > user-configurable, etc. it is user configurable. it is just the upper cap on user configurability. most other points I just snipped off of this mail will be handled as you suggested asap. Lars
From: Jens Axboe [email blocked] Subject: Re: [DRIVER SUBMISSION] DRBD wants to go mainline Date: Mon, 23 Jul 2007 15:37:04 +0200 On Mon, Jul 23 2007, Lars Ellenberg wrote: > > +/* THINK maybe we actually want to use the default "event/%s" worker threads > > + * or similar in linux 2.6, which uses per cpu data and threads. > > + * > > + * To be general, this might need a spin_lock member. > > + * For now, please use the mdev->req_lock to protect list_head, > > + * see drbd_queue_work below. > > + */ > > +struct drbd_work_queue { > > + struct list_head q; > > + struct semaphore s; /* producers up it, worker down()s it */ > > + spinlock_t q_lock; /* to protect the list. */ > > +}; > > > > Umm, how about fixing this to actually use proper workqueues or something > > instead of this open-coded mess? > > unlikely to happen "right now". > but it is on our todo list... But stuff like that is definitely a merge show stopper, jfyi. -- Jens Axboe



Related Links:

Reply

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <b> <quote> <pre> <hr> <br> <p> <img> <blockquote> <font> <tt> <table> <tr> <i>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

speck-geostationary