Message ID | 20130820111602.3cea203c@gandalf.local.home (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Tue, Aug 20, 2013 at 11:16:02AM -0400, Steven Rostedt wrote: > > The down/up_read_non_owner() is a nasty hack in the API of the rwsem > operations. It was once removed, but then resurrected for use with > bcache. Not only is the API an abomination to the rwsem API, it also > prevents bcache from ever being compiled with PREEMPT_RT, as the RT > kernel requires all rwsems to have an owner. > > Instead of keeping the down/up_read_non_owner() around, it is better to > modify the one user of it and have it do something a bit differently. > > From reading the bcache code, the reason for the non_owner usage is > that a request is made, and writers must wait for that request to > finish before they can continue. But because the request is completed > by another task, the rwsem can not be held for read and then released > on completion. > > Instead of abusing the rwsem code for this, I added a refcount > "nr_requests" to the cached_dev structure as well as a "write_waiters" > wait queue. When a request is to be made, the rwsem is still taken for > read, but this time with an owner. The refcount is incremented and > before exiting the function, the rwsem is released. > > The writer will then take the rwsem for write, check the refcount, if > it is not zero, it will release the rwsem, add itself to a wait_event() > waiting for refcount to become zero, and then try again. I _really_ disagree with this approach. I get that there's a problem, but the bcache code REALLY IS USING THE RWSEM AS A LOCK; the answer isn't to open code the lock! Apologies to Christoph for getting distracted and not responding when you started to explain what the issues were for RT. I'm not really convinced they're that insurmountable (one of the issues was debugging, which the _non_owner() stuff always handled just fine), but I'll take it on faith that this usage is incompatible with rwsems + the RT functionality since I haven't actually had time to dig into it. So assuming that's the case, IMO the sanest thing to do is make a new type of lock - "rwsem_non_process" or somesuch - and use _that_ in bcache. Not open coding the lock. It can even live in the bcache code if we want since there currently wouldn't be any other users, I don't really care. But open coding it? Come on... makes me wonder what other code in the kernel is open coding locks because it couldn't release it in the same process context that took the lock for whatever reason. Also, nack this patch because increasing the number of atomic ops to shared cachelines in our fast path. If it does end up being open coded, I'll make a more efficient version. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 20 Aug 2013 19:19:00 -0700 Kent Overstreet <kmo@daterainc.com> wrote: > On Tue, Aug 20, 2013 at 11:16:02AM -0400, Steven Rostedt wrote: > > > I _really_ disagree with this approach. I had a feeling you would love it. > > I get that there's a problem, but the bcache code REALLY IS USING THE > RWSEM AS A LOCK; the answer isn't to open code the lock! Actually, it is using it as a semaphore. The problem with Linux was that it only had spin locks and semaphores. People would use the semaphore as either a completion or a mutex. Luckily, we created both of those and replaced 90% of all semaphores with either a mutex or a completion. The rwsem, sorta has the same issue. But it was pretty much always used as a read/write sleeping lock. It was not used as a semaphore. But it seems that the bcache really does use it as a semaphore here as it isn't just a mutex location. It's a special kind of completion, but the completion is in a strange way. > > Apologies to Christoph for getting distracted and not responding when > you started to explain what the issues were for RT. I'm not really > convinced they're that insurmountable (one of the issues was debugging, > which the _non_owner() stuff always handled just fine), but I'll take it > on faith that this usage is incompatible with rwsems + the RT > functionality since I haven't actually had time to dig into it. The thing is, RT requires priority inheritance. When a task blocks on a rwsem, it has to boost the owner of that rwsem otherwise it risks unbounded priority inversion. I have a hack that actually implements a work around for non_owner, but it adds overhead to all locks to do so. > > So assuming that's the case, IMO the sanest thing to do is make a new > type of lock - "rwsem_non_process" or somesuch - and use _that_ in > bcache. Not open coding the lock. I actually started that, but decided this was the easier patch to send. Don't worry, I was expecting this email. No surprises here ;-) This email was taken from Andrew Morton's play book (I see you Cc'd him, I forgot to). It's one of these patches of "It's so bad that the owner of the code will fix the issue out of fear that this patch may make it in". > > It can even live in the bcache code if we want since there currently > wouldn't be any other users, I don't really care. But open coding it? > Come on... makes me wonder what other code in the kernel is open coding > locks because it couldn't release it in the same process context that > took the lock for whatever reason. Most cases just have completions. This looks like a case where "don't do something while transaction is taking place". Which can usually get away with a wait event. > > Also, nack this patch because increasing the number of atomic ops to > shared cachelines in our fast path. If it does end up being open coded, > I'll make a more efficient version. Perhaps I can whip up a "struct rwsem_non_owner" and have a very limited API for that, and then you can use that. Thanks, -- Steve -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Aug 20, 2013 at 10:41:53PM -0400, Steven Rostedt wrote: > > I get that there's a problem, but the bcache code REALLY IS USING THE > > RWSEM AS A LOCK; the answer isn't to open code the lock! > > Actually, it is using it as a semaphore. The problem with Linux > was that it only had spin locks and semaphores. People would use the > semaphore as either a completion or a mutex. Luckily, we created both > of those and replaced 90% of all semaphores with either a mutex or a > completion. > > The rwsem, sorta has the same issue. But it was pretty much always used > as a read/write sleeping lock. It was not used as a semaphore. But it > seems that the bcache really does use it as a semaphore here as it > isn't just a mutex location. It's a special kind of completion, but the > completion is in a strange way. I think we're mostly quibbling over semantics here, and it's not that big a deal - that said, I don't really get your distinction between a semaphore and a mutex; I'd distinguish them by saying a semaphore can count an arbitrary number of resources (the number it's initialized to), and a mutex is always initialized to one - the fact that in the kernel mutexes must be released by the same process that took them (while important for PI and debugging) is not fundamental. "rw semaphore" never made any sense to me; they're not counting anything, like regular semaphores. They're just sleepable rw locks. So it _sounds_ like you're saying bcache is using it as a semaphore, beacuse it's using it as a lock we don't release in the original context? in analogy to linux mutexes and semaphores? Not quite sure what you mean. > > Apologies to Christoph for getting distracted and not responding when > > you started to explain what the issues were for RT. I'm not really > > convinced they're that insurmountable (one of the issues was debugging, > > which the _non_owner() stuff always handled just fine), but I'll take it > > on faith that this usage is incompatible with rwsems + the RT > > functionality since I haven't actually had time to dig into it. > > The thing is, RT requires priority inheritance. When a task blocks on a > rwsem, it has to boost the owner of that rwsem otherwise it risks > unbounded priority inversion. > > I have a hack that actually implements a work around for non_owner, but > it adds overhead to all locks to do so. Ok - I'd just be curious why the PI bit can't be factored out into a wrapper like how the debug stuff is handled with the _non_owner bits, but I can certainly believe that. > > So assuming that's the case, IMO the sanest thing to do is make a new > > type of lock - "rwsem_non_process" or somesuch - and use _that_ in > > bcache. Not open coding the lock. > > I actually started that, but decided this was the easier patch to send. > Don't worry, I was expecting this email. No surprises here ;-) > > This email was taken from Andrew Morton's play book (I see you Cc'd > him, I forgot to). It's one of these patches of "It's so bad that the > owner of the code will fix the issue out of fear that this patch may > make it in". Ah, I see :) > > It can even live in the bcache code if we want since there currently > > wouldn't be any other users, I don't really care. But open coding it? > > Come on... makes me wonder what other code in the kernel is open coding > > locks because it couldn't release it in the same process context that > > took the lock for whatever reason. > > Most cases just have completions. This looks like a case where "don't > do something while transaction is taking place". Which can usually get > away with a wait event. Eh, only in sense that it's easy to implement mutexes and rwlocks with wait_event(). To simplify the algorithm just a bit (only leaving out some optimizations), bcache is using it to protect a rb tree, which containts "things that are undergoing background writeback". For foreground writes, we've got to check if the write overlaps with anything in this rb tree. If so, we force _this_ write to writeback; background writeback will write stale data to the backing device, but that's fine because the data in the cache will still be marked as dirty. To add stuff to this rb tree - i.e. for background writeback to start flushing some dirty data - it's got to make sure it's not going to overwrite some in progress foreground writethrough write. Tracking every outstanding foreground write would be expensive, so foreground writes take a read lock on the rb tree (both to check it for anything they overlap with, and for their duration), and background writeback takes a write lock when it goes to scan for dirty data to add to the rb tree. Even if you complain it's not _just_ protecting the rb tree, we're still using it for mutual exclusion, plain and simple, and that's all a lock is. > > Also, nack this patch because increasing the number of atomic ops to > > shared cachelines in our fast path. If it does end up being open coded, > > I'll make a more efficient version. > > Perhaps I can whip up a "struct rwsem_non_owner" and have a very > limited API for that, and then you can use that. That would be perfect :) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 20 Aug 2013 20:36:58 -0700 Kent Overstreet <kmo@daterainc.com> wrote: > I think we're mostly quibbling over semantics here, and it's not that > big a deal - that said, I don't really get your distinction between a > semaphore and a mutex; I'd distinguish them by saying a semaphore can > count an arbitrary number of resources (the number it's initialized to), > and a mutex is always initialized to one - the fact that in the kernel > mutexes must be released by the same process that took them (while > important for PI and debugging) is not fundamental. Understood. As, I usually am focused on PI and debugging, it may be more fundamental to me than others. > > "rw semaphore" never made any sense to me; they're not counting > anything, like regular semaphores. They're just sleepable rw locks. Me either. > > So it _sounds_ like you're saying bcache is using it as a semaphore, > beacuse it's using it as a lock we don't release in the original > context? in analogy to linux mutexes and semaphores? Not quite sure what > you mean. Actually, I'm thinking you are using it more as a completion than a semaphore. > > I have a hack that actually implements a work around for non_owner, but > > it adds overhead to all locks to do so. > > Ok - I'd just be curious why the PI bit can't be factored out into a > wrapper like how the debug stuff is handled with the _non_owner bits, > but I can certainly believe that. The problem is that all sleeping locks go through rt_mutex. The locks are replaced with locks that incorporate PI. > To simplify the algorithm just a bit (only leaving out some > optimizations), bcache is using it to protect a rb tree, which containts > "things that are undergoing background writeback". I think this is what makes bcache unique in the kernel, and I don't think there's open coded locks elsewhere. > > For foreground writes, we've got to check if the write overlaps with > anything in this rb tree. If so, we force _this_ write to writeback; > background writeback will write stale data to the backing device, but > that's fine because the data in the cache will still be marked as dirty. > > To add stuff to this rb tree - i.e. for background writeback to start > flushing some dirty data - it's got to make sure it's not going to > overwrite some in progress foreground writethrough write. > > Tracking every outstanding foreground write would be expensive, so > foreground writes take a read lock on the rb tree (both to check it for > anything they overlap with, and for their duration), and background > writeback takes a write lock when it goes to scan for dirty data to add > to the rb tree. > > Even if you complain it's not _just_ protecting the rb tree, we're still > using it for mutual exclusion, plain and simple, and that's all a lock > is. Well, rwlocks and rwsems are not mutually exclusive ;-) The read side has multiple accessors. A mutual exclusion also would be a single task needing exclusive access to a resource. But as things are done in the background, that is not the case. But we are arguing semantics here and not helping with a solution. > > > > > Also, nack this patch because increasing the number of atomic ops to > > > shared cachelines in our fast path. If it does end up being open coded, > > > I'll make a more efficient version. > > > > Perhaps I can whip up a "struct rwsem_non_owner" and have a very > > limited API for that, and then you can use that. > > That would be perfect :) OK, I'll see what I can do. -- Steve -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Index: linux-trace.git/drivers/md/bcache/bcache.h =================================================================== --- linux-trace.git.orig/drivers/md/bcache/bcache.h +++ linux-trace.git/drivers/md/bcache/bcache.h @@ -486,6 +486,15 @@ struct cached_dev { atomic_t running; /* + * Requests in progress. + */ + atomic_t nr_requests; + /* + * Writers waiting for requests to finish. + */ + wait_queue_head_t write_waiters; + + /* * Writes take a shared lock from start to finish; scanning for dirty * data to refill the rb tree requires an exclusive lock. */ Index: linux-trace.git/drivers/md/bcache/request.c =================================================================== --- linux-trace.git.orig/drivers/md/bcache/request.c +++ linux-trace.git/drivers/md/bcache/request.c @@ -998,7 +998,9 @@ static void cached_dev_write_complete(st struct search *s = container_of(cl, struct search, cl); struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); - up_read_non_owner(&dc->writeback_lock); + if (atomic_dec_return(&dc->nr_requests) == 0) + wake_up(&dc->write_waiters); + cached_dev_bio_complete(cl); } @@ -1024,7 +1026,8 @@ static void request_write(struct cached_ bch_keybuf_check_overlapping(&s->op.c->moving_gc_keys, &start, &end); check_should_skip(dc, s); - down_read_non_owner(&dc->writeback_lock); + down_read(&dc->writeback_lock); + atomic_inc(&dc->nr_requests); if (bch_keybuf_check_overlapping(&dc->writeback_keys, &start, &end)) { s->op.skip = false; @@ -1053,6 +1056,7 @@ static void request_write(struct cached_ } out: closure_call(&s->op.cl, bch_insert_data, NULL, cl); + up_read(&dc->writeback_lock); continue_at(cl, cached_dev_write_complete, NULL); skip: s->op.skip = true; Index: linux-trace.git/drivers/md/bcache/super.c =================================================================== --- linux-trace.git.orig/drivers/md/bcache/super.c +++ linux-trace.git/drivers/md/bcache/super.c @@ -1081,6 +1081,8 @@ static void register_bdev(struct cache_s dc->sb_bio.bi_io_vec[0].bv_page = sb_page; get_page(sb_page); + init_waitqueue_head(&dc->write_waiters); + if (cached_dev_init(dc, sb->block_size << 9)) goto err; Index: linux-trace.git/drivers/md/bcache/writeback.c =================================================================== --- linux-trace.git.orig/drivers/md/bcache/writeback.c +++ linux-trace.git/drivers/md/bcache/writeback.c @@ -121,6 +121,20 @@ static void dirty_init(struct keybuf_key bch_bio_map(bio, NULL); } +/* + * If requests are in transit, we must wait for them to finish. + */ +static void down_writeback_lock(struct cached_dev *dc) +{ +again: + down_write(&dc->writeback_lock); + if (atomic_read(&dc->nr_requests)) { + up_write(&dc->writeback_lock); + wait_event(dc->write_waiters, !atomic_read(&dc->nr_requests)); + goto again; + } +} + static void refill_dirty(struct closure *cl) { struct cached_dev *dc = container_of(cl, struct cached_dev, @@ -134,7 +148,7 @@ static void refill_dirty(struct closure !dc->writeback_running) closure_return(cl); - down_write(&dc->writeback_lock); + down_writeback_lock(dc); if (!atomic_read(&dc->has_dirty)) { SET_BDEV_STATE(&dc->sb, BDEV_STATE_CLEAN);