[4/6] kvm tools: Add rwlock wrapper

Message ID	1306419950-19064-4-git-send-email-levinsasha928@gmail.com (mailing list archive)
State	New, archived
Headers	show Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id p4QEQLT1023453 for <patchwork-kvm@patchwork.kernel.org>; Thu, 26 May 2011 14:26:22 GMT DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:cc:subject:date:message-id:x-mailer:in-reply-to:references; b=djBb+WWyEtqR8qx7rNeNjL7iTk59H7OXZhYLjEPD4eBvEtcrmBRtuFLuxVC5hopYGK 6ggQ3PGCfMpfJ94vy30+OoX5JiWXQ2zJDfSo6Dnh2DJfw0FNNkFXK//YoanXw7Hygc5h qkhzx3x8aas7GxjqG5ffqbQ9gErpir5f2nBTs= From: Sasha Levin <levinsasha928@gmail.com> To: penberg@kernel.org Cc: john@jfloren.net, kvm@vger.kernel.org, mingo@elte.hu, asias.hejun@gmail.com, gorcunov@gmail.com, prasadjoshi124@gmail.com, Sasha Levin <levinsasha928@gmail.com> Subject: [PATCH 4/6] kvm tools: Add rwlock wrapper Date: Thu, 26 May 2011 17:25:48 +0300 Message-Id: <1306419950-19064-4-git-send-email-levinsasha928@gmail.com> In-Reply-To: <1306419950-19064-1-git-send-email-levinsasha928@gmail.com> References: <1306419950-19064-1-git-send-email-levinsasha928@gmail.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Sasha Levin May 26, 2011, 2:25 p.m. UTC

Adds a rwlock wrapper which like the mutex wrapper makes rwlock calls
similar to their kernel counterparts.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
---
 tools/kvm/include/kvm/rwsem.h |   39 +++++++++++++++++++++++++++++++++++++++
 1 files changed, 39 insertions(+), 0 deletions(-)
 create mode 100644 tools/kvm/include/kvm/rwsem.h

Pekka Enberg May 26, 2011, 4:02 p.m. UTC | #1

On Thu, 26 May 2011, Sasha Levin wrote:
> Adds a rwlock wrapper which like the mutex wrapper makes rwlock calls
> similar to their kernel counterparts.
>
> Signed-off-by: Sasha Levin <levinsasha928@gmail.com>

There's no explanation why a mutex isn't sufficient. The pthread locking 
primitives aren't all that great in practice so unless you have some 
correctness issue that requires a rwlock or some numbers, I'd prefer you 
go for a mutex.

 			Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sasha Levin May 26, 2011, 4:19 p.m. UTC | #2

On Thu, 2011-05-26 at 19:02 +0300, Pekka Enberg wrote:
> On Thu, 26 May 2011, Sasha Levin wrote:
> > Adds a rwlock wrapper which like the mutex wrapper makes rwlock calls
> > similar to their kernel counterparts.
> >
> > Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> 
> There's no explanation why a mutex isn't sufficient. The pthread locking 
> primitives aren't all that great in practice so unless you have some 
> correctness issue that requires a rwlock or some numbers, I'd prefer you 
> go for a mutex.

I've added some rwlocks because of what Ingo said yesterday about
adding/removing devices after the first initialization phase.

Take MMIO lock for example: Since we can now run SMP guests, we may have
multiple MMIO exits (one from each VCPU thread). Each of those exits
leads to searching the MMIO rbtree.

We can use a mutex to lock it, but it just means that those threads will
be blocked there instead of concurrently searching the MMIO tree which
makes the search linear instead of parallel.

It's hard to bring 'real' numbers at this stage because the only 'real'
device we have which uses MMIO is the VESA driver, and we can't really
simulate many VCPUs writing to it :)

Ingo Molnar May 26, 2011, 6:05 p.m. UTC | #3

* Sasha Levin <levinsasha928@gmail.com> wrote:

> On Thu, 2011-05-26 at 19:02 +0300, Pekka Enberg wrote:
> > On Thu, 26 May 2011, Sasha Levin wrote:
> > > Adds a rwlock wrapper which like the mutex wrapper makes rwlock calls
> > > similar to their kernel counterparts.
> > >
> > > Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> > 
> > There's no explanation why a mutex isn't sufficient. The pthread 
> > locking primitives aren't all that great in practice so unless 
> > you have some correctness issue that requires a rwlock or some 
> > numbers, I'd prefer you go for a mutex.
> 
> I've added some rwlocks because of what Ingo said yesterday about 
> adding/removing devices after the first initialization phase.
> 
> Take MMIO lock for example: Since we can now run SMP guests, we may 
> have multiple MMIO exits (one from each VCPU thread). Each of those 
> exits leads to searching the MMIO rbtree.
> 
> We can use a mutex to lock it, but it just means that those threads 
> will be blocked there instead of concurrently searching the MMIO 
> tree which makes the search linear instead of parallel.
> 
> It's hard to bring 'real' numbers at this stage because the only 
> 'real' device we have which uses MMIO is the VESA driver, and we 
> can't really simulate many VCPUs writing to it :)

I'd suggest keeping it simple first - rwlocks are nasty and will 
bounce a cacheline just as much.

If lookup scalability is an issue we can extend RCU to tools/kvm/.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity May 26, 2011, 6:11 p.m. UTC | #4

On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> >
> >  I've added some rwlocks because of what Ingo said yesterday about
> >  adding/removing devices after the first initialization phase.
> >
> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> >  exits leads to searching the MMIO rbtree.
> >
> >  We can use a mutex to lock it, but it just means that those threads
> >  will be blocked there instead of concurrently searching the MMIO
> >  tree which makes the search linear instead of parallel.
> >
> >  It's hard to bring 'real' numbers at this stage because the only
> >  'real' device we have which uses MMIO is the VESA driver, and we
> >  can't really simulate many VCPUs writing to it :)
>
> I'd suggest keeping it simple first - rwlocks are nasty and will
> bounce a cacheline just as much.

Well, this is the first case where tools/kvm can do better than qemu 
with its global lock, so I think it's worth it.

> If lookup scalability is an issue we can extend RCU to tools/kvm/.

Definitely rcu is a perfect patch for mmio dispatch.

Pekka Enberg May 26, 2011, 6:21 p.m. UTC | #5

On Thu, May 26, 2011 at 9:11 PM, Avi Kivity <avi@redhat.com> wrote:
> On 05/26/2011 09:05 PM, Ingo Molnar wrote:
>>
>> >
>> >  I've added some rwlocks because of what Ingo said yesterday about
>> >  adding/removing devices after the first initialization phase.
>> >
>> >  Take MMIO lock for example: Since we can now run SMP guests, we may
>> >  have multiple MMIO exits (one from each VCPU thread). Each of those
>> >  exits leads to searching the MMIO rbtree.
>> >
>> >  We can use a mutex to lock it, but it just means that those threads
>> >  will be blocked there instead of concurrently searching the MMIO
>> >  tree which makes the search linear instead of parallel.
>> >
>> >  It's hard to bring 'real' numbers at this stage because the only
>> >  'real' device we have which uses MMIO is the VESA driver, and we
>> >  can't really simulate many VCPUs writing to it :)
>>
>> I'd suggest keeping it simple first - rwlocks are nasty and will
>> bounce a cacheline just as much.
>
> Well, this is the first case where tools/kvm can do better than qemu with
> its global lock, so I think it's worth it.
>
>> If lookup scalability is an issue we can extend RCU to tools/kvm/.
>
> Definitely rcu is a perfect patch for mmio dispatch.

Userspace RCU code is here, Sasha, if you feel like tackling this:

http://lttng.org/urcu

:-)

I'm CC'ing Paul and Mathieu as well for urcu.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sasha Levin May 26, 2011, 6:57 p.m. UTC | #6

On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote:
> On Thu, May 26, 2011 at 9:11 PM, Avi Kivity <avi@redhat.com> wrote:
> > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> >>
> >> >
> >> >  I've added some rwlocks because of what Ingo said yesterday about
> >> >  adding/removing devices after the first initialization phase.
> >> >
> >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> >> >  exits leads to searching the MMIO rbtree.
> >> >
> >> >  We can use a mutex to lock it, but it just means that those threads
> >> >  will be blocked there instead of concurrently searching the MMIO
> >> >  tree which makes the search linear instead of parallel.
> >> >
> >> >  It's hard to bring 'real' numbers at this stage because the only
> >> >  'real' device we have which uses MMIO is the VESA driver, and we
> >> >  can't really simulate many VCPUs writing to it :)
> >>
> >> I'd suggest keeping it simple first - rwlocks are nasty and will
> >> bounce a cacheline just as much.
> >
> > Well, this is the first case where tools/kvm can do better than qemu with
> > its global lock, so I think it's worth it.
> >
> >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> >
> > Definitely rcu is a perfect patch for mmio dispatch.
> 
> Userspace RCU code is here, Sasha, if you feel like tackling this:
> 
> http://lttng.org/urcu
> 
> :-)
> 
> I'm CC'ing Paul and Mathieu as well for urcu.

Sounds good!

Should be quite an addition and could be used in more places than just
the MMIO dispatcher.

Ingo Molnar May 26, 2011, 8:25 p.m. UTC | #7

* Pekka Enberg <penberg@kernel.org> wrote:

> On Thu, May 26, 2011 at 9:11 PM, Avi Kivity <avi@redhat.com> wrote:
> > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> >>
> >> >
> >> >  I've added some rwlocks because of what Ingo said yesterday about
> >> >  adding/removing devices after the first initialization phase.
> >> >
> >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> >> >  exits leads to searching the MMIO rbtree.
> >> >
> >> >  We can use a mutex to lock it, but it just means that those threads
> >> >  will be blocked there instead of concurrently searching the MMIO
> >> >  tree which makes the search linear instead of parallel.
> >> >
> >> >  It's hard to bring 'real' numbers at this stage because the only
> >> >  'real' device we have which uses MMIO is the VESA driver, and we
> >> >  can't really simulate many VCPUs writing to it :)
> >>
> >> I'd suggest keeping it simple first - rwlocks are nasty and will
> >> bounce a cacheline just as much.
> >
> > Well, this is the first case where tools/kvm can do better than qemu with
> > its global lock, so I think it's worth it.
> >
> >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> >
> > Definitely rcu is a perfect patch for mmio dispatch.
> 
> Userspace RCU code is here, Sasha, if you feel like tackling this:
> 
> http://lttng.org/urcu
> 
> :-)
> 
> I'm CC'ing Paul and Mathieu as well for urcu.

I think we should rather share some of the kernel RCU code in an 
intelligent way instead of bringing in yet another library which is a 
IIRC a distant copy of the kernel code to begin with.

That way we could lazily benefit from all the enhancements Paul does 
to the kernel RCU code! :-)

Note that kernel/treercu.c is pretty tied to the kernel right now, so 
a first approach could be to:

 - Check Paul's excellent documentation about RCU and make sure
   you don't miss Paul's excellent 3-part primer on LWN.net:

     http://lwn.net/Articles/262464/
     http://lwn.net/Articles/263130/
     http://lwn.net/Articles/264090/

   There are also lots of very good RCU articles on the LWN Kernel 
   Index page:

	http://lwn.net/Kernel/Index/

 - Check kernel/tinyrcu.c to see how RCU is implemented in its 
   simplest form. :)

 - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/

 - Massage it so far that it is suitable for tools/kvm/. We really
   only need a few core RCU facilities initially:

    struct rcu_head;

    rcu_read_lock();
    rcu_read_unlock();

    rcu_dereference()

    call_rcu(head, func);

    rcu_synchronize();

   The rest, _bh(), etc. are all kernelisms.

 - Then once it's working we could look at doing the code sharing
   *for real*: splitting the functionality out of the original
   treercu.c code into kernel/treercu-lib.c and rcuupdate-lib.h or so
   and include that one in tools/kvm/rcu/.

 - [ You might also benefit from porting rcutorture code to 
     user-space. It will catch RCU bugs like nothing else. ]

That way the 'core RCU' logic would be contained in treercu-lib.c, 
all kernel glue would be in kernel/rcutree.c.

Some other approach might be possible as well, this was just a first 
rough idea :)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mathieu Desnoyers May 26, 2011, 11:05 p.m. UTC | #8

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Pekka Enberg <penberg@kernel.org> wrote:
> 
> > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity <avi@redhat.com> wrote:
> > > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> > >>
> > >> >
> > >> >  I've added some rwlocks because of what Ingo said yesterday about
> > >> >  adding/removing devices after the first initialization phase.
> > >> >
> > >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> > >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> > >> >  exits leads to searching the MMIO rbtree.
> > >> >
> > >> >  We can use a mutex to lock it, but it just means that those threads
> > >> >  will be blocked there instead of concurrently searching the MMIO
> > >> >  tree which makes the search linear instead of parallel.
> > >> >
> > >> >  It's hard to bring 'real' numbers at this stage because the only
> > >> >  'real' device we have which uses MMIO is the VESA driver, and we
> > >> >  can't really simulate many VCPUs writing to it :)
> > >>
> > >> I'd suggest keeping it simple first - rwlocks are nasty and will
> > >> bounce a cacheline just as much.
> > >
> > > Well, this is the first case where tools/kvm can do better than qemu with
> > > its global lock, so I think it's worth it.
> > >
> > >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> > >
> > > Definitely rcu is a perfect patch for mmio dispatch.
> > 
> > Userspace RCU code is here, Sasha, if you feel like tackling this:
> > 
> > http://lttng.org/urcu
> > 
> > :-)
> > 
> > I'm CC'ing Paul and Mathieu as well for urcu.
> 
> I think we should rather share some of the kernel RCU code in an 
> intelligent way

Hi Ingo,

By all means feel free to redo all the work Paul have spent care and
time coding and testing.

> instead of bringing in yet another library which is a 
> IIRC a distant copy of the kernel code to begin with.

This is either a lie, or immensely misinformed. You should go and look
at the source before doing nonsensical assumptions like this. What you
are saying cannot be further from the truth.

> 
> That way we could lazily benefit from all the enhancements Paul does 
> to the kernel RCU code! :-)

Maybe there is a reason why Paul have been working with me on the
userspace RCU implementation in parallel with working on the Linux
kernel one ?

> 
> Note that kernel/treercu.c is pretty tied to the kernel right now, so 
> a first approach could be to:
> 
>  - Check Paul's excellent documentation about RCU and make sure
>    you don't miss Paul's excellent 3-part primer on LWN.net:
> 
>      http://lwn.net/Articles/262464/
>      http://lwn.net/Articles/263130/
>      http://lwn.net/Articles/264090/
> 
>    There are also lots of very good RCU articles on the LWN Kernel 
>    Index page:
> 
> 	http://lwn.net/Kernel/Index/

Or just (see README)

git clone git://git.lttng.org/urcu
cd userspace-rcu
./bootstrap
./configure
make
make install
ldconfig

#include <urcu.h>
gcc -lurcu ...

and be done with it ?

>  - Check kernel/tinyrcu.c to see how RCU is implemented in its 
>    simplest form. :)

...so simplistic it only works on UP systems, which are not so common
these days on the systems targeted by kvm.

> 
>  - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/

This code is very much tied with the kernel scheduler. This is actually
one of the main reason why we had to reimplement RCU for userspace
rather than to "simply copy the kernel one" as you recommend.

> 
>  - Massage it so far that it is suitable for tools/kvm/. We really
>    only need a few core RCU facilities initially:
> 
>     struct rcu_head;
> 
>     rcu_read_lock();
>     rcu_read_unlock();
> 
>     rcu_dereference()
> 
>     call_rcu(head, func);
> 
>     rcu_synchronize();
> 
>    The rest, _bh(), etc. are all kernelisms.

rcu_synchronize and the rcu read lock/unlock, in tree RCU, are tied to
the scheduler to deal with preemption. User-land does not have this
luxury.

> 
>  - Then once it's working we could look at doing the code sharing
>    *for real*: splitting the functionality out of the original
>    treercu.c code into kernel/treercu-lib.c and rcuupdate-lib.h or so
>    and include that one in tools/kvm/rcu/.
> 
>  - [ You might also benefit from porting rcutorture code to 
>      user-space. It will catch RCU bugs like nothing else. ]

Userspace RCU already includes the torture test suite you are referring
to.

> 
> That way the 'core RCU' logic would be contained in treercu-lib.c, 
> all kernel glue would be in kernel/rcutree.c.
> 
> Some other approach might be possible as well, this was just a first 
> rough idea :)

"wheel not invented here" syndrome ?

Mathieu

Mathieu Desnoyers May 26, 2011, 11:09 p.m. UTC | #9

* Sasha Levin (levinsasha928@gmail.com) wrote:
> On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote:
> > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity <avi@redhat.com> wrote:
> > > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> > >>
> > >> >
> > >> >  I've added some rwlocks because of what Ingo said yesterday about
> > >> >  adding/removing devices after the first initialization phase.
> > >> >
> > >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> > >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> > >> >  exits leads to searching the MMIO rbtree.
> > >> >
> > >> >  We can use a mutex to lock it, but it just means that those threads
> > >> >  will be blocked there instead of concurrently searching the MMIO
> > >> >  tree which makes the search linear instead of parallel.
> > >> >
> > >> >  It's hard to bring 'real' numbers at this stage because the only
> > >> >  'real' device we have which uses MMIO is the VESA driver, and we
> > >> >  can't really simulate many VCPUs writing to it :)
> > >>
> > >> I'd suggest keeping it simple first - rwlocks are nasty and will
> > >> bounce a cacheline just as much.
> > >
> > > Well, this is the first case where tools/kvm can do better than qemu with
> > > its global lock, so I think it's worth it.
> > >
> > >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> > >
> > > Definitely rcu is a perfect patch for mmio dispatch.
> > 
> > Userspace RCU code is here, Sasha, if you feel like tackling this:
> > 
> > http://lttng.org/urcu
> > 
> > :-)
> > 
> > I'm CC'ing Paul and Mathieu as well for urcu.
> 
> Sounds good!
> 
> Should be quite an addition and could be used in more places than just
> the MMIO dispatcher.

Hi Sasha,

Feel free to let me know if you need any help, have questions, or need
improvements to liburcu. I'd be interested to know about the specifics
of you read vs update workload rate. Also, if you need more thorough
information, we have a paper describing all the liburcu flavors. It
might help you choose the one better suited for your needs. (if you
don't care that much about fine-tuning, my recommendation is to stick
with the "urcu.h"/"liburcu" flavor). Link to the paper preprint can be
found at http://www.efficios.com/publications

Thanks!

Mathieu

Paul E. McKenney May 27, 2011, 12:58 a.m. UTC | #10

On Thu, May 26, 2011 at 07:05:08PM -0400, Mathieu Desnoyers wrote:
> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Pekka Enberg <penberg@kernel.org> wrote:
> > 
> > > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity <avi@redhat.com> wrote:
> > > > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> > > >>
> > > >> >
> > > >> >  I've added some rwlocks because of what Ingo said yesterday about
> > > >> >  adding/removing devices after the first initialization phase.
> > > >> >
> > > >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> > > >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> > > >> >  exits leads to searching the MMIO rbtree.
> > > >> >
> > > >> >  We can use a mutex to lock it, but it just means that those threads
> > > >> >  will be blocked there instead of concurrently searching the MMIO
> > > >> >  tree which makes the search linear instead of parallel.
> > > >> >
> > > >> >  It's hard to bring 'real' numbers at this stage because the only
> > > >> >  'real' device we have which uses MMIO is the VESA driver, and we
> > > >> >  can't really simulate many VCPUs writing to it :)
> > > >>
> > > >> I'd suggest keeping it simple first - rwlocks are nasty and will
> > > >> bounce a cacheline just as much.
> > > >
> > > > Well, this is the first case where tools/kvm can do better than qemu with
> > > > its global lock, so I think it's worth it.
> > > >
> > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> > > >
> > > > Definitely rcu is a perfect patch for mmio dispatch.
> > > 
> > > Userspace RCU code is here, Sasha, if you feel like tackling this:
> > > 
> > > http://lttng.org/urcu
> > > 
> > > :-)
> > > 
> > > I'm CC'ing Paul and Mathieu as well for urcu.

I am hoping we can get better convergence between the user-level and
kernel-level URCU implementations once I get SRCU merged into the TREE_RCU
and TINY_RCU implementations.  But it is early days for user-level RCU
implementations -- for example, the kernel-level implementations have
deep dependencies on being able to lock themselves cheaply to a given CPU,
which does not exist at user level.

But there seems to be an assumption that there should be only one URCU
implementation, and I am not sure that this assumption holds.  For
example, there are several in http://lttng.org/urcu, each corresponding
to different use cases.  This should not be too much of a surprise, given
that there are quite a few implementations in the Linux kernel: TINY_RCU,
TINY_PREEMPT_RCU, TREE_RCU, TREE_PREEMPT_RCU, and SRCU.  Of course,
each of the first four variants provides RCU-bh and RCU-sched, and
TINY_PREEMPT_RCU and TREE_PREEMPT_RCU provide preemptible RCU in addition.

And back in the mid-1990s, I would never have imagined a need for more
than one implementation of RCU.  ;-)

All that aside, one advantage of http://lttng.org/urcu is that it already
exists, which allows prototyping to proceed immediately.  If it turns
out that URCU doesn't help for whatever reason, then there is no point in
worrying further.  And if URCU does turn out to help, we will know more
about exactly what this particular situation requires of URCU, which
will likely help us better understand what the implemenation should
look like -- perhaps very close to one of the URCU implementations,
perhaps very close to one of the in-kernel implementations.

Seem reasonable, or am I missing something here?

							Thanx, Paul

> > I think we should rather share some of the kernel RCU code in an 
> > intelligent way
> 
> Hi Ingo,
> 
> By all means feel free to redo all the work Paul have spent care and
> time coding and testing.
> 
> > instead of bringing in yet another library which is a 
> > IIRC a distant copy of the kernel code to begin with.
> 
> This is either a lie, or immensely misinformed. You should go and look
> at the source before doing nonsensical assumptions like this. What you
> are saying cannot be further from the truth.
> 
> > 
> > That way we could lazily benefit from all the enhancements Paul does 
> > to the kernel RCU code! :-)
> 
> Maybe there is a reason why Paul have been working with me on the
> userspace RCU implementation in parallel with working on the Linux
> kernel one ?
> 
> > 
> > Note that kernel/treercu.c is pretty tied to the kernel right now, so 
> > a first approach could be to:
> > 
> >  - Check Paul's excellent documentation about RCU and make sure
> >    you don't miss Paul's excellent 3-part primer on LWN.net:
> > 
> >      http://lwn.net/Articles/262464/
> >      http://lwn.net/Articles/263130/
> >      http://lwn.net/Articles/264090/
> > 
> >    There are also lots of very good RCU articles on the LWN Kernel 
> >    Index page:
> > 
> > 	http://lwn.net/Kernel/Index/
> 
> Or just (see README)
> 
> git clone git://git.lttng.org/urcu
> cd userspace-rcu
> ./bootstrap
> ./configure
> make
> make install
> ldconfig
> 
> #include <urcu.h>
> gcc -lurcu ...
> 
> and be done with it ?
> 
> >  - Check kernel/tinyrcu.c to see how RCU is implemented in its 
> >    simplest form. :)
> 
> ...so simplistic it only works on UP systems, which are not so common
> these days on the systems targeted by kvm.
> 
> > 
> >  - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/
> 
> This code is very much tied with the kernel scheduler. This is actually
> one of the main reason why we had to reimplement RCU for userspace
> rather than to "simply copy the kernel one" as you recommend.
> 
> > 
> >  - Massage it so far that it is suitable for tools/kvm/. We really
> >    only need a few core RCU facilities initially:
> > 
> >     struct rcu_head;
> > 
> >     rcu_read_lock();
> >     rcu_read_unlock();
> > 
> >     rcu_dereference()
> > 
> >     call_rcu(head, func);
> > 
> >     rcu_synchronize();
> > 
> >    The rest, _bh(), etc. are all kernelisms.
> 
> rcu_synchronize and the rcu read lock/unlock, in tree RCU, are tied to
> the scheduler to deal with preemption. User-land does not have this
> luxury.
> 
> > 
> >  - Then once it's working we could look at doing the code sharing
> >    *for real*: splitting the functionality out of the original
> >    treercu.c code into kernel/treercu-lib.c and rcuupdate-lib.h or so
> >    and include that one in tools/kvm/rcu/.
> > 
> >  - [ You might also benefit from porting rcutorture code to 
> >      user-space. It will catch RCU bugs like nothing else. ]
> 
> Userspace RCU already includes the torture test suite you are referring
> to.
> 
> > 
> > That way the 'core RCU' logic would be contained in treercu-lib.c, 
> > all kernel glue would be in kernel/rcutree.c.
> > 
> > Some other approach might be possible as well, this was just a first 
> > rough idea :)
> 
> "wheel not invented here" syndrome ?
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 27, 2011, 9:12 a.m. UTC | #11

* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> > > > I'm CC'ing Paul and Mathieu as well for urcu.
> 
> I am hoping we can get better convergence between the user-level 
> and kernel-level URCU implementations once I get SRCU merged into 
> the TREE_RCU and TINY_RCU implementations. [...]

Yeah.

> [...]  But it is early days for user-level RCU implementations -- 
> for example, the kernel-level implementations have deep 
> dependencies on being able to lock themselves cheaply to a given 
> CPU, which does not exist at user level.

Correct - this is why i suggested a plain copy first, then look back 
how we (and whether we!) want to share logic.

> But there seems to be an assumption that there should be only one 
> URCU implementation, and I am not sure that this assumption holds.  

I'm not sure about that either. And sice tools/kvm/ lives in the 
kernel repo it would be a mortal sin [*] to not explore the code 
sharing angle!!! :-)

If a reasonable amount of sharing of logic is possible without making 
it painful for the kernel RCU code we could do other nice things like 
change the RCU logic and test it in user-space first and run 
user-space rcutorture on some really big cluster.

> [ ... ]
>
> All that aside, one advantage of http://lttng.org/urcu is that it 
> already exists, which allows prototyping to proceed immediately.  

it's offline right now:

 $ git clone git://git.lttng.org/urcu
 Cloning into urcu...
 fatal: The remote end hung up unexpectedly

One complication is that it's LGPL while tools/kvm/ is GPLv2. I guess 
we could copy a suitable implementation into tools/kvm/rcu/?

Thanks,

	Ingo

[1] punishable by death or eternal hacking of a Windows driver (i'd pick the former)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sasha Levin May 27, 2011, 10:19 a.m. UTC | #12

On Thu, 2011-05-26 at 19:09 -0400, Mathieu Desnoyers wrote:
> * Sasha Levin (levinsasha928@gmail.com) wrote:
> > On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote:
> > > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity <avi@redhat.com> wrote:
> > > > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> > > >>
> > > >> >
> > > >> >  I've added some rwlocks because of what Ingo said yesterday about
> > > >> >  adding/removing devices after the first initialization phase.
> > > >> >
> > > >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> > > >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> > > >> >  exits leads to searching the MMIO rbtree.
> > > >> >
> > > >> >  We can use a mutex to lock it, but it just means that those threads
> > > >> >  will be blocked there instead of concurrently searching the MMIO
> > > >> >  tree which makes the search linear instead of parallel.
> > > >> >
> > > >> >  It's hard to bring 'real' numbers at this stage because the only
> > > >> >  'real' device we have which uses MMIO is the VESA driver, and we
> > > >> >  can't really simulate many VCPUs writing to it :)
> > > >>
> > > >> I'd suggest keeping it simple first - rwlocks are nasty and will
> > > >> bounce a cacheline just as much.
> > > >
> > > > Well, this is the first case where tools/kvm can do better than qemu with
> > > > its global lock, so I think it's worth it.
> > > >
> > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> > > >
> > > > Definitely rcu is a perfect patch for mmio dispatch.
> > > 
> > > Userspace RCU code is here, Sasha, if you feel like tackling this:
> > > 
> > > http://lttng.org/urcu
> > > 
> > > :-)
> > > 
> > > I'm CC'ing Paul and Mathieu as well for urcu.
> > 
> > Sounds good!
> > 
> > Should be quite an addition and could be used in more places than just
> > the MMIO dispatcher.
> 
> Hi Sasha,
> 
> Feel free to let me know if you need any help, have questions, or need
> improvements to liburcu. I'd be interested to know about the specifics
> of you read vs update workload rate. Also, if you need more thorough
> information, we have a paper describing all the liburcu flavors. It
> might help you choose the one better suited for your needs. (if you
> don't care that much about fine-tuning, my recommendation is to stick
> with the "urcu.h"/"liburcu" flavor). Link to the paper preprint can be
> found at http://www.efficios.com/publications

Hi Mathieu!

In tools/kvm/ we use a rb-tree (same one used by the kernel) with the
augmentation feature to support an interval rb-tree - which means that
every update to the tree not only updates the nodes directly related to
the updated node but also all the nodes on the path to the root of the
tree.

I see that in liburcu there is an implementation of a rcu linked list
but no implementation of a rb-tree.

Are you currently working on one? or maybe I should try writing one and
sending it to you?

Ingo Molnar May 27, 2011, 10:25 a.m. UTC | #13

* Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> >  - Check kernel/tinyrcu.c to see how RCU is implemented in its 
> >    simplest form. :)
> 
> ...so simplistic it only works on UP systems, which are not so common
> these days on the systems targeted by kvm.

As i said above, in its simplest form - which is UP.

Obviously it's not tinyrcu.c that should be used by tools/kvm/ but 
what i suggested, tree-RCU:

> >  - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/
> 
> This code is very much tied with the kernel scheduler. [...]

It would not be particularly complex to enable user-space to request 
a callback on context switch events.

I was thinking on and off about allowing perf events to generate a 
per sampling event notification signal on specific events, such as 
page faults or context switches.

Obviously this won't be enabled from NMI contexts due to atomicity 
constraints, but the pagefault and maybe the context switch path 
looks doable.

That capability would be a rather simple kernel change and it would 
allow a user-space RCU implementation to be notified of various key 
events, context switches in particular.

Would you be interested in helping code up such a facility? The urcu 
library could make good use of it i think, regardless of what we do 
in tools/kvm/.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 27, 2011, 10:36 a.m. UTC | #14

* Sasha Levin <levinsasha928@gmail.com> wrote:

> I see that in liburcu there is an implementation of a rcu linked 
> list but no implementation of a rb-tree.

Another approach would be, until the RCU interactions are sorted out, 
to implement a 'big reader lock' thing that is completely lockless on 
the read side (!).

It works well if the write side is expensive, but very rare: which is 
certainly the case for these ioport registration data structures used 
in the mmio event demux fast path!

The write_lock() side signals all worker threads to finish whatever 
they are doing now and to wait for the write_unlock(). Then the 
modification can be done and the worker threads can be resumed.

This can be updated to RCU later on without much trouble.

The advantage is that this could be implemented with the existing 
thread-pool primitives straight away i think, we'd need five 
primitives:

  bread_lock();
  bread_unlock();
  bwrite_lock();
  bwrite_lock();

  brlock_init();

and a data type:

  struct brlock;

bread_lock()/bread_unlock() is basically just a compiler barrier. 
bwrite_lock() stops all (other) worker threads.
bwrite_unlock() resumes them.

That's all - should be 50 lines of code, unless i'm missing something 
:-)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 27, 2011, 11:07 a.m. UTC | #15

* Ingo Molnar <mingo@elte.hu> wrote:

> > This code is very much tied with the kernel scheduler. [...]
> 
> It would not be particularly complex to enable user-space to 
> request a callback on context switch events.
> 
> I was thinking on and off about allowing perf events to generate a 
> per sampling event notification signal on specific events, such as 
> page faults or context switches.

I was thinking about that on and off so loudly that Peter implemented 
it long ago via fasync support on the perf event fd! :-)

So if you set a notification signal via fcntl(F_SETOWN) on the 
scheduler context switch event fd, the user-space RCU code will get a 
signal on every context switch.

I have not tried it for this purpose yet, so let us know if there are 
unexpected complications :)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 27, 2011, 11:10 a.m. UTC | #16

* Ingo Molnar <mingo@elte.hu> wrote:

> I was thinking about that on and off so loudly that Peter 
> implemented it long ago via fasync support on the perf event fd! 
> :-)
> 
> So if you set a notification signal via fcntl(F_SETOWN) on the 
> scheduler context switch event fd, the user-space RCU code will get 
> a signal on every context switch.
> 
> I have not tried it for this purpose yet, so let us know if there 
> are unexpected complications :)

Note that you do not want the context switch event, but the CPU 
migration event: that will notify user-space when it gets migrated to 
another CPU. This is the case that RCU really needs.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 27, 2011, 11:24 a.m. UTC | #17

* Ingo Molnar <mingo@elte.hu> wrote:

> Note that you do not want the context switch event, but the CPU 
> migration event: that will notify user-space when it gets migrated 
> to another CPU. This is the case that RCU really needs.

Also note that the main current use-case of perf events is 
instrumentation, thus if you make use of this facility for user-space 
RCU you need to check whether the events are all precise and arrive 
in time to the target process, etc.

Statistical behavior isnt a big problem for instrumentation but it's 
a showstopper for RCU, obviously! :-)

If you find such bugs then we want to fix them, so there's no 
fundamental *desire* from us for these events to be statistical and 
inaccurate anywhere.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mathieu Desnoyers May 27, 2011, 12:48 p.m. UTC | #18

* Ingo Molnar (mingo@elte.hu) wrote:
> > [ ... ]
> >
> > All that aside, one advantage of http://lttng.org/urcu is that it 
> > already exists, which allows prototyping to proceed immediately.  
> 
> it's offline right now:
> 
>  $ git clone git://git.lttng.org/urcu
>  Cloning into urcu...
>  fatal: The remote end hung up unexpectedly

This would be:

git clone git://git.lttng.org/userspace-rcu.git

I'll make sure a symlink for urcu -> userspace-rcu.git gets setup
shortly.

> One complication is that it's LGPL while tools/kvm/ is GPLv2. I guess 
> we could copy a suitable implementation into tools/kvm/rcu/?

That might make sense, yes. The intent of having liburcu LGPL is really
to give as much liberty for apps developers to link to it as they
have calling Linux kernel system calls. This peculiarness of the GPL
Linux is able to use to let non-GPL apps use it does not apply so
clearly to libraries, hence the use of LGPL for all my libs that support
application execution.

Thanks,

Mathieu

Ingo Molnar May 27, 2011, 1:07 p.m. UTC | #19

* Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > instead of bringing in yet another library which is a IIRC a 
> > distant copy of the kernel code to begin with.
> 
> This is either a lie, or immensely misinformed. You should go and 
> look at the source before doing nonsensical assumptions like this. 
> What you are saying cannot be further from the truth.

I was merely immensely misinformed, which (distinct!) possibility i 
tried to signal via the 'IIRC' qualifier as well ;-)

While right now i cannot clone the repository itself:

  $ git clone git://git.lttng.org/urcu
  Cloning into urcu...
  fatal: The remote end hung up unexpectedly

(tried it several times today, same failure)

I found a tarball of the package so could have a look again.

Despite my initial impression of it being kernel RCU code related (it 
has system.h, smp_mb, list_head, etc.), a closer look indeed suggests 
that the RCU implementation itself does not use the kernel RCU 
concepts so it sadly cannot be a distant copy of the kernel RCU code!

Which is too bad IMO: i don't think user-space RCU should necessarily 
be so different from kernel-space RCU. I think the two codebases 
could be shared, or at least they could become closer relatives! :-)

That could be done using the migration event notification trick i 
suggested in the previous mail.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mathieu Desnoyers May 27, 2011, 1:14 p.m. UTC | #20

* Sasha Levin (levinsasha928@gmail.com) wrote:
> On Thu, 2011-05-26 at 19:09 -0400, Mathieu Desnoyers wrote:
> > * Sasha Levin (levinsasha928@gmail.com) wrote:
> > > On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote:
> > > > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity <avi@redhat.com> wrote:
> > > > > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> > > > >>
> > > > >> >
> > > > >> >  I've added some rwlocks because of what Ingo said yesterday about
> > > > >> >  adding/removing devices after the first initialization phase.
> > > > >> >
> > > > >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> > > > >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> > > > >> >  exits leads to searching the MMIO rbtree.
> > > > >> >
> > > > >> >  We can use a mutex to lock it, but it just means that those threads
> > > > >> >  will be blocked there instead of concurrently searching the MMIO
> > > > >> >  tree which makes the search linear instead of parallel.
> > > > >> >
> > > > >> >  It's hard to bring 'real' numbers at this stage because the only
> > > > >> >  'real' device we have which uses MMIO is the VESA driver, and we
> > > > >> >  can't really simulate many VCPUs writing to it :)
> > > > >>
> > > > >> I'd suggest keeping it simple first - rwlocks are nasty and will
> > > > >> bounce a cacheline just as much.
> > > > >
> > > > > Well, this is the first case where tools/kvm can do better than qemu with
> > > > > its global lock, so I think it's worth it.
> > > > >
> > > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> > > > >
> > > > > Definitely rcu is a perfect patch for mmio dispatch.
> > > > 
> > > > Userspace RCU code is here, Sasha, if you feel like tackling this:
> > > > 
> > > > http://lttng.org/urcu
> > > > 
> > > > :-)
> > > > 
> > > > I'm CC'ing Paul and Mathieu as well for urcu.
> > > 
> > > Sounds good!
> > > 
> > > Should be quite an addition and could be used in more places than just
> > > the MMIO dispatcher.
> > 
> > Hi Sasha,
> > 
> > Feel free to let me know if you need any help, have questions, or need
> > improvements to liburcu. I'd be interested to know about the specifics
> > of you read vs update workload rate. Also, if you need more thorough
> > information, we have a paper describing all the liburcu flavors. It
> > might help you choose the one better suited for your needs. (if you
> > don't care that much about fine-tuning, my recommendation is to stick
> > with the "urcu.h"/"liburcu" flavor). Link to the paper preprint can be
> > found at http://www.efficios.com/publications
> 
> Hi Mathieu!
> 
> In tools/kvm/ we use a rb-tree (same one used by the kernel) with the
> augmentation feature to support an interval rb-tree - which means that
> every update to the tree not only updates the nodes directly related to
> the updated node but also all the nodes on the path to the root of the
> tree.

Cool !!

I'm adding in copy Phil Howard who has been working on RCU RB tree for
much longer than myself.

> I see that in liburcu there is an implementation of a rcu linked list
> but no implementation of a rb-tree.
> 
> Are you currently working on one? or maybe I should try writing one and
> sending it to you?

Actually, I started working on one last year, but had to interrupt my
effort before I got it even working right. The state of this
(disclaimer: unfinished!!) work is available in the "rbtree" branch of
the urcu library. This tree has insertion/removals protected by a mutex,
and uses a RCU read lock to protect traversal. The main problem I was
facing when I had to stop working on that code is that the "nil" node:

  56 /* Sentinel (bottom nodes).
  57  * Don't care about p, left, right, pos and key values */
  58 struct rcu_rbtree_node rcu_rbtree_nil = {
  59         .color = COLOR_BLACK,
  60 };

ended up being written to as temporary node by the algorithm presented
in CLRS, chap. 12.  So sharing it as a common node, as proposed in their
book, made sense only if you consider you have no concurrency, but seems
to break left and right in the presence of concurrency, and I did not
have time to review their entire algo to find out where I should check
for accesses to this nil node.

This implementation is trying to think of the RB tree in terms of how
each operation is affecting the read-side visibility of the tree nodes.
It uses the fact that readers only ever go down into the tree
extensively.

I'd be glad to help out if someone want to have a look and try to
complete that work, which should only be considered as "work in
progress" level of (in)completeness.

We'd have to see how we can go from this implementation of a standard RB
tree to an interval RB tree too. I guess it will depend whether you need
the updates from the target node up to the root to be done "all at once"
from a reader perspective (then you would probably need to replace a
copy of a part of the tree all at once), or if you can allow the update
to be done piece-wise on a node-by-node basis as readers go through the
tree (from root to leafs).

Thanks,

Mathieu

Ingo Molnar May 27, 2011, 1:19 p.m. UTC | #21

* Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > it's offline right now:
> > 
> >  $ git clone git://git.lttng.org/urcu
> >  Cloning into urcu...
> >  fatal: The remote end hung up unexpectedly
> 
> This would be:
> 
> git clone git://git.lttng.org/userspace-rcu.git

Hey, my impression wasn't *entirely* wrong, your initial urcu commit:

 From 27b012e271a82b9a0d94543688904f207cd154ea Mon Sep 17 00:00:00 2001
 From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
 Date: Thu, 5 Feb 2009 19:06:44 -0500
 Subject: [PATCH] init version

 ---
  Makefile |    6 ++
  urcu.c   |  250 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  urcu.h   |   69 +++++++++++++++++
  3 files changed, 325 insertions(+), 0 deletions(-)

Has:

+/* The "volatile" is due to gcc bugs */
+#define barrier() __asm__ __volatile__("": : :"memory")
+

Which code sequence i recognize very well as a kernel maintainer ;-) 
Here's the kernel's compiler.h definition of the same:

  /* The "volatile" is due to gcc bugs */
  #define barrier() __asm__ __volatile__("": : :"memory")

This:

+/* x86 32/64 specific */
+#define mb()    asm volatile("mfence":::"memory")
+#define rmb()   asm volatile("lfence":::"memory")
+#define wmb()   asm volatile("sfence" ::: "memory")
+
+
+
+/* x86 32 */
+static inline void atomic_inc(int *v)
+{
+       asm volatile("lock; incl %0"
+                    : "+m" (v->counter));

is familiar to an arch/x86/ maintainer as well :-)

So yes, kernel code was obviously used in the making of urcu - just 
not the RCU kernel code it appears.

Which is a pity i think! :-)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mathieu Desnoyers May 27, 2011, 1:22 p.m. UTC | #22

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > >  - Check kernel/tinyrcu.c to see how RCU is implemented in its 
> > >    simplest form. :)
> > 
> > ...so simplistic it only works on UP systems, which are not so common
> > these days on the systems targeted by kvm.
> 
> As i said above, in its simplest form - which is UP.
> 
> Obviously it's not tinyrcu.c that should be used by tools/kvm/ but 
> what i suggested, tree-RCU:

I agree that tree-RCU has some grace-period management scalability
benefits that would be interesting to have.

> > >  - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/
> > 
> > This code is very much tied with the kernel scheduler. [...]
> 
> It would not be particularly complex to enable user-space to request 
> a callback on context switch events.
> 
> I was thinking on and off about allowing perf events to generate a 
> per sampling event notification signal on specific events, such as 
> page faults or context switches.
>
> Obviously this won't be enabled from NMI contexts due to atomicity 
> constraints, but the pagefault and maybe the context switch path 
> looks doable.
> 
> That capability would be a rather simple kernel change and it would 
> allow a user-space RCU implementation to be notified of various key 
> events, context switches in particular.

I'm worried about "self-recursion" behaviors that could be triggered
though: if the userland callback code called from a page fault triggers
a page fault all by itself, then it looks like a good way to bring the
system to its knees. The same apply to context switches. Do you have a
way to handle this in mind ?

> 
> Would you be interested in helping code up such a facility? The urcu 
> library could make good use of it i think, regardless of what we do 
> in tools/kvm/.

I'm open to try finding out the best possible approach to support RCU in
user-space (disclaimer: I might need some help on this due to my time
being fully taken by contracts currently). I guess the sweet spot will
end up being at a crossroad between kernel-only and userland-only
solution.

Thanks,

Mathieu

> 
> Thanks,
> 
> 	Ingo

Mathieu Desnoyers May 27, 2011, 1:29 p.m. UTC | #23

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > > it's offline right now:
> > > 
> > >  $ git clone git://git.lttng.org/urcu
> > >  Cloning into urcu...
> > >  fatal: The remote end hung up unexpectedly
> > 
> > This would be:
> > 
> > git clone git://git.lttng.org/userspace-rcu.git
> 
> Hey, my impression wasn't *entirely* wrong, your initial urcu commit:
> 
>  From 27b012e271a82b9a0d94543688904f207cd154ea Mon Sep 17 00:00:00 2001
>  From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
>  Date: Thu, 5 Feb 2009 19:06:44 -0500
>  Subject: [PATCH] init version
> 
>  ---
>   Makefile |    6 ++
>   urcu.c   |  250 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   urcu.h   |   69 +++++++++++++++++
>   3 files changed, 325 insertions(+), 0 deletions(-)
> 
> Has:
> 
> +/* The "volatile" is due to gcc bugs */
> +#define barrier() __asm__ __volatile__("": : :"memory")
> +
> 
> Which code sequence i recognize very well as a kernel maintainer ;-) 
> Here's the kernel's compiler.h definition of the same:
> 
>   /* The "volatile" is due to gcc bugs */
>   #define barrier() __asm__ __volatile__("": : :"memory")
> 
> This:
> 
> +/* x86 32/64 specific */
> +#define mb()    asm volatile("mfence":::"memory")
> +#define rmb()   asm volatile("lfence":::"memory")
> +#define wmb()   asm volatile("sfence" ::: "memory")
> +
> +
> +
> +/* x86 32 */
> +static inline void atomic_inc(int *v)
> +{
> +       asm volatile("lock; incl %0"
> +                    : "+m" (v->counter));
> 
> is familiar to an arch/x86/ maintainer as well :-)
> 
> So yes, kernel code was obviously used in the making of urcu - just 
> not the RCU kernel code it appears.
> 
> Which is a pity i think! :-)

Heh :) You know, I really like the Linux kernel coding style, which is
what I tried to stick to within this library. So although I initially
imported some of the core Linux kernel macroisms, I had to reimplement
them (derived from a MIT-licensed code-base) as soon as I decided to go
for MIT-licensed low-level primitives and LGPL-licensed library.

About RCU, the picture seemed very much different in the userspace
landscape compared to the kernel (needed to use of per-thread RCU
nesting counters compared to per-CPU in the kernel because of lack of
integration with the scheduler), but more on that in my other follow-up
reply.

Thanks,

Mathieu

> 
> Thanks,
> 
> 	Ingo

Ingo Molnar May 27, 2011, 1:31 p.m. UTC | #24

* Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> I'm worried about "self-recursion" behaviors that could be 
> triggered though: if the userland callback code called from a page 
> fault triggers a page fault all by itself, then it looks like a 
> good way to bring the system to its knees. [...]

Not really, SIGIO isnt being reprocessed until the signal handler 
returns.

> [...] The same apply to context switches. Do you have a way to 
> handle this in mind ?

Shouldnt be a problem in theory: yes, in case of repeat migrations 
repeat signals will be sent, but they should not nest in any nasty 
fashion.

That's the theory, it needs checking! :-)

One furthr optimization would be possible: in case you think we can 
write the signal handler in assembly or build it with gcc flags that 
does not use SSE we might also add a 'lightweight signal handler' 
kind of flag to the kernel, which does not save FPU/vector-CPU(SSE) 
state. In this case signals become *really* fast on x86, almost as 
fast as interrupts.

One detail: you'd not want to use a queueing signal, because the 
siginfo queue might overflow. It's also unnecessary: RCU only needs 
the last migration event, previous history is uninteresting.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 27, 2011, 1:36 p.m. UTC | #25

* Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > So yes, kernel code was obviously used in the making of urcu - 
> > just not the RCU kernel code it appears.
> > 
> > Which is a pity i think! :-)
> 
> Heh :) You know, I really like the Linux kernel coding style, which 
> is what I tried to stick to within this library. So although I 
> initially imported some of the core Linux kernel macroisms, I had 
> to reimplement them (derived from a MIT-licensed code-base) as soon 
> as I decided to go for MIT-licensed low-level primitives and 
> LGPL-licensed library.

Well, that might be a somewhat fragile assumption in certain 
jurisdictions, did you know that this is not necessarily a clean
room reimplementation of the urcu code anymore, so the chain of
derivation might still be present legally, right? :-)

[ Not that i can imagine Linus ever bothering you about barrier() or 
  atomic_inc() ;-) ]

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mathieu Desnoyers May 27, 2011, 2:11 p.m. UTC | #26

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > This code is very much tied with the kernel scheduler. [...]
> > 
> > It would not be particularly complex to enable user-space to 
> > request a callback on context switch events.
> > 
> > I was thinking on and off about allowing perf events to generate a 
> > per sampling event notification signal on specific events, such as 
> > page faults or context switches.
> 
> I was thinking about that on and off so loudly that Peter implemented 
> it long ago via fasync support on the perf event fd! :-)
> 
> So if you set a notification signal via fcntl(F_SETOWN) on the 
> scheduler context switch event fd, the user-space RCU code will get a 
> signal on every context switch.
> 
> I have not tried it for this purpose yet, so let us know if there are 
> unexpected complications :)

I'm worried about the following complications:

Let's define per-process, per-cpu data structures, mapped in userland,
that keep track of the per-cpu read-side C.S. nesting count. Let say we
use this signal-based mechanism to inform userland of migrations.

1) The first thing I notice is that we're not informed of threads being
   blocked. So multiple threads taking read-side C.S. in turn on the
   same CPU as they are being scheduled will corrupt each other's
   read-side C.S. counter.

2) The second thing I notice is that there is no form of synchronization
   between the userland execution and delivery of this signal, which
   leads to interesting race conditions.

Any thoughts on how to address these ?

Thanks,

Mathieu

Mathieu Desnoyers May 27, 2011, 2:18 p.m. UTC | #27

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > Note that you do not want the context switch event, but the CPU 
> > migration event: that will notify user-space when it gets migrated 
> > to another CPU. This is the case that RCU really needs.
> 
> Also note that the main current use-case of perf events is 
> instrumentation, thus if you make use of this facility for user-space 
> RCU you need to check whether the events are all precise and arrive 
> in time to the target process, etc.
> 
> Statistical behavior isnt a big problem for instrumentation but it's 
> a showstopper for RCU, obviously! :-)
> 
> If you find such bugs then we want to fix them, so there's no 
> fundamental *desire* from us for these events to be statistical and 
> inaccurate anywhere.

The accuracy vs speed tradeoff is actually quite different from the
instrumentation vs low-level-synchronization point of views. It might be
acceptable in some sampling situations to get some inaccuracy due to
lack of locking if it makes the data collection tool reasonably fast for
real-life use (note that I am talking about "sampling", not event-based
tracing here). So there are situations where adding locking will make
the overhead prohibitive for sampling, but would be required for the
perfect accuracy needed by RCU.

So although the desire might not be to get inaccurate data, the actual
desire to get it in a low-overhead fashion can lead to different
decisions regarding the accuracy vs speed tradeoff.

Thanks,

Mathieu

Sasha Levin May 27, 2011, 3:52 p.m. UTC | #28

On Fri, 2011-05-27 at 12:36 +0200, Ingo Molnar wrote:
> * Sasha Levin <levinsasha928@gmail.com> wrote:
> 
> > I see that in liburcu there is an implementation of a rcu linked 
> > list but no implementation of a rb-tree.
> 
> Another approach would be, until the RCU interactions are sorted out, 
> to implement a 'big reader lock' thing that is completely lockless on 
> the read side (!).
> 
> It works well if the write side is expensive, but very rare: which is 
> certainly the case for these ioport registration data structures used 
> in the mmio event demux fast path!
> 
> The write_lock() side signals all worker threads to finish whatever 
> they are doing now and to wait for the write_unlock(). Then the 
> modification can be done and the worker threads can be resumed.
> 
> This can be updated to RCU later on without much trouble.
> 
> The advantage is that this could be implemented with the existing 
> thread-pool primitives straight away i think, we'd need five 
> primitives:
> 
>   bread_lock();
>   bread_unlock();
>   bwrite_lock();
>   bwrite_lock();
> 
>   brlock_init();
> 
> and a data type:
> 
>   struct brlock;
> 
> bread_lock()/bread_unlock() is basically just a compiler barrier. 
> bwrite_lock() stops all (other) worker threads.
> bwrite_unlock() resumes them.
> 
> That's all - should be 50 lines of code, unless i'm missing something 
> :-)
> 
> Thanks,
> 
> 	Ingo

Isn't there something similar to this in the kernel?

I prefer not implementing a new lock type at the moment mostly because
we're not tackling a bug or an immediate problem, we don't really need
locking at the moment (we add all devices at init and don't support
hotplug yet). So I'd rather not write code just to solve it faster but
have it thrown away later.

Ingo Molnar May 27, 2011, 5:10 p.m. UTC | #29

* Sasha Levin <levinsasha928@gmail.com> wrote:

> On Fri, 2011-05-27 at 12:36 +0200, Ingo Molnar wrote:
> > * Sasha Levin <levinsasha928@gmail.com> wrote:
> > 
> > > I see that in liburcu there is an implementation of a rcu linked 
> > > list but no implementation of a rb-tree.
> > 
> > Another approach would be, until the RCU interactions are sorted out, 
> > to implement a 'big reader lock' thing that is completely lockless on 
> > the read side (!).
> > 
> > It works well if the write side is expensive, but very rare: which is 
> > certainly the case for these ioport registration data structures used 
> > in the mmio event demux fast path!
> > 
> > The write_lock() side signals all worker threads to finish whatever 
> > they are doing now and to wait for the write_unlock(). Then the 
> > modification can be done and the worker threads can be resumed.
> > 
> > This can be updated to RCU later on without much trouble.
> > 
> > The advantage is that this could be implemented with the existing 
> > thread-pool primitives straight away i think, we'd need five 
> > primitives:
> > 
> >   bread_lock();
> >   bread_unlock();
> >   bwrite_lock();
> >   bwrite_lock();
> > 
> >   brlock_init();
> > 
> > and a data type:
> > 
> >   struct brlock;
> > 
> > bread_lock()/bread_unlock() is basically just a compiler barrier. 
> > bwrite_lock() stops all (other) worker threads.
> > bwrite_unlock() resumes them.
> > 
> > That's all - should be 50 lines of code, unless i'm missing something 
> > :-)
> > 
> > Thanks,
> > 
> > 	Ingo
> 
> Isn't there something similar to this in the kernel?

Yeah, there's include/linux/lglock.h.

> I prefer not implementing a new lock type at the moment mostly because
> we're not tackling a bug or an immediate problem, we don't really need
> locking at the moment (we add all devices at init and don't support
> hotplug yet). So I'd rather not write code just to solve it faster but
> have it thrown away later.

We don't have to throw it away: RCU is rather complex to pull off 
here, and in many cases, where writes are very rare, brlocks are the 
best solution even with RCU present.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney May 27, 2011, 5:22 p.m. UTC | #30

On Fri, May 27, 2011 at 11:12:20AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > > > I'm CC'ing Paul and Mathieu as well for urcu.
> > 
> > I am hoping we can get better convergence between the user-level 
> > and kernel-level URCU implementations once I get SRCU merged into 
> > the TREE_RCU and TINY_RCU implementations. [...]
> 
> Yeah.
> 
> > [...]  But it is early days for user-level RCU implementations -- 
> > for example, the kernel-level implementations have deep 
> > dependencies on being able to lock themselves cheaply to a given 
> > CPU, which does not exist at user level.
> 
> Correct - this is why i suggested a plain copy first, then look back 
> how we (and whether we!) want to share logic.

OK, here is an approach that I rejected long ago due to its not handling
existing code bases nicely.  But it should work fine for user-space
applications that are willing to adapt themselves to RCU, so it is well
worth considering for this set of use cases.

The basic trick is to pretend that each user-level thread is its own CPU.
This is easiest if the application does not do any RCU activity from
signal handlers, though app-code signal handlers can be dealt with as
well, if needed.  (But I hate POSIX signals...)

Given this trick, the code that is currently invoked from the
scheduling-clock interrupt can be instead invoked from a per-thread
SIGALRM.

Given the current implementation in -tip, the RCU core processing can be
done from separate threads, but things must be tweaked because TREE_RCU
assumes that the RCU core processing happens on the same CPU (which now
becomes in the same thread) that it corresponds to.  In other words,
the rcuc0 thread is hard-affinitied to CPU 0, the rcuc1 thread to CPU 1,
and so on.

One way to handle this would be to do the per-CPU kthread processing
in signal-handler context.  Then code segments that disable interrupts
(the __call_rcu() function comes immediately to mind) must block the
corresponding signals.  Which can easily be abstracted so that common
code handles it.

We could handle dyntick-idle if there is a convenient way to get
notification when a thread blocks (as opposed to being preempted).
There are a number of strategies that might work here, the first that
comes to mind is to notify only if the block is TASK_INTERRUPTIBLE,
which indicates a relatively long-term sleep.  This notification could
call rcu_enter_nohz() and friends.  So, is there a way to get
notification on TASK_INTERRUPTIBLE blocking and unblocking?

This is not a general-purpose solution (which is why I rejected it when
thinking along these lines some years ago), but it would be an interesting
way to share the in-kernel code.  And I believe that this approach would
be quite useful to a great number of user-level apps/tools/utilities
that were willing to live within its constraints.

The each-thread-is-a-CPU might seem limiting, but the current TREE_RCU
implementation would allow up to 4,194,304 threads on a 64-bit system
and up to 524,288 on a 32-bit system, which should prove sufficient for
most uses.  Famous last words...  But it would be easy to add a fifth
level of hierarchy if someone really does have a legitimate need for more
threads, which would bring us to 268,435,456 threads for 64-bit systems
and 16,777,216 threads for 32-bit systems.  And it is easy to add more
levels -- and the extra levels don't penalize people who don't need them.
With the current implementation, the maximum number of threads would
need to be specified at compile time, but again, this should be OK in
almost all cases.  Default to (say) 131,072 threads maximum and be happy.

> > But there seems to be an assumption that there should be only one 
> > URCU implementation, and I am not sure that this assumption holds.  
> 
> I'm not sure about that either. And sice tools/kvm/ lives in the 
> kernel repo it would be a mortal sin [*] to not explore the code 
> sharing angle!!! :-)
> 
> If a reasonable amount of sharing of logic is possible without making 
> it painful for the kernel RCU code we could do other nice things like 
> change the RCU logic and test it in user-space first and run 
> user-space rcutorture on some really big cluster.

That would be cool -- also, it would make the Linux-kernel code
more accessible, because people could play with it in userspace,
single-stepping, setting breakpoints, and so on.

> > [ ... ]
> >
> > All that aside, one advantage of http://lttng.org/urcu is that it 
> > already exists, which allows prototyping to proceed immediately.  
> 
> it's offline right now:
> 
>  $ git clone git://git.lttng.org/urcu
>  Cloning into urcu...
>  fatal: The remote end hung up unexpectedly
> 
> One complication is that it's LGPL while tools/kvm/ is GPLv2. I guess 
> we could copy a suitable implementation into tools/kvm/rcu/?

That is another reason why I believe that an in-kernel-tree version
of URCU is not a replacement for the variant that Mathieu is maintaining
(and that I am contributing to).  Mathieu's implementation can be used
by non-GPL applications and by applications that are not closely tied
to the Linux kernel.

So I really don't see a problem with having both of them around.

> Thanks,
> 
> 	Ingo
> 
> [1] punishable by death or eternal hacking of a Windows driver (i'd pick the former)

Ouch!!!

One of my college buddies left the field due to Windows taking over large
chunks of the embedded space in the 1990s.  So, like you, he definitely
rejected the latter.  ;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sasha Levin May 27, 2011, 8:19 p.m. UTC | #31

On Fri, 2011-05-27 at 19:10 +0200, Ingo Molnar wrote:
> * Sasha Levin <levinsasha928@gmail.com> wrote:
> 
> > On Fri, 2011-05-27 at 12:36 +0200, Ingo Molnar wrote:
> > > * Sasha Levin <levinsasha928@gmail.com> wrote:
> > > 
> > > > I see that in liburcu there is an implementation of a rcu linked 
> > > > list but no implementation of a rb-tree.
> > > 
> > > Another approach would be, until the RCU interactions are sorted out, 
> > > to implement a 'big reader lock' thing that is completely lockless on 
> > > the read side (!).
> > > 
> > > It works well if the write side is expensive, but very rare: which is 
> > > certainly the case for these ioport registration data structures used 
> > > in the mmio event demux fast path!
> > > 
> > > The write_lock() side signals all worker threads to finish whatever 
> > > they are doing now and to wait for the write_unlock(). Then the 
> > > modification can be done and the worker threads can be resumed.
> > > 
> > > This can be updated to RCU later on without much trouble.
> > > 
> > > The advantage is that this could be implemented with the existing 
> > > thread-pool primitives straight away i think, we'd need five 
> > > primitives:
> > > 
> > >   bread_lock();
> > >   bread_unlock();
> > >   bwrite_lock();
> > >   bwrite_lock();
> > > 
> > >   brlock_init();
> > > 
> > > and a data type:
> > > 
> > >   struct brlock;
> > > 
> > > bread_lock()/bread_unlock() is basically just a compiler barrier. 
> > > bwrite_lock() stops all (other) worker threads.
> > > bwrite_unlock() resumes them.
> > > 
> > > That's all - should be 50 lines of code, unless i'm missing something 
> > > :-)
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > 
> > Isn't there something similar to this in the kernel?
> 
> Yeah, there's include/linux/lglock.h.
> 
> > I prefer not implementing a new lock type at the moment mostly because
> > we're not tackling a bug or an immediate problem, we don't really need
> > locking at the moment (we add all devices at init and don't support
> > hotplug yet). So I'd rather not write code just to solve it faster but
> > have it thrown away later.
> 
> We don't have to throw it away: RCU is rather complex to pull off 
> here, and in many cases, where writes are very rare, brlocks are the 
> best solution even with RCU present.

So the basic plan here is to allocate a futex(?) for each VCPU thread,
and have the writer thread lock all futexes when it needs to write?

If we assume we only have one writer thread, it can stay pretty simple.

[4/6] kvm tools: Add rwlock wrapper

Commit Message

Comments

Patch