diff mbox series

[v6,4/5] 9pfs: T_readdir latency optimization

Message ID 14ec5d880cfca878bf32e643243c7ab3f4a52440.1587309014.git.qemu_oss@crudebyte.com (mailing list archive)
State New, archived
Headers show
Series 9pfs: readdir optimization | expand

Commit Message

Christian Schoenebeck April 19, 2020, 3:06 p.m. UTC
Make top half really top half and bottom half really bottom half:

Each T_readdir request handling is hopping between threads (main
I/O thread and background I/O driver threads) several times for
every individual directory entry, which sums up to huge latencies
for handling just a single T_readdir request.

Instead of doing that, collect now all required directory entries
(including all potentially required stat buffers for each entry) in
one rush on a background I/O thread from fs driver by calling the
previously added function v9fs_co_readdir_many() instead of
v9fs_co_readdir(), then assemble the entire resulting network
response message for the readdir request on main I/O thread. The
fs driver is still aborting the directory entry retrieval loop
(on the background I/O thread inside of v9fs_co_readdir_many())
as soon as it would exceed the client's requested maximum R_readdir
response size. So this will not introduce a performance penalty on
another end.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/9p.c | 122 +++++++++++++++++++++++----------------------------
 1 file changed, 55 insertions(+), 67 deletions(-)

Comments

Christian Schoenebeck June 3, 2020, 5:16 p.m. UTC | #1
On Sonntag, 19. April 2020 17:06:17 CEST Christian Schoenebeck wrote:
> Make top half really top half and bottom half really bottom half:
> 
> Each T_readdir request handling is hopping between threads (main
> I/O thread and background I/O driver threads) several times for
> every individual directory entry, which sums up to huge latencies
> for handling just a single T_readdir request.
> 
> Instead of doing that, collect now all required directory entries
> (including all potentially required stat buffers for each entry) in
> one rush on a background I/O thread from fs driver by calling the
> previously added function v9fs_co_readdir_many() instead of
> v9fs_co_readdir(), then assemble the entire resulting network
> response message for the readdir request on main I/O thread. The
> fs driver is still aborting the directory entry retrieval loop
> (on the background I/O thread inside of v9fs_co_readdir_many())
> as soon as it would exceed the client's requested maximum R_readdir
> response size. So this will not introduce a performance penalty on
> another end.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---
>  hw/9pfs/9p.c | 122 +++++++++++++++++++++++----------------------------
>  1 file changed, 55 insertions(+), 67 deletions(-)

Ping. Anybody?

I would like to roll out this patch set soon and this is the only patch in 
this series missing a review yet.
Greg Kurz June 29, 2020, 4:39 p.m. UTC | #2
On Wed, 03 Jun 2020 19:16:08 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Sonntag, 19. April 2020 17:06:17 CEST Christian Schoenebeck wrote:
> > Make top half really top half and bottom half really bottom half:
> > 
> > Each T_readdir request handling is hopping between threads (main
> > I/O thread and background I/O driver threads) several times for
> > every individual directory entry, which sums up to huge latencies
> > for handling just a single T_readdir request.
> > 
> > Instead of doing that, collect now all required directory entries
> > (including all potentially required stat buffers for each entry) in
> > one rush on a background I/O thread from fs driver by calling the
> > previously added function v9fs_co_readdir_many() instead of
> > v9fs_co_readdir(), then assemble the entire resulting network
> > response message for the readdir request on main I/O thread. The
> > fs driver is still aborting the directory entry retrieval loop
> > (on the background I/O thread inside of v9fs_co_readdir_many())
> > as soon as it would exceed the client's requested maximum R_readdir
> > response size. So this will not introduce a performance penalty on
> > another end.
> > 
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > ---
> >  hw/9pfs/9p.c | 122 +++++++++++++++++++++++----------------------------
> >  1 file changed, 55 insertions(+), 67 deletions(-)
> 
> Ping. Anybody?
> 
> I would like to roll out this patch set soon and this is the only patch in 
> this series missing a review yet.
> 

Hi Christian,

Sorry for getting back to this only now :-\

So I still have some concerns about the locking of the directory stream
pointer a fid. It was initially introduced to avoid concurrent accesses
by multiple threads to the corresponding internal glibc object, as
recommended in the readdir(3) manual page. Now, this patch considerably
extends the critical section to also contain calls to telldir() and all
the _many_ readdir()... so I'm not sure exactly what's the purpose of
that mutex right now. Please provide more details.

Cheers,

--
Greg
Christian Schoenebeck June 30, 2020, 3:16 p.m. UTC | #3
On Montag, 29. Juni 2020 18:39:02 CEST Greg Kurz wrote:
> On Wed, 03 Jun 2020 19:16:08 +0200
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > On Sonntag, 19. April 2020 17:06:17 CEST Christian Schoenebeck wrote:
> > > Make top half really top half and bottom half really bottom half:
> > > 
> > > Each T_readdir request handling is hopping between threads (main
> > > I/O thread and background I/O driver threads) several times for
> > > every individual directory entry, which sums up to huge latencies
> > > for handling just a single T_readdir request.
> > > 
> > > Instead of doing that, collect now all required directory entries
> > > (including all potentially required stat buffers for each entry) in
> > > one rush on a background I/O thread from fs driver by calling the
> > > previously added function v9fs_co_readdir_many() instead of
> > > v9fs_co_readdir(), then assemble the entire resulting network
> > > response message for the readdir request on main I/O thread. The
> > > fs driver is still aborting the directory entry retrieval loop
> > > (on the background I/O thread inside of v9fs_co_readdir_many())
> > > as soon as it would exceed the client's requested maximum R_readdir
> > > response size. So this will not introduce a performance penalty on
> > > another end.
> > > 
> > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > > ---
> > > 
> > >  hw/9pfs/9p.c | 122 +++++++++++++++++++++++----------------------------
> > >  1 file changed, 55 insertions(+), 67 deletions(-)
> > 
> > Ping. Anybody?
> > 
> > I would like to roll out this patch set soon and this is the only patch in
> > this series missing a review yet.
> 
> Hi Christian,

Hi Greg,

> Sorry for getting back to this only now :-\
> 
> So I still have some concerns about the locking of the directory stream
> pointer a fid. It was initially introduced to avoid concurrent accesses
> by multiple threads to the corresponding internal glibc object, as
> recommended in the readdir(3) manual page. Now, this patch considerably
> extends the critical section to also contain calls to telldir() and all
> the _many_ readdir()... so I'm not sure exactly what's the purpose of
> that mutex right now. Please provide more details.

The intention of this lock is to prevent concurrent r/w (i.e. telldir()/
readdir()) of the dir stream position by two or more active readdir requests 
handled in parallel, provided that they would use the same FID. Due to the 
latter requirement for this to become relevant, we already agreed that this is 
not something that would usually happen with a Linux 9p client, but there is 
nothing from protocol point of view that would prohibit this to be done by a 
client, so the resulting undefined behaviour should be prevented, which this 
lock does.

What's your precise concern?

Best regards,
Christian Schoenebeck
Greg Kurz June 30, 2020, 4:39 p.m. UTC | #4
On Tue, 30 Jun 2020 17:16:40 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Montag, 29. Juni 2020 18:39:02 CEST Greg Kurz wrote:
> > On Wed, 03 Jun 2020 19:16:08 +0200
> > 
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > On Sonntag, 19. April 2020 17:06:17 CEST Christian Schoenebeck wrote:
> > > > Make top half really top half and bottom half really bottom half:
> > > > 
> > > > Each T_readdir request handling is hopping between threads (main
> > > > I/O thread and background I/O driver threads) several times for
> > > > every individual directory entry, which sums up to huge latencies
> > > > for handling just a single T_readdir request.
> > > > 
> > > > Instead of doing that, collect now all required directory entries
> > > > (including all potentially required stat buffers for each entry) in
> > > > one rush on a background I/O thread from fs driver by calling the
> > > > previously added function v9fs_co_readdir_many() instead of
> > > > v9fs_co_readdir(), then assemble the entire resulting network
> > > > response message for the readdir request on main I/O thread. The
> > > > fs driver is still aborting the directory entry retrieval loop
> > > > (on the background I/O thread inside of v9fs_co_readdir_many())
> > > > as soon as it would exceed the client's requested maximum R_readdir
> > > > response size. So this will not introduce a performance penalty on
> > > > another end.
> > > > 
> > > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > > > ---
> > > > 
> > > >  hw/9pfs/9p.c | 122 +++++++++++++++++++++++----------------------------
> > > >  1 file changed, 55 insertions(+), 67 deletions(-)
> > > 
> > > Ping. Anybody?
> > > 
> > > I would like to roll out this patch set soon and this is the only patch in
> > > this series missing a review yet.
> > 
> > Hi Christian,
> 
> Hi Greg,
> 
> > Sorry for getting back to this only now :-\
> > 
> > So I still have some concerns about the locking of the directory stream
> > pointer a fid. It was initially introduced to avoid concurrent accesses
> > by multiple threads to the corresponding internal glibc object, as
> > recommended in the readdir(3) manual page. Now, this patch considerably
> > extends the critical section to also contain calls to telldir() and all
> > the _many_ readdir()... so I'm not sure exactly what's the purpose of
> > that mutex right now. Please provide more details.
> 
> The intention of this lock is to prevent concurrent r/w (i.e. telldir()/
> readdir()) of the dir stream position by two or more active readdir requests 
> handled in parallel, provided that they would use the same FID. Due to the 
> latter requirement for this to become relevant, we already agreed that this is 
> not something that would usually happen with a Linux 9p client, but there is 
> nothing from protocol point of view that would prohibit this to be done by a 
> client, so the resulting undefined behaviour should be prevented, which this 
> lock does.
> 

Makes sense. The dir stream position is no different from glibc's internal
dirent in that respect: it shouldn't be used by concurrent threads.

> What's your precise concern?
> 

My overall concern is still the same. The patches are big and I've
been away for too long, so I need some more discussion to reassemble
my thoughts and the puzzle :)

But now that I'm starting to understand why it's a good thing to
extend the critical section, I realize it should be extended
even more: we also have calls to seekdir() and rewindir() in
v9fs_readdir() and friends that could _theoretically_ cause the
very same kind of undefined behavior you're mentioning.

I think the change is important enough that it deserves its
own patch. I expect that moving the locking to the top-level
handler might also simplify the existing code, and thus your
series as well.

Please let me come up with a patch first.

> Best regards,
> Christian Schoenebeck
> 
> 

Cheers,

--
Greg
Christian Schoenebeck June 30, 2020, 6 p.m. UTC | #5
On Dienstag, 30. Juni 2020 18:39:57 CEST Greg Kurz wrote:
> On Tue, 30 Jun 2020 17:16:40 +0200
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > On Montag, 29. Juni 2020 18:39:02 CEST Greg Kurz wrote:
> > > On Wed, 03 Jun 2020 19:16:08 +0200
> > > 
> > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > > On Sonntag, 19. April 2020 17:06:17 CEST Christian Schoenebeck wrote:
> > > > > Make top half really top half and bottom half really bottom half:
> > > > > 
> > > > > Each T_readdir request handling is hopping between threads (main
> > > > > I/O thread and background I/O driver threads) several times for
> > > > > every individual directory entry, which sums up to huge latencies
> > > > > for handling just a single T_readdir request.
> > > > > 
> > > > > Instead of doing that, collect now all required directory entries
> > > > > (including all potentially required stat buffers for each entry) in
> > > > > one rush on a background I/O thread from fs driver by calling the
> > > > > previously added function v9fs_co_readdir_many() instead of
> > > > > v9fs_co_readdir(), then assemble the entire resulting network
> > > > > response message for the readdir request on main I/O thread. The
> > > > > fs driver is still aborting the directory entry retrieval loop
> > > > > (on the background I/O thread inside of v9fs_co_readdir_many())
> > > > > as soon as it would exceed the client's requested maximum R_readdir
> > > > > response size. So this will not introduce a performance penalty on
> > > > > another end.
> > > > > 
> > > > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > > > > ---
> > > > > 
> > > > >  hw/9pfs/9p.c | 122
> > > > >  +++++++++++++++++++++++----------------------------
> > > > >  1 file changed, 55 insertions(+), 67 deletions(-)
> > > > 
> > > > Ping. Anybody?
> > > > 
> > > > I would like to roll out this patch set soon and this is the only
> > > > patch in
> > > > this series missing a review yet.
> > > 
> > > Hi Christian,
> > 
> > Hi Greg,
> > 
> > > Sorry for getting back to this only now :-\
> > > 
> > > So I still have some concerns about the locking of the directory stream
> > > pointer a fid. It was initially introduced to avoid concurrent accesses
> > > by multiple threads to the corresponding internal glibc object, as
> > > recommended in the readdir(3) manual page. Now, this patch considerably
> > > extends the critical section to also contain calls to telldir() and all
> > > the _many_ readdir()... so I'm not sure exactly what's the purpose of
> > > that mutex right now. Please provide more details.
> > 
> > The intention of this lock is to prevent concurrent r/w (i.e. telldir()/
> > readdir()) of the dir stream position by two or more active readdir
> > requests handled in parallel, provided that they would use the same FID.
> > Due to the latter requirement for this to become relevant, we already
> > agreed that this is not something that would usually happen with a Linux
> > 9p client, but there is nothing from protocol point of view that would
> > prohibit this to be done by a client, so the resulting undefined
> > behaviour should be prevented, which this lock does.
> 
> Makes sense. The dir stream position is no different from glibc's internal
> dirent in that respect: it shouldn't be used by concurrent threads.

Exactly, it is a conceptual issue per se in general, independent of whether 
glibc is used and independent of the fs driver.

> > What's your precise concern?
> 
> My overall concern is still the same. The patches are big and I've
> been away for too long, so I need some more discussion to reassemble
> my thoughts and the puzzle :)

Np, if you have questions, just come along.

> But now that I'm starting to understand why it's a good thing to
> extend the critical section, I realize it should be extended
> even more: we also have calls to seekdir() and rewindir() in
> v9fs_readdir() and friends that could _theoretically_ cause the
> very same kind of undefined behavior you're mentioning.

You are talking about the "old" code now. Yes, it is wrong and I did not 
bother to fix the "old" implementation, since this optimized readdir 
implementation fixed this issue as well in one rush, but also note ...

> I think the change is important enough that it deserves its
> own patch. I expect that moving the locking to the top-level
> handler might also simplify the existing code, and thus your
> series as well.

... please reconsider if you really want to open that box. The "old" readdir 
implementation on main thread is inherently wrong, and handling the lock there 
and error pathes is extremely cumbersome and error prone.

Also: it does not make sense to move locking on this series from fs driver 
back to main I/O thread. Atomicity must retain on driver side, not on top 
half.

Best regards,
Christian Schoenebeck
Greg Kurz July 1, 2020, 10:09 a.m. UTC | #6
On Tue, 30 Jun 2020 20:00:08 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Dienstag, 30. Juni 2020 18:39:57 CEST Greg Kurz wrote:
> > On Tue, 30 Jun 2020 17:16:40 +0200
> > 
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > On Montag, 29. Juni 2020 18:39:02 CEST Greg Kurz wrote:
> > > > On Wed, 03 Jun 2020 19:16:08 +0200
> > > > 
> > > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > > > On Sonntag, 19. April 2020 17:06:17 CEST Christian Schoenebeck wrote:
> > > > > > Make top half really top half and bottom half really bottom half:
> > > > > > 
> > > > > > Each T_readdir request handling is hopping between threads (main
> > > > > > I/O thread and background I/O driver threads) several times for
> > > > > > every individual directory entry, which sums up to huge latencies
> > > > > > for handling just a single T_readdir request.
> > > > > > 
> > > > > > Instead of doing that, collect now all required directory entries
> > > > > > (including all potentially required stat buffers for each entry) in
> > > > > > one rush on a background I/O thread from fs driver by calling the
> > > > > > previously added function v9fs_co_readdir_many() instead of
> > > > > > v9fs_co_readdir(), then assemble the entire resulting network
> > > > > > response message for the readdir request on main I/O thread. The
> > > > > > fs driver is still aborting the directory entry retrieval loop
> > > > > > (on the background I/O thread inside of v9fs_co_readdir_many())
> > > > > > as soon as it would exceed the client's requested maximum R_readdir
> > > > > > response size. So this will not introduce a performance penalty on
> > > > > > another end.
> > > > > > 
> > > > > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > > > > > ---
> > > > > > 
> > > > > >  hw/9pfs/9p.c | 122
> > > > > >  +++++++++++++++++++++++----------------------------
> > > > > >  1 file changed, 55 insertions(+), 67 deletions(-)
> > > > > 
> > > > > Ping. Anybody?
> > > > > 
> > > > > I would like to roll out this patch set soon and this is the only
> > > > > patch in
> > > > > this series missing a review yet.
> > > > 
> > > > Hi Christian,
> > > 
> > > Hi Greg,
> > > 
> > > > Sorry for getting back to this only now :-\
> > > > 
> > > > So I still have some concerns about the locking of the directory stream
> > > > pointer a fid. It was initially introduced to avoid concurrent accesses
> > > > by multiple threads to the corresponding internal glibc object, as
> > > > recommended in the readdir(3) manual page. Now, this patch considerably
> > > > extends the critical section to also contain calls to telldir() and all
> > > > the _many_ readdir()... so I'm not sure exactly what's the purpose of
> > > > that mutex right now. Please provide more details.
> > > 
> > > The intention of this lock is to prevent concurrent r/w (i.e. telldir()/
> > > readdir()) of the dir stream position by two or more active readdir
> > > requests handled in parallel, provided that they would use the same FID.
> > > Due to the latter requirement for this to become relevant, we already
> > > agreed that this is not something that would usually happen with a Linux
> > > 9p client, but there is nothing from protocol point of view that would
> > > prohibit this to be done by a client, so the resulting undefined
> > > behaviour should be prevented, which this lock does.
> > 
> > Makes sense. The dir stream position is no different from glibc's internal
> > dirent in that respect: it shouldn't be used by concurrent threads.
> 
> Exactly, it is a conceptual issue per se in general, independent of whether 
> glibc is used and independent of the fs driver.
> 
> > > What's your precise concern?
> > 
> > My overall concern is still the same. The patches are big and I've
> > been away for too long, so I need some more discussion to reassemble
> > my thoughts and the puzzle :)
> 
> Np, if you have questions, just come along.
> 
> > But now that I'm starting to understand why it's a good thing to
> > extend the critical section, I realize it should be extended
> > even more: we also have calls to seekdir() and rewindir() in
> > v9fs_readdir() and friends that could _theoretically_ cause the
> > very same kind of undefined behavior you're mentioning.
> 
> You are talking about the "old" code now. Yes, it is wrong and I did not 

No I'm talking about code that isn't changed by this series:

    if (initial_offset == 0) {
        v9fs_co_rewinddir(pdu, fidp);
    } else {
        v9fs_co_seekdir(pdu, fidp, initial_offset);
    }
    count = v9fs_do_readdir(pdu, fidp, max_count);

Leaving these outside the critical section seems to negate
your arguments... please clarify.

> bother to fix the "old" implementation, since this optimized readdir 
> implementation fixed this issue as well in one rush, but also note ...
> 
> > I think the change is important enough that it deserves its
> > own patch. I expect that moving the locking to the top-level
> > handler might also simplify the existing code, and thus your
> > series as well.
> 
> ... please reconsider if you really want to open that box. The "old" readdir 
> implementation on main thread is inherently wrong, and handling the lock there 
> and error pathes is extremely cumbersome and error prone.
> 

There are indeed several issues with the current readdir:

1) potential inconsistency on concurrent Treaddir requests

2) excessive dispatching to worker threads

So we agreed that 1) should never happen with a regular linux
client (we could even dump the lock actually) but we want to
make it clear in the code anyway that actions on the directory
stream are serialized.

2) basically means moving some logic from the frontend to the
backend, ie. called from v9fs_co_run_in_worker(). This implies
that a very long request (ie. one that would span on many calls
to readdir()) cannot be interrupted by the client anymore, as
opposed to what we have now BTW.

I tend to think that addressing both issues in one "rush" is
as much "error prone".

> Also: it does not make sense to move locking on this series from fs driver 
> back to main I/O thread. Atomicity must retain on driver side, not on top 
> half.
> 

Then the whole seekdir/rewinddir/telldir/readdir sequence should
be called with a single v9fs_co_run_in_worker() invocation, in
which case the locking could just be something like:

int coroutine_fn v9fs_co_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp,
                                      struct V9fsDirEnt **entries,
                                      int32_t maxsize, bool dostat)
{
    int err = 0;

    if (v9fs_request_cancelled(pdu)) {
        return -EINTR;
    }

    v9fs_readdir_lock(&fidp->fs.dir);

    v9fs_co_run_in_worker({
        err = do_readdir_many(pdu, fidp, entries, maxsize, dostat);
    });

    v9fs_readdir_unlock(&fidp->fs.dir);
    return err;
}

This would technically leave the locking in the main I/O thread,
but it isn't conceptually different from locking at the beginning
of do_readdir_lock() and unlocking before returning. My concern
is that I don't know how coroutine mutexes behave with multiple
worker threads...

> Best regards,
> Christian Schoenebeck
> 
> 

Cheers,

--
Greg
Christian Schoenebeck July 1, 2020, 11:47 a.m. UTC | #7
On Mittwoch, 1. Juli 2020 12:09:24 CEST Greg Kurz wrote:
> No I'm talking about code that isn't changed by this series:
> 
>     if (initial_offset == 0) {
>         v9fs_co_rewinddir(pdu, fidp);
>     } else {
>         v9fs_co_seekdir(pdu, fidp, initial_offset);
>     }
>     count = v9fs_do_readdir(pdu, fidp, max_count);
> 
> Leaving these outside the critical section seems to negate
> your arguments... please clarify.

Yeah, I admit I have neglected this issue, as concurrent readdir requests with 
same FID always was some kind of theoretical issue. But yes, you are right, 
that should be addressed by 

1. entirely removing the rewinddir/seekdir code here

2. passing initial_offset to v9fs_do_readdir(), which in turn would
   pass it to readdir_many()

3. readdir_many() would handle rewinddir/seekdir exclusively in its crticial
   section.

> There are indeed several issues with the current readdir:
> 
> 1) potential inconsistency on concurrent Treaddir requests
> 
> 2) excessive dispatching to worker threads
> 
> So we agreed that 1) should never happen with a regular linux
> client (we could even dump the lock actually) but we want to
> make it clear in the code anyway that actions on the directory
> stream are serialized.

My point is you are trying to fix a merely thereotical issue on code that's 
conceptually, inherently wrong and dead end code at first place. Top half code 
should always be designed to be as thin as possible. The old readdir code 
though is the complete opposite.

> 2) basically means moving some logic from the frontend to the
> backend, ie. called from v9fs_co_run_in_worker(). This implies
> that a very long request (ie. one that would span on many calls
> to readdir()) cannot be interrupted by the client anymore, as
> opposed to what we have now BTW.

3) Complexity of old readdir code is so much bigger compared to the new one
   such that probability of additional issues is also significantly higher.

> I tend to think that addressing both issues in one "rush" is
> as much "error prone".

No it's not. The optimized readdir implementation is quantifyable, 
significantly less complex than the old implementation (i.e. significantly 
smaller amount of branches and error pathes, determenistic clear separation of 
thread's task assignments which includes much smaller amount of thread hops). 
Less complexity and increased determinism consequently means reduced chance of 
misbehaviours. So that's not a subjective, but rather a quantifyable, proven 
statement.

You can't switch from the old (wrong) concept to the new concept without some 
minimum amount of changes, which yes are not small, but I don't see any way to 
make the change set smaller without yet introducing new negative side effects.

I am now using words that I heard from your side before: please be realistic. 
Currently man power on 9p code is extremely limited to put it mildly. Yes, we 
could spend time on fixing this (theoretical) issue on the old readdir could. 
But what would it buy? Due to the limited man power we can only move forward 
with compromises; in this case you are fighting for code that's DOA. The only 
thing that you achieve by still sticking to the old readdir code is you are 
preventing that we move forward at all. Remember: I originally sent this patch 
almost 7 months ago.

> > Also: it does not make sense to move locking on this series from fs driver
> > back to main I/O thread. Atomicity must retain on driver side, not on top
> > half.
> 
> Then the whole seekdir/rewinddir/telldir/readdir sequence should
> be called with a single v9fs_co_run_in_worker() invocation, in
> which case the locking could just be something like:
> 
> int coroutine_fn v9fs_co_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp,
>                                       struct V9fsDirEnt **entries,
>                                       int32_t maxsize, bool dostat)
> {
>     int err = 0;
> 
>     if (v9fs_request_cancelled(pdu)) {
>         return -EINTR;
>     }
> 
>     v9fs_readdir_lock(&fidp->fs.dir);
> 
>     v9fs_co_run_in_worker({
>         err = do_readdir_many(pdu, fidp, entries, maxsize, dostat);
>     });
> 
>     v9fs_readdir_unlock(&fidp->fs.dir);
>     return err;
> }

That's exactly what should be prevented. The lock must be on driver thread 
side, not on main thread side. The goal is to reduce the time slice of 
individual locks. If the lock is on main thread, the time slice of a lock 
(even on huge directories with thousands of entries) is multiple factors 
larger than if the lock is on driver thread side. So that's not just few 
percent or so, it is huge.

> This would technically leave the locking in the main I/O thread,
> but it isn't conceptually different from locking at the beginning
> of do_readdir_lock() and unlocking before returning. My concern
> is that I don't know how coroutine mutexes behave with multiple
> worker threads...

Well, your Mutex -> CoMutex change was intended to fix a deadlock with the old 
readdir implementation. That deadlock could not happen with the new readdir 
implementation, which suggests that it probably makes sense to revert this 
change (i.e. CoMutex -> Mutex) for this new implementation.

Best regards,
Christian Schoenebeck
Greg Kurz July 1, 2020, 3:12 p.m. UTC | #8
On Wed, 01 Jul 2020 13:47:12 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 1. Juli 2020 12:09:24 CEST Greg Kurz wrote:
> > No I'm talking about code that isn't changed by this series:
> > 
> >     if (initial_offset == 0) {
> >         v9fs_co_rewinddir(pdu, fidp);
> >     } else {
> >         v9fs_co_seekdir(pdu, fidp, initial_offset);
> >     }
> >     count = v9fs_do_readdir(pdu, fidp, max_count);
> > 
> > Leaving these outside the critical section seems to negate
> > your arguments... please clarify.
> 
> Yeah, I admit I have neglected this issue, as concurrent readdir requests with 
> same FID always was some kind of theoretical issue. But yes, you are right, 

It's perfectly fine to miss things, that's what reviews are made for :)

> that should be addressed by 
> 
> 1. entirely removing the rewinddir/seekdir code here
> 
> 2. passing initial_offset to v9fs_do_readdir(), which in turn would
>    pass it to readdir_many()
> 
> 3. readdir_many() would handle rewinddir/seekdir exclusively in its crticial
>    section.
> 

Sounds good. v7 ?

> > There are indeed several issues with the current readdir:
> > 
> > 1) potential inconsistency on concurrent Treaddir requests
> > 
> > 2) excessive dispatching to worker threads
> > 
> > So we agreed that 1) should never happen with a regular linux
> > client (we could even dump the lock actually) but we want to
> > make it clear in the code anyway that actions on the directory
> > stream are serialized.
> 
> My point is you are trying to fix a merely thereotical issue on code that's 
> conceptually, inherently wrong and dead end code at first place. Top half code 

I'm not trying to fix anything. We already have locking in place to
deal with this theoretical issue and it interferes with your changes.
I don't care that much if a silly guest shoots itself in the foot
actually, so it'll be fine with me if you just dump the lock, as
long as it doesn't cause QEMU to hang or crash.

> should always be designed to be as thin as possible. The old readdir code 
> though is the complete opposite.
> 

It isn't readdir only, most requests span over multiple v9fs_co_*() calls...

> > 2) basically means moving some logic from the frontend to the
> > backend, ie. called from v9fs_co_run_in_worker(). This implies
> > that a very long request (ie. one that would span on many calls
> > to readdir()) cannot be interrupted by the client anymore, as
> > opposed to what we have now BTW.
> 

... most likely to allow clients to interrupt a long request with a
Tflush. Have you considered this aspect in your changes ?

> 3) Complexity of old readdir code is so much bigger compared to the new one
>    such that probability of additional issues is also significantly higher.
> 
> > I tend to think that addressing both issues in one "rush" is
> > as much "error prone".
> 
> No it's not. The optimized readdir implementation is quantifyable, 
> significantly less complex than the old implementation (i.e. significantly 
> smaller amount of branches and error pathes, determenistic clear separation of 
> thread's task assignments which includes much smaller amount of thread hops). 
> Less complexity and increased determinism consequently means reduced chance of 
> misbehaviours. So that's not a subjective, but rather a quantifyable, proven 
> statement.
> 
> You can't switch from the old (wrong) concept to the new concept without some 
> minimum amount of changes, which yes are not small, but I don't see any way to 
> make the change set smaller without yet introducing new negative side effects.
> 
> I am now using words that I heard from your side before: please be realistic. 
> Currently man power on 9p code is extremely limited to put it mildly. Yes, we 
> could spend time on fixing this (theoretical) issue on the old readdir could. 
> But what would it buy? Due to the limited man power we can only move forward 
> with compromises; in this case you are fighting for code that's DOA. The only 
> thing that you achieve by still sticking to the old readdir code is you are 
> preventing that we move forward at all. Remember: I originally sent this patch 
> almost 7 months ago.
> 
> > > Also: it does not make sense to move locking on this series from fs driver
> > > back to main I/O thread. Atomicity must retain on driver side, not on top
> > > half.
> > 
> > Then the whole seekdir/rewinddir/telldir/readdir sequence should
> > be called with a single v9fs_co_run_in_worker() invocation, in
> > which case the locking could just be something like:
> > 
> > int coroutine_fn v9fs_co_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp,
> >                                       struct V9fsDirEnt **entries,
> >                                       int32_t maxsize, bool dostat)
> > {
> >     int err = 0;
> > 
> >     if (v9fs_request_cancelled(pdu)) {
> >         return -EINTR;
> >     }
> > 
> >     v9fs_readdir_lock(&fidp->fs.dir);
> > 
> >     v9fs_co_run_in_worker({
> >         err = do_readdir_many(pdu, fidp, entries, maxsize, dostat);
> >     });
> > 
> >     v9fs_readdir_unlock(&fidp->fs.dir);
> >     return err;
> > }
> 
> That's exactly what should be prevented. The lock must be on driver thread 
> side, not on main thread side. The goal is to reduce the time slice of 
> individual locks. If the lock is on main thread, the time slice of a lock 
> (even on huge directories with thousands of entries) is multiple factors 
> larger than if the lock is on driver thread side. So that's not just few 
> percent or so, it is huge.
> 

Sorry I don't get it...  In a contention-less situation, which is the
case we really care for, qemu_co_mutex_lock() has no overhead.

> > This would technically leave the locking in the main I/O thread,
> > but it isn't conceptually different from locking at the beginning
> > of do_readdir_lock() and unlocking before returning. My concern
> > is that I don't know how coroutine mutexes behave with multiple
> > worker threads...
> 
> Well, your Mutex -> CoMutex change was intended to fix a deadlock with the old 
> readdir implementation. That deadlock could not happen with the new readdir 
> implementation, which suggests that it probably makes sense to revert this 
> change (i.e. CoMutex -> Mutex) for this new implementation.
> 

No we can't because it is also used with 9p2000.u, that you said you
don't want to fix.

> Best regards,
> Christian Schoenebeck
> 
> 

Cheers,

--
Greg
Christian Schoenebeck July 2, 2020, 11:43 a.m. UTC | #9
On Mittwoch, 1. Juli 2020 17:12:40 CEST Greg Kurz wrote:
> On Wed, 01 Jul 2020 13:47:12 +0200
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > On Mittwoch, 1. Juli 2020 12:09:24 CEST Greg Kurz wrote:
> > > No I'm talking about code that isn't changed by this series:
> > >     if (initial_offset == 0) {
> > >     
> > >         v9fs_co_rewinddir(pdu, fidp);
> > >     
> > >     } else {
> > >     
> > >         v9fs_co_seekdir(pdu, fidp, initial_offset);
> > >     
> > >     }
> > >     count = v9fs_do_readdir(pdu, fidp, max_count);
> > > 
> > > Leaving these outside the critical section seems to negate
> > > your arguments... please clarify.
> > 
> > Yeah, I admit I have neglected this issue, as concurrent readdir requests
> > with same FID always was some kind of theoretical issue. But yes, you are
> > right,
> It's perfectly fine to miss things, that's what reviews are made for :)
> 
> > that should be addressed by
> > 
> > 1. entirely removing the rewinddir/seekdir code here
> > 
> > 2. passing initial_offset to v9fs_do_readdir(), which in turn would
> > 
> >    pass it to readdir_many()
> > 
> > 3. readdir_many() would handle rewinddir/seekdir exclusively in its
> > crticial> 
> >    section.
> 
> Sounds good. v7 ?

Sure, that's not the problem, I can repost this handled appropriately of 
course.

But would you please finalize your picture of this patch set first? I would 
really (try) to finally nail this thing with the next version.

> > > There are indeed several issues with the current readdir:
> > > 
> > > 1) potential inconsistency on concurrent Treaddir requests
> > > 
> > > 2) excessive dispatching to worker threads
> > > 
> > > So we agreed that 1) should never happen with a regular linux
> > > client (we could even dump the lock actually) but we want to
> > > make it clear in the code anyway that actions on the directory
> > > stream are serialized.
> > 
> > My point is you are trying to fix a merely thereotical issue on code
> > that's
> > conceptually, inherently wrong and dead end code at first place. Top half
> > code
> I'm not trying to fix anything. We already have locking in place to
> deal with this theoretical issue and it interferes with your changes.
> I don't care that much if a silly guest shoots itself in the foot
> actually, so it'll be fine with me if you just dump the lock, as
> long as it doesn't cause QEMU to hang or crash.

Ah ok, I got you as if you wanted to fix more details on the old readdir code, 
which would be a clear blocker for this patch set to proceed. Good then.

> > should always be designed to be as thin as possible. The old readdir code
> > though is the complete opposite.
> 
> It isn't readdir only, most requests span over multiple v9fs_co_*() calls...

Right, I know! And that's actually my root motivation to finally bring this 
patch set forward, since I am very aware that this patch set is just a small 
brick in the overall procecss of fixing similarly affected code portions. So 
yes, that's my plan to fix them with upcoming patch sets, too.

Having said that, could we probably try to make future reviews a bit more 
efficient at certain aspects? For instance it would help a lot if the patch 
set was reviewed entirely, and not stopped at the very first issue spotted and 
then suspended to ++version, as this creates large latencies in the overall 
review process?

And likewise it would also help if review is prioritized on relevant behaviour 
aspects (concept, design) first before arguing about code style details like 
variable names or other details invisible to users.

And finally: compromises. As man power on 9p is very limited, it would make 
sense to limit patch sets at a certain extent and agree to postpone certain 
non-critical issues to subsequent patches (or sets of), otherwise a patch set 
will grow and grow and it will take ages to get forward.

> > > 2) basically means moving some logic from the frontend to the
> > > backend, ie. called from v9fs_co_run_in_worker(). This implies
> > > that a very long request (ie. one that would span on many calls
> > > to readdir()) cannot be interrupted by the client anymore, as
> > > opposed to what we have now BTW.
> 
> ... most likely to allow clients to interrupt a long request with a
> Tflush. Have you considered this aspect in your changes ?

In this particular patch set, no, as for readdir this should not be an issue 
in practice for anybody. After this patch set is applied, even on huge 
directories, the fs driver's duration for readdir would barely be measurable. 
In fact the server's latency would always be much higher, yet.

But also for the upcoming, planned patch sets (i.e. other request types): That 
would be an example where I would ask you to lower you expectations a bit and 
that we could agree to postpone such an aspect to a subsequent, separate patch 
(or set of).

> > 3) Complexity of old readdir code is so much bigger compared to the new
> > one
> > 
> >    such that probability of additional issues is also significantly
> >    higher.
> > > 
> > > I tend to think that addressing both issues in one "rush" is
> > > as much "error prone".
> > 
> > No it's not. The optimized readdir implementation is quantifyable,
> > significantly less complex than the old implementation (i.e. significantly
> > smaller amount of branches and error pathes, determenistic clear
> > separation of thread's task assignments which includes much smaller
> > amount of thread hops). Less complexity and increased determinism
> > consequently means reduced chance of misbehaviours. So that's not a
> > subjective, but rather a quantifyable, proven statement.
> > 
> > You can't switch from the old (wrong) concept to the new concept without
> > some minimum amount of changes, which yes are not small, but I don't see
> > any way to make the change set smaller without yet introducing new
> > negative side effects.
> > 
> > I am now using words that I heard from your side before: please be
> > realistic. Currently man power on 9p code is extremely limited to put it
> > mildly. Yes, we could spend time on fixing this (theoretical) issue on
> > the old readdir could. But what would it buy? Due to the limited man
> > power we can only move forward with compromises; in this case you are
> > fighting for code that's DOA. The only thing that you achieve by still
> > sticking to the old readdir code is you are preventing that we move
> > forward at all. Remember: I originally sent this patch almost 7 months
> > ago.
> > 
> > > > Also: it does not make sense to move locking on this series from fs
> > > > driver
> > > > back to main I/O thread. Atomicity must retain on driver side, not on
> > > > top
> > > > half.
> > > 
> > > Then the whole seekdir/rewinddir/telldir/readdir sequence should
> > > be called with a single v9fs_co_run_in_worker() invocation, in
> > > which case the locking could just be something like:
> > > 
> > > int coroutine_fn v9fs_co_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp,
> > > 
> > >                                       struct V9fsDirEnt **entries,
> > >                                       int32_t maxsize, bool dostat)
> > > 
> > > {
> > > 
> > >     int err = 0;
> > >     
> > >     if (v9fs_request_cancelled(pdu)) {
> > >     
> > >         return -EINTR;
> > >     
> > >     }
> > >     
> > >     v9fs_readdir_lock(&fidp->fs.dir);
> > >     
> > >     v9fs_co_run_in_worker({
> > >     
> > >         err = do_readdir_many(pdu, fidp, entries, maxsize, dostat);
> > >     
> > >     });
> > >     
> > >     v9fs_readdir_unlock(&fidp->fs.dir);
> > >     return err;
> > > 
> > > }
> > 
> > That's exactly what should be prevented. The lock must be on driver thread
> > side, not on main thread side. The goal is to reduce the time slice of
> > individual locks. If the lock is on main thread, the time slice of a lock
> > (even on huge directories with thousands of entries) is multiple factors
> > larger than if the lock is on driver thread side. So that's not just few
> > percent or so, it is huge.
> 
> Sorry I don't get it...  In a contention-less situation, which is the
> case we really care for, qemu_co_mutex_lock() has no overhead.

Yes, it only kicks in if there is concurrency.

> > > This would technically leave the locking in the main I/O thread,
> > > but it isn't conceptually different from locking at the beginning
> > > of do_readdir_lock() and unlocking before returning. My concern
> > > is that I don't know how coroutine mutexes behave with multiple
> > > worker threads...
> > 
> > Well, your Mutex -> CoMutex change was intended to fix a deadlock with the
> > old readdir implementation. That deadlock could not happen with the new
> > readdir implementation, which suggests that it probably makes sense to
> > revert this change (i.e. CoMutex -> Mutex) for this new implementation.
> 
> No we can't because it is also used with 9p2000.u, that you said you
> don't want to fix.

Yes and no, I did not say I don't want to fix readdir for 9p2000.u at all. 
What I said was I want to prioritize and concentrate on the most relevant 
issues first. 9p2000.L is the most commonly used protocol variant, so I would 
like to fix the most severe (e.g. performance) issues for 9p2000.L first 
before spending time on 9p2000.u which is AFAICS barely used in comparison, 
which I personally don't use at all, and which I am hence not testing in the 
same degree and cannot serve with the same quality as 9p2000.L right now.

Plus if there are really users caring for 9p2000.u, I can gladly assist them 
to address these issues for 9p2000.u as well, provided that they help at least 
with testing the required 9p2000.u changes.

Back to the actual topic: so what do we do about the mutex then? CoMutex for 
9p2000.u and Mutex for 9p2000.L? I know you find that ugly, but it would just 
be a transitional measure.

Best regards,
Christian Schoenebeck
Greg Kurz July 2, 2020, 3:35 p.m. UTC | #10
On Thu, 02 Jul 2020 13:43:11 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 1. Juli 2020 17:12:40 CEST Greg Kurz wrote:
> > On Wed, 01 Jul 2020 13:47:12 +0200
> > 
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > On Mittwoch, 1. Juli 2020 12:09:24 CEST Greg Kurz wrote:
> > > > No I'm talking about code that isn't changed by this series:
> > > >     if (initial_offset == 0) {
> > > >     
> > > >         v9fs_co_rewinddir(pdu, fidp);
> > > >     
> > > >     } else {
> > > >     
> > > >         v9fs_co_seekdir(pdu, fidp, initial_offset);
> > > >     
> > > >     }
> > > >     count = v9fs_do_readdir(pdu, fidp, max_count);
> > > > 
> > > > Leaving these outside the critical section seems to negate
> > > > your arguments... please clarify.
> > > 
> > > Yeah, I admit I have neglected this issue, as concurrent readdir requests
> > > with same FID always was some kind of theoretical issue. But yes, you are
> > > right,
> > It's perfectly fine to miss things, that's what reviews are made for :)
> > 
> > > that should be addressed by
> > > 
> > > 1. entirely removing the rewinddir/seekdir code here
> > > 
> > > 2. passing initial_offset to v9fs_do_readdir(), which in turn would
> > > 
> > >    pass it to readdir_many()
> > > 
> > > 3. readdir_many() would handle rewinddir/seekdir exclusively in its
> > > crticial> 
> > >    section.
> > 
> > Sounds good. v7 ?
> 
> Sure, that's not the problem, I can repost this handled appropriately of 
> course.
> 
> But would you please finalize your picture of this patch set first? I would 
> really (try) to finally nail this thing with the next version.
> 

I'll do my best.

> > > > There are indeed several issues with the current readdir:
> > > > 
> > > > 1) potential inconsistency on concurrent Treaddir requests
> > > > 
> > > > 2) excessive dispatching to worker threads
> > > > 
> > > > So we agreed that 1) should never happen with a regular linux
> > > > client (we could even dump the lock actually) but we want to
> > > > make it clear in the code anyway that actions on the directory
> > > > stream are serialized.
> > > 
> > > My point is you are trying to fix a merely thereotical issue on code
> > > that's
> > > conceptually, inherently wrong and dead end code at first place. Top half
> > > code
> > I'm not trying to fix anything. We already have locking in place to
> > deal with this theoretical issue and it interferes with your changes.
> > I don't care that much if a silly guest shoots itself in the foot
> > actually, so it'll be fine with me if you just dump the lock, as
> > long as it doesn't cause QEMU to hang or crash.
> 
> Ah ok, I got you as if you wanted to fix more details on the old readdir code, 
> which would be a clear blocker for this patch set to proceed. Good then.
> 
> > > should always be designed to be as thin as possible. The old readdir code
> > > though is the complete opposite.
> > 
> > It isn't readdir only, most requests span over multiple v9fs_co_*() calls...
> 
> Right, I know! And that's actually my root motivation to finally bring this 
> patch set forward, since I am very aware that this patch set is just a small 
> brick in the overall procecss of fixing similarly affected code portions. So 
> yes, that's my plan to fix them with upcoming patch sets, too.
> 
> Having said that, could we probably try to make future reviews a bit more 
> efficient at certain aspects? For instance it would help a lot if the patch 
> set was reviewed entirely, and not stopped at the very first issue spotted and 
> then suspended to ++version, as this creates large latencies in the overall 
> review process?
> 

Review of 9pfs is a best effort thing... I usually stop reviewing when I'm fed
up or when all the time I could reasonably invest is elapsed. Breaking down
the series in smaller patches is the usual improvement for that.

> And likewise it would also help if review is prioritized on relevant behaviour 
> aspects (concept, design) first before arguing about code style details like 
> variable names or other details invisible to users.
> 

I don't remember questioning the overall concept behind these changes
because it looks reasonable enough (even if I would appreciate to be
able to verify them with a working reproducer).

Even if it is invisible to the users, coding style or how a series is
broken down in patches is important for developers.

> And finally: compromises. As man power on 9p is very limited, it would make 
> sense to limit patch sets at a certain extent and agree to postpone certain 
> non-critical issues to subsequent patches (or sets of), otherwise a patch set 
> will grow and grow and it will take ages to get forward.
> 

FWIW this patchset was more than 10 patches and now it is only 5 :)

The good news is that you'll soon be able to merge things by yourself.
I'll help you when I can but I won't be the gating factor anymore.

Hurrah !

> > > > 2) basically means moving some logic from the frontend to the
> > > > backend, ie. called from v9fs_co_run_in_worker(). This implies
> > > > that a very long request (ie. one that would span on many calls
> > > > to readdir()) cannot be interrupted by the client anymore, as
> > > > opposed to what we have now BTW.
> > 
> > ... most likely to allow clients to interrupt a long request with a
> > Tflush. Have you considered this aspect in your changes ?
> 
> In this particular patch set, no, as for readdir this should not be an issue 
> in practice for anybody. After this patch set is applied, even on huge 
> directories, the fs driver's duration for readdir would barely be measurable. 
> In fact the server's latency would always be much higher, yet.
> 

Reproducer ? Numbers ? ;)

> But also for the upcoming, planned patch sets (i.e. other request types): That 
> would be an example where I would ask you to lower you expectations a bit and 
> that we could agree to postpone such an aspect to a subsequent, separate patch 
> (or set of).
> 
> > > 3) Complexity of old readdir code is so much bigger compared to the new
> > > one
> > > 
> > >    such that probability of additional issues is also significantly
> > >    higher.
> > > > 
> > > > I tend to think that addressing both issues in one "rush" is
> > > > as much "error prone".
> > > 
> > > No it's not. The optimized readdir implementation is quantifyable,
> > > significantly less complex than the old implementation (i.e. significantly
> > > smaller amount of branches and error pathes, determenistic clear
> > > separation of thread's task assignments which includes much smaller
> > > amount of thread hops). Less complexity and increased determinism
> > > consequently means reduced chance of misbehaviours. So that's not a
> > > subjective, but rather a quantifyable, proven statement.
> > > 
> > > You can't switch from the old (wrong) concept to the new concept without
> > > some minimum amount of changes, which yes are not small, but I don't see
> > > any way to make the change set smaller without yet introducing new
> > > negative side effects.
> > > 
> > > I am now using words that I heard from your side before: please be
> > > realistic. Currently man power on 9p code is extremely limited to put it
> > > mildly. Yes, we could spend time on fixing this (theoretical) issue on
> > > the old readdir could. But what would it buy? Due to the limited man
> > > power we can only move forward with compromises; in this case you are
> > > fighting for code that's DOA. The only thing that you achieve by still
> > > sticking to the old readdir code is you are preventing that we move
> > > forward at all. Remember: I originally sent this patch almost 7 months
> > > ago.
> > > 
> > > > > Also: it does not make sense to move locking on this series from fs
> > > > > driver
> > > > > back to main I/O thread. Atomicity must retain on driver side, not on
> > > > > top
> > > > > half.
> > > > 
> > > > Then the whole seekdir/rewinddir/telldir/readdir sequence should
> > > > be called with a single v9fs_co_run_in_worker() invocation, in
> > > > which case the locking could just be something like:
> > > > 
> > > > int coroutine_fn v9fs_co_readdir_many(V9fsPDU *pdu, V9fsFidState *fidp,
> > > > 
> > > >                                       struct V9fsDirEnt **entries,
> > > >                                       int32_t maxsize, bool dostat)
> > > > 
> > > > {
> > > > 
> > > >     int err = 0;
> > > >     
> > > >     if (v9fs_request_cancelled(pdu)) {
> > > >     
> > > >         return -EINTR;
> > > >     
> > > >     }
> > > >     
> > > >     v9fs_readdir_lock(&fidp->fs.dir);
> > > >     
> > > >     v9fs_co_run_in_worker({
> > > >     
> > > >         err = do_readdir_many(pdu, fidp, entries, maxsize, dostat);
> > > >     
> > > >     });
> > > >     
> > > >     v9fs_readdir_unlock(&fidp->fs.dir);
> > > >     return err;
> > > > 
> > > > }
> > > 
> > > That's exactly what should be prevented. The lock must be on driver thread
> > > side, not on main thread side. The goal is to reduce the time slice of
> > > individual locks. If the lock is on main thread, the time slice of a lock
> > > (even on huge directories with thousands of entries) is multiple factors
> > > larger than if the lock is on driver thread side. So that's not just few
> > > percent or so, it is huge.
> > 
> > Sorry I don't get it...  In a contention-less situation, which is the
> > case we really care for, qemu_co_mutex_lock() has no overhead.
> 
> Yes, it only kicks in if there is concurrency.
> 

And we don't care to preserve performance for silly clients, do we ?

> > > > This would technically leave the locking in the main I/O thread,
> > > > but it isn't conceptually different from locking at the beginning
> > > > of do_readdir_lock() and unlocking before returning. My concern
> > > > is that I don't know how coroutine mutexes behave with multiple
> > > > worker threads...
> > > 
> > > Well, your Mutex -> CoMutex change was intended to fix a deadlock with the
> > > old readdir implementation. That deadlock could not happen with the new
> > > readdir implementation, which suggests that it probably makes sense to
> > > revert this change (i.e. CoMutex -> Mutex) for this new implementation.
> > 
> > No we can't because it is also used with 9p2000.u, that you said you
> > don't want to fix.
> 
> Yes and no, I did not say I don't want to fix readdir for 9p2000.u at all. 
> What I said was I want to prioritize and concentrate on the most relevant 
> issues first. 9p2000.L is the most commonly used protocol variant, so I would 
> like to fix the most severe (e.g. performance) issues for 9p2000.L first 
> before spending time on 9p2000.u which is AFAICS barely used in comparison, 
> which I personally don't use at all, and which I am hence not testing in the 
> same degree and cannot serve with the same quality as 9p2000.L right now.
> 
> Plus if there are really users caring for 9p2000.u, I can gladly assist them 
> to address these issues for 9p2000.u as well, provided that they help at least 
> with testing the required 9p2000.u changes.
> 

I would personally do the opposite... again ;)

Like you say we essentially care for 9p2000.L and we only do limited
support for 9p2000.u. If we have a change that we really want for
9p2000.L but it has an impact on 9p2000.u because of shared code,
it is fine to do the changes anyway, including changes to the 9p2000.u
specific code. Very basic testing of 9p2000.u (mount/ls or find/umount)
or maybe running pjd-fstest is enough IMHO. If we break someone's setup
and he complains, then we can ask him to test a fix before we merge it.

> Back to the actual topic: so what do we do about the mutex then? CoMutex for 
> 9p2000.u and Mutex for 9p2000.L? I know you find that ugly, but it would just 
> be a transitional measure.
> 

That would ruin my day...

More seriously, the recent fix for a deadlock condition that was present
for years proves that nobody seems to be using silly clients with QEMU.
So I think we should just dump the lock and add a one-time warning in
the top level handlers when we detect a duplicate readdir request on
the same fid. This should be a very simple patch (I can take care of
it right away).

> Best regards,
> Christian Schoenebeck
> 
> 

Cheers,

--
Greg
Christian Schoenebeck July 2, 2020, 5:23 p.m. UTC | #11
On Donnerstag, 2. Juli 2020 17:35:00 CEST Greg Kurz wrote:
> > > It isn't readdir only, most requests span over multiple v9fs_co_*()
> > > calls...> 
> > Right, I know! And that's actually my root motivation to finally bring
> > this
> > patch set forward, since I am very aware that this patch set is just a
> > small brick in the overall procecss of fixing similarly affected code
> > portions. So yes, that's my plan to fix them with upcoming patch sets,
> > too.
> > 
> > Having said that, could we probably try to make future reviews a bit more
> > efficient at certain aspects? For instance it would help a lot if the
> > patch
> > set was reviewed entirely, and not stopped at the very first issue spotted
> > and then suspended to ++version, as this creates large latencies in the
> > overall review process?
> 
> Review of 9pfs is a best effort thing... I usually stop reviewing when I'm
> fed up or when all the time I could reasonably invest is elapsed. Breaking
> down the series in smaller patches is the usual improvement for that.

No need to defend, you do these things voluntarily. I am glad that you spend 
time on this project at all. I would appreciate though if we could reduce the 
overall time for a patch set a bit, and my suggestion trying to walk through 
an entire set before re-posting a new version might indeed bring an 
improvement without anybody having to invest more time, but rather less.

> > And likewise it would also help if review is prioritized on relevant
> > behaviour aspects (concept, design) first before arguing about code style
> > details like variable names or other details invisible to users.
> 
> I don't remember questioning the overall concept behind these changes
> because it looks reasonable enough (even if I would appreciate to be
> able to verify them with a working reproducer).

What exactly do you mean here with working reproducer?

> Even if it is invisible to the users, coding style or how a series is
> broken down in patches is important for developers.

Dedication for detail beyond a certain fine graded level is luxury, and for 
that reason it is designated for projects which clearly can afford it. On 
(sub)projects with low activity and high latency (like this one) it easily 
leads to frustration, which is obviously contra productive.

I don't say don't care about code style, details et al., but caring a bit less 
on reviews would not hurt, so to say. A bit of (invisible) tolerance avoids 
friction and stagnancy.

> > And finally: compromises. As man power on 9p is very limited, it would
> > make
> > sense to limit patch sets at a certain extent and agree to postpone
> > certain
> > non-critical issues to subsequent patches (or sets of), otherwise a patch
> > set will grow and grow and it will take ages to get forward.
> 
> FWIW this patchset was more than 10 patches and now it is only 5 :)

Hey, no misleading number crunching please. ;-) Three dropped patches were 
never intended to be merged, they were pure benchmarks.

> The good news is that you'll soon be able to merge things by yourself.
> I'll help you when I can but I won't be the gating factor anymore.
> 
> Hurrah !

Yep, I'll do my best to pursue your work, I appreciate what you did and that 
you still volunteer to help!

> > > > > 2) basically means moving some logic from the frontend to the
> > > > > backend, ie. called from v9fs_co_run_in_worker(). This implies
> > > > > that a very long request (ie. one that would span on many calls
> > > > > to readdir()) cannot be interrupted by the client anymore, as
> > > > > opposed to what we have now BTW.
> > > 
> > > ... most likely to allow clients to interrupt a long request with a
> > > Tflush. Have you considered this aspect in your changes ?
> > 
> > In this particular patch set, no, as for readdir this should not be an
> > issue in practice for anybody. After this patch set is applied, even on
> > huge directories, the fs driver's duration for readdir would barely be
> > measurable. In fact the server's latency would always be much higher,
> > yet.
> 
> Reproducer ? Numbers ? ;)

Well, you have already seen and run the benchmark in this series yourself. You 
can easily hit several hundred ms server latency, which is very hard to top 
with any fs driver reading a directory. Ok, maybe with a tape drive you could, 
but that's pretty much it. :) With a 'normal' system the fs driver would 
rather block for something between <1ms .. 8ms, i.e. fs driver completes 
before it would be able to actually see a flush request.

But to make it also clear: it would not bite with the new design either. If 
someone really would see a necessity to abort readdir requests, that would be 
possible without huge changes.

> > > > > Then the whole seekdir/rewinddir/telldir/readdir sequence should
> > > > > be called with a single v9fs_co_run_in_worker() invocation, in
> > > > > which case the locking could just be something like:
> > > > > 
> > > > > int coroutine_fn v9fs_co_readdir_many(V9fsPDU *pdu, V9fsFidState
> > > > > *fidp,
> > > > > 
> > > > >                                       struct V9fsDirEnt **entries,
> > > > >                                       int32_t maxsize, bool dostat)
> > > > > 
> > > > > {
> > > > > 
> > > > >     int err = 0;
> > > > >     
> > > > >     if (v9fs_request_cancelled(pdu)) {
> > > > >     
> > > > >         return -EINTR;
> > > > >     
> > > > >     }
> > > > >     
> > > > >     v9fs_readdir_lock(&fidp->fs.dir);
> > > > >     
> > > > >     v9fs_co_run_in_worker({
> > > > >     
> > > > >         err = do_readdir_many(pdu, fidp, entries, maxsize, dostat);
> > > > >     
> > > > >     });
> > > > >     
> > > > >     v9fs_readdir_unlock(&fidp->fs.dir);
> > > > >     return err;
> > > > > 
> > > > > }
> > > > 
> > > > That's exactly what should be prevented. The lock must be on driver
> > > > thread
> > > > side, not on main thread side. The goal is to reduce the time slice of
> > > > individual locks. If the lock is on main thread, the time slice of a
> > > > lock
> > > > (even on huge directories with thousands of entries) is multiple
> > > > factors
> > > > larger than if the lock is on driver thread side. So that's not just
> > > > few
> > > > percent or so, it is huge.
> > > 
> > > Sorry I don't get it...  In a contention-less situation, which is the
> > > case we really care for, qemu_co_mutex_lock() has no overhead.
> > 
> > Yes, it only kicks in if there is concurrency.
> 
> And we don't care to preserve performance for silly clients, do we ?

We don't necessarily need to preserve performance for them, that's right. But 
it is already handled appropriately for them, so destroying it (i.e. slowing 
them down) only makes sense if there is a good reason for that.

> > > > > This would technically leave the locking in the main I/O thread,
> > > > > but it isn't conceptually different from locking at the beginning
> > > > > of do_readdir_lock() and unlocking before returning. My concern
> > > > > is that I don't know how coroutine mutexes behave with multiple
> > > > > worker threads...
> > > > 
> > > > Well, your Mutex -> CoMutex change was intended to fix a deadlock with
> > > > the
> > > > old readdir implementation. That deadlock could not happen with the
> > > > new
> > > > readdir implementation, which suggests that it probably makes sense to
> > > > revert this change (i.e. CoMutex -> Mutex) for this new
> > > > implementation.
> > > 
> > > No we can't because it is also used with 9p2000.u, that you said you
> > > don't want to fix.
> > 
> > Yes and no, I did not say I don't want to fix readdir for 9p2000.u at all.
> > What I said was I want to prioritize and concentrate on the most relevant
> > issues first. 9p2000.L is the most commonly used protocol variant, so I
> > would like to fix the most severe (e.g. performance) issues for 9p2000.L
> > first before spending time on 9p2000.u which is AFAICS barely used in
> > comparison, which I personally don't use at all, and which I am hence not
> > testing in the same degree and cannot serve with the same quality as
> > 9p2000.L right now.
> > 
> > Plus if there are really users caring for 9p2000.u, I can gladly assist
> > them to address these issues for 9p2000.u as well, provided that they
> > help at least with testing the required 9p2000.u changes.
> 
> I would personally do the opposite... again ;)
> 
> Like you say we essentially care for 9p2000.L and we only do limited
> support for 9p2000.u. If we have a change that we really want for
> 9p2000.L but it has an impact on 9p2000.u because of shared code,
> it is fine to do the changes anyway, including changes to the 9p2000.u
> specific code. Very basic testing of 9p2000.u (mount/ls or find/umount)
> or maybe running pjd-fstest is enough IMHO. If we break someone's setup
> and he complains, then we can ask him to test a fix before we merge it.

Well, so you want to handle the relevant 9p2000.u readdir optimizations by 
yourself, and you would send it as follow-up patch set (after this set is 
through), including test cases?

> > Back to the actual topic: so what do we do about the mutex then? CoMutex
> > for 9p2000.u and Mutex for 9p2000.L? I know you find that ugly, but it
> > would just be a transitional measure.
> 
> That would ruin my day...
> 
> More seriously, the recent fix for a deadlock condition that was present
> for years proves that nobody seems to be using silly clients with QEMU.
> So I think we should just dump the lock and add a one-time warning in
> the top level handlers when we detect a duplicate readdir request on
> the same fid. This should be a very simple patch (I can take care of
> it right away).

Looks like we have a consensus! Then I wait for your patch removing the lock, 
and for your feedback whether or not you see anything else in this patch set?

> > Best regards,
> > Christian Schoenebeck
> 
> Cheers,
> 
> --
> Greg

Best regards,
Christian Schoenebeck
Christian Schoenebeck July 3, 2020, 8:08 a.m. UTC | #12
On Donnerstag, 2. Juli 2020 19:23:35 CEST Christian Schoenebeck wrote:
> > > Back to the actual topic: so what do we do about the mutex then? CoMutex
> > > for 9p2000.u and Mutex for 9p2000.L? I know you find that ugly, but it
> > > would just be a transitional measure.
> > 
> > That would ruin my day...
> > 
> > More seriously, the recent fix for a deadlock condition that was present
> > for years proves that nobody seems to be using silly clients with QEMU.
> > So I think we should just dump the lock and add a one-time warning in
> > the top level handlers when we detect a duplicate readdir request on
> > the same fid. This should be a very simple patch (I can take care of
> > it right away).
> 
> Looks like we have a consensus! Then I wait for your patch removing the
> lock, and for your feedback whether or not you see anything else in this
> patch set?

Please wait before you work on this patch. I really do think your patch should 
be based on/after this optimization patch for one reason: if (and even though 
it's a big if) somebody comes along with a silly client as you named it, then 
your patch can simply be reverted, which would not be possible if it's next.

So I would really suggest I change this patch here to go the ugly way with 2 
mutex types for readdir 9p2000.L vs 9p2000.L, and your patch would get rid of 
that mess by removing the lock entirely, okay?

Best regards,
Christian Schoenebeck
Greg Kurz July 3, 2020, 3:53 p.m. UTC | #13
On Thu, 02 Jul 2020 19:23:35 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Donnerstag, 2. Juli 2020 17:35:00 CEST Greg Kurz wrote:
> > > > It isn't readdir only, most requests span over multiple v9fs_co_*()
> > > > calls...> 
> > > Right, I know! And that's actually my root motivation to finally bring
> > > this
> > > patch set forward, since I am very aware that this patch set is just a
> > > small brick in the overall procecss of fixing similarly affected code
> > > portions. So yes, that's my plan to fix them with upcoming patch sets,
> > > too.
> > > 
> > > Having said that, could we probably try to make future reviews a bit more
> > > efficient at certain aspects? For instance it would help a lot if the
> > > patch
> > > set was reviewed entirely, and not stopped at the very first issue spotted
> > > and then suspended to ++version, as this creates large latencies in the
> > > overall review process?
> > 
> > Review of 9pfs is a best effort thing... I usually stop reviewing when I'm
> > fed up or when all the time I could reasonably invest is elapsed. Breaking
> > down the series in smaller patches is the usual improvement for that.
> 
> No need to defend, you do these things voluntarily. I am glad that you spend 
> time on this project at all. I would appreciate though if we could reduce the 
> overall time for a patch set a bit, and my suggestion trying to walk through 
> an entire set before re-posting a new version might indeed bring an 
> improvement without anybody having to invest more time, but rather less.
> 

I'll try to adjust but I think you should also try to split patches (like
you did eventually with the addition of patch 3).

> > > And likewise it would also help if review is prioritized on relevant
> > > behaviour aspects (concept, design) first before arguing about code style
> > > details like variable names or other details invisible to users.
> > 
> > I don't remember questioning the overall concept behind these changes
> > because it looks reasonable enough (even if I would appreciate to be
> > able to verify them with a working reproducer).
> 
> What exactly do you mean here with working reproducer?
> 

Some test program to be run in a real guest that gets improved performance
with this patch set for example.

> > Even if it is invisible to the users, coding style or how a series is
> > broken down in patches is important for developers.
> 
> Dedication for detail beyond a certain fine graded level is luxury, and for 
> that reason it is designated for projects which clearly can afford it. On 
> (sub)projects with low activity and high latency (like this one) it easily 
> leads to frustration, which is obviously contra productive.
> 

The only way to avoid frustration when things don't go to your pace
is to take control, as you decided to do :)

> I don't say don't care about code style, details et al., but caring a bit less 
> on reviews would not hurt, so to say. A bit of (invisible) tolerance avoids 
> friction and stagnancy.
> 

No, this friction and stagnancy is essentially the result of not sharing
the same agenda and different tastes also.

> > > And finally: compromises. As man power on 9p is very limited, it would
> > > make
> > > sense to limit patch sets at a certain extent and agree to postpone
> > > certain
> > > non-critical issues to subsequent patches (or sets of), otherwise a patch
> > > set will grow and grow and it will take ages to get forward.
> > 
> > FWIW this patchset was more than 10 patches and now it is only 5 :)
> 
> Hey, no misleading number crunching please. ;-) Three dropped patches were 
> never intended to be merged, they were pure benchmarks.
> 

Heh :)

> > The good news is that you'll soon be able to merge things by yourself.
> > I'll help you when I can but I won't be the gating factor anymore.
> > 
> > Hurrah !
> 
> Yep, I'll do my best to pursue your work, I appreciate what you did and that 
> you still volunteer to help!
> 

And if I'm to picky on reviews, feel free to ignore my remarks ;)

> > > > > > 2) basically means moving some logic from the frontend to the
> > > > > > backend, ie. called from v9fs_co_run_in_worker(). This implies
> > > > > > that a very long request (ie. one that would span on many calls
> > > > > > to readdir()) cannot be interrupted by the client anymore, as
> > > > > > opposed to what we have now BTW.
> > > > 
> > > > ... most likely to allow clients to interrupt a long request with a
> > > > Tflush. Have you considered this aspect in your changes ?
> > > 
> > > In this particular patch set, no, as for readdir this should not be an
> > > issue in practice for anybody. After this patch set is applied, even on
> > > huge directories, the fs driver's duration for readdir would barely be
> > > measurable. In fact the server's latency would always be much higher,
> > > yet.
> > 
> > Reproducer ? Numbers ? ;)
> 
> Well, you have already seen and run the benchmark in this series yourself. You 
> can easily hit several hundred ms server latency, which is very hard to top 
> with any fs driver reading a directory. Ok, maybe with a tape drive you could, 
> but that's pretty much it. :) With a 'normal' system the fs driver would 

Or some network file system on an extremely slow connection...

> rather block for something between <1ms .. 8ms, i.e. fs driver completes 
> before it would be able to actually see a flush request.
> 
> But to make it also clear: it would not bite with the new design either. If 
> someone really would see a necessity to abort readdir requests, that would be 
> possible without huge changes.
> 

Tflush requests are handled in the top-half exclusively by design, ie.
they rely on the worker thread handling the targeted request to yield
control back to the main I/O thread. Since this doesn't happen anymore
with your patches, I'm not sure how you be able to abort a _many readdir_ 
request once it's been fired into a worker thread.

A possible solution could be to break down a Treaddir into multiple
_many readdirs_ jobs, and have a knob or some logic to control the
latency ratio.

> > > > > > Then the whole seekdir/rewinddir/telldir/readdir sequence should
> > > > > > be called with a single v9fs_co_run_in_worker() invocation, in
> > > > > > which case the locking could just be something like:
> > > > > > 
> > > > > > int coroutine_fn v9fs_co_readdir_many(V9fsPDU *pdu, V9fsFidState
> > > > > > *fidp,
> > > > > > 
> > > > > >                                       struct V9fsDirEnt **entries,
> > > > > >                                       int32_t maxsize, bool dostat)
> > > > > > 
> > > > > > {
> > > > > > 
> > > > > >     int err = 0;
> > > > > >     
> > > > > >     if (v9fs_request_cancelled(pdu)) {
> > > > > >     
> > > > > >         return -EINTR;
> > > > > >     
> > > > > >     }
> > > > > >     
> > > > > >     v9fs_readdir_lock(&fidp->fs.dir);
> > > > > >     
> > > > > >     v9fs_co_run_in_worker({
> > > > > >     
> > > > > >         err = do_readdir_many(pdu, fidp, entries, maxsize, dostat);
> > > > > >     
> > > > > >     });
> > > > > >     
> > > > > >     v9fs_readdir_unlock(&fidp->fs.dir);
> > > > > >     return err;
> > > > > > 
> > > > > > }
> > > > > 
> > > > > That's exactly what should be prevented. The lock must be on driver
> > > > > thread
> > > > > side, not on main thread side. The goal is to reduce the time slice of
> > > > > individual locks. If the lock is on main thread, the time slice of a
> > > > > lock
> > > > > (even on huge directories with thousands of entries) is multiple
> > > > > factors
> > > > > larger than if the lock is on driver thread side. So that's not just
> > > > > few
> > > > > percent or so, it is huge.
> > > > 
> > > > Sorry I don't get it...  In a contention-less situation, which is the
> > > > case we really care for, qemu_co_mutex_lock() has no overhead.
> > > 
> > > Yes, it only kicks in if there is concurrency.
> > 
> > And we don't care to preserve performance for silly clients, do we ?
> 
> We don't necessarily need to preserve performance for them, that's right. But 
> it is already handled appropriately for them, so destroying it (i.e. slowing 
> them down) only makes sense if there is a good reason for that.
> 

Ending up with a mix of two different kind of mutexes for 9p2000.L and .u is
a good enough reason for me.

> > > > > > This would technically leave the locking in the main I/O thread,
> > > > > > but it isn't conceptually different from locking at the beginning
> > > > > > of do_readdir_lock() and unlocking before returning. My concern
> > > > > > is that I don't know how coroutine mutexes behave with multiple
> > > > > > worker threads...
> > > > > 
> > > > > Well, your Mutex -> CoMutex change was intended to fix a deadlock with
> > > > > the
> > > > > old readdir implementation. That deadlock could not happen with the
> > > > > new
> > > > > readdir implementation, which suggests that it probably makes sense to
> > > > > revert this change (i.e. CoMutex -> Mutex) for this new
> > > > > implementation.
> > > > 
> > > > No we can't because it is also used with 9p2000.u, that you said you
> > > > don't want to fix.
> > > 
> > > Yes and no, I did not say I don't want to fix readdir for 9p2000.u at all.
> > > What I said was I want to prioritize and concentrate on the most relevant
> > > issues first. 9p2000.L is the most commonly used protocol variant, so I
> > > would like to fix the most severe (e.g. performance) issues for 9p2000.L
> > > first before spending time on 9p2000.u which is AFAICS barely used in
> > > comparison, which I personally don't use at all, and which I am hence not
> > > testing in the same degree and cannot serve with the same quality as
> > > 9p2000.L right now.
> > > 
> > > Plus if there are really users caring for 9p2000.u, I can gladly assist
> > > them to address these issues for 9p2000.u as well, provided that they
> > > help at least with testing the required 9p2000.u changes.
> > 
> > I would personally do the opposite... again ;)
> > 
> > Like you say we essentially care for 9p2000.L and we only do limited
> > support for 9p2000.u. If we have a change that we really want for
> > 9p2000.L but it has an impact on 9p2000.u because of shared code,
> > it is fine to do the changes anyway, including changes to the 9p2000.u
> > specific code. Very basic testing of 9p2000.u (mount/ls or find/umount)
> > or maybe running pjd-fstest is enough IMHO. If we break someone's setup
> > and he complains, then we can ask him to test a fix before we merge it.
> 
> Well, so you want to handle the relevant 9p2000.u readdir optimizations by 
> yourself, and you would send it as follow-up patch set (after this set is 
> through), including test cases?
> 

Ah it wasn't my point. I was just saying that any valuable change for
9p2000.L prevails and if you have to change some common code (eg.
locking) that could impact the 9p2000.u experience, you can do it
anyway, even if you only do smoke testing with 9p2000.u.

> > > Back to the actual topic: so what do we do about the mutex then? CoMutex
> > > for 9p2000.u and Mutex for 9p2000.L? I know you find that ugly, but it
> > > would just be a transitional measure.
> > 
> > That would ruin my day...
> > 
> > More seriously, the recent fix for a deadlock condition that was present
> > for years proves that nobody seems to be using silly clients with QEMU.
> > So I think we should just dump the lock and add a one-time warning in
> > the top level handlers when we detect a duplicate readdir request on
> > the same fid. This should be a very simple patch (I can take care of
> > it right away).
> 
> Looks like we have a consensus! Then I wait for your patch removing the lock, 
> and for your feedback whether or not you see anything else in this patch set?
> 

I was ready to do that but now I'm reading you other mail... I'll
continue the discussion there.

> > > Best regards,
> > > Christian Schoenebeck
> > 
> > Cheers,
> > 
> > --
> > Greg
> 
> Best regards,
> Christian Schoenebeck
> 
>
Greg Kurz July 3, 2020, 4:08 p.m. UTC | #14
On Fri, 03 Jul 2020 10:08:09 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Donnerstag, 2. Juli 2020 19:23:35 CEST Christian Schoenebeck wrote:
> > > > Back to the actual topic: so what do we do about the mutex then? CoMutex
> > > > for 9p2000.u and Mutex for 9p2000.L? I know you find that ugly, but it
> > > > would just be a transitional measure.
> > > 
> > > That would ruin my day...
> > > 
> > > More seriously, the recent fix for a deadlock condition that was present
> > > for years proves that nobody seems to be using silly clients with QEMU.
> > > So I think we should just dump the lock and add a one-time warning in
> > > the top level handlers when we detect a duplicate readdir request on
> > > the same fid. This should be a very simple patch (I can take care of
> > > it right away).
> > 
> > Looks like we have a consensus! Then I wait for your patch removing the
> > lock, and for your feedback whether or not you see anything else in this
> > patch set?
> 
> Please wait before you work on this patch. I really do think your patch should 
> be based on/after this optimization patch for one reason: if (and even though 
> it's a big if) somebody comes along with a silly client as you named it, then 
> your patch can simply be reverted, which would not be possible if it's next.
> 
> So I would really suggest I change this patch here to go the ugly way with 2 
> mutex types for readdir 9p2000.L vs 9p2000.L, and your patch would get rid of 
> that mess by removing the lock entirely, okay?
> 

If someones ever comes along with a silly client, she will get warnings
explaining that she might get silly results. If it turns out that we
really need to support that for valid reasons, it will be okay to focus
on the appropriate fix when the time comes, not now. So I don't say any
real benefit to postponing the removal of the lock after your series, but
I do at least see one benefit, it will reduce the level of noise.

> Best regards,
> Christian Schoenebeck
> 
>
Christian Schoenebeck July 3, 2020, 6:12 p.m. UTC | #15
On Freitag, 3. Juli 2020 17:53:15 CEST Greg Kurz wrote:
> > > I don't remember questioning the overall concept behind these changes
> > > because it looks reasonable enough (even if I would appreciate to be
> > > able to verify them with a working reproducer).
> > 
> > What exactly do you mean here with working reproducer?
> 
> Some test program to be run in a real guest that gets improved performance
> with this patch set for example.

Mhm, for now the benchmark patches already provided should be proof enough 
that this optimization will bring significant improvements in the end.

Like mentioned on a side note; to actually feel the difference with a real 
Linux client, a kernel patch is also required for the guest. But I definitely 
don't start dealing with LKML before the ground is laid on QEMU side (i.e. 
this patch set is merged on master).

> > I don't say don't care about code style, details et al., but caring a bit
> > less on reviews would not hurt, so to say. A bit of (invisible) tolerance
> > avoids friction and stagnancy.
> 
> No, this friction and stagnancy is essentially the result of not sharing
> the same agenda and different tastes also.

That summary is a bit too simple, and you know that. But okay, let's leave it 
that way.

> > rather block for something between <1ms .. 8ms, i.e. fs driver completes
> > before it would be able to actually see a flush request.
> > 
> > But to make it also clear: it would not bite with the new design either.
> > If
> > someone really would see a necessity to abort readdir requests, that would
> > be possible without huge changes.
> 
> Tflush requests are handled in the top-half exclusively by design, ie.
> they rely on the worker thread handling the targeted request to yield
> control back to the main I/O thread. Since this doesn't happen anymore
> with your patches, I'm not sure how you be able to abort a _many readdir_
> request once it's been fired into a worker thread.
> 
> A possible solution could be to break down a Treaddir into multiple
> _many readdirs_ jobs, and have a knob or some logic to control the
> latency ratio.

Too complicated, there is no need to break it down into subtasks or something. 
It can be handled with a simple abort condition in readdir_many()'s loop. But 
that's not an issue for now to discuss the details.

> > > > > > > Then the whole seekdir/rewinddir/telldir/readdir sequence should
> > > > > > > be called with a single v9fs_co_run_in_worker() invocation, in
> > > > > > > which case the locking could just be something like:
> > > > > > > 
> > > > > > > int coroutine_fn v9fs_co_readdir_many(V9fsPDU *pdu, V9fsFidState
> > > > > > > *fidp,
> > > > > > > 
> > > > > > >                                       struct V9fsDirEnt
> > > > > > >                                       **entries,
> > > > > > >                                       int32_t maxsize, bool
> > > > > > >                                       dostat)
> > > > > > > 
> > > > > > > {
> > > > > > > 
> > > > > > >     int err = 0;
> > > > > > >     
> > > > > > >     if (v9fs_request_cancelled(pdu)) {
> > > > > > >     
> > > > > > >         return -EINTR;
> > > > > > >     
> > > > > > >     }
> > > > > > >     
> > > > > > >     v9fs_readdir_lock(&fidp->fs.dir);
> > > > > > >     
> > > > > > >     v9fs_co_run_in_worker({
> > > > > > >     
> > > > > > >         err = do_readdir_many(pdu, fidp, entries, maxsize,
> > > > > > >         dostat);
> > > > > > >     
> > > > > > >     });
> > > > > > >     
> > > > > > >     v9fs_readdir_unlock(&fidp->fs.dir);
> > > > > > >     return err;
> > > > > > > 
> > > > > > > }
> > > > > > 
> > > > > > That's exactly what should be prevented. The lock must be on
> > > > > > driver
> > > > > > thread
> > > > > > side, not on main thread side. The goal is to reduce the time
> > > > > > slice of
> > > > > > individual locks. If the lock is on main thread, the time slice of
> > > > > > a
> > > > > > lock
> > > > > > (even on huge directories with thousands of entries) is multiple
> > > > > > factors
> > > > > > larger than if the lock is on driver thread side. So that's not
> > > > > > just
> > > > > > few
> > > > > > percent or so, it is huge.
> > > > > 
> > > > > Sorry I don't get it...  In a contention-less situation, which is
> > > > > the
> > > > > case we really care for, qemu_co_mutex_lock() has no overhead.
> > > > 
> > > > Yes, it only kicks in if there is concurrency.
> > > 
> > > And we don't care to preserve performance for silly clients, do we ?
> > 
> > We don't necessarily need to preserve performance for them, that's right.
> > But it is already handled appropriately for them, so destroying it (i.e.
> > slowing them down) only makes sense if there is a good reason for that.
> 
> Ending up with a mix of two different kind of mutexes for 9p2000.L and .u is
> a good enough reason for me.

Correct behaviour always has precedence before code aesthetics, but we already 
agreed to remove the lock entirely anyway, so let's move on.

> > > > > > > This would technically leave the locking in the main I/O thread,
> > > > > > > but it isn't conceptually different from locking at the
> > > > > > > beginning
> > > > > > > of do_readdir_lock() and unlocking before returning. My concern
> > > > > > > is that I don't know how coroutine mutexes behave with multiple
> > > > > > > worker threads...
> > > > > > 
> > > > > > Well, your Mutex -> CoMutex change was intended to fix a deadlock
> > > > > > with
> > > > > > the
> > > > > > old readdir implementation. That deadlock could not happen with
> > > > > > the
> > > > > > new
> > > > > > readdir implementation, which suggests that it probably makes
> > > > > > sense to
> > > > > > revert this change (i.e. CoMutex -> Mutex) for this new
> > > > > > implementation.
> > > > > 
> > > > > No we can't because it is also used with 9p2000.u, that you said you
> > > > > don't want to fix.
> > > > 
> > > > Yes and no, I did not say I don't want to fix readdir for 9p2000.u at
> > > > all.
> > > > What I said was I want to prioritize and concentrate on the most
> > > > relevant
> > > > issues first. 9p2000.L is the most commonly used protocol variant, so
> > > > I
> > > > would like to fix the most severe (e.g. performance) issues for
> > > > 9p2000.L
> > > > first before spending time on 9p2000.u which is AFAICS barely used in
> > > > comparison, which I personally don't use at all, and which I am hence
> > > > not
> > > > testing in the same degree and cannot serve with the same quality as
> > > > 9p2000.L right now.
> > > > 
> > > > Plus if there are really users caring for 9p2000.u, I can gladly
> > > > assist
> > > > them to address these issues for 9p2000.u as well, provided that they
> > > > help at least with testing the required 9p2000.u changes.
> > > 
> > > I would personally do the opposite... again ;)
> > > 
> > > Like you say we essentially care for 9p2000.L and we only do limited
> > > support for 9p2000.u. If we have a change that we really want for
> > > 9p2000.L but it has an impact on 9p2000.u because of shared code,
> > > it is fine to do the changes anyway, including changes to the 9p2000.u
> > > specific code. Very basic testing of 9p2000.u (mount/ls or find/umount)
> > > or maybe running pjd-fstest is enough IMHO. If we break someone's setup
> > > and he complains, then we can ask him to test a fix before we merge it.
> > 
> > Well, so you want to handle the relevant 9p2000.u readdir optimizations by
> > yourself, and you would send it as follow-up patch set (after this set is
> > through), including test cases?
> 
> Ah it wasn't my point. I was just saying that any valuable change for
> 9p2000.L prevails and if you have to change some common code (eg.
> locking) that could impact the 9p2000.u experience, you can do it
> anyway, even if you only do smoke testing with 9p2000.u.

Ah I see, so you would. But you don't. ;-)

And yes, I could. But I won't either, for the following reasons:

1. I would have to write readdir test cases for 9p2000.u as well, because the 
   9p2000.L tests are incompatible. -> My time

2. I have to change the 9p2000.u server code -> My time

3. This patch grows substantially, by both lines and amount of patches -> My
   time and probably +10 versions more of this patch series.

4. I would have to do at least some basic testing of 9p2000.u behaviour
   -> My time

All these things would bring me +x months farther away from reaching my actual 
goal: making 9p useable by Linux clients. So no, I don't waste time on 
9p2000.u optimizations before I see somebody actually caring for 9p2000.u 
efficiency.

Best regards,
Christian Schoenebeck
Christian Schoenebeck July 3, 2020, 6:27 p.m. UTC | #16
On Freitag, 3. Juli 2020 18:08:21 CEST Greg Kurz wrote:
> On Fri, 03 Jul 2020 10:08:09 +0200
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > On Donnerstag, 2. Juli 2020 19:23:35 CEST Christian Schoenebeck wrote:
> > > > > Back to the actual topic: so what do we do about the mutex then?
> > > > > CoMutex
> > > > > for 9p2000.u and Mutex for 9p2000.L? I know you find that ugly, but
> > > > > it
> > > > > would just be a transitional measure.
> > > > 
> > > > That would ruin my day...
> > > > 
> > > > More seriously, the recent fix for a deadlock condition that was
> > > > present
> > > > for years proves that nobody seems to be using silly clients with
> > > > QEMU.
> > > > So I think we should just dump the lock and add a one-time warning in
> > > > the top level handlers when we detect a duplicate readdir request on
> > > > the same fid. This should be a very simple patch (I can take care of
> > > > it right away).
> > > 
> > > Looks like we have a consensus! Then I wait for your patch removing the
> > > lock, and for your feedback whether or not you see anything else in this
> > > patch set?
> > 
> > Please wait before you work on this patch. I really do think your patch
> > should be based on/after this optimization patch for one reason: if (and
> > even though it's a big if) somebody comes along with a silly client as
> > you named it, then your patch can simply be reverted, which would not be
> > possible if it's next.
> > 
> > So I would really suggest I change this patch here to go the ugly way with
> > 2 mutex types for readdir 9p2000.L vs 9p2000.L, and your patch would get
> > rid of that mess by removing the lock entirely, okay?
> 
> If someones ever comes along with a silly client, she will get warnings
> explaining that she might get silly results. If it turns out that we
> really need to support that for valid reasons, it will be okay to focus
> on the appropriate fix when the time comes, not now. So I don't say any
> real benefit to postponing the removal of the lock after your series, but
> I do at least see one benefit, it will reduce the level of noise.

Reasons:

1. Less work for me, as I would have more conflicts to resolve manually if
   your patch would come next.

2. The dual mutex change is just like 3 lines of code more -> trivial (and my 
   problem)

3. If somebody complains, I just have to revert your patch.

4. For you, it does neither mean more time nor more efforts, as you haven't 
   started to write the patch yet.

5. The actual outcome from your side is the same: you get what you want, the
   lock will be gone entirely after all anyway.

and most notably:

6. I don't see any advantage if your patch would come next.

Best regards,
Christian Schoenebeck
diff mbox series

Patch

diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index 43584aca41..8283a3cfbb 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -971,30 +971,6 @@  static int coroutine_fn fid_to_qid(V9fsPDU *pdu, V9fsFidState *fidp,
     return 0;
 }
 
-static int coroutine_fn dirent_to_qid(V9fsPDU *pdu, V9fsFidState *fidp,
-                                      struct dirent *dent, V9fsQID *qidp)
-{
-    struct stat stbuf;
-    V9fsPath path;
-    int err;
-
-    v9fs_path_init(&path);
-
-    err = v9fs_co_name_to_path(pdu, &fidp->path, dent->d_name, &path);
-    if (err < 0) {
-        goto out;
-    }
-    err = v9fs_co_lstat(pdu, &path, &stbuf);
-    if (err < 0) {
-        goto out;
-    }
-    err = stat_to_qid(pdu, &stbuf, qidp);
-
-out:
-    v9fs_path_free(&path);
-    return err;
-}
-
 V9fsPDU *pdu_alloc(V9fsState *s)
 {
     V9fsPDU *pdu = NULL;
@@ -2337,6 +2313,18 @@  size_t v9fs_readdir_response_size(V9fsString *name)
     return 24 + v9fs_string_size(name);
 }
 
+static void v9fs_free_dirents(struct V9fsDirEnt *e)
+{
+    struct V9fsDirEnt *next = NULL;
+
+    for (; e; e = next) {
+        next = e->next;
+        g_free(e->dent);
+        g_free(e->st);
+        g_free(e);
+    }
+}
+
 static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
                                         int32_t max_count)
 {
@@ -2345,54 +2333,53 @@  static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
     V9fsString name;
     int len, err = 0;
     int32_t count = 0;
-    off_t saved_dir_pos;
     struct dirent *dent;
+    struct stat *st;
+    struct V9fsDirEnt *entries = NULL;
 
-    /* save the directory position */
-    saved_dir_pos = v9fs_co_telldir(pdu, fidp);
-    if (saved_dir_pos < 0) {
-        return saved_dir_pos;
-    }
-
-    while (1) {
-        v9fs_readdir_lock(&fidp->fs.dir);
+    /*
+     * inode remapping requires the device id, which in turn might be
+     * different for different directory entries, so if inode remapping is
+     * enabled we have to make a full stat for each directory entry
+     */
+    const bool dostat = pdu->s->ctx.export_flags & V9FS_REMAP_INODES;
 
-        err = v9fs_co_readdir(pdu, fidp, &dent);
-        if (err || !dent) {
-            break;
-        }
-        v9fs_string_init(&name);
-        v9fs_string_sprintf(&name, "%s", dent->d_name);
-        if ((count + v9fs_readdir_response_size(&name)) > max_count) {
-            v9fs_readdir_unlock(&fidp->fs.dir);
+    /*
+     * Fetch all required directory entries altogether on a background IO
+     * thread from fs driver. We don't want to do that for each entry
+     * individually, because hopping between threads (this main IO thread
+     * and background IO driver thread) would sum up to huge latencies.
+     */
+    count = v9fs_co_readdir_many(pdu, fidp, &entries, max_count, dostat);
+    if (count < 0) {
+        err = count;
+        count = 0;
+        goto out;
+    }
+    count = 0;
 
-            /* Ran out of buffer. Set dir back to old position and return */
-            v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
-            v9fs_string_free(&name);
-            return count;
-        }
+    for (struct V9fsDirEnt *e = entries; e; e = e->next) {
+        dent = e->dent;
 
         if (pdu->s->ctx.export_flags & V9FS_REMAP_INODES) {
-            /*
-             * dirent_to_qid() implies expensive stat call for each entry,
-             * we must do that here though since inode remapping requires
-             * the device id, which in turn might be different for
-             * different entries; we cannot make any assumption to avoid
-             * that here.
-             */
-            err = dirent_to_qid(pdu, fidp, dent, &qid);
+            st = e->st;
+            /* e->st should never be NULL, but just to be sure */
+            if (!st) {
+                err = -1;
+                break;
+            }
+
+            /* remap inode */
+            err = stat_to_qid(pdu, st, &qid);
             if (err < 0) {
-                v9fs_readdir_unlock(&fidp->fs.dir);
-                v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
-                v9fs_string_free(&name);
-                return err;
+                break;
             }
         } else {
             /*
              * Fill up just the path field of qid because the client uses
              * only that. To fill the entire qid structure we will have
              * to stat each dirent found, which is expensive. For the
-             * latter reason we don't call dirent_to_qid() here. Only drawback
+             * latter reason we don't call stat_to_qid() here. Only drawback
              * is that no multi-device export detection of stat_to_qid()
              * would be done and provided as error to the user here. But
              * user would get that error anyway when accessing those
@@ -2405,25 +2392,26 @@  static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
             qid.version = 0;
         }
 
+        v9fs_string_init(&name);
+        v9fs_string_sprintf(&name, "%s", dent->d_name);
+
         /* 11 = 7 + 4 (7 = start offset, 4 = space for storing count) */
         len = pdu_marshal(pdu, 11 + count, "Qqbs",
                           &qid, dent->d_off,
                           dent->d_type, &name);
 
-        v9fs_readdir_unlock(&fidp->fs.dir);
+        v9fs_string_free(&name);
 
         if (len < 0) {
-            v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
-            v9fs_string_free(&name);
-            return len;
+            err = len;
+            break;
         }
+
         count += len;
-        v9fs_string_free(&name);
-        saved_dir_pos = dent->d_off;
     }
 
-    v9fs_readdir_unlock(&fidp->fs.dir);
-
+out:
+    v9fs_free_dirents(entries);
     if (err < 0) {
         return err;
     }