Message ID | 168806401782.1034990.9686296943273298604.stgit@morisot.1015granger.net (mailing list archive) |
---|---|
Headers | show |
Series | RPC service thread scheduler optimizations | expand |
On Fri, 30 Jun 2023, Chuck Lever wrote: > Hi - > > Walking a linked list to find an idle thread is not CPU cache- > friendly, and in fact I've noted palpable per-request latency > impacts as the number of nfsd threads on the server increases. > > After discussing some possible improvements with Jeff at LSF/MM, > I've been experimenting with the following series. I've measured an > order of magnitude latency improvement in the thread lookup time, > and have managed to keep the whole thing lockless. > > The only thing I don't like is that allocating the idle bitmaps in > advance means we've got an /a priori/ cap on the number of NFSD > threads that can be created. I'd love to find a way to enable > the pool idle bitmaps to expand dynamically. Suggestions welcome. Hi Chuck, The series looks good. I did notice that patch 6/8 used UINT_MAX and U32_MAX in different places for the same number, though the next patch replaced them both for a new number - the same in both places now. I agree that an a priori cap on number of threads is not ideal. Have you considered using the xarray to only store busy threads? I think its lookup mechanism mostly relies on a bitmap of present entries, but I'm not completely sure. That would require some extra work for svc_stop_threads() which is the only place we are interested in threads that aren't busy. We would need to record a target number of threads, and whenever a thread becomes idle it checks if the number is exceeded. If so it exits decrementing the number of threads, other wise it re-inserts into the xa (if it cannot find a transport to handle). Alternately we could store bitmaps as values in an xarray, much like the ida code does. Thanks, NeilBrown
On Fri, 30 Jun 2023, NeilBrown wrote: > > I agree that an a priori cap on number of threads is not ideal. > Have you considered using the xarray to only store busy threads? > I think its lookup mechanism mostly relies on a bitmap of present > entries, but I'm not completely sure. It might be event better to use xa_set_mark() and xa_clear_mark() to manage the busy state. These are not atomic so you would need an atomic operation in the rqst. #define XA_MARK_IDLE XA_MARK_1 do { rqstp = xa_find(xa, ..., XA_MARK_IDLE); if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags)) { xa_clear_mark(xa, rqstp->xa_index, XA_MARK_IDLE); break; } } while {rqstp); xa_find() should be nearly as fast at find_next_bit() NeilBrown
> On Jun 29, 2023, at 11:04 PM, NeilBrown <neilb@suse.de> wrote: > > On Fri, 30 Jun 2023, NeilBrown wrote: >> >> I agree that an a priori cap on number of threads is not ideal. >> Have you considered using the xarray to only store busy threads? >> I think its lookup mechanism mostly relies on a bitmap of present >> entries, but I'm not completely sure. > > It might be event better to use xa_set_mark() and xa_clear_mark() to > manage the busy state. > These are not atomic so you would need an atomic operation in the rqst. > > #define XA_MARK_IDLE XA_MARK_1 > > do { > rqstp = xa_find(xa, ..., XA_MARK_IDLE); > if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags)) { > xa_clear_mark(xa, rqstp->xa_index, XA_MARK_IDLE); > break; > } > } while {rqstp); > > xa_find() should be nearly as fast at find_next_bit() Yes, xa_find() is fast, and takes only the RCU read lock. Unfortunately, xa_clear_mark and xa_set_mark (and adding and removing items) require taking the xa_lock(). The current list-based mechanism is lockless, and so is the replacement I've proposed. Jeff and I regarded the addition of a spin lock in the select-an-idle-thread-and-wake-it-up function as something we want to avoid. Thus the pool's idle bitmap enables the use of atomic bit testing, setting, and clearing to make it unnecessary to hold a spin lock while selecting and marking the chosen thread. After selection, only the RCU read lock is needed to safely map that bit position to an svc_rqst pointer. We might be able to use another sp_flags bit to cause thread selection to sleep, if we also protect the thread selection process with the RCU read lock. Flip the bit on, wait one RCU grace period, and it should be safe to copy and replace the idle bitmap. Once bitmap replacement is complete, clear the bit and wake up any threads waiting to find an idle worker. The balancing act is whether we want to add that kind of cleverness to handle a situation that will be relatively rare. Is it sensible to allow more than a few thousand RPC service threads in a pool? -- Chuck Lever