[RFC,v4,05/13] workqueue, ktask: renice helper threads to prevent starvation

With ktask helper threads running at MAX_NICE, it's possible for one or
more of them to begin chunks of the task and then have their CPU time
constrained by higher priority threads.  The main ktask thread, running
at normal priority, may finish all available chunks of the task and then
wait on the MAX_NICE helpers to finish the last in-progress chunks, for
longer than it would have if no helpers were used.

Avoid this by having the main thread assign its priority to each
unfinished helper one at a time so that on a heavily loaded system,
exactly one thread in a given ktask call is running at the main thread's
priority.  At least one thread to ensure forward progress, and at most
one thread to limit excessive multithreading.

Since the workqueue interface, on which ktask is built, does not provide
access to worker threads, ktask can't adjust their priorities directly,
so add a new interface to allow a previously-queued work item to run at
a different priority than the one controlled by the corresponding
workqueue's 'nice' attribute.  The worker assigned to the work item will
run the work at the given priority, temporarily overriding the worker's
priority.

The interface is flush_work_at_nice, which ensures the given work item's
assigned worker runs at the specified nice level and waits for the work
item to finish.

An alternative choice would have been to simply requeue the work item to
a pool with workers of the new priority, but this doesn't seem feasible
because a worker may have already started executing the work and there's
currently no way to interrupt it midway through.  The proposed interface
solves this issue because a worker's priority can be adjusted while it's
executing the work.

TODO:  flush_work_at_nice is a proof-of-concept only, and it may be
desired to have the interface set the work's nice without also waiting
for it to finish.  It's implemented in the flush path for this RFC
because it was fairly simple to write ;-)

I ran tests similar to the ones in the last patch with a couple of
differences:
 - The non-ktask workload uses 8 CPUs instead of 7 to compete with the
   main ktask thread as well as the ktask helpers, so that when the main
   thread finishes, its CPU is completely occupied by the non-ktask
   workload, meaning MAX_NICE helpers can't run as often.
 - The non-ktask workload starts before the ktask workload, rather
   than after, to maximize the chance that it starves helpers.

Runtimes in seconds.

Case 1: Synthetic, worst-case CPU contention

  ktask_test - a tight loop doing integer multiplication to max out on CPU;
               used for testing only, does not appear in this series
  stress-ng  - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200");

             8_ktask_thrs           8_ktask_thrs
               w/o_renice  (stdev)   with_renice  (stdev)  1_ktask_thr  (stdev)
             ------------------------------------------------------------------
  ktask_test        41.98  ( 0.22)         25.15  ( 2.98)        30.40  ( 0.61)
  stress-ng         44.79  ( 1.11)         46.37  ( 0.69)        53.29  ( 1.91)

Without renicing, ktask_test finishes just after stress-ng does because
stress-ng needs to free up CPUs for the helpers to finish (ktask_test
shows a shorter runtime than stress-ng because ktask_test was started
later).  Renicing lets ktask_test finish 40% sooner, and running the
same amount of work in ktask_test with 1 thread instead of 8 finishes in
a comparable amount of time, though longer than "with_renice" because
MAX_NICE threads still get some CPU time, and the effect over 8 threads
adds up.

stress-ng's total runtime gets a little longer going from no renicing to
renicing, as expected, because each reniced ktask thread takes more CPU
time than before when the helpers were starved.

Running with one ktask thread, stress-ng's reported walltime goes up
because that single thread interferes with fewer stress-ng threads,
but with more impact, causing a greater spread in the time it takes for
individual stress-ng threads to finish.  Averages of the per-thread
stress-ng times from "with_renice" to "1_ktask_thr" come out roughly
the same, though, 43.81 and 43.89 respectively.  So the total runtime of
stress-ng across all threads is unaffected, but the time stress-ng takes
to finish running its threads completely actually improves by spreading
the ktask_test work over more threads.

Case 2: Real-world CPU contention

  ktask_vfio - VFIO page pin a 32G kvm guest
  usemem     - faults in 86G of anonymous THP per thread, PAGE_SIZE stride;
               used to mimic the page clearing that dominates in ktask_vfio
               so that usemem competes for the same system resources

             8_ktask_thrs           8_ktask_thrs
               w/o_renice  (stdev)   with_renice  (stdev)  1_ktask_thr  (stdev)
             ------------------------------------------------------------------
  ktask_vfio        18.59  ( 0.19)         14.62  ( 2.03)        16.24  ( 0.90)
      usemem        47.54  ( 0.89)         48.18  ( 0.77)        49.70  ( 1.20)

These results are similar to case 1's, though the differences between
times are not quite as pronounced because ktask_vfio ran shorter
compared to usemem.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/workqueue.h |   5 ++
 kernel/ktask.c            |  81 ++++++++++++++++++-----------
 kernel/workqueue.c        | 106 +++++++++++++++++++++++++++++++++++---
 3 files changed, 156 insertions(+), 36 deletions(-)

Message ID	20181105165558.11698-6-daniel.m.jordan@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 15A7014BD for <patchwork-kvm@patchwork.kernel.org>; Mon, 5 Nov 2018 16:59:57 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0372529B8D for <patchwork-kvm@patchwork.kernel.org>; Mon, 5 Nov 2018 16:59:57 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EACFD29BA0; Mon, 5 Nov 2018 16:59:56 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BD99C29B8D for <patchwork-kvm@patchwork.kernel.org>; Mon, 5 Nov 2018 16:59:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387831AbeKFCU1 (ORCPT <rfc822;patchwork-kvm@patchwork.kernel.org>); Mon, 5 Nov 2018 21:20:27 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:43434 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730068AbeKFCRV (ORCPT <rfc822;kvm@vger.kernel.org>); Mon, 5 Nov 2018 21:17:21 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs0Tw052185; Mon, 5 Nov 2018 16:56:16 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=ocotEYS5gwBRk6gU87/nqGF4VbCejB5eUqqLuMjIfMQ=; b=mN54KvD5juUsCho+IUsHz9pdJiCENCu2Zk68E3tt4J+ZeqbK+u8lhaGnR6ede6YIV8Zi G0cJxC6Hstx3BlUz8Tv2uPx3bhRvWsGAXx9kbwg8Ly8VAVJlRrsl5c9iryriKxkNZdLB 77woyrxNGrg2hihBo1BHyHVoMpZzRMYwRqNLTV56De9PZFqzHVyjhuejpkJXBbSHW6Pe Ctie/HE8JUJ4+mwTFYPCAPSXCpILnHber5mldAbJ5QcWes73oNe55l7E7j/3l8EDceln qT51C5PTnfliZqYGCd7Ow1YgJLGX+/sSMn9Xw/XrlqskUtf5cqAZvluvCa0F7Y8TmInw ow== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2120.oracle.com with ESMTP id 2nh4aqg2cx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:16 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuFfJ022245 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:15 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuEud010809; Mon, 5 Nov 2018 16:56:14 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:13 -0800 From: Daniel Jordan <daniel.m.jordan@oracle.com> To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 05/13] workqueue, ktask: renice helper threads to prevent starvation Date: Mon, 5 Nov 2018 11:55:50 -0500 Message-Id: <20181105165558.11698-6-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: <kvm.vger.kernel.org> X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP
Series	ktask: multithread CPU-intensive kernel work \| expand [RFC,v4,00/13] ktask: multithread CPU-intensive kernel work [RFC,v4,01/13] ktask: add documentation [RFC,v4,02/13] ktask: multithread CPU-intensive kernel work [RFC,v4,03/13] ktask: add undo support [RFC,v4,04/13] ktask: run helper threads at MAX_NICE [RFC,v4,05/13] workqueue, ktask: renice helper threads to prevent starvation [RFC,v4,06/13] vfio: parallelize vfio_pin_map_dma [RFC,v4,07/13] mm: change locked_vm's type from unsigned long to atomic_long_t [RFC,v4,08/13] vfio: remove unnecessary mmap_sem writer acquisition around locked_vm [RFC,v4,09/13] vfio: relieve mmap_sem reader cacheline bouncing by holding it longer [RFC,v4,10/13] mm: enlarge type of offset argument in mem_map_offset and mem_map_next [RFC,v4,11/13] mm: parallelize deferred struct page initialization within each node [RFC,v4,12/13] mm: parallelize clear_gigantic_page [RFC,v4,13/13] hugetlbfs: parallelize hugetlbfs_fallocate with ktask

[RFC,v4,05/13] workqueue, ktask: renice helper threads to prevent starvation

Commit Message

Comments

Patch