cxl/pmem: Fix module reload vs workqueue state

A test of the form:

    while true; do modprobe -r cxl_pmem; modprobe cxl_pmem; done

May lead to a crash signature of the form:

    BUG: unable to handle page fault for address: ffffffffc0660030
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    [..]
    Workqueue: cxl_pmem 0xffffffffc0660030
    RIP: 0010:0xffffffffc0660030
    Code: Unable to access opcode bytes at RIP 0xffffffffc0660006.
    [..]
    Call Trace:
     ? process_one_work+0x4ec/0x9c0
     ? pwq_dec_nr_in_flight+0x100/0x100
     ? rwlock_bug.part.0+0x60/0x60
     ? worker_thread+0x2eb/0x700

In that report the 0xffffffffc0660030 address corresponds to the former
function address of cxl_nvb_update_state() from a previous load of the
module, not the current address. Fix that by arranging for ->state_work
in the 'struct cxl_nvdimm_bridge' object to be reinitialized on cxl_pmem
module reload.

Details:

Recall that CXL subsystem wants to link a CXL memory expander device to
an NVDIMM sub-hierarchy when both a persistent memory range has been
registered by the CXL platform driver (cxl_acpi) *and* when that CXL
memory expander has published persistent memory capacity (Get Partition
Info). To this end the cxl_nvdimm_bridge driver arranges to rescan the
CXL bus when either of those conditions change. The helper
bus_rescan_devices() can not be called underneath the device_lock() for
any device on that bus, so the cxl_nvdimm_bridge driver uses a workqueue
for the rescan.

Typically a driver allocates driver data to hold a 'struct work_struct'
for a driven device, but for a workqueue that may run after ->remove()
returns, driver data will have been freed. The 'struct
cxl_nvdimm_bridge' object holds the state and work_struct directly.
Unfortunately it was only arranging for that infrastructure to be
initialized once per device creation rather than the necessary once per
workqueue (cxl_pmem_wq) creation.

Introduce is_cxl_nvdimm_bridge() and cxl_nvdimm_bridge_reset() in
support of invalidating stale references to a recently destroyed
cxl_pmem_wq.

Cc: <stable@vger.kernel.org>
Fixes: 8fdcb1704f61 ("cxl/pmem: Add initial infrastructure for pmem support")
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/pmem.c |    8 +++++++-
 drivers/cxl/cxl.h       |    8 ++++++++
 drivers/cxl/pmem.c      |   29 +++++++++++++++++++++++++++--
 3 files changed, 42 insertions(+), 3 deletions(-)

Message ID	163665474585.3505991.8397182770066720755.stgit@dwillia2-desk3.amr.corp.intel.com
State	Accepted
Commit	53989fad1286e652ea3655ae3367ba698da8d2ff
Headers	show Return-Path: <linux-cxl-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5964DC433EF for <linux-cxl@archiver.kernel.org>; Thu, 11 Nov 2021 18:19:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3613961268 for <linux-cxl@archiver.kernel.org>; Thu, 11 Nov 2021 18:19:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234376AbhKKSV7 (ORCPT <rfc822;linux-cxl@archiver.kernel.org>); Thu, 11 Nov 2021 13:21:59 -0500 Received: from mga01.intel.com ([192.55.52.88]:46403 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232366AbhKKSV7 (ORCPT <rfc822;linux-cxl@vger.kernel.org>); Thu, 11 Nov 2021 13:21:59 -0500 X-IronPort-AV: E=McAfee;i="6200,9189,10165"; a="256696589" X-IronPort-AV: E=Sophos;i="5.87,226,1631602800"; d="scan'208";a="256696589" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Nov 2021 10:19:06 -0800 X-IronPort-AV: E=Sophos;i="5.87,226,1631602800"; d="scan'208";a="492642362" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.25]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Nov 2021 10:19:05 -0800 Subject: [PATCH] cxl/pmem: Fix module reload vs workqueue state From: Dan Williams <dan.j.williams@intel.com> To: linux-cxl@vger.kernel.org Cc: stable@vger.kernel.org, Vishal Verma <vishal.l.verma@intel.com>, ben.widawsky@intel.com, ira.weiny@intel.com, alison.schofield@intel.com, Jonathan.Cameron@huawei.com Date: Thu, 11 Nov 2021 10:19:05 -0800 Message-ID: <163665474585.3505991.8397182770066720755.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: <linux-cxl.vger.kernel.org> X-Mailing-List: linux-cxl@vger.kernel.org
Series	cxl/pmem: Fix module reload vs workqueue state \| expand cxl/pmem: Fix module reload vs workqueue state

cxl/pmem: Fix module reload vs workqueue state

Commit Message

Comments

Patch