From patchwork Wed Mar 27 17:17:36 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 13607117 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56598C47DD9 for ; Wed, 27 Mar 2024 17:18:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DE8976B009C; Wed, 27 Mar 2024 13:18:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D9B066B009D; Wed, 27 Mar 2024 13:18:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C38AE6B009E; Wed, 27 Mar 2024 13:18:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A68906B009C for ; Wed, 27 Mar 2024 13:18:03 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id E83F5A1325 for ; Wed, 27 Mar 2024 17:17:59 +0000 (UTC) X-FDA: 81943476678.02.404215E Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf06.hostedemail.com (Postfix) with ESMTP id 286B818001B for ; Wed, 27 Mar 2024 17:17:58 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Orvc7TZr; spf=pass (imf06.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711559878; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=G8ccLk4IwuzIQJFrPipCiLzVcTODVh2CvV2jxW3RkMc=; b=a5fPcfn0CD144/w5oUPUwmP1SZ5A0u1R5Ky16klMVIYmfQj42xgFmd3VwbTYaBlI0XO4ak B+/zKqz3hb1+SHjp7CThjbvBF3V+xe0IgApmyda8WkkbZHijNOjFKo2Vhm8UC1VBl7r45R bCnSC2riDZKgpNJts4bRoB8DaXNKB0I= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Orvc7TZr; spf=pass (imf06.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711559878; a=rsa-sha256; cv=none; b=rO8c9Xhm7RENm0HfBok72Sb4y7vJ2B4s/0lU6d5zfhixYGQNZxmMsL9wKNyZfo6dh/iyCL wFOCxghp/Ynho3VA+fSSjmlr+BwUIby0rnPHyhfnChQGiz10Xxm/DlMjY4H/OkOz7TUysu bqN5NOHtd7W1oCjoXRnyiBfGILIqrWY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1711559877; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=G8ccLk4IwuzIQJFrPipCiLzVcTODVh2CvV2jxW3RkMc=; b=Orvc7TZr7RBLTBXSXoz7sA1sdhrpe883B6ZsVs3tqe+FVdkCS+HDhEg7D18u4gXrH4OGXF rlCqYFrDp82USUFRSrLJJ92pedY5usdBnCtqwMavP6a7ltQI/wgS1nWcCgRMlgMYTo/svk +TbZ3ydqmi7PyO79CXhN3EnWm34AAcU= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-551-UN5Oy67YMRqj2AlkFHDtAg-1; Wed, 27 Mar 2024 13:17:53 -0400 X-MC-Unique: UN5Oy67YMRqj2AlkFHDtAg-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 050523C02B6C; Wed, 27 Mar 2024 17:17:52 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.193.208]) by smtp.corp.redhat.com (Postfix) with ESMTP id DD695111E40C; Wed, 27 Mar 2024 17:17:48 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Christian Borntraeger , Janosch Frank , Claudio Imbrenda , Heiko Carstens , Vasily Gorbik , Andrew Morton , Peter Xu , Alexander Gordeev , Sven Schnelle , Gerald Schaefer , Andrea Arcangeli , kvm@vger.kernel.org, linux-s390@vger.kernel.org Subject: [PATCH v2 1/2] mm/userfaultfd: don't place zeropages when zeropages are disallowed Date: Wed, 27 Mar 2024 18:17:36 +0100 Message-ID: <20240327171737.919590-2-david@redhat.com> In-Reply-To: <20240327171737.919590-1-david@redhat.com> References: <20240327171737.919590-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.3 X-Rspamd-Queue-Id: 286B818001B X-Rspam-User: X-Stat-Signature: 71488rfnaqzywpnh8ostq64dinzj44ru X-Rspamd-Server: rspam01 X-HE-Tag: 1711559878-270103 X-HE-Meta: U2FsdGVkX1+xobac5bW6MMQmRpoInVE3ev4HA2HTFRpHlscG7w9wMt2Mn+smshjuuUNRtDhowlShVowNauexSgnGzp1NRWyRnjGcbP6fh4CqkW9SR1kPG2k3je3XNftxoanOLigBU/QM+/lCDu0Yxf+ECNFVQ9n7bs0aAGxUR/JNNWthGGXM+8u/91xln6LgPB7AmFvKfC/Qh5Pxad0EPQA5lWW2GAUuFXozW4OobWa39BU3ldcY1zqahg/F2p/sJBrxKoIT8+MKTZeXGeZTQ3afGw/jb6vJrh7ruCqhQNZg9jTzZRkkegSVIa+Gij49uyMq2xiWIMr/p+aWZXQSr5vzdWzOnXAjIJrsRADGjpIcn8ywX+FYDfUafKTkrRcxi1R/9+rTHxg4GISP784xoDQqHJEzYhao9xlZu4ZUX03cNMezeFRUOMMt1bCrbNWMPJt7pajLbx1FjXrC9Q7PIGrLXrxVsc0YDy8NFcqHnUpbLGpF70Bqww0xBM1FBlnc5wmh4ZPYG36l0+YjxMxrYLjqBbZLpZ1+Rz3X5/KxFm5b1MPVQ1AxTABsNZILJyrCRa4A2js03xt4cAea1rKUwqHDZ8Vy7RcHsmV9hO/577CBKfc0zkSGxRENWuDNnhlyh6zkC2P6GM5VwqpVLabh/aqtAAcg0PPxuGVh2eMOKCp24dWOB3WF9vL4gXxFdqdir0CkpHhStBCMYhN+HKkt9K4OA7VvldwLN/ZnXOh2UOOp01GctnsO7SFR4Oq9Sab0vWJUir9nOHOkaV6prB10fLfpcojXJZivAkJgsZCmdqWdH+rL5N5wtzH54czhFRjKf5B3p/TnbLAXAyWCgVxFzlVJ4nMGmrTE3vN8JIbTkeANsjs3T18d5Yw07M/ARc0WXU9vUyQg8wiZLZUqSYt5BVWuYOBmAlwoqnT4fhQuScHTxHRxDMFsKiMJXwCiTNFfk4EXikBxrT0tNljwFl4 ShwXVbzo drQ2C9swdeMzvk49l0sZMPL1rfgTmq2YC/BLtZGMsl2r9UCXSKiwMAwCJb8hPTzaXeiL8LMux1xSI54La8/NioHrs1lNGwbc2eSHlqTFPO9Han/brD8/lcGBdrYSWBL/hsOXfQq1lqdwhTTkuUUP7Xs1ymkpFNSgAST56om9haqLPfT4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: s390x must disable shared zeropages for processes running VMs, because the VMs could end up making use of "storage keys" or protected virtualization, which are incompatible with shared zeropages. Yet, with userfaultfd it is possible to insert shared zeropages into such processes. Let's fallback to simply allocating a fresh zeroed anonymous folio and insert that instead. mm_forbids_zeropage() was introduced in commit 593befa6ab74 ("mm: introduce mm_forbids_zeropage function"), briefly before userfaultfd went upstream. Note that we don't want to fail the UFFDIO_ZEROPAGE request like we do for hugetlb, it would be rather unexpected. Further, we also cannot really indicated "not supported" to user space ahead of time: it could be that the MM disallows zeropages after userfaultfd was already registered. Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- mm/userfaultfd.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 712160cd41ec..9d385696fb89 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -316,6 +316,37 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd, goto out; } +static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, unsigned long dst_addr) +{ + struct folio *folio; + int ret = -ENOMEM; + + folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr); + if (!folio) + return ret; + + if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL)) + goto out_put; + + /* + * The memory barrier inside __folio_mark_uptodate makes sure that + * zeroing out the folio become visible before mapping the page + * using set_pte_at(). See do_anonymous_page(). + */ + __folio_mark_uptodate(folio); + + ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, + &folio->page, true, 0); + if (ret) + goto out_put; + + return 0; +out_put: + folio_put(folio); + return ret; +} + static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr) @@ -324,6 +355,9 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd, spinlock_t *ptl; int ret; + if (mm_forbids_zeropage(dst_vma->mm)) + return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr); + _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), dst_vma->vm_page_prot)); ret = -EAGAIN; From patchwork Wed Mar 27 17:17:37 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 13607118 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34E8CC54E67 for ; Wed, 27 Mar 2024 17:18:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BEC846B009E; Wed, 27 Mar 2024 13:18:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B9DB86B009F; Wed, 27 Mar 2024 13:18:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A3D3B6B00A0; Wed, 27 Mar 2024 13:18:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 84D596B009E for ; Wed, 27 Mar 2024 13:18:08 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 4A2F516011E for ; Wed, 27 Mar 2024 17:18:08 +0000 (UTC) X-FDA: 81943477056.24.B018271 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf16.hostedemail.com (Postfix) with ESMTP id 6D113180012 for ; Wed, 27 Mar 2024 17:18:06 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=T5zXGS7A; spf=pass (imf16.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711559886; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BvzM0tI1442ad9elHAX827KhH0GuwUl4zhYChvz46rg=; b=L9P2LvopQ32Xha04klqQ79BOSWcbjSocTl4YlqrFDLrcM9Aiy5Incid659x3+X1IB3JhCk YcXWIMXB1NnJBRJ9ATVoy1Ee0kS563xw8YzlJc8A70sWkSxx+wOtA24uCw/EZGf4yBnzN3 tojeBEytukkkdjzYo1PnGIpTPYUc47s= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711559886; a=rsa-sha256; cv=none; b=rHPY5gI+3r4aQ9Ye/ea4xLsdCrouk9BCjjhUGhbKXtswVZ44mpXuomvcNYqSlVVPQsiwPl 9S2GgD0u+HZKS6c3b4E5lzYKymiT9+q/IZ0ymJRuoD1ha6biDdsFHB34Nf9IAJZz/vuM/Q WRdT9nR4kSpYLP16mBbljJR/hLwZsic= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=T5zXGS7A; spf=pass (imf16.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1711559885; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BvzM0tI1442ad9elHAX827KhH0GuwUl4zhYChvz46rg=; b=T5zXGS7ATl+Ry4jTMcDXyHSv2jFDSs7zYb9t9AsffvQSCIIfSc/nEwjFClQUZpOry6M57Q h2ODVjiHZzbgf2Pm3ygBX7aUaXpnUFJBQq3HJkXc8T9QNfhh89W7278f0xUrEByrFUi0SN srkiekdfvh8PMi/jtpuVwo+yOYdwLCA= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-507-xQyAFs2iOZCkAHoy4e8fPw-1; Wed, 27 Mar 2024 13:17:57 -0400 X-MC-Unique: xQyAFs2iOZCkAHoy4e8fPw-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id EF89E3802AC0; Wed, 27 Mar 2024 17:17:56 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.193.208]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3DA65111E3F3; Wed, 27 Mar 2024 17:17:52 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Christian Borntraeger , Janosch Frank , Claudio Imbrenda , Heiko Carstens , Vasily Gorbik , Andrew Morton , Peter Xu , Alexander Gordeev , Sven Schnelle , Gerald Schaefer , Andrea Arcangeli , kvm@vger.kernel.org, linux-s390@vger.kernel.org Subject: [PATCH v2 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests Date: Wed, 27 Mar 2024 18:17:37 +0100 Message-ID: <20240327171737.919590-3-david@redhat.com> In-Reply-To: <20240327171737.919590-1-david@redhat.com> References: <20240327171737.919590-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.3 X-Stat-Signature: 8zoug3mu7hnp1ok1mswc5rxyj6b1emzx X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 6D113180012 X-Rspam-User: X-HE-Tag: 1711559886-813730 X-HE-Meta: U2FsdGVkX1/5BYhgaY8XG4+cv+BqOqvabjGYgXuNlj4GNohz5qHp/CH71x5CgBNpn4XUwWnrMe9YzZXNHkRy0VrS3AMCYY9Wm44CSBKIhXBONybgw/HjsK+4uGwW/gMkf5LJgf7i+UpigoscpJbkg41fdq4ZFx2Dia7FLIJaMeiCaVWODXb4sBShNKP/QRPUFXoFJDbOo2uk9ThocKYU+B+QWU4N5PKpzRzEK7D1/3K3esBT67+AZzzHfQCC531U1wKzpDbROrjcg1x5t/VCbxw5hcUeksfBBn5DrmRpIHgLf+Lx3jp1pLRxE+NNyHO5UByU1dBfR6/3BW58ZjT97eS36GDW5WYXlTBEMrOLtXbyalVGDSffzNKmE0myMzOb4xrgFeEtaBGtHhG97KVkSrZB6EzEYEyeOOPLnxwxx0Domggfp8nrKkAqUGk4L0pu23foumTPWhLrZ4z7w0kNSquPwCxC+emn5KupcZUbnu0McoL9vZoIafXcd54siohs22C2UY8S1NJsrq1k/nb452z5GdKMBADSOxOxuqT5OUfeNrUvZW0xa1ddzWLY0j33CO8pu2eoFyqR1lhQM1nDPRdit/rXoKwMKjWc4WCnxponfy+XsH6HxtNp46fSSZ8d1hMIrJ8TYk/etoMid3HadvZOGUEPI4GCYc/Ay5vxJLzb9JbHTG9twXTvxX9Hm3espcyTDh2h2fon6QA9YwUPXrSH/qr3hFSTAUyeP1y25cFjBzh8dTKcLQ6fb4mRtXw3YAWQVO+Zqk/X7Vt3MgFugXUCzW3dL6dA5kBJbtXi8Qoxs72c7Q0JQFr0V2ycoY8f8OAv0z/auxs0H5sBXcLMLg867BuYJzF3qkImBwghu6hsmFukwUwu0qop4VxiOyb7KPm/0BDYbi6nvc0E8QtIBaTlKRAbtdnHFiucKVNapCVbetWDSVJkfSDAnab/yWzLyqSKfqr1O0Wr7SAFANI dRXIrBr6 bCp9dHlb7GOCssM2xl1sqRmQluUQPJX3ZN08IEYGbEF3BJ11l5PA4MpimLsVetN2XZZsXXcGYmjLmXw6A444m8WuqHo+X0GIugU+/f5fgJ7T0z0E2mKugQGLctGnTM+NlOthpJeu9i7SdIzP+gphGhDZUACraCltQ4cGtDLpLlPaYbYiFCmQkdEYlAJ0ctfYMzymeMSQO8dwQqvRddEcuJD2IuvC+k7jmXVqg3fVtNanyKGnO35rlKHKYuw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: commit fa41ba0d08de ("s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs") introduced an undesired side effect when combined with memory ballooning and VM migration: memory part of the inflated memory balloon will consume memory. Assuming we have a 100GiB VM and inflated the balloon to 40GiB. Our VM will consume ~60GiB of memory. If we now trigger a VM migration, hypervisors like QEMU will read all VM memory. As s390x does not support the shared zeropage, we'll end up allocating for all previously-inflated memory part of the memory balloon: 50 GiB. So we might easily (unexpectedly) crash the VM on the migration source. Even worse, hypervisors like QEMU optimize for zeropage migration to not consume memory on the migration destination: when migrating a "page full of zeroes", on the migration destination they check whether the target memory is already zero (by reading the destination memory) and avoid writing to the memory to not allocate memory: however, s390x will also allocate memory here, implying that also on the migration destination, we will end up allocating all previously-inflated memory part of the memory balloon. This is especially bad if actual memory overcommit was not desired, when memory ballooning is used for dynamic VM memory resizing, setting aside some memory during boot that can be added later on demand. Alternatives like virtio-mem that would avoid this issue are not yet available on s390x. There could be ways to optimize some cases in user space: before reading memory in an anonymous private mapping on the migration source, check via /proc/self/pagemap if anything is already populated. Similarly check on the migration destination before reading. While that would avoid populating tables full of shared zeropages on all architectures, it's harder to get right and performant, and requires user space changes. Further, with posctopy live migration we must place a page, so there, "avoid touching memory to avoid allocating memory" is not really possible. (Note that a previously we would have falsely inserted shared zeropages into processes using UFFDIO_ZEROPAGE where mm_forbids_zeropage() would have actually forbidden it) PV is currently incompatible with memory ballooning, and in the common case, KVM guests don't make use of storage keys. Instead of zapping zeropages when enabling storage keys / PV, that turned out to be problematic in the past, let's do exactly the same we do with KSM pages: trigger unsharing faults to replace the shared zeropages by proper anonymous folios. What about added latency when enabling storage kes? Having a lot of zeropages in applicable environments (PV, legacy guests, unittests) is unexpected. Further, KSM could today already unshare the zeropages and unmerging KSM pages when enabling storage kets would unshare the KSM-placed zeropages in the same way, resulting in the same latency. Reviewed-by: Christian Borntraeger Tested-by: Christian Borntraeger Fixes: fa41ba0d08de ("s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs") Signed-off-by: David Hildenbrand --- arch/s390/include/asm/gmap.h | 2 +- arch/s390/include/asm/mmu.h | 5 + arch/s390/include/asm/mmu_context.h | 1 + arch/s390/include/asm/pgtable.h | 15 ++- arch/s390/kvm/kvm-s390.c | 4 +- arch/s390/mm/gmap.c | 163 +++++++++++++++++++++------- 6 files changed, 143 insertions(+), 47 deletions(-) diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h index 5cc46e0dde62..9725586f4259 100644 --- a/arch/s390/include/asm/gmap.h +++ b/arch/s390/include/asm/gmap.h @@ -146,7 +146,7 @@ int gmap_mprotect_notify(struct gmap *, unsigned long start, void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4], unsigned long gaddr, unsigned long vmaddr); -int gmap_mark_unmergeable(void); +int s390_disable_cow_sharing(void); void s390_unlist_old_asce(struct gmap *gmap); int s390_replace_asce(struct gmap *gmap); void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns); diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h index bb1b4bef1878..4c2dc7abc285 100644 --- a/arch/s390/include/asm/mmu.h +++ b/arch/s390/include/asm/mmu.h @@ -32,6 +32,11 @@ typedef struct { unsigned int uses_skeys:1; /* The mmu context uses CMM. */ unsigned int uses_cmm:1; + /* + * The mmu context allows COW-sharing of memory pages (KSM, zeropage). + * Note that COW-sharing during fork() is currently always allowed. + */ + unsigned int allow_cow_sharing:1; /* The gmaps associated with this context are allowed to use huge pages. */ unsigned int allow_gmap_hpage_1m:1; } mm_context_t; diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h index 929af18b0908..a7789a9f6218 100644 --- a/arch/s390/include/asm/mmu_context.h +++ b/arch/s390/include/asm/mmu_context.h @@ -35,6 +35,7 @@ static inline int init_new_context(struct task_struct *tsk, mm->context.has_pgste = 0; mm->context.uses_skeys = 0; mm->context.uses_cmm = 0; + mm->context.allow_cow_sharing = 1; mm->context.allow_gmap_hpage_1m = 0; #endif switch (mm->context.asce_limit) { diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 60950e7a25f5..1a71cb19c089 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -566,10 +566,19 @@ static inline pud_t set_pud_bit(pud_t pud, pgprot_t prot) } /* - * In the case that a guest uses storage keys - * faults should no longer be backed by zero pages + * As soon as the guest uses storage keys or enables PV, we deduplicate all + * mapped shared zeropages and prevent new shared zeropages from getting + * mapped. */ -#define mm_forbids_zeropage mm_has_pgste +static inline int mm_forbids_zeropage(struct mm_struct *mm) +{ +#ifdef CONFIG_PGSTE + if (!mm->context.allow_cow_sharing) + return 1; +#endif + return 0; +} + static inline int mm_uses_skeys(struct mm_struct *mm) { #ifdef CONFIG_PGSTE diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 5147b943a864..db3392f0be21 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -2631,9 +2631,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd) if (r) break; - mmap_write_lock(current->mm); - r = gmap_mark_unmergeable(); - mmap_write_unlock(current->mm); + r = s390_disable_cow_sharing(); if (r) break; diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c index 094b43b121cd..9233b0acac89 100644 --- a/arch/s390/mm/gmap.c +++ b/arch/s390/mm/gmap.c @@ -2549,41 +2549,6 @@ static inline void thp_split_mm(struct mm_struct *mm) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ -/* - * Remove all empty zero pages from the mapping for lazy refaulting - * - This must be called after mm->context.has_pgste is set, to avoid - * future creation of zero pages - * - This must be called after THP was disabled. - * - * mm contracts with s390, that even if mm were to remove a page table, - * racing with the loop below and so causing pte_offset_map_lock() to fail, - * it will never insert a page table containing empty zero pages once - * mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set. - */ -static int __zap_zero_pages(pmd_t *pmd, unsigned long start, - unsigned long end, struct mm_walk *walk) -{ - unsigned long addr; - - for (addr = start; addr != end; addr += PAGE_SIZE) { - pte_t *ptep; - spinlock_t *ptl; - - ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); - if (!ptep) - break; - if (is_zero_pfn(pte_pfn(*ptep))) - ptep_xchg_direct(walk->mm, addr, ptep, __pte(_PAGE_INVALID)); - pte_unmap_unlock(ptep, ptl); - } - return 0; -} - -static const struct mm_walk_ops zap_zero_walk_ops = { - .pmd_entry = __zap_zero_pages, - .walk_lock = PGWALK_WRLOCK, -}; - /* * switch on pgstes for its userspace process (for kvm) */ @@ -2601,22 +2566,140 @@ int s390_enable_sie(void) mm->context.has_pgste = 1; /* split thp mappings and disable thp for future mappings */ thp_split_mm(mm); - walk_page_range(mm, 0, TASK_SIZE, &zap_zero_walk_ops, NULL); mmap_write_unlock(mm); return 0; } EXPORT_SYMBOL_GPL(s390_enable_sie); -int gmap_mark_unmergeable(void) +static int find_zeropage_pte_entry(pte_t *pte, unsigned long addr, + unsigned long end, struct mm_walk *walk) +{ + unsigned long *found_addr = walk->private; + + /* Return 1 of the page is a zeropage. */ + if (is_zero_pfn(pte_pfn(*pte))) { + + /* + * Shared zeropage in e.g., a FS DAX mapping? We cannot do the + * right thing and likely don't care: FAULT_FLAG_UNSHARE + * currently only works in COW mappings, which is also where + * mm_forbids_zeropage() is checked. + */ + if (!is_cow_mapping(walk->vma->vm_flags)) + return -EFAULT; + + *found_addr = addr; + return 1; + } + return 0; +} + +static const struct mm_walk_ops find_zeropage_ops = { + .pte_entry = find_zeropage_pte_entry, + .walk_lock = PGWALK_WRLOCK, +}; + +/* + * Unshare all shared zeropages, replacing them by anonymous pages. Note that + * we cannot simply zap all shared zeropages, because this could later + * trigger unexpected userfaultfd missing events. + * + * This must be called after mm->context.allow_cow_sharing was + * set to 0, to avoid future mappings of shared zeropages. + * + * mm contracts with s390, that even if mm were to remove a page table, + * and racing with walk_page_range_vma() calling pte_offset_map_lock() + * would fail, it will never insert a page table containing empty zero + * pages once mm_forbids_zeropage(mm) i.e. + * mm->context.allow_cow_sharing is set to 0. + */ +static int __s390_unshare_zeropages(struct mm_struct *mm) +{ + struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); + unsigned long addr; + int rc; + + for_each_vma(vmi, vma) { + /* + * We could only look at COW mappings, but it's more future + * proof to catch unexpected zeropages in other mappings and + * fail. + */ + if ((vma->vm_flags & VM_PFNMAP) || is_vm_hugetlb_page(vma)) + continue; + addr = vma->vm_start; + +retry: + rc = walk_page_range_vma(vma, addr, vma->vm_end, + &find_zeropage_ops, &addr); + if (rc <= 0) + continue; + + /* addr was updated by find_zeropage_pte_entry() */ + rc = handle_mm_fault(vma, addr, + FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE, + NULL); + if (rc & VM_FAULT_OOM) + return -ENOMEM; + /* + * See break_ksm(): even after handle_mm_fault() returned 0, we + * must start the lookup from the current address, because + * handle_mm_fault() may back out if there's any difficulty. + * + * VM_FAULT_SIGBUS and VM_FAULT_SIGSEGV are unexpected but + * maybe they could trigger in the future on concurrent + * truncation. In that case, the shared zeropage would be gone + * and we can simply retry and make progress. + */ + cond_resched(); + goto retry; + } + + return rc; +} + +static int __s390_disable_cow_sharing(struct mm_struct *mm) { + int rc; + + if (!mm->context.allow_cow_sharing) + return 0; + + mm->context.allow_cow_sharing = 0; + + /* Replace all shared zeropages by anonymous pages. */ + rc = __s390_unshare_zeropages(mm); /* * Make sure to disable KSM (if enabled for the whole process or * individual VMAs). Note that nothing currently hinders user space * from re-enabling it. */ - return ksm_disable(current->mm); + if (!rc) + rc = ksm_disable(mm); + if (rc) + mm->context.allow_cow_sharing = 1; + return rc; +} + +/* + * Disable most COW-sharing of memory pages for the whole process: + * (1) Disable KSM and unmerge/unshare any KSM pages. + * (2) Disallow shared zeropages and unshare any zerpages that are mapped. + * + * Not that we currently don't bother with COW-shared pages that are shared + * with parent/child processes due to fork(). + */ +int s390_disable_cow_sharing(void) +{ + int rc; + + mmap_write_lock(current->mm); + rc = __s390_disable_cow_sharing(current->mm); + mmap_write_unlock(current->mm); + return rc; } -EXPORT_SYMBOL_GPL(gmap_mark_unmergeable); +EXPORT_SYMBOL_GPL(s390_disable_cow_sharing); /* * Enable storage key handling from now on and initialize the storage @@ -2685,7 +2768,7 @@ int s390_enable_skey(void) goto out_up; mm->context.uses_skeys = 1; - rc = gmap_mark_unmergeable(); + rc = __s390_disable_cow_sharing(mm); if (rc) { mm->context.uses_skeys = 0; goto out_up;