From patchwork Tue Nov 29 11:22:28 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Kirill A . Shutemov" X-Patchwork-Id: 9451521 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 7942B60710 for ; Tue, 29 Nov 2016 11:24:56 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5AE3A27CEA for ; Tue, 29 Nov 2016 11:24:56 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4F6FD281A7; Tue, 29 Nov 2016 11:24:56 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5590F2818B for ; Tue, 29 Nov 2016 11:24:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757380AbcK2LYo (ORCPT ); Tue, 29 Nov 2016 06:24:44 -0500 Received: from mga06.intel.com ([134.134.136.31]:53212 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756396AbcK2LYb (ORCPT ); Tue, 29 Nov 2016 06:24:31 -0500 Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga104.jf.intel.com with ESMTP; 29 Nov 2016 03:24:17 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,568,1473145200"; d="scan'208";a="1091991882" Received: from black.fi.intel.com ([10.237.72.28]) by fmsmga002.fm.intel.com with ESMTP; 29 Nov 2016 03:24:12 -0800 Received: by black.fi.intel.com (Postfix, from userid 1000) id D811C1C5; Tue, 29 Nov 2016 13:23:11 +0200 (EET) From: "Kirill A. Shutemov" To: "Theodore Ts'o" , Andreas Dilger , Jan Kara , Andrew Morton Cc: Alexander Viro , Hugh Dickins , Andrea Arcangeli , Dave Hansen , Vlastimil Babka , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv5 00/36] ext4: support of huge pages Date: Tue, 29 Nov 2016 14:22:28 +0300 Message-Id: <20161129112304.90056-1-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.10.2 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Here's respin of my huge ext4 patchset on top of Matthew's patchset with few changes and fixes (see below). Please review and consider applying. I don't see any xfstests regressions with huge pages enabled. Patch with new configurations for xfstests-bld is below. The basics are the same as with tmpfs[1] which is in Linus' tree now and ext4 built on top of it. The main difference is that we need to handle read out from and write-back to backing storage. As with other THPs, the implementation is build around compound pages: a naturally aligned collection of pages that memory management subsystem [in most cases] treat as a single entity: - head page (the first subpage) on LRU represents whole huge page; - head page's flags represent state of whole huge page (with few exceptions); - mm can't migrate subpages of the compound page individually; For THP, we use PMD-sized huge pages. Head page links buffer heads for whole huge page. Dirty/writeback/etc. tracking happens on per-hugepage level as all subpages share the same page flags. lock_page() on any subpage would lock whole hugepage for the same reason. On radix-tree, a huge page represented as a multi-order entry of the same order (HPAGE_PMD_ORDER). This allows us to track dirty/writeback on radix-tree tags with the same granularity as on struct page. On IO via syscalls, we are still limited by copying upto PAGE_SIZE per iteration. The limitation here comes from how copy_page_to_iter() and copy_page_from_iter() work wrt. highmem: it can only handle one small page a time. On write side, we also have problem with assuming small pages: write length and offset within page calculated before we know if small or huge page is allocated. It's not easy to fix. Looks like it would require change in ->write_begin() interface to accept len > PAGE_SIZE. On split_huge_page() we need to free buffers before splitting the page. Page buffers takes additional pin on the page and can be a vector to mess with the page during split. We want to avoid this. If try_to_free_buffers() fails, split_huge_page() would return -EBUSY. Readahead doesn't play with huge pages well: 128k max readahead window, assumption on page size, PageReadahead() to track hit/miss. I've got it to allocate huge pages, but it doesn't provide any readahead as such. I don't know how to do this right. It's not clear at this point if we really need readahead with huge pages. I guess it's good enough for now. Shadow entries ignored on allocation -- recently evicted page is not promoted to active list. Not sure if current workingset logic is adequate for huge pages. On eviction, we split the huge page and setup 4k shadow entries as usual. Unlike tmpfs, ext4 makes use of tags in radix-tree. The approach I used for tmpfs -- 512 entries in radix-tree per-hugepages -- doesn't work well if we want to have coherent view on tags. So the first patch converts tmpfs to use multi-order entries in radix-tree. The same infrastructure used for ext4. Encryption doesn't handle huge pages yet. To avoid regressions we just disable huge pages for the inode if it has EXT4_INODE_ENCRYPT. Tested with 4k, 1k, encryption and bigalloc. All with and without huge=always. I think it's reasonable coverage. The patchset is also in git: git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugeext4/v5 [1] http://lkml.kernel.org/r/1465222029-45942-1-git-send-email-kirill.shutemov@linux.intel.com Changes since v4: - Rebase onto updated radix-tree interface; - Change interface to page cache lookups wrt. multi-order entries; - Do not mess with BIO_MAX_PAGES: ext4_mpage_readpages() now uses block_read_full_page() for THP read out; - Fix work with memcg enabled; - Drop bogus VM_BUG_ON() from wp_huge_pmd(); Changes since v3: - account huge page to dirty/writeback/reclaimable/etc. according to its size. It fixes background writback. - move code that adds huge page to radix-tree to page_cache_tree_insert() (Jan); - make ramdisk work with huge pages; - fix unaccont of shadow entries (Jan); - use try_to_release_page() instead of try_to_free_buffers() in split_huge_page() (Jan); - make thp_get_unmapped_area() respect S_HUGE_MODE; - use huge-page aligned address to zap page range in wp_huge_pmd(); - use ext4_kvmalloc in ext4_mpage_readpages() instead of kmalloc() (Andreas); Changes since v2: - fix intermittent crash in generic/299; - typo (condition inversion) in do_generic_file_read(), reported by Jitendra; TODO: - on IO via syscalls, copy more than PAGE_SIZE per iteration to/from userspace; - readahead ?; - wire up madvise()/fadvise(); - encryption with huge pages; - reclaim of file huge pages can be optimized -- split_huge_page() is not required for pages with backing storage; From f523dd3aad026f5a3f8cbabc0ec69958a0618f6b Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 12 Aug 2016 19:44:30 +0300 Subject: [PATCH] Add few more configurations to test ext4 with huge pages Four new configurations: huge_4k, huge_1k, huge_bigalloc, huge_encrypt. Signed-off-by: Kirill A. Shutemov --- .../test-appliance/files/root/fs/ext4/cfg/all.list | 4 ++++ .../test-appliance/files/root/fs/ext4/cfg/huge_1k | 6 ++++++ .../test-appliance/files/root/fs/ext4/cfg/huge_4k | 6 ++++++ .../test-appliance/files/root/fs/ext4/cfg/huge_bigalloc | 14 ++++++++++++++ .../files/root/fs/ext4/cfg/huge_bigalloc.exclude | 7 +++++++ .../test-appliance/files/root/fs/ext4/cfg/huge_encrypt | 5 +++++ .../files/root/fs/ext4/cfg/huge_encrypt.exclude | 16 ++++++++++++++++ kvm-xfstests/util/parse_cli | 1 + 8 files changed, 59 insertions(+) create mode 100644 kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k create mode 100644 kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_4k create mode 100644 kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc create mode 100644 kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc.exclude create mode 100644 kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt create mode 100644 kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt.exclude diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list index 7ec37f4bafaa..14a8e72d2e6e 100644 --- a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list +++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list @@ -9,3 +9,7 @@ dioread_nolock data_journal bigalloc bigalloc_1k +huge_4k +huge_1k +huge_bigalloc +huge_encrypt diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k new file mode 100644 index 000000000000..209c76a8a6c1 --- /dev/null +++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k @@ -0,0 +1,6 @@ +export FS=ext4 +export TEST_DEV=$SM_TST_DEV +export TEST_DIR=$SM_TST_MNT +export MKFS_OPTIONS="-q -b 1024" +export EXT_MOUNT_OPTIONS="huge=always" +TESTNAME="Ext4 1k block with huge pages" diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_4k b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_4k new file mode 100644 index 000000000000..bae901cb2bab --- /dev/null +++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_4k @@ -0,0 +1,6 @@ +export FS=ext4 +export TEST_DEV=$PRI_TST_DEV +export TEST_DIR=$PRI_TST_MNT +export MKFS_OPTIONS="-q" +export EXT_MOUNT_OPTIONS="huge=always" +TESTNAME="Ext4 4k block with huge pages" diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc new file mode 100644 index 000000000000..b3d87562bce6 --- /dev/null +++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc @@ -0,0 +1,14 @@ +SIZE=large +export MKFS_OPTIONS="-O bigalloc" +export EXT_MOUNT_OPTIONS="huge=always" + +# Until we can teach xfstests the difference between cluster size and +# block size, avoid collapse_range, insert_range, and zero_range since +# these will fail due the fact that these operations require +# cluster-aligned ranges. +export FSX_AVOID="-C -I -z" +export FSSTRESS_AVOID="-f collapse=0 -f insert=0 -f zero=0" +export XFS_IO_AVOID="fcollapse finsert zero" + +TESTNAME="Ext4 4k block w/bigalloc" + diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc.exclude b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc.exclude new file mode 100644 index 000000000000..bd779be99518 --- /dev/null +++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc.exclude @@ -0,0 +1,7 @@ +# bigalloc does not support on-line defrag +ext4/301 +ext4/302 +ext4/303 +ext4/304 +ext4/307 +ext4/308 diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt new file mode 100644 index 000000000000..29f058ba937d --- /dev/null +++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt @@ -0,0 +1,5 @@ +SIZE=small +export MKFS_OPTIONS="" +export EXT_MOUNT_OPTIONS="test_dummy_encryption,huge=always" +REQUIRE_FEATURE=encryption +TESTNAME="Ext4 encryption" diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt.exclude b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt.exclude new file mode 100644 index 000000000000..b91cc58b5aa3 --- /dev/null +++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt.exclude @@ -0,0 +1,16 @@ +ext4/004 # dump/restore doesn't handle quotas + +# encryption doesn't play well with quota +generic/082 +generic/219 +generic/230 +generic/231 +generic/232 +generic/233 +generic/235 +generic/270 + +# generic/204 tests ENOSPC handling; it doesn't correctly +# anticipate the external extended attribute required when +# using a 1k block size +generic/204 diff --git a/kvm-xfstests/util/parse_cli b/kvm-xfstests/util/parse_cli index 83400ea71985..ba64ce5df016 100644 --- a/kvm-xfstests/util/parse_cli +++ b/kvm-xfstests/util/parse_cli @@ -36,6 +36,7 @@ print_help () echo "Common file system configurations are:" echo " 4k 1k ext3 nojournal ext3conv metacsum dioread_nolock " echo " data_journal bigalloc bigalloc_1k inline" + echo " huge_4k huge_1k huge_bigalloc huge_encrypt" echo "" echo "xfstest names have the form: ext4/NNN generic/NNN shared/NNN" echo ""