From patchwork Wed Dec 4 14:06:52 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nikos Tsironis X-Patchwork-Id: 11272987 X-Patchwork-Delegate: snitzer@redhat.com Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A73E66C1 for ; Wed, 4 Dec 2019 14:07:33 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 66278206DF for ; Wed, 4 Dec 2019 14:07:33 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="UYH+ufNF" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 66278206DF Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=arrikto.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=dm-devel-bounces@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1575468452; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post; bh=uoz1ECgV1jPk1Ug2BQGL+FW1rWauMVVeFS0AgYdHgc0=; b=UYH+ufNFitL/W57AqClBShtNoZ6ACzsLx20EJIJbamXwnRPBXuIjpba/kPRLFkI4DfjcZS aQuaPKRgBxlF205SFO3Cj7aqWDPblO/WvSvsM3nrGLjL0TvdqjIyvjnJAm8qM29UYy41ix YDI5XS7b/Ox+HdJBywiztRBNCiVwvOU= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-360--CBU0gZhMU2kJZeF-0jTXQ-1; Wed, 04 Dec 2019 09:07:30 -0500 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 4D2BE8005BF; Wed, 4 Dec 2019 14:07:26 +0000 (UTC) Received: from colo-mx.corp.redhat.com (colo-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.20]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 296AF5C21B; Wed, 4 Dec 2019 14:07:26 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id E177C18089CE; Wed, 4 Dec 2019 14:07:25 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id xB4E79AA006868 for ; Wed, 4 Dec 2019 09:07:09 -0500 Received: by smtp.corp.redhat.com (Postfix) id 872D910545F0; Wed, 4 Dec 2019 14:07:09 +0000 (UTC) Delivered-To: dm-devel@redhat.com Received: from mimecast-mx02.redhat.com (mimecast03.extmail.prod.ext.rdu2.redhat.com [10.11.55.19]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 81AB910545EF for ; Wed, 4 Dec 2019 14:07:07 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-1.mimecast.com [207.211.31.81]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 33848801036 for ; Wed, 4 Dec 2019 14:07:07 +0000 (UTC) Received: from mail-lf1-f67.google.com (mail-lf1-f67.google.com [209.85.167.67]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-147-fHivghJ2OmunSUrmAzDJgw-1; Wed, 04 Dec 2019 09:07:03 -0500 Received: by mail-lf1-f67.google.com with SMTP id y5so6249513lfy.7 for ; Wed, 04 Dec 2019 06:07:03 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=ZP96pYpb+yJSS/U166/rrikcYlNmw66MMOATGiY9JSc=; b=UlrJPdcfssT3I0aRn0T0z9TpbhjZiZarvgezfA5GdqBQ2qozTqmopwhNqA7f2mZpap 3qQqZlgTGCijg++iQUzUXoHEmYYfKRegixnLfNWGgAcPRzxotNxy74fsgqrL+s6lycE6 dfrLjHD7lNEhSYf6IYNU52lP589qHhExzX3xcfVhAiOAYa3nZzj58RFUUdA45l4oK1HH ilRthQ8Zk1LBaG8bsNwJo+rGq58Paf+PEBTGkeS/MWx3omUSbPz1Du9SE8VRGU3A92Ta v+z7aiANZmik9uxMBu/7xdM0I3MS1Ux7MeUbfNlQyqwTybwtF9r2egbJuxy8PXaQri8S hdBw== X-Gm-Message-State: APjAAAVeT+N0pshOoH3MD0Bd2jDZcnbkY3yPbDsMsMYTK3+7X33/pKSC p4GxO4al9El7FiGDyAjwd1Pfpw== X-Google-Smtp-Source: APXvYqyLY3TxWYUrO/xeC+19Os4vdfm2f764wvpJqsM39mJdpvARqYDng271SiOls/tHgLHhww9MYg== X-Received: by 2002:ac2:5b86:: with SMTP id o6mr2212526lfn.44.1575468421832; Wed, 04 Dec 2019 06:07:01 -0800 (PST) Received: from snf-864.vm.snf.arr ([31.177.62.212]) by smtp.gmail.com with ESMTPSA id i19sm3335533ljj.24.2019.12.04.06.07.00 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 04 Dec 2019 06:07:01 -0800 (PST) From: Nikos Tsironis To: snitzer@redhat.com, agk@redhat.com, dm-devel@redhat.com Date: Wed, 4 Dec 2019 16:06:52 +0200 Message-Id: <20191204140654.26214-2-ntsironis@arrikto.com> In-Reply-To: <20191204140654.26214-1-ntsironis@arrikto.com> References: <20191204140654.26214-1-ntsironis@arrikto.com> X-MC-Unique: fHivghJ2OmunSUrmAzDJgw-1 X-MC-Unique: -CBU0gZhMU2kJZeF-0jTXQ-1 X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-MIME-Autoconverted: from quoted-printable to 8bit by lists01.pubmisc.prod.ext.phx2.redhat.com id xB4E79AA006868 X-loop: dm-devel@redhat.com Cc: vkoukis@arrikto.com, ntsironis@arrikto.com, iliastsi@arrikto.com Subject: [dm-devel] [PATCH 1/3] dm clone metadata: Track exact changes per transaction X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Mimecast-Spam-Score: 0 Extend struct dirty_map with a second bitmap which tracks the exact regions that were hydrated during the current metadata transaction. Moreover, fix __flush_dmap() to only commit the metadata of the regions that were hydrated during the current transaction. This is required by the following commits to fix a data corruption bug. Fixes: 7431b7835f55 ("dm: add clone target") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Nikos Tsironis --- drivers/md/dm-clone-metadata.c | 90 +++++++++++++++++++++++++++++------------- 1 file changed, 62 insertions(+), 28 deletions(-) diff --git a/drivers/md/dm-clone-metadata.c b/drivers/md/dm-clone-metadata.c index 08c552e5e41b..ee870a425ab8 100644 --- a/drivers/md/dm-clone-metadata.c +++ b/drivers/md/dm-clone-metadata.c @@ -67,23 +67,34 @@ struct superblock_disk { * To save constantly doing look ups on disk we keep an in core copy of the * on-disk bitmap, the region_map. * - * To further reduce metadata I/O overhead we use a second bitmap, the dmap - * (dirty bitmap), which tracks the dirty words, i.e. longs, of the region_map. + * In order to track which regions are hydrated during a metadata transaction, + * we use a second set of bitmaps, the dmap (dirty bitmap), which includes two + * bitmaps, namely dirty_regions and dirty_words. The dirty_regions bitmap + * tracks the regions that got hydrated during the current metadata + * transaction. The dirty_words bitmap tracks the dirty words, i.e. longs, of + * the dirty_regions bitmap. + * + * This allows us to precisely track the regions that were hydrated during the + * current metadata transaction and update the metadata accordingly, when we + * commit the current transaction. This is important because dm-clone should + * only commit the metadata of regions that were properly flushed to the + * destination device beforehand. Otherwise, in case of a crash, we could end + * up with a corrupted dm-clone device. * * When a region finishes hydrating dm-clone calls * dm_clone_set_region_hydrated(), or for discard requests * dm_clone_cond_set_range(), which sets the corresponding bits in region_map * and dmap. * - * During a metadata commit we scan the dmap for dirty region_map words (longs) - * and update accordingly the on-disk metadata. Thus, we don't have to flush to - * disk the whole region_map. We can just flush the dirty region_map words. + * During a metadata commit we scan dmap->dirty_words and dmap->dirty_regions + * and update the on-disk metadata accordingly. Thus, we don't have to flush to + * disk the whole region_map. We can just flush the dirty region_map bits. * - * We use a dirty bitmap, which is smaller than the original region_map, to - * reduce the amount of memory accesses during a metadata commit. As dm-bitset - * accesses the on-disk bitmap in 64-bit word granularity, there is no - * significant benefit in tracking the dirty region_map bits with a smaller - * granularity. + * We use the helper dmap->dirty_words bitmap, which is smaller than the + * original region_map, to reduce the amount of memory accesses during a + * metadata commit. Moreover, as dm-bitset also accesses the on-disk bitmap in + * 64-bit word granularity, the dirty_words bitmap helps us avoid useless disk + * accesses. * * We could update directly the on-disk bitmap, when dm-clone calls either * dm_clone_set_region_hydrated() or dm_clone_cond_set_range(), buts this @@ -92,12 +103,13 @@ struct superblock_disk { * e.g., in a hooked overwrite bio's completion routine, and further reduce the * I/O completion latency. * - * We maintain two dirty bitmaps. During a metadata commit we atomically swap - * the currently used dmap with the unused one. This allows the metadata update - * functions to run concurrently with an ongoing commit. + * We maintain two dirty bitmap sets. During a metadata commit we atomically + * swap the currently used dmap with the unused one. This allows the metadata + * update functions to run concurrently with an ongoing commit. */ struct dirty_map { unsigned long *dirty_words; + unsigned long *dirty_regions; unsigned int changed; }; @@ -461,22 +473,40 @@ static size_t bitmap_size(unsigned long nr_bits) return BITS_TO_LONGS(nr_bits) * sizeof(long); } -static int dirty_map_init(struct dm_clone_metadata *cmd) +static int __dirty_map_init(struct dirty_map *dmap, unsigned long nr_words, + unsigned long nr_regions) { - cmd->dmap[0].changed = 0; - cmd->dmap[0].dirty_words = kvzalloc(bitmap_size(cmd->nr_words), GFP_KERNEL); + dmap->changed = 0; - if (!cmd->dmap[0].dirty_words) { - DMERR("Failed to allocate dirty bitmap"); + dmap->dirty_words = kvzalloc(bitmap_size(nr_words), GFP_KERNEL); + if (!dmap->dirty_words) + return -ENOMEM; + + dmap->dirty_regions = kvzalloc(bitmap_size(nr_regions), GFP_KERNEL); + if (!dmap->dirty_regions) { + kvfree(dmap->dirty_words); return -ENOMEM; } - cmd->dmap[1].changed = 0; - cmd->dmap[1].dirty_words = kvzalloc(bitmap_size(cmd->nr_words), GFP_KERNEL); + return 0; +} - if (!cmd->dmap[1].dirty_words) { +static void __dirty_map_exit(struct dirty_map *dmap) +{ + kvfree(dmap->dirty_words); + kvfree(dmap->dirty_regions); +} + +static int dirty_map_init(struct dm_clone_metadata *cmd) +{ + if (__dirty_map_init(&cmd->dmap[0], cmd->nr_words, cmd->nr_regions)) { DMERR("Failed to allocate dirty bitmap"); - kvfree(cmd->dmap[0].dirty_words); + return -ENOMEM; + } + + if (__dirty_map_init(&cmd->dmap[1], cmd->nr_words, cmd->nr_regions)) { + DMERR("Failed to allocate dirty bitmap"); + __dirty_map_exit(&cmd->dmap[0]); return -ENOMEM; } @@ -487,8 +517,8 @@ static int dirty_map_init(struct dm_clone_metadata *cmd) static void dirty_map_exit(struct dm_clone_metadata *cmd) { - kvfree(cmd->dmap[0].dirty_words); - kvfree(cmd->dmap[1].dirty_words); + __dirty_map_exit(&cmd->dmap[0]); + __dirty_map_exit(&cmd->dmap[1]); } static int __load_bitset_in_core(struct dm_clone_metadata *cmd) @@ -633,21 +663,23 @@ unsigned long dm_clone_find_next_unhydrated_region(struct dm_clone_metadata *cmd return find_next_zero_bit(cmd->region_map, cmd->nr_regions, start); } -static int __update_metadata_word(struct dm_clone_metadata *cmd, unsigned long word) +static int __update_metadata_word(struct dm_clone_metadata *cmd, + unsigned long *dirty_regions, + unsigned long word) { int r; unsigned long index = word * BITS_PER_LONG; unsigned long max_index = min(cmd->nr_regions, (word + 1) * BITS_PER_LONG); while (index < max_index) { - if (test_bit(index, cmd->region_map)) { + if (test_bit(index, dirty_regions)) { r = dm_bitset_set_bit(&cmd->bitset_info, cmd->bitset_root, index, &cmd->bitset_root); - if (r) { DMERR("dm_bitset_set_bit failed"); return r; } + __clear_bit(index, dirty_regions); } index++; } @@ -721,7 +753,7 @@ static int __flush_dmap(struct dm_clone_metadata *cmd, struct dirty_map *dmap) if (word == cmd->nr_words) break; - r = __update_metadata_word(cmd, word); + r = __update_metadata_word(cmd, dmap->dirty_regions, word); if (r) return r; @@ -802,6 +834,7 @@ int dm_clone_set_region_hydrated(struct dm_clone_metadata *cmd, unsigned long re dmap = cmd->current_dmap; __set_bit(word, dmap->dirty_words); + __set_bit(region_nr, dmap->dirty_regions); __set_bit(region_nr, cmd->region_map); dmap->changed = 1; @@ -830,6 +863,7 @@ int dm_clone_cond_set_range(struct dm_clone_metadata *cmd, unsigned long start, if (!test_bit(region_nr, cmd->region_map)) { word = region_nr / BITS_PER_LONG; __set_bit(word, dmap->dirty_words); + __set_bit(region_nr, dmap->dirty_regions); __set_bit(region_nr, cmd->region_map); dmap->changed = 1; } From patchwork Wed Dec 4 14:06:53 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nikos Tsironis X-Patchwork-Id: 11272985 X-Patchwork-Delegate: snitzer@redhat.com Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4F15B159A for ; Wed, 4 Dec 2019 14:07:21 +0000 (UTC) Received: from us-smtp-delivery-1.mimecast.com (us-smtp-1.mimecast.com [205.139.110.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 0F502206DF for ; Wed, 4 Dec 2019 14:07:20 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="GzZ4YS3E" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0F502206DF Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=arrikto.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=dm-devel-bounces@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1575468439; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post; bh=c4f1kxGNaqNH4SLpCsDxfJkMjSGoeIijyAdZtI4+BSw=; b=GzZ4YS3EXkrMs8zJb9lGpUsarxQHjn7/SYW9+tr+gHYvYaBAnImAFBkKGtOfOLFHh8Entb j/z5l4LsmsrGtUk1XidbSvssnwHInmSQSR51GARBTrSxcNFF9EaJpFQB6COdJEWP0SKSXP Un7KSTE/G9Apz/3tPpKNTRzbNM/wPYg= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-24-7JB24jlIPgWT0CxMR8gC4w-1; Wed, 04 Dec 2019 09:07:17 -0500 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 7A2177A0FA; Wed, 4 Dec 2019 14:07:13 +0000 (UTC) Received: from colo-mx.corp.redhat.com (colo-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.20]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1040410842CD; Wed, 4 Dec 2019 14:07:13 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id 5011E18089C8; Wed, 4 Dec 2019 14:07:12 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id xB4E7BT1006878 for ; Wed, 4 Dec 2019 09:07:11 -0500 Received: by smtp.corp.redhat.com (Postfix) id 2649A2026D67; Wed, 4 Dec 2019 14:07:11 +0000 (UTC) Delivered-To: dm-devel@redhat.com Received: from mimecast-mx02.redhat.com (mimecast06.extmail.prod.ext.rdu2.redhat.com [10.11.55.22]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 214FE2026D68 for ; Wed, 4 Dec 2019 14:07:09 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id D6426185B0B1 for ; Wed, 4 Dec 2019 14:07:08 +0000 (UTC) Received: from mail-lj1-f195.google.com (mail-lj1-f195.google.com [209.85.208.195]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-383-aFYrLxDuOG-yHCknbcToRQ-1; Wed, 04 Dec 2019 09:07:04 -0500 Received: by mail-lj1-f195.google.com with SMTP id e10so8260108ljj.6 for ; Wed, 04 Dec 2019 06:07:04 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=2RpnrFCFR6N50EhDrGlni6CocyYnr33aVI9CHVRmkT0=; b=TxEr3yvMemHCbPAJvULoTaBWjGT2V0mpSZi2G2MjOPBlQIShlyK4WTytEsZ2vpPsCs +Ou6Ji9Lq1BOcrMcmNKyPXO7ghqH8jJc46JxD647hQFq9dlSHxtWZh8PDE+OFw5y/S5+ Swgwu9lp7yllizMFDCkdPZgGsTrUyukO4cm2/fsfTMXXvwPj8/qYT25/UrvwMUBCiW8A Be6EHasJ3bmyrvCvcr7pxUgZmBfy/Sd096AZubhvJhCUWVNVWVHOwzrtNaiw+DZ5TX0u PVgMp6WGL3J1AZwwdhFW8oEuAFk1OSvy4TSUsNnSkckWgaDVX0ktUg2ymUUXbfGoCwAz sR6w== X-Gm-Message-State: APjAAAUAAeYUOEyeyIraehLum3UCzBSkFSUzHTVK+zfh3i5iWIFuKDG6 RxV0yQ/K/RqbMSsd7LFbQGVJ+w== X-Google-Smtp-Source: APXvYqwwra2W3KLdl8GmJZAwG6qbUHEkG2TsLgNiLkcuZGuSFk9q3fJO0MRXuAM2R2KlEaUIV0YR9w== X-Received: by 2002:a2e:9a55:: with SMTP id k21mr1859978ljj.85.1575468423110; Wed, 04 Dec 2019 06:07:03 -0800 (PST) Received: from snf-864.vm.snf.arr ([31.177.62.212]) by smtp.gmail.com with ESMTPSA id i19sm3335533ljj.24.2019.12.04.06.07.01 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 04 Dec 2019 06:07:02 -0800 (PST) From: Nikos Tsironis To: snitzer@redhat.com, agk@redhat.com, dm-devel@redhat.com Date: Wed, 4 Dec 2019 16:06:53 +0200 Message-Id: <20191204140654.26214-3-ntsironis@arrikto.com> In-Reply-To: <20191204140654.26214-1-ntsironis@arrikto.com> References: <20191204140654.26214-1-ntsironis@arrikto.com> X-MC-Unique: aFYrLxDuOG-yHCknbcToRQ-1 X-MC-Unique: 7JB24jlIPgWT0CxMR8gC4w-1 X-Scanned-By: MIMEDefang 2.78 on 10.11.54.4 X-MIME-Autoconverted: from quoted-printable to 8bit by lists01.pubmisc.prod.ext.phx2.redhat.com id xB4E7BT1006878 X-loop: dm-devel@redhat.com Cc: vkoukis@arrikto.com, ntsironis@arrikto.com, iliastsi@arrikto.com Subject: [dm-devel] [PATCH 2/3] dm clone metadata: Use a two phase commit X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Mimecast-Spam-Score: 0 Split the metadata commit in two parts: 1. dm_clone_metadata_pre_commit(): Prepare the current transaction for committing. After this is called, all subsequent metadata updates, done through either dm_clone_set_region_hydrated() or dm_clone_cond_set_range(), will be part of the next transaction. 2. dm_clone_metadata_commit(): Actually commit the current transaction to disk and start a new transaction. This is required by the following commit. It allows dm-clone to flush the destination device after step (1) to ensure that all freshly hydrated regions, for which we are updating the metadata, are properly written to non-volatile storage and won't be lost in case of a crash. Fixes: 7431b7835f55 ("dm: add clone target") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Nikos Tsironis --- drivers/md/dm-clone-metadata.c | 46 +++++++++++++++++++++++++++++++++--------- drivers/md/dm-clone-metadata.h | 17 ++++++++++++++++ drivers/md/dm-clone-target.c | 7 ++++++- 3 files changed, 60 insertions(+), 10 deletions(-) diff --git a/drivers/md/dm-clone-metadata.c b/drivers/md/dm-clone-metadata.c index ee870a425ab8..c05b12110456 100644 --- a/drivers/md/dm-clone-metadata.c +++ b/drivers/md/dm-clone-metadata.c @@ -127,6 +127,9 @@ struct dm_clone_metadata { struct dirty_map dmap[2]; struct dirty_map *current_dmap; + /* Protected by lock */ + struct dirty_map *committing_dmap; + /* * In core copy of the on-disk bitmap to save constantly doing look ups * on disk. @@ -511,6 +514,7 @@ static int dirty_map_init(struct dm_clone_metadata *cmd) } cmd->current_dmap = &cmd->dmap[0]; + cmd->committing_dmap = NULL; return 0; } @@ -775,15 +779,17 @@ static int __flush_dmap(struct dm_clone_metadata *cmd, struct dirty_map *dmap) return 0; } -int dm_clone_metadata_commit(struct dm_clone_metadata *cmd) +int dm_clone_metadata_pre_commit(struct dm_clone_metadata *cmd) { - int r = -EPERM; + int r = 0; struct dirty_map *dmap, *next_dmap; down_write(&cmd->lock); - if (cmd->fail_io || dm_bm_is_read_only(cmd->bm)) + if (cmd->fail_io || dm_bm_is_read_only(cmd->bm)) { + r = -EPERM; goto out; + } /* Get current dirty bitmap */ dmap = cmd->current_dmap; @@ -795,7 +801,7 @@ int dm_clone_metadata_commit(struct dm_clone_metadata *cmd) * The last commit failed, so we don't have a clean dirty-bitmap to * use. */ - if (WARN_ON(next_dmap->changed)) { + if (WARN_ON(next_dmap->changed || cmd->committing_dmap)) { r = -EINVAL; goto out; } @@ -805,11 +811,33 @@ int dm_clone_metadata_commit(struct dm_clone_metadata *cmd) cmd->current_dmap = next_dmap; spin_unlock_irq(&cmd->bitmap_lock); - /* - * No one is accessing the old dirty bitmap anymore, so we can flush - * it. - */ - r = __flush_dmap(cmd, dmap); + /* Set old dirty bitmap as currently committing */ + cmd->committing_dmap = dmap; +out: + up_write(&cmd->lock); + + return r; +} + +int dm_clone_metadata_commit(struct dm_clone_metadata *cmd) +{ + int r = -EPERM; + + down_write(&cmd->lock); + + if (cmd->fail_io || dm_bm_is_read_only(cmd->bm)) + goto out; + + if (WARN_ON(!cmd->committing_dmap)) { + r = -EINVAL; + goto out; + } + + r = __flush_dmap(cmd, cmd->committing_dmap); + if (!r) { + /* Clear committing dmap */ + cmd->committing_dmap = NULL; + } out: up_write(&cmd->lock); diff --git a/drivers/md/dm-clone-metadata.h b/drivers/md/dm-clone-metadata.h index 3fe50a781c11..14af1ebd853f 100644 --- a/drivers/md/dm-clone-metadata.h +++ b/drivers/md/dm-clone-metadata.h @@ -75,7 +75,23 @@ void dm_clone_metadata_close(struct dm_clone_metadata *cmd); /* * Commit dm-clone metadata to disk. + * + * We use a two phase commit: + * + * 1. dm_clone_metadata_pre_commit(): Prepare the current transaction for + * committing. After this is called, all subsequent metadata updates, done + * through either dm_clone_set_region_hydrated() or + * dm_clone_cond_set_range(), will be part of the **next** transaction. + * + * 2. dm_clone_metadata_commit(): Actually commit the current transaction to + * disk and start a new transaction. + * + * This allows dm-clone to flush the destination device after step (1) to + * ensure that all freshly hydrated regions, for which we are updating the + * metadata, are properly written to non-volatile storage and won't be lost in + * case of a crash. */ +int dm_clone_metadata_pre_commit(struct dm_clone_metadata *cmd); int dm_clone_metadata_commit(struct dm_clone_metadata *cmd); /* @@ -112,6 +128,7 @@ int dm_clone_metadata_abort(struct dm_clone_metadata *cmd); * Switches metadata to a read only mode. Once read-only mode has been entered * the following functions will return -EPERM: * + * dm_clone_metadata_pre_commit() * dm_clone_metadata_commit() * dm_clone_set_region_hydrated() * dm_clone_cond_set_range() diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c index b3d89072d21c..613c913c296c 100644 --- a/drivers/md/dm-clone-target.c +++ b/drivers/md/dm-clone-target.c @@ -1122,8 +1122,13 @@ static int commit_metadata(struct clone *clone) goto out; } - r = dm_clone_metadata_commit(clone->cmd); + r = dm_clone_metadata_pre_commit(clone->cmd); + if (unlikely(r)) { + __metadata_operation_failed(clone, "dm_clone_metadata_pre_commit", r); + goto out; + } + r = dm_clone_metadata_commit(clone->cmd); if (unlikely(r)) { __metadata_operation_failed(clone, "dm_clone_metadata_commit", r); goto out; From patchwork Wed Dec 4 14:06:54 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nikos Tsironis X-Patchwork-Id: 11272989 X-Patchwork-Delegate: snitzer@redhat.com Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8AA606C1 for ; Wed, 4 Dec 2019 14:07:40 +0000 (UTC) Received: from us-smtp-delivery-1.mimecast.com (us-smtp-1.mimecast.com [207.211.31.81]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 47F1F2073C for ; Wed, 4 Dec 2019 14:07:40 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="VcG1Lotf" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 47F1F2073C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=arrikto.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=dm-devel-bounces@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1575468459; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post; bh=xwKZOY6WgEX7ym2okmUxvIUUY6+djd9v7IFlCcy9mxw=; b=VcG1LotfUBoKgQeTvj/XkmtFt97uDWnbRmqTJ89rd93Lt1WdAxCiSvG/RcVktn9wO4XKJD hJr8LDWBFGq2iF47kWr8wnTBtoIqijVeExEYF9mb48wyUKeip1YeeYKJzWuNu2jZAN3rYe apw5j81nm31EPUuO+Z7CLINeYU8wqZc= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-205--Ate9EkiN-6LUwGdjCfBCA-1; Wed, 04 Dec 2019 09:07:37 -0500 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 52543990AD; Wed, 4 Dec 2019 14:07:29 +0000 (UTC) Received: from colo-mx.corp.redhat.com (colo-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.20]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 300921001B23; Wed, 4 Dec 2019 14:07:29 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id EC2A118089CF; Wed, 4 Dec 2019 14:07:28 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id xB4E7Bn6006891 for ; Wed, 4 Dec 2019 09:07:11 -0500 Received: by smtp.corp.redhat.com (Postfix) id B19AA10545F5; Wed, 4 Dec 2019 14:07:11 +0000 (UTC) Delivered-To: dm-devel@redhat.com Received: from mimecast-mx02.redhat.com (mimecast01.extmail.prod.ext.rdu2.redhat.com [10.11.55.17]) by smtp.corp.redhat.com (Postfix) with ESMTPS id AB3E410545EF for ; Wed, 4 Dec 2019 14:07:08 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-1.mimecast.com [205.139.110.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 6462585A32B for ; Wed, 4 Dec 2019 14:07:08 +0000 (UTC) Received: from mail-lj1-f196.google.com (mail-lj1-f196.google.com [209.85.208.196]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-224-uFTvmP6MOaWAZ7Xkee5zrg-1; Wed, 04 Dec 2019 09:07:06 -0500 Received: by mail-lj1-f196.google.com with SMTP id c19so8200037lji.11 for ; Wed, 04 Dec 2019 06:07:05 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=2FKSdUh5oYjVbB/mhP4PZCnVvTtHIBBs/6ko75iZmOM=; b=qrXHqqXuq2YZX1M1+17w8077Nr6Outb9DbqfqjzpiomJYsFUIwCMuhD/EgGGtzpyjQ mw9mpBcyZtKp9kmqoVoB0hBJiqZYnnQEIEtOck9mOA59xrpc8f0x7CrUz2u3N+HYfgmo Pq0uA75kAN/0+CFUIqZWvy3SRy+oU/SvF+7aBpwaIuZUu8Vph0Q6D4rOYgGHSbyMIb+D +xTARfQq7+I0UL0Zrd8Lq5ONnFuGyqmSHdiLQG8HHombKhp+Y3vXLEA0R9swExcCu3Qh Isgp+jLBUsLEnftGdT7zPiEVfIvtqrEf1vhYx+Dv267jW22P09lDuvpO1wKbkAPQijV/ CzXw== X-Gm-Message-State: APjAAAXVrAASUjSvcF3msVP1VLFiK7thZeCVy97ElrwXatN8tm7f1anb lNbJ71AVRTPg8Gu9iXih94mDhA== X-Google-Smtp-Source: APXvYqyYrseq4vR0tB//bjOFtvkyicIWXAs6c75VRJ5g9+POmSx4Ab7aEXMmKpJruFgHEwLEGAjrQg== X-Received: by 2002:a2e:58c:: with SMTP id 134mr2206481ljf.12.1575468424465; Wed, 04 Dec 2019 06:07:04 -0800 (PST) Received: from snf-864.vm.snf.arr ([31.177.62.212]) by smtp.gmail.com with ESMTPSA id i19sm3335533ljj.24.2019.12.04.06.07.03 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 04 Dec 2019 06:07:03 -0800 (PST) From: Nikos Tsironis To: snitzer@redhat.com, agk@redhat.com, dm-devel@redhat.com Date: Wed, 4 Dec 2019 16:06:54 +0200 Message-Id: <20191204140654.26214-4-ntsironis@arrikto.com> In-Reply-To: <20191204140654.26214-1-ntsironis@arrikto.com> References: <20191204140654.26214-1-ntsironis@arrikto.com> X-MC-Unique: uFTvmP6MOaWAZ7Xkee5zrg-1 X-MC-Unique: -Ate9EkiN-6LUwGdjCfBCA-1 X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-MIME-Autoconverted: from quoted-printable to 8bit by lists01.pubmisc.prod.ext.phx2.redhat.com id xB4E7Bn6006891 X-loop: dm-devel@redhat.com Cc: vkoukis@arrikto.com, ntsironis@arrikto.com, iliastsi@arrikto.com Subject: [dm-devel] [PATCH 3/3] dm clone: Flush destination device before committing metadata X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Mimecast-Spam-Score: 0 dm-clone maintains an on-disk bitmap which records which regions are valid in the destination device, i.e., which regions have already been hydrated, or have been written to directly, via user I/O. Setting a bit in the on-disk bitmap meas the corresponding region is valid in the destination device and we redirect all I/O regarding it to the destination device. Suppose the destination device has a volatile write-back cache and the following sequence of events occur: 1. A region gets hydrated, either through the background hydration or because it was written to directly, via user I/O. 2. The commit timeout expires and we commit the metadata, marking that region as valid in the destination device. 3. The system crashes and the destination device's cache has not been flushed, meaning the region's data are lost. The next time we read that region we read it from the destination device, since the metadata have been successfully committed, but the data are lost due to the crash, so we read garbage instead of the old data. This has several implications: 1. In case of background hydration or of writes with size smaller than the region size (which means we first copy the whole region and then issue the smaller write), we corrupt data that the user never touched. 2. In case of writes with size equal to the device's logical block size, we fail to provide atomic sector writes. When the system recovers the user will read garbage from the sector instead of the old data or the new data. 3. In case of writes without the FUA flag set, after the system recovers, the written sectors will contain garbage instead of a random mix of sectors containing either old data or new data, thus we fail again to provide atomic sector writes. 4. Even when the user flushes the dm-clone device, because we first commit the metadata and then pass down the flush, the same risk for corruption exists (if the system crashes after the metadata have been committed but before the flush is passed down). The only case which is unaffected is that of writes with size equal to the region size and with the FUA flag set. But, because FUA writes trigger metadata commits, this case can trigger the corruption indirectly. To solve this and avoid the potential data corruption we flush the destination device **before** committing the metadata. This ensures that any freshly hydrated regions, for which we commit the metadata, are properly written to non-volatile storage and won't be lost in case of a crash. Fixes: 7431b7835f55 ("dm: add clone target") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Nikos Tsironis --- drivers/md/dm-clone-target.c | 46 ++++++++++++++++++++++++++++++++++++++------ 1 file changed, 40 insertions(+), 6 deletions(-) diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c index 613c913c296c..d1e1b5b56b1b 100644 --- a/drivers/md/dm-clone-target.c +++ b/drivers/md/dm-clone-target.c @@ -86,6 +86,12 @@ struct clone { struct dm_clone_metadata *cmd; + /* + * bio used to flush the destination device, before committing the + * metadata. + */ + struct bio flush_bio; + /* Region hydration hash table */ struct hash_table_bucket *ht; @@ -1108,10 +1114,13 @@ static bool need_commit_due_to_time(struct clone *clone) /* * A non-zero return indicates read-only or fail mode. */ -static int commit_metadata(struct clone *clone) +static int commit_metadata(struct clone *clone, bool *dest_dev_flushed) { int r = 0; + if (dest_dev_flushed) + *dest_dev_flushed = false; + mutex_lock(&clone->commit_lock); if (!dm_clone_changed_this_transaction(clone->cmd)) @@ -1128,6 +1137,19 @@ static int commit_metadata(struct clone *clone) goto out; } + bio_reset(&clone->flush_bio); + bio_set_dev(&clone->flush_bio, clone->dest_dev->bdev); + clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; + + r = submit_bio_wait(&clone->flush_bio); + if (unlikely(r)) { + __metadata_operation_failed(clone, "flush destination device", r); + goto out; + } + + if (dest_dev_flushed) + *dest_dev_flushed = true; + r = dm_clone_metadata_commit(clone->cmd); if (unlikely(r)) { __metadata_operation_failed(clone, "dm_clone_metadata_commit", r); @@ -1199,6 +1221,7 @@ static void process_deferred_bios(struct clone *clone) static void process_deferred_flush_bios(struct clone *clone) { struct bio *bio; + bool dest_dev_flushed; struct bio_list bios = BIO_EMPTY_LIST; struct bio_list bio_completions = BIO_EMPTY_LIST; @@ -1218,7 +1241,7 @@ static void process_deferred_flush_bios(struct clone *clone) !(dm_clone_changed_this_transaction(clone->cmd) && need_commit_due_to_time(clone))) return; - if (commit_metadata(clone)) { + if (commit_metadata(clone, &dest_dev_flushed)) { bio_list_merge(&bios, &bio_completions); while ((bio = bio_list_pop(&bios))) @@ -1232,8 +1255,17 @@ static void process_deferred_flush_bios(struct clone *clone) while ((bio = bio_list_pop(&bio_completions))) bio_endio(bio); - while ((bio = bio_list_pop(&bios))) - generic_make_request(bio); + while ((bio = bio_list_pop(&bios))) { + if ((bio->bi_opf & REQ_PREFLUSH) && dest_dev_flushed) { + /* We just flushed the destination device as part of + * the metadata commit, so there is no reason to send + * another flush. + */ + bio_endio(bio); + } else { + generic_make_request(bio); + } + } } static void do_worker(struct work_struct *work) @@ -1405,7 +1437,7 @@ static void clone_status(struct dm_target *ti, status_type_t type, /* Commit to ensure statistics aren't out-of-date */ if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) - (void) commit_metadata(clone); + (void) commit_metadata(clone, NULL); r = dm_clone_get_free_metadata_block_count(clone->cmd, &nr_free_metadata_blocks); @@ -1839,6 +1871,7 @@ static int clone_ctr(struct dm_target *ti, unsigned int argc, char **argv) bio_list_init(&clone->deferred_flush_completions); clone->hydration_offset = 0; atomic_set(&clone->hydrations_in_flight, 0); + bio_init(&clone->flush_bio, NULL, 0); clone->wq = alloc_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM, 0); if (!clone->wq) { @@ -1912,6 +1945,7 @@ static void clone_dtr(struct dm_target *ti) struct clone *clone = ti->private; mutex_destroy(&clone->commit_lock); + bio_uninit(&clone->flush_bio); for (i = 0; i < clone->nr_ctr_args; i++) kfree(clone->ctr_args[i]); @@ -1966,7 +2000,7 @@ static void clone_postsuspend(struct dm_target *ti) wait_event(clone->hydration_stopped, !atomic_read(&clone->hydrations_in_flight)); flush_workqueue(clone->wq); - (void) commit_metadata(clone); + (void) commit_metadata(clone, NULL); } static void clone_resume(struct dm_target *ti)