From patchwork Wed Feb  6 20:46:13 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Josef Bacik <josef@toxicpanda.com>
X-Patchwork-Id: 10800039
Return-Path: <linux-btrfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CD1B01390
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
 Wed,  6 Feb 2019 20:46:31 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B7D0A2D095
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
 Wed,  6 Feb 2019 20:46:31 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id A8E842D175; Wed,  6 Feb 2019 20:46:31 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 403432D095
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
 Wed,  6 Feb 2019 20:46:21 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726610AbfBFUqT (ORCPT
        <rfc822;patchwork-linux-btrfs@patchwork.kernel.org>);
        Wed, 6 Feb 2019 15:46:19 -0500
Received: from mail-qt1-f195.google.com ([209.85.160.195]:35331 "EHLO
        mail-qt1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726561AbfBFUqT (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Wed, 6 Feb 2019 15:46:19 -0500
Received: by mail-qt1-f195.google.com with SMTP id v11so9556830qtc.2
        for <linux-btrfs@vger.kernel.org>;
 Wed, 06 Feb 2019 12:46:18 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=toxicpanda-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:subject:date:message-id;
        bh=UJw9r7elGpR9pPhmGMak4yG3MUgz21XDNU4bYTx5nRA=;
        b=RTXuAmECPrgTEvbVlxkrtXWfk1BV1TMNF1twxRkJHsxmGEFwS2Di2HX1G7ygNu033Q
         rsdpkk+2yflOjXnOD2FTfcrFmm3zr+jEq5UGmhfsVIq/a4m8hPxXvqx/XrUkffMWHfea
         IPlvLRNnGqVVBlLpOjnQGrnlVJwSv5J7CLaSvmSW+GqPoSdZRBiUctVx0+oxjyB3vWn4
         qaOIpb5lNGys2Y0lG0YFNXBJz8EFCC0WVEycb/DshzlyBBkeFlj5QLcRnDSclBECYAOi
         jiujYhnaOi7cakeE4Bk+SoS+6c53kCMr+uqEs9LAXWq4P50jQEnVahKXK64+8gpm8QtL
         MTMQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:subject:date:message-id;
        bh=UJw9r7elGpR9pPhmGMak4yG3MUgz21XDNU4bYTx5nRA=;
        b=frTXY83rM3AI9vohTseSu5i1JXZtdir06Pu7k5wDjJ3tvhmE5JZ7hXQlTqcocbL7VU
         MMxSlJtmB17vppdO0fE7Ma/H2eX1yALbrf0fbH8aPWVfCd6wuQ16JfMpEKk0Lg+VQ824
         c6S5e79Lgh+x8THDFHLF5Kcys9ZlcnNeEZKyzMfPEhAB7m8bpSxE0K0K+wJnOjde9WCN
         xAfz8PfDwpf5r05rGQ2I/9PiYSSXJaGsc6fPC8sGlnH67TacQb1bo/90xydMYlFvuI9l
         CPiYq0m8wELohXFUpt0fT03uLVhm37lKCG8+JjjL1J3Jb8y46TMkEa3sKjTEV0nOQ7ZX
         ZW+A==
X-Gm-Message-State: AHQUAuamQDN+QkmIWWBq3kjffP+QyMwd9VDFkeDydp3cUjlSjvqN4Tny
        jJEfzI145U3RUNe1KkcSLje4WBWHj7U=
X-Google-Smtp-Source: 
 AHgI3Ia5rmUv5IdvcM+usogcX9HKJPkK1WcxJRTH26LQzULRnehw6tXKexkGdtqGhVUgP4HuYemNjg==
X-Received: by 2002:aed:34e6:: with SMTP id x93mr9448515qtd.156.1549485978083;
        Wed, 06 Feb 2019 12:46:18 -0800 (PST)
Received: from localhost ([107.15.81.208])
        by smtp.gmail.com with ESMTPSA id
 b77sm12733924qka.5.2019.02.06.12.46.17
        (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
        Wed, 06 Feb 2019 12:46:17 -0800 (PST)
From: Josef Bacik <josef@toxicpanda.com>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH 0/2] Fix missing reference aborts when resuming snapshot
 delete
Date: Wed,  6 Feb 2019 15:46:13 -0500
Message-Id: <20190206204615.5862-1-josef@toxicpanda.com>
X-Mailer: git-send-email 2.14.3
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

With my delayed refs rsv patches in place we started hitting issues in our build
servers that do a lot of snapshot deletions.  Turns out there was a bug in
btrfs_end_transaction_throttle() that caused it to basically always commit the
transaction, which uncovered this particular bug.

The gory details are in the change logs for both patches, but generally speaking
it's a problem with how we update our root_item->drop_progress key.  We will
skip updating it some times even though we will have dropped references to
blocks.  If we crash or unmount at these times we will start at a point earlier
in our delete than we should be and try to free blocks that we already freed,
thus ending up with a transaction abort because we couldn't find the extent
reference.

There are 2 patches, 1 patch to deal with already broken file systems, and 1
patch to keep this problem from happening in the first place.

The steps to reproduce this easily are sort of tricky, I had to add a couple of
debug patches to the kernel in order to make it easy, basically I just needed to
make sure we did actually commit the transaction every time we finished a
walk_down_tree/walk_up_tree combo.

The reproducer

1) Creates a base subvolume.
2) Creates 100k files in the subvolume.
3) Snapshots the base subvolume (snap1).
4) Touches files 5000-6000 in snap1.
5) Snapshots snap1 (snap2).
6) Deletes snap1.

I do this with dm-log-writes, and then replay to every FUA in the log and fsck
the fs.  Without these patches this falls over pretty quickly.  With just the
first patch we can mount the fs at the point that the fsck fails and it cleans
everything up properly.  With both patches applied the fsck never fails and
we're golden.  Thanks,

Josef