summaryrefslogtreecommitdiff
path: root/README.md
blob: 369107c6d268f5de83145f81bd8258c698f23e87 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
To: linux-btrfs@vger.kernel.org
From: Luke T. Shumaker <lukeshu@lukeshu.com>
Subject: btrfs-rec: Recover (data from) a broken btrfs filesystem

Inspired by a mis-typed `dd` command, for the last year I've been
working on a tool for recovering corrupt btrfs filesystems; at first
idly here and there, but more actively in the last few months.  I hope
to get it incorporated into btrfs-progs, though perhaps that is
problematic for a few reasons I'll get to.  If the code can't be
incorporated into btrfs-progs, at least the ideas and algorithms
should be.

    https://git.lukeshu.com/btrfs-progs-ng/

Highlights:

 - In general, it's more tolerant of corrupt filesystems than
   `btrfs check --repair`, `btrfs rescue` or `btrfs restore`.

 - `btrfs-rec inspect rebuild-mappings` is a better
   `btrfs rescue chunk-recover`.

 - `btrfs-rec inspect rebuild-trees` can re-attach lost branches to
   broken B+ trees.

 - `btrfs-rec inspect mount` is a read-only FUSE implementation of
   btrfs.  This is conceptually a replacement for `btrfs restore`.

 - It's entirely written in Go.  I'm not saying that's a good thing,
   but it's an interesting thing.

Hopefully some folks will find it useful, or at least neat!

    1.      Motivation
    2.      Overview of use
    3.      Prior art
    4.      Internals/Design
    4.1.      Overview of the source tree layout
    4.2.      Base decisions: CLI structure, Go, JSON
    4.3.      Algorithms
    4.3.1.      The `rebuild-mappings` algorithm
    4.3.2.      The `--rebuild` algorithm
    4.3.2.1.      rebuilt forrest behavior
    4.3.2.2.      rebuilt individual tree behavior
    4.3.3.      The `rebuild-trees` algorithm
    4.3.3.1.      initialization
    4.3.3.2.      the main loop
    4.3.3.3.      graph callbacks
    5.      Future work
    6.      Problems for merging this code into btrfs-progs

# 1. Motivation

Have you ever ended up with a corrupt btrfs filesystem (through no
fault of btrfs itself, but perhaps a failing drive, or a mistaken `dd`
invocation)?  Surely losing less than 100MB of data from a drive
should not render hundreds of GB of perfectly intact data unreadable!
And yet, the existing tools are unable to even attempt to read that
data:

    $ btrfs check --repair --force dump-zero.1.img
    enabling repair mode
    Opening filesystem to check...
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
    ERROR: cannot read chunk root
    ERROR: cannot open file system

or

    $ btrfs check --init-extent-tree --force dump-zero.1.img
    Opening filesystem to check...
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
    ERROR: cannot read chunk root
    ERROR: cannot open file system

or

    $ btrfs check --init-csum-tree --force dump-zero.1.img
    Creating a new CRC tree
    Opening filesystem to check...
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
    ERROR: cannot read chunk root
    ERROR: cannot open file system

or

    $ btrfs rescue chunk-recover dump-zero.1.img
    Scanning: DONE in dev0
    corrupt node: root=1 block=160410271744 slot=0, corrupt node: root=1 block=160410271744, nritems too large, have 39 expect range [1,0]
    Couldn't read tree root
    open with broken chunk error

or

    $ btrfs rescue zero-log dump-zero.1.img
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    ERROR: cannot read chunk root
    ERROR: could not open ctree

or

    $ mkdir out
    $ btrfs restore dump-zero.1.img out
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
    ERROR: cannot read chunk root
    Could not open root, trying backup super
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
    ERROR: cannot read chunk root
    Could not open root, trying backup super
    ERROR: superblock bytenr 274877906944 is larger than device size 256060514304
    Could not open root, trying backup super

or

    $ btrfs restore --list-roots dump-zero.1.img
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
    ERROR: cannot read chunk root
    Could not open root, trying backup super
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
    ERROR: cannot read chunk root
    Could not open root, trying backup super
    ERROR: superblock bytenr 274877906944 is larger than device size 256060514304
    Could not open root, trying backup super

or

    $ btrfs-find-root dump-zero.1.img
    WARNING: cannot read chunk root, continue anyway
    Superblock thinks the generation is 6596071
    Superblock thinks the level is 1

Well, have I got a tool for you!

(FWIW, I also tried manipulating the filesystem and patching to tools
to try to get past those errors, only to get a different set of
errors.  Some of these patches I am separately submitting to
btrfs-progs.)

# 2. Overview of use

There are two `btrfs-rec` sub-command groups:
`btrfs-rec inspect SUBCMD` and `btrfs-rec repair SUBCMD`, and you can
find out about various sub-commands with `btrfs-rec help`.  These are
both told about devices or images with the `--pv` flag.

`btrfs-rec inspect SUBCMD` commands open the filesystem read-only, and
(generally speaking) write extracted or rebuilt information to stdout.
`btrfs-rec repair SUBCMD` commands open the filesystem read+write, and
consume information from `btrfs-rec inspect SUBCMD` commands to
actually repair the filesystem (except I haven't actually implemented
any `repair` commands yet...  despite the lack of `repair` commands, I
believe that `btrfs-rec` is already a useful because of the `btrfs-rec
inspect mount` command to get data out of the broken filesystem).
This split allows you to try things without being scared by WARNINGs
about not using these tools unless you're an expert or have been told
to by a developer.

In the broken `dump-zero.1.img` example above (which has a perfectly
intact superblock, but a totally broken `CHUNK_TREE`), to "repair" it
I'd:

 1. Start by using `btrfs-rec inspect rebuild-mappings` to rebuild the
    broken chunk/dev/blockgroup trees:

        $ btrfs-rec inspect rebuild-mappings \
            --pv=dump-zero.1.img \
            > mappings-1.json

 2. If it only mostly succeeds, but on stderr tells us about a few
    regions of the image that it wasn't able to figure out the chunks
    for.  Using some human-level knowledge, you can write those
    yourself, inserting them into the generated `mappings.json`, and
    ask `rebuild-mappings` to normalize what you wrote:

        $ btrfs-rec inspect rebuild-mappings \
            --pv=dump-zero.1.img \
            --mappings=<(sed <mappings-1.json \
                -e '2a{"LAddr":5242880,"PAddr":{"Dev":1,"Addr":5242880},"Size":1},' \
                -e '2a{"LAddr":13631488,"PAddr":{"Dev":1,"Addr":13631488},"Size":1},') \
            > mappings-2.json

 3. Now that it has functioning chunk/dev/blockgroup trees, we can use
    `btrfs-rec inspect rebuild-trees` to rebuild other trees that rely
    on those:

        $ btrfs-rec inspect rebuild-mappings \
            --pv=dump-zero.1.img \
            --mappings=mappings-2.json \
            > trees.json

 4. Now that (hopefully) everything that was damaged has been
    reconstructed, we can use `btrfs-rec inspect mount` to mount the
    filesystem read-only and copy out our data:

        $ mkdir mnt
        $ sudo btrfs-rec inspect mount \
            --pv=dump-zero.1.img \
            --mappings=mappings-2.json \
            --trees=trees.json \
            ./mnt

This example is fleshed out more (and the manual edits to
`mappings.json` explained more) in `./examples/main.sh`.

# 3. Prior art

Comparing `btrfs-rec inspect mount` with the existing
https://github.com/adam900710/btrfs-fuse project:

 - Again, mine has better fault tolerance
 - Mine is read-only
 - Mine supports xattrs ("TODO" in Adam's)
 - Mine supports separate inode address spaces for subvolumes; Adam's
   doesn't due to limitations in FUSE, mine works around this by
   lazily setting up separate mountpoints for each subvolume (though
   this does mean that the process needs to run as root, which is a
   bummer).

# 4. Internals/Design

## 4.1. Overview of the source tree layout

 - `examples/` has example scripts showing how to use `btrfs-rec`.

 - `lib/btrfs/` is the core btrfs implementation.

 - `lib/btrfscheck/` and `lib/btrfsutil/` are libraries for
   "btrfs-progs" type programs, that are userland-y things that I
   thought should be separate from the core implementation; something
   that frustrated me about libbtrfs was having to figure out "is this
   thing here in support of btrfs bits-on-disk, or in support of a
   higher-level 'how btrfs-progs wants to think about things'?"

 - `cmd/btrfs-rec/` is where the command implementations live.  If a
   sub-command fits in a single file, it's
   `cmd/btrfs-rec/inspect_SUBCMD.go`, otherwise, it's in a separate
   `cmd/btrfs-rec/inspect/SUBCMD/` package.

 - `lib/textui/` is reasonably central to how the commands implement a
   text/CLI user-interface.

 - `lib/binstruct/`, `lib/diskio/`, and `lib/streamio/` are
   non-btrfs-specific libraries related to the problem domain.

 - `lib/containers/`, `lib/fmtutil/`, `lib/maps/`, `lib/slices/`, and
   `lib/profile/` are all generic Go libraries that have nothing to do
   with btrfs or the problem domain, but weren't in the Go standard
   library and I didn't find/know-of exiting implementations that I
   liked.  Of these, all but `containers` are pretty simple utility
   libraries.  Also, some of these things have been added to the
   standard library since I started the project.

## 4.2. Base decisions: CLI structure, Go, JSON

I started with trying to enhance btrfs-progs, but ended up writing a
wholy new program in Go, for several reasons:

 - writing a new thing: I was having to learn both the btrfs-progs
   codebase and how btrfs-bits-on-disk work, and it got to the point
   that I decided I should just focus on learning btrfs-bits-on-disk.

 - writing a new thing: It was becoming increasingly apparent to me
   that it was going to be an uphill-fight of having recovery-tools
   share the same code as the main-tools, as the routines used by the
   main-tools rightly have validity checks, where recovery-tools want
   to say "yes, I know it's invalid, can you give it to me anyway?".

 - writing it in not-C: I love me some C, but higher level languages
   are good for productivity.  And I was trying to write a whole lot
   of code at once, I needed a productivity boost.

 - writing it in not-C: This forced me to learn btrfs-bits-on-disk
   better, instead of just cribbing from btrfs-progs.  That knowledge
   is particularly important for having ideas on how to deal with
   corrupt bits-on-disk.

 - writing it in Go: At the time I started, my day job was writing Go,
   so I had Go swapped into my brain.  And Go still feels close to C
   but provides *a lot* of niceness and safety over C.

It turned out that Go was perhaps not the best choice, but we'll come
back to that.

I wanted to separate things into a pipeline.  For instance: Instead of
`btrfs rescue chunk-recover` trying to do everything to rebuild a
broken chunk tree, I wanted to separate I/O from computation from
repairs.  So I have `btrfs-rec inspect rebuild-mappings scan` that
reads all the info necessary to rebuild the chunk tree, then dump that
as a 2GB glob of JSON.  Then I can feed that JSON to `btrfs-rec
inspect rebuild-mappings process` which actually rebuilds the mappings
in the chunk tree, and dumps them as JSON.  And then other commands
can consume that `mappings.json` to use that instead of trying to read
the chunk tree from the actual FS, so that you don't have to make
potentially destructive writes to inspect an FS with a broken chunk
tree, and can inspect it more forensically.  Or then use
`btrfs-rec repair SOME_SUBCMD_I_HAVENT_WRITTEN_YET` to write that
chunk tree in `mappings.json` back to the filesystem.

(But also, the separate steps thing was useful just so I could iterate
on the algorithms of `rebuild-mappings process` separately from having
to scan the entire FS)

So, I made the decision that `btrfs-rec inspect SUBCMD` commands
should all only open the FS read-only, and output their work to a
separate file; that writing that info back to the FS should be
separate in `btrfs-rec repair SUBCMD`.

For connecting those parts of the pipeline, I chose JSON, for a few
reasons:

 - I wanted something reasonably human-readable, so that I could debug
   it easier.

 - I wanted something reasonably human-readable, so that human
   end-users could make manual edits; for example, in
   `examples/main.sh` I have an example of manually editing
   `mappings.json` to resolve a region that the algorithm couldn't
   figure out, but with knowledge of what caused the corruption a
   human can.

 - I didn't want to invent my own DSL and have to handle writing a
   parser.  (This part didn't pay off!  See below.)

 - I wanted something that I thought would have good support in a
   variety of languages, so that if Go is problematic for getting
   things merged upstream it could be rewritten in C (or maybe Rust?)
   piece-meal where each subcommand can be rewritten one at a time.

It turned out that JSON was perhaps not the best choice.

OK, so: Go and/or JSON maybe being mistakes:

 - I spent a lot of time getting the garbage collector to not just
   kill performance.

 - The `btrfs-rec inspect rebuild-mappings SUBCMD` subcommands all
   throw a lot of data through the JSON encoder/decoder, and I learned
   that the Go stdlib `encoding/json` package has memory use that
   grows O(n^2) (-ish? I didn't study the implementation, but that's
   what the curve looks like just observing it) on the size of the
   data being shoved through it, so I had to go take a break and go
   write https://pkg.go.dev/git.lukeshu.com/go/lowmemjson which is a
   mostly-drop-in-replacement that tries to be as close-as possible to
   O(1) memory use.  So I did end up having to write my own parser
   anyway :(

## 4.3. Algorithms

There are 3 algorithms of note in `btrfs-rec`, that I think are worth
getting into mainline btrfs-progs even if the code of `btrfs-rec`
doesn't get in:

 1. The `btrfs-rec inspect rebuild-mappings` algoritithm to rebuild
    information from the `CHUNK_TREE`, `DEV_TREE`, and
    `BLOCK_GROUP_TREE`.

 2. The `btrfs-rec --rebuild` algorithm to cope with reading broken B+
    trees.

 3. The `btrfs-rec inspect rebuild-trees` algorithm to re-attach lost
    branches to broken B+ trees.

### 4.3.1. The `rebuild-mappings` algorithm

(This step-zero scan is `btrfs-rec inspect rebuild-mappings scan`, and
principally lives in `./lib/btrfsutil/scan.go` and
`./cmd/btrfs-rec/inspect/rebuidmappings/scan.go`)

 0. Similar to `btrfs rescue chunk-recover`, scan each device for
    things that look like nodes; keep track of:
     - Checksums of every block on the device
     - Which physical addresses contain nodes that claim to be at a
       given logical addess.
     - Any found Chunk items, BlockGroup items, DevExtent, and CSum
       items.  Keep track of the key for each of these, and for CSum
       items also track the generation.

Create a bucket of the data from Chunks, DevExtents, and BlockGroups;
since these are mostly a Chunk and a DevExtent+BlockGroup store pretty
much the same information; we can use one to reconstruct the other.
How we "merge" these and handle conflicts is in
`./lib/btrfs/btrfsvol/lvm.go:addMapping()`, I don't think this
part is particularly clever, but given that `btrfs rescue
chunk-recover` crashes if it encounters two overlapping chunks, I
suppose I should spell it out:

 - A "mapping" is represented as a group of 4 things:

   + logical address
   + a list of 1 or more physical addresses (device ID and offset)
   + size, and a Boolean indicator of whether the size is "locked"
   + block group flags, and a Boolean presence-indicator

 - Mappings must be merged if their logical or physical regions
   overlap.

 - If a mapping has a "locked" size, then when merging it may subsume
   smaller mappings with unlocked sizes, but its size cannot be
   changed; trying to merge a locked-size mapping with another mapping
   that is not for a subset region should return an error.

 - If a mapping has block group flags present, then those flags may
   not be changed; it may only be merged with another mapping that
   does not have flags present, or has identical flags.

 - When returning an error because of overlapping non-mergeable
   mappings, just log an error on stderr and keep going.  That's an
   important design thing that is different than normal filesystem
   code; if there's an error, yeah, detect and notify about it, **but
   don't bail out of the whole routine**.  Just skip that one item or
   whatever.

Now that we know how to "add a mapping", let's do that:

(The following main-steps are `btrfs-rec inspect rebuild-mappings
process`, and principally live in
`./cmd/btrfs-rec/inspect/rebuidmappings/process.go`)

 1. Add all found Chunks.

 2. Add all found DevExtents.

 3. Add a phyical:logical mapping of length nodesize for each node
    that was found.

 4. Any mappings from steps 2 or 3 that are missing blockgroup flags
    (that is: they weren't able to be merged with a mapping from step
    1), use the found BlockGroups to fill in those flags.

 5. Now we'll merge all found CSum items into a map of the sums of the
    logical address space.  Sort all of the csum items by generation,
    then by address.  Loop over them in that order, inserting their
    sums into the map.  If two csum items overlap, but agree about the
    sums of the overlapping region, that's fine, just take their
    union.  For overlaps that disagree, items with a newer generation
    kick out items with an older generation.  If disagreeing items
    have the same generation... I don't think that can happen except
    by a filesystem bug (i.e. not by a failing drive or other external
    corruption), so I wasn't too concerned about it, so I just log an
    error on stderr and skip the later-processed item.  See
    `./cmd/btrfs-rec/inspect/rebuidmappings/process_sums_logical.go`.

    Look at regions of the logical address space that meet all the 3
    criteria:

     - we have CSum items for them
     - we have a BlockGroup for them
     - we don't have a Chunk/DevExtent mapping them to the pysical
       address space.

    Pair those CSums up with BlockGroups, and for each BlockGroup,
    search the list of checksums of physical blocks to try to find a
    physical region that matches the logical csums (and isn't already
    mapped to a different logical region).  I used a
    Knuth-Morris-Pratt search, modified to handle holes in the logical
    csum list as wildcards.

    Insert any found mappings into our bucket of mappings.

 6. Do the same again, but with a fuzzy search (we can re-use the csum
    map of the logical address space).  My implementation of this is
    comparatively time and space intensive; I just walk over the
    entire unmapped physical address space, noting what % of match
    each BlockGroup has if placed at that location.  I keep track of
    the best 2 matches for each BlockGroup.  If the best match is
    better than a 50% match, and the second best is less than a 50%
    match, then I add the best match.  In my experience, the best
    match is >90% (or at whatever the maximum percent is for how much
    of the BlockGroup has logical sums), and the second best is 0% or
    1%.  The point of tracking both is that if there isn't a clear-cut
    winner, I don't want it to commit to a potentially wrong choice.

### 4.3.2. The `--rebuild` algorithm

The `--rebuild` flag is implied by the `--trees=trees.json` flag, and
triggers an algorithm that allows "safely" reading from a broken B+
tree, rather than the usual B+ tree lookup and search functions.  I
probably should have tried to understand the `btrfs restore`
algorithm, maybe I reinvented the wheel...

This algorithm requires a list of all nodes on the filesystem; we find
these using the same scan as above (`./lib/btrfsutil/scan.go`), the
same procedure as `btrfs rescue chunk-recover`.

We walk all of those nodes, and build a reasonably lightweight
in-memory graph of all nodes (`./lib/btrfsutil/graph.go`), tracking

 - each node's
   + logical address
   + level
   + generation
   + tree
   + each item's key and size
 - each keypointer's
   + source node
   + source slot within the node
   + tree of the source node
   + destination node
   + destination level implied by the level of the source node
   + destination key
   + destination generation
 - logical addresses and error messages for nodes that are pointed to
   by a keypointer or the superblock, but can't be read (because that
   logical address isn't mapped, or it doesn't look like a node,
   or...)
 - an index such that for a given node we can quickly list both
   keypointers both originating at that node and pointing to that
   node.

#### 4.3.2.1. rebuilt forrest behavior (looking up trees)

(see: `./lib/btrfsutil/rebuilt_forrest.go`)

 - The `ROOT_TREE`, `CHUNK_TREE`, `TREE_LOG`, and `BLOCK_GROUP_TREE`
   (the trees pointed to directy by the superblock) work as you'd
   expect.
 - For other trees, we (as you'd expect) look up the root item in the
   rebuilt `ROOT_TREE`, and then (if rootitem.ParentUUID is non-zero)
   eagerly also look up the parent tree (recursing on ourself).  We
   try to use the `UUID_TREE` tree to help with this, but fall back to
   just doing a linear scan over the `ROOT_TREE`.  If we fail to look
   up the parent tree (or its parent, or a more distant ancestor),
   then (depending on a flag) we either make a note of that, or error
   out and fail to look up the child tree.  For `--rebuild` and
   `--trees=trees.json` we are permissive of this error, and just make
   note of it; but we'll re-use this algorithm in the `rebuild-trees`
   algorithm below, and it needs the more strict handling.
 - When creating the rebuilt individual tree, we start by adding the
   root node specified by the superblock/root-item.  But we may also
   add additional root nodes grafted on to the tree by the
   `--trees=trees.json` flag or by the `rebuild-trees` algorithm
   below.  So a tree may have more than 1 root node.

#### 4.3.2.2. rebuilt individual tree behavior

(see: `./lib/btrfsutil/rebuilt_tree.go`)

In order to read from a tree, we first have to build a few indexes.
We store these indexes in an Adaptive Replacement Cache; they are all
re-buildable based on the tree's list of roots and the above graph; if
we have a bunch of trees we don't need to keep all of this in memory
at once.  Note that this is done 100% with the in-memory graph, we
don't need to read anything from the filesystem during these
procedures.

 - The first index we build is the "node index".  This is an index
   that for every node tells us what root(s) the tree would need to
   have in order for the tree to include that node, and also what the
   highest item key would be acceptable in the node if the tree
   includes that root.  We track both a `loMaxItem` and a `hiMaxItem`,
   in case the tree is real broken and there are multiple paths from
   the root to the node; as these different paths may imply different
   max-item constraints.  Put more concretely, the type of the index
   is:

       map[ nodeID → map[ rootNodeID → {loMaxItem, hiMaxItem} ] ]

   We'll do a loop over the graph, using dynamic-programming
   memoization to figure out ordering and avoid processing the same
   node twice; for each node we'll

   + Check whether the owner-tree is this tree or one of this tree's
      ancestors (and if it's an ancestor, that the node's generation
      isn't after the point that the child tree was forked from the
      parent tree).  If not, we are done processing that node (record
      an empty/nil set of roots for it).

   + Create an empty map of `rootID` → {`loMaxItem`, `hiMaxItem`}.

   + Look at each keypointer that that points at the node and:

     * Skip the keypointer if its expectations of the node aren't met:
       if the level, generation, and min-key constraints don't match
       up.  If the keypointer isn't in the last slot in the source
       node, we also go ahead and include checking that the
       destination node's max-key is under the min-key of the
       keypointer in the next slot, since that's cheap to do now.

     * Skip the keypointer if its source node's owner-tree isn't this
       tree or one of this tree's ancestors (and if it's an ancestor,
       that the node's generation isn't after the point that the child
       tree was forked from the parent tree).

     * dynamic-programming recurse and index the keypointer's source
       node.

     * for every root that would result in the keypointer's source
       node being included in the tree:

       . If the keypointer is in the last slot, look at what the what
         the source node's last-item constraints would be if that root
         is included, and can now check the max-item of our
         destination node.  We check against the `hiMaxItem`; as if
         there is any valid path from the root to this node, then we
         want to be permissive and include it.  If that check fails,
         then we're done with this keypointer.  Also, make node of
         those `loMaxItem` and `hiMaxItem` values, we'll use them
         again in just a moment.

       . Otherwise, set both `loMaxItem` and `hiMaxItem` to 1-under
         the min-item of the keypointer in the next slot.

       . Insert that `loMaxItem` and `hiMaxItem` pair into the
         `rootID` → {`loMaxItem`, `hiMaxItem`} map we created above.
         If an entry already exists for this root (since a broken tree
         might have multiple paths from the root to our node), then
         set `loMaxItem` to the min of the existing entry and our
         value, and `hiMaxItem` to the max.

   + If that `rootID` → {`loMaxItem`, `hiMaxItem`} map is still empty,
     then consider this node to be a (potential) root, and insert
     `rootID=thisNode` -> {`loMaxItem=maxKey`, `hiMaxItem=maxKey`}
     (where `maxKey` is the maximum value of the key datatype).

   + Take that `rootID` → {`loMaxItem`, `hiMaxItem`} map and insert it
     into the index as the entry for this node.

 - The next index we build is the "item index".  This is a "sorted
   map" (implemented as a red-black tree, supporting sub-range
   iteration) of `key` → {`nodeID`, `slotNumber`}; a map that for each
   key tells us where to find the item with that key.

   + Loop over the node index, and for each node check if both (a) it
     has `level==0` (is a leaf node containing items), and (b) its set
     of roots that would include it has any overlap with the tree's
     set of roots.

   + Loop over each of those included leaf nodes, and loop over the
     items in each node.  Insert the `key` → {`nodeId`, `slot`} into
     our sorted map.  If there is already an entry for that key,
     decide which one wins by:

     * Use the one from the node with the owner-tree that is closer to
       this tree; node with owner=thisTree wins over a node with
       owner=thisTree.parent, which would win over a node with
       owner.thisTree.parent.parent.  If that's a tie, then...

     * Use the one from the node with the higher generation.  If
       that's a tie, then...

     * I don't know, I have the code `panic`:

           // TODO: This is a panic because I'm not really sure what the
           // best way to handle this is, and so if this happens I want the
           // program to crash and force me to figure out how to handle it.
           panic(fmt.Errorf("dup nodes in tree=%v: old=%v=%v ; new=%v=%v",
               tree.ID,
               oldNode, tree.forrest.graph.Nodes[oldNode],
               newNode, tree.forrest.graph.Nodes[newNode]))

   Note that this algorithm means that for a given node we may use a
   few items from that node, while having other items from that same
   node be overridden by another node.

 - The final index we build is the "error index".  This is an index of
   what errors correspond to which range of keys, so that we can
   report them, and give an idea of "there may be entries missing from
   this directory" and similar.

   For each error, we'll track the min-key and max-key of the range it
   applies to, the node it came from, and what the error string is.
   We'll store these into an interval tree keyed on that
   min-key/max-key range.

   + Create an empty set `nodesToProcess`.  Now populate it:

     * Once again, we'll loop over the node index, but this time we'll
       only check that there's overlap between the set of roots that
       would include the node and the tree's set of roots.  The nodes
       that are included in this tree, insert both that node itself
       and all node IDs that it has keypointers pointing to into the
       `nodesToProcess` set.

     * Also insert all of the tree's roots into `nodesToProcess`; this
       is in case the superblock/root-item points to an invalid node
       that we couldn't read.

   + Now loop over `nodesToProcess`.  For each node, create an empty
     list of errors.  Use the keypointers pointing to and the min
     `loMaxItem` from the node index to construct a set of
     expectations for the node; this should be reasonably
     straight-forward, given:

     * If different keypointers have disagreeing levels, insert an
       error in to the list, and don't bother with checking the node's
       level.

     * If different keypointers have disagreeing generations, insert
       an error in to the list, and don't bother with checking the
       node's generation.

     * If different keypointers have different min-item expectations,
       use the max of them.

     Then:

     * If the node is a "bad node" in the graph, insert the error
       message associated with it.  Otherwise, check those
       expectations against the node in the graph.

     If the list of error messages is non-empty, then insert their
     concatenation into the interval tree, with the range set to the
     min of the min-item expectations from the keypointers through the
     max of the `hiMaxItem`s from the node index.  If the min min-item
     expectation turns out to be higher than the max `hiMaxItem`, then
     set the range to the zero-key through the max-key.

From there, it should be trivial to implement the usual B+ tree
operations using those indexes; exact-lookup using the item index, and
range-lookups and walks using the item index together with the error
index.  Efficiently searching the `CSUM_TREE` requires knowing item
sizes, so that's why we recorded the item sizes into the graph.

### 4.3.3. The `rebuild-trees` algorithm

The `btrfs inspect rebuild-trees` algorithm finds nodes to attach as
extra roots to trees.  I think that conceptually it's the the simplest
of the 3 algorithms, but turned out to be the hardest to get right.
So... maybe more than the others reference the source code too
(`./cmd/btrfs-rec/inspect/rebuildtrees/`) because I might forget some
small but important detail.

The core idea here is that we're just going to walk each tree,
inspecting each item in the tree, and checking for any items that are
implied by other items (e.g.: a dir entry item implies the existence
of inode item for the inode that it points at).  If an implied item is
not in the tree, but is in some other node, then we look at which
potential roots we could add to the tree that would add that other
node.  Then, after we've processed all of the items in the filesystem,
we go add those various roots to the various trees, keeping track of
which items are added or updated.  If any of those added/updated items
have a version with a newer generation on a different node, see what
roots we could add to get that newer version.  Then add those roots,
keeping track of items that are added/updated.  Once we reach
steady-state with the newest version of each item has been added, loop
back and inspect all added/updated items for implied items, keeping
track of roots we could add.  Repeat until a steady-state is reached.

There are lots of little details in that process, some of which are
for correctness, and some of which are for "it should run in hours
instead of weeks."

#### 4.3.3.1. initialization

First up, we're going to build and in-memory graph, same as above.
But this time, while we're reading the nodes to do that, we're also
going to watch for some specific items and record a few things about
them.

(see: `./cmd/btrfs-rec/inspect/rebuildtrees/scan.go`)

For each {`nodeID`, `slotNumber`} pair that matches one of these item
types, we're going to record:

 - flags:
   + `INODE_ITEM`s: whether it has the `INODE_NODATASUM` flag set
 - names:
   + `DIR_INDEX` items: the file's name
 - sizes:
   + `EXTENT_CSUM` items: the number of bytes that this is a sum for
     (i.e. the item size over the checksum size, times the block size)
   + `EXTENT_DATA` items: the number of bytes in this extent
     (i.e. either the item size minus
     `offsetof(btrfs_file_extent_item.disk_bytenr)` if
     `FILE_EXTENT_INLINE`, or else the item's `num_bytes`).
 - data backrefs:
   - `EXTENT_ITEM`s and `METADATA_ITEM`s: a list of the same length as
     the number of refs embedded in the item; for embeded
     ExtentDataRefs, the list entry is the subvolume tree ID that the
     ExtentDataRef points at, otherwise it is zero.
   - `EXTENT_DATA_REF` items: a list of length 1, with the sole member
     being the subvolume tree ID that the ExtentDataRef points at.

#### 4.3.3.2. the main loop

(see: `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go`)

Start with that scan data (graph + info about items), and also a
rebuilt forrest from the above algorithm, but with:

 - the flag set so that it refuses to look up a tree if it can't look
   up all of that tree's ancestors

 - an additional "potential-item index" that is similar to the item
   index.  It is generated the same way and can cache/evict the same
   way; the difference is that we invert the check for if the set of
   roots for a node has overlap with the tree's set of roots; we're
   looking for *potential* nodes that we could add to this tree.

 - some callbacks; we'll get to what we do in these callbacks in a
   bit, but for now, what the callbacks are:

   + a callback that is called for each added/updated item when we add
     a root.

   + a callback that is called whenever we add a root

   + a callback that intercepts looking up a root item

   + a callback that intercepts resolving an UUID to an object ID.

  (The callbacks are in
  `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_treecb.go`)

We have 5 unordered queues ("work lists"?); these are sets that when
it's time to drain them we'll sort the members and process them in
that order.

 1. the tree queue: a list of tree IDs that we need to crawl
 2. the retry-item queue: for each tree ID, a set of items that we
    should re-process if we add a root to that tree
 3. the added-item queue: a set of key/tree pairs identifying items
    that have been added by adding a root to a tree
 4. the settled-item-queue: a set of key/tree pairs that have have not
    just been added by adding a root, but we've also verified that
    they are the newest-generation item with that key that we could
    add to the tree.
 5. the augment queue: for each item that we want to add to a tree,
    the list of roots that we could add to get that item.

The roots all start out empty, except for the tree queue, which we
seed with the `ROOT_TREE`, the `CHUNK_TREE`, and the
`BLOCK_GROUP_TREE` (It is a "TODO" task that it should probably also
be seeded with the `TREE_LOG`, but as I will say below in the "future
work" section, I don't actually understand the `TREE_LOG`, so I
couldn't implement it).

Now we're going to loop until the tree queue, added-item queue,
settled-item queue, and augment queue are all empty (all queues except
for the retry-item queue).  Each loop "pass" has 3 substeps:

 1. Crawl the trees (drain the tree queue, fill the added-item queue).

 2. Either:

    a. if the added-item queue is non-empty: "settle" those items
       (drain the added-item queue, fill the augment queue and the
       settled-item queue).

    b. otherwise: process items (drain the settled-item queue, fill
       the augment queue and the tree queue)

 3. Apply augments (drain the augment queue and maybe the retry-item
    queue, fill the added-item queue).

OK, let's look at those 3 substeps in more detail:

 1. Crawl the trees; drain the tree queue, fill the added-item queue.

    We just look up the tree in the rebuilt forrest, which will (per
    the above `--rebuild` algorithm) will either fail to look up the
    tree, or succeed, and add to that tree the root node from the
    superblock/root-item.  Because we set an item-added callback, when
    adding that root it will loop over the nodes added by that root,
    and call our callback for each item in one of the added nodes.
    Our callback inserts each item into the added-item queue.  The
    forrest also calls our root-added callback, but because of the way
    this algorithm works, that turns out to be a no-op at this step.

    I mentioned that we added callbacks to intercept the forrest's
    looking up of root items and resolving UUIDs; we override the
    forrest's "lookup root item" routine and "resolve UUID" routine to
    instead of doing normal lookups on the `ROOT_TREE` and
    `UUID_TREE`, use the above `WantXXX` routines that we'll define
    below in the "graph callbacks" section.

    It shouldn't matter what order this queue is processed in, but I
    sort tree IDs numerically.

    The crawling is fairly fast because it's just in-memory, the only
    accesses to disk are looking up root items and resolving UUIDs.

 2. Either:

    a. Settle items from the added-item queue to the settled-item queue
       (and fill the augment queue).

       For each item in the queue, we look in the tree's item index to
       get the {node, slot} pair for it, then we do the same in the
       tree's potential-item index.  If the potential-item index
       contains an entry for the item's key, then we check if the
       potential-item's node should "win" over the queue item's node,
       deciding the "winner" using the same routine as when building
       the item index.  If the potential-item's node wins, then we add
       the potential node's set of roots to the augment queue.  If the
       queue-item's node wins, then we add the item to the
       settled-item queue (except, as an optimization, if the item is
       of a type that cannot possibly imply the existence of another
       item, then we just drop it and don't add it to the settled-item
       queue).

       It shouldn't matter what order this queue is processed in, but
       I sort it numerically by treeID and then by item key.

       This step is fairly fast because it's entirely in-memory,
       making no accesses to disk.

    b. Process items from the settled-item queue (drain the
       settled-item queue, fill the augment queue and the tree queue).

       This step accesses disk, and so the order we process the queue
       in turns out to be pretty important in order to keep our disk
       access patterns cache-friendly.  For the most part, we just
       sort each queue item by tree, then by key.  But, we have
       special handling for `EXTENT_ITEM`s, `METADATA_ITEM`s, and
       `EXTENT_DATA_REF` items: We break `EXTENT_ITEM`s and
       `METADATA_ITEM`s in to "sub-items", treating each ref embedded
       in them as a separate item.  For those embedded items that are
       `EXTENT_DATA_REF`s, and for stand-alone `EXTENT_DATA_REF`
       items, we sort them not with the `EXTENT_TREE` items, but with
       the items of the tree that the extent data ref points at.
       Recall that during the intitial scan step, we took note of
       which tree every extent data ref points at, so we can perform
       this sort without accessing disk yet.  This splitting does mean
       that we may visit/read an `EXTENT_ITEM` or `METADATA_ITEM`
       multiple times as we process the queue, but to do otherwise is
       to solve MinLA, which is NP-hard and also an optimal MinLA
       solution I still think would perform worse than this; there is
       a reasonably lengthy discussion of this in a comment in
       `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:sortSettledItemQueue()`.

       Now we loop over that sorted queue.  In the code, this loop is
       deceptively simple.  Read the item, then pass it to a function
       that tells us what other items are implied by it.  That
       function is large, but simple; it's just a giant table.  The
       trick is how it tells us about implied items; we give it set of
       callbacks that it calls to tell us these things; the real
       complexity is in the callbacks.  These "graph callbacks" will
       be discussed in detail below, but as an illustrative example:
       It may call `.WantOff()` with a tree ID, object ID, item type,
       and offset to specify a precise item that it believes should
       exist.

       If we encounter a `ROOT_ITEM`, add the tree described by that
       item to the tree queue.

    (Both the "can this item even imply the existence of another item"
    check and the "what items are implied by this item" routine are in
    `./lib/btrfscheck/graph.go`)

 3. Apply augments; drain the augment queue (and maybe the retry-item
    queue), fill the added-item queuee.

    It is at this point that I call out that the augment queue isn't
    implemented as a simple map/set like the others, the
    `treeAugmentQueue struct` has special handling for sets of
    different sizes; optimizing the space for empty and len()==1 sized
    sets, and falling back to normal the usual implementation for
    larger sets; this is important because those small sets are the
    overwhelming majority, and otherwise there's no way the program
    would be able to run on my 32GB RAM laptop.  Now that I think
    about it, I bet it would even be worth it to add optimized storage
    for len()==2 sized sets.

    The reason is that each "want" from above is tracked in the queue
    separately; if we were OK merging them, then this optimized
    storage wouldn't be nescessary.  But we keep them separate, so
    that:

    - For all "wants", including ones with empty sets, graph callbacks
      can check if a want has already been processed; avoiding
      re-doing any work (see the description of the graph callbacks
      below).

    - For "wants" with non-empty sets, we can see how many different
      "wants" could be satisfied with a given root, in order to decide
      which root to choose.

    Anyway, we loop over the trees in the augment queue.  For each
    tree we look at that tree's augment queue and look at all the
    choices of root nodes to add (below), and decide on a list to add.
    The we add each of those roots to the tree; the adding of each
    root triggers several calls to our item-added callback (filling
    the added-item queue), and our root-added callback.  The
    root-added callback moves any items from the retry-item queue for
    this tree to the added-item queue.

    How do we decide between choices of root nodes to add?
    `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:resolveTreeAugments()`
    has a good comment explaining the criteria we'd like to optimize
    for, and then code that does an OK-ish job of actually optimizing
    for that:
	
    - It loops over the augment queue for that tree, building a list
      of possible roots, for each possible root making note of 3
      things:

       a. how many "wants" that root satisfies,

       b. how far from treee the root's owner is (owner=tree is a
          distance of 0, owner=tree.parent is a distance of 1,
          owner=tree.parent.parent is a distance of 2, and so on), and
          
       c. what the generation of that root is.
       
    - We sort that list first by highest-count-first, then by
      lowest-distance-first, then by highest-generation-first.
      
    - We create a "return" set and an "illegal" set.  We loop over the
      sorted list; for each possible root if it is in the illegal set,
      we skip it, otherwise we insert it into the return set and for
      each "want" that includes this root we all all roots that
      satisfy that want to the illegal list.

It is important that the rebuilt forrest have the flag set so that it
refuses to look up a tree if it can't look up all of that tree's
ancestors; otherwise the potential-items index would be garbage as we
wouldn't have a good idea of which nodes are OK to consider; but this
does have the downside that it won't even attempt to improve a tree
with a missing parent.  Perhaps the algorithm should flip the flag
once the loop terminates, and then re-seed the tree queue with each
`ROOT_ITEM` from the `ROOT_TREE`?

#### 4.3.3.3. graph callbacks

(see: `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go`)

The graph callbacks are what tie the above together.

For each of these callbacks, whenever I say that it looks up something
in a tree's item index or potential-item index, that implies looking
the tree up from the forrest; if the forrest cannot look up that tree,
then the callback returns early, after either:

 - if we are in substep 1 and are processing a tree: we add the tree
   that is being processed to the tree queue.  (TODO: Wait, this
   assumes that an augment will be applied to the `ROOT_TREE` before
   the next pass... if that isn't the case, this will result in the
   loop never terminating... I guess I need to add a separate
   retry-tree queue?)

 - if we are in substep 2 and are processing an item: we add the item
   that is being processed to the retry-item queue for the tree that
   cannot be looked up

The 6 methods in the `brfscheck.GraphCallbacks` interface are:

 1. `FSErr()`: There's an error with the filesystem; this callback
    just spits it out on stderr.  I say such a trivial matter because,
    again, for a recovery tool I think it's worth putting care in to
    how you handle errors and where you expect them: We expect them
    here, so we have to check for them to avoid reading invalid data
    or whatever, but we don't actually need to do anything other than
    watch our step.

 2. `Want()`: We want an item in a given tree with a given object ID
    and item type, but we don't care about what the item's offset is.

    The callback works by searching the item index to see if it can
    find such an item; if so, it has nothing else to do and returns.
    Otherwise, it searches the potential-item index; for each matching
    item it finds it looks in the node index for the node containing
    that item, and adds the roots that would add that node, and adds
    those roots to a set.  Once it has finished searching the
    potential-item index, it adds that set to the augment queue (even
    if that set is still empty).

 3. `WantOff()`: The same, but we want a specific offset.

 4. `WantDirIndex()`: We want a `DIR_INDEX` item for a given inode and
    filename, but we don't know what the offset of that item is.

    First we scan over the item index, looking at all `DIR_INDEX`
    items for that inode number.  For each item, we can check the scan
    data to see what the filename in that `DIR_INDEX` is, so we can
    see if the item satisfies this want without accessing the disk.
    If there's a match, then there is nothing else to do, so we
    return.  Otherwise, we do that same search over the potential-item
    index; if we find any matches, then we build the set of roots to
    add to the augment queue the same as in `Want`.

 5. `WantFileExt()`: We want 1 or more `DATA_EXTENT` items in the
    given tree for the given inode, and we want them to cover from 0
    to a given size bytes of that file.

    First we walk that range in the item index, to build a list of the
    gaps that we need to fill ("Step 1" in
    `rebuild_wantcb.go:_wantRange()`).  This walk
    (`rebuild_wantcb.go:_walkRange()`) requires knowing the size of
    each file extent; so doing this quickly without hitting disk is
    why we recorded the size of each file extent in our initialization
    step.
	
	Then ("Step 2" in `_wantRange()`) we iterate over each of the
    gaps, and for each gap do a very similar walk (again, by calling
    `_walkRange()`, but this time over the potential-item index.  For
    each file extent we find that has is entirely within the gap, we
    "want" that extent, and move the beginning of of the gap forward
    to the end of that extent.  This algorithm is dumb and greedy,
    potentially making sub-optimal selections; and so could probably
    stand to be improved; but in my real-world use, it seems to be
    "good enough".

 6. `WantCSum()`: We want 1 or more `EXTENT_CSUM` items to cover the
    half-open interval [`lo_logical_addr`, `hi_logical_addr`).  Well,
    maybe.  It also takes a subvolume ID and an inode number; and
    looks up in the scan data whether that inode has the
    `INODE_NODATASUM` flag set; if it does have the flag set, then it
    returns early without looking for any `EXTENT_CSUM` items.  If it
    doesn't return early, then it performs the same want-range routine
    as `WantFileExt`, but with the appropriate tree, object ID, and
    item types for csums as opposed to data extents.

For each of these callbacks, we generate a "wantKey", a tuple
representing the function and its arguments; we check the
augment-queue to see if we've already enqueued a set of roots for that
want, and if so, that callback can return early without checking the
potential-item index.

# 5. Future work

It's in a reasonably useful place, I think; and so now I'm going to
take a break from it for a while.  But there's still lots of work to
do:

 - RAID almost certainly doesn't work.

 - Encryption is not implemented.

 - It doesn't understand (ignores) the `TREE_LOG` (because I don't
   understand the `TREE_LOG`).

 - `btrfs-rec inspect mount` should add "lost+found" directories for
   inodes that are included in the subvolume's tree but aren't
   reachable from the tree's root inode

 - I still need to implement `btrfs-rec repair SUBCMD` subcommands to
   write rebuilt-information from `btrfs-rec inspect` back to the
   filesystem.

 - I need to figure out the error handling/reporting story for
   `mount`.

 - It needs a lot more tests

    + I'd like to get the existing btrfs-progs fsck tests to run on
      it.

 - In the process of writing this email, I realized that I probably
   need to add a retry-tree queue; see the "graph callbacks" section
   in the description of the `rebuild-trees` algorithm above.

 - Shere are a number of "TODO" comments or panics in the code:

    + Some of them definitely need done.

    + Some of them are `panic("TODO")` on the basis that if it's
      seeing something on the filesystem that it doesn't recognize,
      it's probably that I didn't get to implementing that
      thing/situation, but it's possible that the thing is just
      corrupt.  This should only be for situations that the node
      passed the checksum test, so it being corrupt would have to be
      caused by a bug in btrfs rather than a failing drive or other
      corruption; I wasn't too worried about btrfs bugs.

 - `btrfs-rec inspect rebuild-trees` is slow, and can probably be made
   a lot faster.

   Just to give you an idea of the speeds, the run-times for the
   various steps on my ThinkPad E15 for a 256GB disk image are as
   follows:

        btrfs-rec inspect rebuild-mappings scan       :     7m 31s
        btrfs-rec inspect rebuild-mappings list-nodes :        47s
        btrfs-rec inspect rebuild-mappings process    :     8m 22s
        btrfs-rec inspect rebuild-trees               : 1h  4m 55s
        btrfs-rec inspect ls-files                    :    29m 55s
        btrfs-rec inspect ls-trees                    :     8m 40s

   For the most part, it's all single-threaded (with the main
   exception that in several places I/O has been moved to a separate
   thread from the main CPU-heavy thread), but a lot of the algorithms
   could be parallelized.

 - There are a lot of "tunable" values that I haven't really spent
   time tuning.  These are all annotated with `textui.Tunable()`.  I
   sort-of intended for them to be adjustable on the CLI.

 - Perhaps the `btrfs inspect rebuild-trees` algorithm could be
   adjusted to also try to rebuild trees with missing parents; see the
   above discussion of the algorithm.

# 6. Problems for merging this code into btrfs-progs

 - It's written in Go, not C.

 - It's effectively GPLv3+ (not GPLv2-only or GPLv2+) because of use
   of some code under the Apache 2.0 license (2 files in the codebase
   itself that are based off of Apache-licensed code, and use of
   unmodified 3rd-party libraries).

 - It uses ARC (Adaptive Replacement Cache), which is patented by IBM,
   and the patent doesn't expire for another 7 months.  An important
   property of ARC over LRU is that it is scan-resistant; the above
   algorithms do a lot of scanning.  On that note, now that RedHat is
   owned by IBM: who in the company do we need to get to talk to
   eachother so that we can get ARC into the Linux kernel before then?

-- 
Happy hacking,
~ Luke Shumaker