1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
|
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Announcing: btrfs-rec: Recover (data from) a broken btrfs filesystem — Luke Shumaker</title>
<link rel="stylesheet" href="assets/style.css">
<link rel="alternate" type="application/atom+xml" href="./index.atom" name="web log entries"/>
</head>
<body>
<header><a href="/">Luke Shumaker</a> » <a href=/blog>blog</a> » btrfs-rec</header>
<article>
<h1
id="announcing-btrfs-rec-recover-data-from-a-broken-btrfs-filesystem">Announcing:
btrfs-rec: Recover (data from) a broken btrfs filesystem</h1>
<blockquote>
<p>I originally sent this email on 2023-07-10, but it has been caught by
their bogofilter. Yes, it was <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/README.md?id=18e6066c241cf3d252b6521150843ffc858d8434">plaintext</a>.
No, I didn't use GMail. Yes, I've successfully participated in vger
lists in the past. Yes, I've reached out to postmaster; no, I haven't
received a reply yet (as of 2023-07-14).</p>
</blockquote>
<div style="font-family: monospace">
<p>To: linux-btrfs@vger.kernel.org<br/> From: Luke T. Shumaker
<lukeshu@lukeshu.com><br/> Subject: btrfs-rec: Recover (data from)
a broken btrfs filesystem<br/> Date: Mon, 10 Jul 2023 21:23:41
-0600<br/> Message-ID:
<87jzv7uo5e.wl-lukeshu@lukeshu.com><br/></p>
</div>
<p>Inspired by a mis-typed <code>dd</code> command, for the last year
I've been working on a tool for recovering corrupt btrfs filesystems; at
first idly here and there, but more actively in the last few months. I
hope to get it incorporated into btrfs-progs, though perhaps that is
problematic for a few reasons I'll get to. If the code can't be
incorporated into btrfs-progs, at least the ideas and algorithms should
be.</p>
<p><a
href="https://git.lukeshu.com/btrfs-progs-ng/">https://git.lukeshu.com/btrfs-progs-ng/</a></p>
<p>Highlights:</p>
<ul>
<li><p>In general, it's more tolerant of corrupt filesystems than
<code>btrfs check --repair</code>, <code>btrfs rescue</code> or
<code>btrfs restore</code>.</p></li>
<li><p><code>btrfs-rec inspect rebuild-mappings</code> is a better
<code>btrfs rescue chunk-recover</code>.</p></li>
<li><p><code>btrfs-rec inspect rebuild-trees</code> can re-attach lost
branches to broken B+ trees.</p></li>
<li><p><code>btrfs-rec inspect mount</code> is a read-only FUSE
implementation of btrfs. This is conceptually a replacement for
<code>btrfs restore</code>.</p></li>
<li><p>It's entirely written in Go. I'm not saying that's a good thing,
but it's an interesting thing.</p></li>
</ul>
<p>Hopefully some folks will find it useful, or at least neat!</p>
<ul>
<li><a href="#motivation">1. Motivation</a></li>
<li><a href="#overview-of-use">2. Overview of use</a></li>
<li><a href="#prior-art">3. Prior art</a></li>
<li><a href="#internalsdesign">4. Internals/Design</a></li>
<li><a href="#overview-of-the-source-tree-layout">4.1. Overview of the
source tree layout</a></li>
<li><a href="#base-decisions-cli-structure-go-json">4.2. Base decisions:
CLI structure, Go, JSON</a></li>
<li><a href="#algorithms">4.3. Algorithms</a></li>
<li><a href="#the-rebuild-mappings-algorithm">4.3.1. The
<code>rebuild-mappings</code> algorithm</a></li>
<li><a href="#the---rebuild-algorithm">4.3.2. The <code>--rebuild</code>
algorithm</a></li>
<li><a href="#rebuilt-forrest-behavior-looking-up-trees">4.3.2.1.
rebuilt forrest behavior</a></li>
<li><a href="#rebuilt-individual-tree-behavior">4.3.2.2. rebuilt
individual tree behavior</a></li>
<li><a href="#the-rebuild-trees-algorithm">4.3.3. The
<code>rebuild-trees</code> algorithm</a></li>
<li><a href="#initialization">4.3.3.1. initialization</a></li>
<li><a href="#the-main-loop">4.3.3.2. the main loop</a></li>
<li><a href="#graph-callbacks">4.3.3.3. graph callbacks</a></li>
<li><a href="#future-work">5. Future work</a></li>
<li><a href="#problems-with-merging-this-code-into-btrfs">6. Problems
for merging this code into btrfs-progs</a></li>
</ul>
<h1 id="motivation">1. Motivation</h1>
<p>Have you ever ended up with a corrupt btrfs filesystem (through no
fault of btrfs itself, but perhaps a failing drive, or a mistaken
<code>dd</code> invocation)? Surely losing less than 100MB of data from
a drive should not render hundreds of GB of perfectly intact data
unreadable! And yet, the existing tools are unable to even attempt to
read that data:</p>
<pre><code>$ btrfs check --repair --force dump-zero.1.img
enabling repair mode
Opening filesystem to check...
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
ERROR: cannot open file system</code></pre>
<p>or</p>
<pre><code>$ btrfs check --init-extent-tree --force dump-zero.1.img
Opening filesystem to check...
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
ERROR: cannot open file system</code></pre>
<p>or</p>
<pre><code>$ btrfs check --init-csum-tree --force dump-zero.1.img
Creating a new CRC tree
Opening filesystem to check...
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
ERROR: cannot open file system</code></pre>
<p>or</p>
<pre><code>$ btrfs rescue chunk-recover dump-zero.1.img
Scanning: DONE in dev0
corrupt node: root=1 block=160410271744 slot=0, corrupt node: root=1 block=160410271744, nritems too large, have 39 expect range [1,0]
Couldn't read tree root
open with broken chunk error</code></pre>
<p>or</p>
<pre><code>$ btrfs rescue zero-log dump-zero.1.img
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
ERROR: cannot read chunk root
ERROR: could not open ctree</code></pre>
<p>or</p>
<pre><code>$ mkdir out
$ btrfs restore dump-zero.1.img out
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
Could not open root, trying backup super
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
Could not open root, trying backup super
ERROR: superblock bytenr 274877906944 is larger than device size 256060514304
Could not open root, trying backup super</code></pre>
<p>or</p>
<pre><code>$ btrfs restore --list-roots dump-zero.1.img
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
Could not open root, trying backup super
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
Could not open root, trying backup super
ERROR: superblock bytenr 274877906944 is larger than device size 256060514304
Could not open root, trying backup super</code></pre>
<p>or</p>
<pre><code>$ btrfs-find-root dump-zero.1.img
WARNING: cannot read chunk root, continue anyway
Superblock thinks the generation is 6596071
Superblock thinks the level is 1</code></pre>
<p>Well, have I got a tool for you!</p>
<p>(FWIW, I also tried manipulating the filesystem and patching to tools
to try to get past those errors, only to get a different set of errors.
Some of these patches I am separately submitting to btrfs-progs.)</p>
<h1 id="overview-of-use">2. Overview of use</h1>
<p>There are two <code>btrfs-rec</code> sub-command groups:
<code>btrfs-rec inspect <var>SUBCMD</var></code> and <code>btrfs-rec
repair <var>SUBCMD</var></code>, and you can find out about various
sub-commands with <code>btrfs-rec help</code>. These are both told about
devices or images with the <code>--pv</code> flag.</p>
<p><code>btrfs-rec inspect <var>SUBCMD</var></code> commands open the
filesystem read-only, and (generally speaking) write extracted or
rebuilt information to stdout. <code>btrfs-rec repair
<var>SUBCMD</var></code> commands open the filesystem read+write, and
consume information from <code>btrfs-rec inspect
<var>SUBCMD</var></code> commands to actually repair the filesystem
(except I haven't actually implemented any <code>repair</code> commands
yet... despite the lack of <code>repair</code> commands, I believe that
<code>btrfs-rec</code> is already a useful because of the
<code>btrfs-rec inspect mount</code> command to get data out of the
broken filesystem). This split allows you to try things without being
scared by WARNINGs about not using these tools unless you're an expert
or have been told to by a developer.</p>
<p>In the broken <code>dump-zero.1.img</code> example above (which has a
perfectly intact superblock, but a totally broken
<code>CHUNK_TREE</code>), to "repair" it I'd:</p>
<ol type="1">
<li><p>Start by using <code>btrfs-rec inspect rebuild-mappings</code> to
rebuild the broken chunk/dev/blockgroup trees:</p>
<pre><code>$ btrfs-rec inspect rebuild-mappings \
--pv=dump-zero.1.img \
> mappings-1.json</code></pre></li>
<li><p>If it only mostly succeeds, but on stderr tells us about a few
regions of the image that it wasn't able to figure out the chunks for.
Using some human-level knowledge, you can write those yourself,
inserting them into the generated <code>mappings.json</code>, and ask
<code>rebuild-mappings</code> to normalize what you wrote:</p>
<pre><code>$ btrfs-rec inspect rebuild-mappings \
--pv=dump-zero.1.img \
--mappings=<(sed <mappings-1.json \
-e '2a{"LAddr":5242880,"PAddr":{"Dev":1,"Addr":5242880},"Size":1},' \
-e '2a{"LAddr":13631488,"PAddr":{"Dev":1,"Addr":13631488},"Size":1},') \
> mappings-2.json</code></pre></li>
<li><p>Now that it has functioning chunk/dev/blockgroup trees, we can
use <code>btrfs-rec inspect rebuild-trees</code> to rebuild other trees
that rely on those:</p>
<pre><code>$ btrfs-rec inspect rebuild-mappings \
--pv=dump-zero.1.img \
--mappings=mappings-2.json \
> trees.json</code></pre></li>
<li><p>Now that (hopefully) everything that was damaged has been
reconstructed, we can use <code>btrfs-rec inspect mount</code> to mount
the filesystem read-only and copy out our data:</p>
<pre><code>$ mkdir mnt
$ sudo btrfs-rec inspect mount \
--pv=dump-zero.1.img \
--mappings=mappings-2.json \
--trees=trees.json \
./mnt</code></pre></li>
</ol>
<p>This example is fleshed out more (and the manual edits to
<code>mappings.json</code> explained more) in <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/examples/main.sh?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./examples/main.sh</code></a>.</p>
<h1 id="prior-art">3. Prior art</h1>
<p>Comparing <code>btrfs-rec inspect mount</code> with the existing
https://github.com/adam900710/btrfs-fuse project:</p>
<ul>
<li>Again, mine has better fault tolerance</li>
<li>Mine is read-only</li>
<li>Mine supports xattrs ("TODO" in Adam's)</li>
<li>Mine supports separate inode address spaces for subvolumes; Adam's
doesn't due to limitations in FUSE, mine works around this by lazily
setting up separate mountpoints for each subvolume (though this does
mean that the process needs to run as root, which is a bummer).</li>
</ul>
<h1 id="internalsdesign">4. Internals/Design</h1>
<h2 id="overview-of-the-source-tree-layout">4.1. Overview of the source
tree layout</h2>
<ul>
<li><p><a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/examples?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>examples/</code></a>
has example scripts showing how to use <code>btrfs-rec</code>.</p></li>
<li><p><a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfs?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/btrfs/</code></a>
is the core btrfs implementation.</p></li>
<li><p><a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfscheck?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/btrfscheck/</code></a>
and <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/btrfsutil/</code></a>
are libraries for "btrfs-progs" type programs, that are userland-y
things that I thought should be separate from the core implementation;
something that frustrated me about libbtrfs was having to figure out "is
this thing here in support of btrfs bits-on-disk, or in support of a
higher-level 'how btrfs-progs wants to think about things'?"</p></li>
<li><p><a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>cmd/btrfs-rec/</code></a>
is where the command implementations live. If a sub-command fits in a
single file, it's
<code>cmd/btrfs-rec/inspect_<var>SUBCMD</var>.go</code>, otherwise, it's
in a separate <code>cmd/btrfs-rec/inspect/<var>SUBCMD</var>/</code>
package.</p></li>
<li><p><a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/textui?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/textui/</code></a>
is reasonably central to how the commands implement a text/CLI
user-interface.</p></li>
<li><p><a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/binstruct?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/binstruct/</code></a>,
<a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/diskio?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/diskio/</code></a>,
and <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/streamio?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/streamio/</code></a>
are non-btrfs-specific libraries related to the problem domain.</p></li>
<li><p><a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/containers?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/containers/</code></a>,
<a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/fmtutil?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/fmtutil/</code></a>,
<a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/maps?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/maps/</code></a>,
<a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/slices?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/slices/</code></a>,
and <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/profile?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/profile/</code></a>
are all generic Go libraries that have nothing to do with btrfs or the
problem domain, but weren't in the Go standard library and I didn't
find/know-of exiting implementations that I liked. Of these, all but
<code>containers</code> are pretty simple utility libraries. Also, some
of these things have been added to the standard library since I started
the project.</p></li>
</ul>
<h2 id="base-decisions-cli-structure-go-json">4.2. Base decisions: CLI
structure, Go, JSON</h2>
<p>I started with trying to enhance btrfs-progs, but ended up writing a
wholy new program in Go, for several reasons:</p>
<ul>
<li><p>writing a new thing: I was having to learn both the btrfs-progs
codebase and how btrfs-bits-on-disk work, and it got to the point that I
decided I should just focus on learning btrfs-bits-on-disk.</p></li>
<li><p>writing a new thing: It was becoming increasingly apparent to me
that it was going to be an uphill-fight of having recovery-tools share
the same code as the main-tools, as the routines used by the main-tools
rightly have validity checks, where recovery-tools want to say "yes, I
know it's invalid, can you give it to me anyway?".</p></li>
<li><p>writing it in not-C: I love me some C, but higher level languages
are good for productivity. And I was trying to write a whole lot of code
at once, I needed a productivity boost.</p></li>
<li><p>writing it in not-C: This forced me to learn btrfs-bits-on-disk
better, instead of just cribbing from btrfs-progs. That knowledge is
particularly important for having ideas on how to deal with corrupt
bits-on-disk.</p></li>
<li><p>writing it in Go: At the time I started, my day job was writing
Go, so I had Go swapped into my brain. And Go still feels close to C but
provides <em>a lot</em> of niceness and safety over C.</p></li>
</ul>
<p>It turned out that Go was perhaps not the best choice, but we'll come
back to that.</p>
<p>I wanted to separate things into a pipeline. For instance: Instead of
<code>btrfs rescue chunk-recover</code> trying to do everything to
rebuild a broken chunk tree, I wanted to separate I/O from computation
from repairs. So I have
<code>btrfs-rec inspect rebuild-mappings scan</code> that reads all the
info necessary to rebuild the chunk tree, then dump that as a 2GB glob
of JSON. Then I can feed that JSON to
<code>btrfs-rec inspect rebuild-mappings process</code> which actually
rebuilds the mappings in the chunk tree, and dumps them as JSON. And
then other commands can consume that <code>mappings.json</code> to use
that instead of trying to read the chunk tree from the actual FS, so
that you don't have to make potentially destructive writes to inspect an
FS with a broken chunk tree, and can inspect it more forensically. Or
then use <code>btrfs-rec repair
<var>SOME_SUBCMD_I_HAVENT_WRITTEN_YET</var></code> to write that chunk
tree in <code>mappings.json</code> back to the filesystem.</p>
<p>(But also, the separate steps thing was useful just so I could
iterate on the algorithms of <code>rebuild-mappings process</code>
separately from having to scan the entire FS)</p>
<p>So, I made the decision that <code>btrfs-rec inspect
<var>SUBCMD</var></code> commands should all only open the FS read-only,
and output their work to a separate file; that writing that info back to
the FS should be separate in <code>btrfs-rec repair
<var>SUBCMD</var></code>.</p>
<p>For connecting those parts of the pipeline, I chose JSON, for a few
reasons:</p>
<ul>
<li><p>I wanted something reasonably human-readable, so that I could
debug it easier.</p></li>
<li><p>I wanted something reasonably human-readable, so that human
end-users could make manual edits; for example, in <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/examples/main.sh?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>examples/main.sh</code></a>
I have an example of manually editing <code>mappings.json</code> to
resolve a region that the algorithm couldn't figure out, but with
knowledge of what caused the corruption a human can.</p></li>
<li><p>I didn't want to invent my own DSL and have to handle writing a
parser. (This part didn't pay off! See below.)</p></li>
<li><p>I wanted something that I thought would have good support in a
variety of languages, so that if Go is problematic for getting things
merged upstream it could be rewritten in C (or maybe Rust?) piece-meal
where each subcommand can be rewritten one at a time.</p></li>
</ul>
<p>It turned out that JSON was perhaps not the best choice.</p>
<p>OK, so: Go and/or JSON maybe being mistakes:</p>
<ul>
<li><p>I spent a lot of time getting the garbage collector to not just
kill performance.</p></li>
<li><p>The <code>btrfs-rec inspect rebuild-mappings
<var>SUBCMD</var></code> subcommands all throw a lot of data through the
JSON encoder/decoder, and I learned that the Go stdlib
<code>encoding/json</code> package has memory use that grows O(n^2)
(-ish? I didn't study the implementation, but that's what the curve
looks like just observing it) on the size of the data being shoved
through it, so I had to go take a break and go write
https://pkg.go.dev/git.lukeshu.com/go/lowmemjson which is a
mostly-drop-in-replacement that tries to be as close-as possible to O(1)
memory use. So I did end up having to write my own parser anyway
:(</p></li>
</ul>
<h2 id="algorithms">4.3. Algorithms</h2>
<p>There are 3 algorithms of note in <code>btrfs-rec</code>, that I
think are worth getting into mainline btrfs-progs even if the code of
<code>btrfs-rec</code> doesn't get in:</p>
<ol type="1">
<li><p>The <code>btrfs-rec inspect rebuild-mappings</code> algoritithm
to rebuild information from the <code>CHUNK_TREE</code>,
<code>DEV_TREE</code>, and <code>BLOCK_GROUP_TREE</code>.</p></li>
<li><p>The <code>btrfs-rec --rebuild</code> algorithm to cope with
reading broken B+ trees.</p></li>
<li><p>The <code>btrfs-rec inspect rebuild-trees</code> algorithm to
re-attach lost branches to broken B+ trees.</p></li>
</ol>
<h3 id="the-rebuild-mappings-algorithm">4.3.1. The
<code>rebuild-mappings</code> algorithm</h3>
<p>(This step-zero scan is
<code>btrfs-rec inspect rebuild-mappings scan</code>, and principally
lives in <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/scan.go</code></a>
and <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildmappings/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildmappings/scan.go</code></a>)</p>
<ol start="0" type="1">
<li>Similar to <code>btrfs rescue chunk-recover</code>, scan each device
for things that look like nodes; keep track of:
<ul>
<li>Checksums of every block on the device</li>
<li>Which physical addresses contain nodes that claim to be at a given
logical addess.</li>
<li>Any found Chunk items, BlockGroup items, DevExtent, and CSum items.
Keep track of the key for each of these, and for CSum items also track
the generation.</li>
</ul></li>
</ol>
<p>Create a bucket of the data from Chunks, DevExtents, and BlockGroups;
since these are mostly a Chunk and a DevExtent+BlockGroup store pretty
much the same information; we can use one to reconstruct the other. How
we "merge" these and handle conflicts is in <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfs/btrfsvol/lvm.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n121"><code>./lib/btrfs/btrfsvol/lvm.go:addMapping()</code></a>,
I don't think this part is particularly clever, but given that
<code>btrfs rescue chunk-recover</code> crashes if it encounters two
overlapping chunks, I suppose I should spell it out:</p>
<ul>
<li><p>A "mapping" is represented as a group of 4 things:</p>
<ul>
<li>logical address</li>
<li>a list of 1 or more physical addresses (device ID and offset)</li>
<li>size, and a Boolean indicator of whether the size is "locked"</li>
<li>block group flags, and a Boolean presence-indicator</li>
</ul></li>
<li><p>Mappings must be merged if their logical or physical regions
overlap.</p></li>
<li><p>If a mapping has a "locked" size, then when merging it may
subsume smaller mappings with unlocked sizes, but its size cannot be
changed; trying to merge a locked-size mapping with another mapping that
is not for a subset region should return an error.</p></li>
<li><p>If a mapping has block group flags present, then those flags may
not be changed; it may only be merged with another mapping that does not
have flags present, or has identical flags.</p></li>
<li><p>When returning an error because of overlapping non-mergeable
mappings, just log an error on stderr and keep going. That's an
important design thing that is different than normal filesystem code; if
there's an error, yeah, detect and notify about it, <strong>but don't
bail out of the whole routine</strong>. Just skip that one item or
whatever.</p></li>
</ul>
<p>Now that we know how to "add a mapping", let's do that:</p>
<p>(The following main-steps are
<code>btrfs-rec inspect rebuild-mappings process</code>, and principally
live in <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildmappings/process.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildmappings/process.go</code></a>)</p>
<ol type="1">
<li><p>Add all found Chunks.</p></li>
<li><p>Add all found DevExtents.</p></li>
<li><p>Add a phyical:logical mapping of length nodesize for each node
that was found.</p></li>
<li><p>Any mappings from steps 2 or 3 that are missing blockgroup flags
(that is: they weren't able to be merged with a mapping from step 1),
use the found BlockGroups to fill in those flags.</p></li>
<li><p>Now we'll merge all found CSum items into a map of the sums of
the logical address space. Sort all of the csum items by generation,
then by address. Loop over them in that order, inserting their sums into
the map. If two csum items overlap, but agree about the sums of the
overlapping region, that's fine, just take their union. For overlaps
that disagree, items with a newer generation kick out items with an
older generation. If disagreeing items have the same generation... I
don't think that can happen except by a filesystem bug (i.e. not by a
failing drive or other external corruption), so I wasn't too concerned
about it, so I just log an error on stderr and skip the later-processed
item. See <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildmappings/process_sums_logical.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildmappings/process_sums_logical.go</code></a>.</p>
<p>Look at regions of the logical address space that meet all the 3
criteria:</p>
<ul>
<li>we have CSum items for them</li>
<li>we have a BlockGroup for them</li>
<li>we don't have a Chunk/DevExtent mapping them to the pysical address
space.</li>
</ul>
<p>Pair those CSums up with BlockGroups, and for each BlockGroup, search
the list of checksums of physical blocks to try to find a physical
region that matches the logical csums (and isn't already mapped to a
different logical region). I used a Knuth-Morris-Pratt search, modified
to handle holes in the logical csum list as wildcards.</p>
<p>Insert any found mappings into our bucket of mappings.</p></li>
<li><p>Do the same again, but with a fuzzy search (we can re-use the
csum map of the logical address space). My implementation of this is
comparatively time and space intensive; I just walk over the entire
unmapped physical address space, noting what % of match each BlockGroup
has if placed at that location. I keep track of the best 2 matches for
each BlockGroup. If the best match is better than a 50% match, and the
second best is less than a 50% match, then I add the best match. In my
experience, the best match is >90% (or at whatever the maximum
percent is for how much of the BlockGroup has logical sums), and the
second best is 0% or 1%. The point of tracking both is that if there
isn't a clear-cut winner, I don't want it to commit to a potentially
wrong choice.</p></li>
</ol>
<h3 id="the---rebuild-algorithm">4.3.2. The <code>--rebuild</code>
algorithm</h3>
<p>The <code>--rebuild</code> flag is implied by the
<code>--trees=trees.json</code> flag, and triggers an algorithm that
allows "safely" reading from a broken B+ tree, rather than the usual B+
tree lookup and search functions. I probably should have tried to
understand the <code>btrfs restore</code> algorithm, maybe I reinvented
the wheel...</p>
<p>This algorithm requires a list of all nodes on the filesystem; we
find these using the same scan as above (<a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/scan.go</code></a>),
the same procedure as <code>btrfs rescue chunk-recover</code>.</p>
<p>We walk all of those nodes, and build a reasonably lightweight
in-memory graph of all nodes (<a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/graph.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/graph.go</code></a>),
tracking</p>
<ul>
<li>each node's
<ul>
<li>logical address</li>
<li>level</li>
<li>generation</li>
<li>tree</li>
<li>each item's key and size</li>
</ul></li>
<li>each keypointer's
<ul>
<li>source node</li>
<li>source slot within the node</li>
<li>tree of the source node</li>
<li>destination node</li>
<li>destination level implied by the level of the source node</li>
<li>destination key</li>
<li>destination generation</li>
</ul></li>
<li>logical addresses and error messages for nodes that are pointed to
by a keypointer or the superblock, but can't be read (because that
logical address isn't mapped, or it doesn't look like a node,
or...)</li>
<li>an index such that for a given node we can quickly list both
keypointers both originating at that node and pointing to that
node.</li>
</ul>
<h4 id="rebuilt-forrest-behavior-looking-up-trees">4.3.2.1. rebuilt
forrest behavior (looking up trees)</h4>
<p>(see: <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/rebuilt_forrest.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/rebuilt_forrest.go</code></a>)</p>
<ul>
<li>The <code>ROOT_TREE</code>, <code>CHUNK_TREE</code>,
<code>TREE_LOG</code>, and <code>BLOCK_GROUP_TREE</code> (the trees
pointed to directy by the superblock) work as you'd expect.</li>
<li>For other trees, we (as you'd expect) look up the root item in the
rebuilt <code>ROOT_TREE</code>, and then (if rootitem.ParentUUID is
non-zero) eagerly also look up the parent tree (recursing on ourself).
We try to use the <code>UUID_TREE</code> tree to help with this, but
fall back to just doing a linear scan over the <code>ROOT_TREE</code>.
If we fail to look up the parent tree (or its parent, or a more distant
ancestor), then (depending on a flag) we either make a note of that, or
error out and fail to look up the child tree. For <code>--rebuild</code>
and <code>--trees=trees.json</code> we are permissive of this error, and
just make note of it; but we'll re-use this algorithm in the
<code>rebuild-trees</code> algorithm below, and it needs the more strict
handling.</li>
<li>When creating the rebuilt individual tree, we start by adding the
root node specified by the superblock/root-item. But we may also add
additional root nodes grafted on to the tree by the
<code>--trees=trees.json</code> flag or by the
<code>rebuild-trees</code> algorithm below. So a tree may have more than
1 root node.</li>
</ul>
<h4 id="rebuilt-individual-tree-behavior">4.3.2.2. rebuilt individual
tree behavior</h4>
<p>(see: <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/rebuilt_tree.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/rebuilt_tree.go</code></a>)</p>
<p>In order to read from a tree, we first have to build a few indexes.
We store these indexes in an Adaptive Replacement Cache; they are all
re-buildable based on the tree's list of roots and the above graph; if
we have a bunch of trees we don't need to keep all of this in memory at
once. Note that this is done 100% with the in-memory graph, we don't
need to read anything from the filesystem during these procedures.</p>
<ul>
<li><p>The first index we build is the "node index". This is an index
that for every node tells us what root(s) the tree would need to have in
order for the tree to include that node, and also what the highest item
key would be acceptable in the node if the tree includes that root. We
track both a <code>loMaxItem</code> and a <code>hiMaxItem</code>, in
case the tree is real broken and there are multiple paths from the root
to the node; as these different paths may imply different max-item
constraints. Put more concretely, the type of the index is:</p>
<pre><code>map[ nodeID → map[ rootNodeID → {loMaxItem, hiMaxItem} ] ]</code></pre>
<p>We'll do a loop over the graph, using dynamic-programming memoization
to figure out ordering and avoid processing the same node twice; for
each node we'll</p>
<ul>
<li><p>Check whether the owner-tree is this tree or one of this tree's
ancestors (and if it's an ancestor, that the node's generation isn't
after the point that the child tree was forked from the parent tree). If
not, we are done processing that node (record an empty/nil set of roots
for it).</p></li>
<li><p>Create an empty map of <code>rootID</code> →
{<code>loMaxItem</code>, <code>hiMaxItem</code>}.</p></li>
<li><p>Look at each keypointer that that points at the node and:</p>
<ul>
<li><p>Skip the keypointer if its expectations of the node aren't met:
if the level, generation, and min-key constraints don't match up. If the
keypointer isn't in the last slot in the source node, we also go ahead
and include checking that the destination node's max-key is under the
min-key of the keypointer in the next slot, since that's cheap to do
now.</p></li>
<li><p>Skip the keypointer if its source node's owner-tree isn't this
tree or one of this tree's ancestors (and if it's an ancestor, that the
node's generation isn't after the point that the child tree was forked
from the parent tree).</p></li>
<li><p>dynamic-programming recurse and index the keypointer's source
node.</p></li>
<li><p>for every root that would result in the keypointer's source node
being included in the tree:</p>
<p>. If the keypointer is in the last slot, look at what the what the
source node's last-item constraints would be if that root is included,
and can now check the max-item of our destination node. We check against
the <code>hiMaxItem</code>; as if there is any valid path from the root
to this node, then we want to be permissive and include it. If that
check fails, then we're done with this keypointer. Also, make node of
those <code>loMaxItem</code> and <code>hiMaxItem</code> values, we'll
use them again in just a moment.</p>
<p>. Otherwise, set both <code>loMaxItem</code> and
<code>hiMaxItem</code> to 1-under the min-item of the keypointer in the
next slot.</p>
<p>. Insert that <code>loMaxItem</code> and <code>hiMaxItem</code> pair
into the <code>rootID</code> → {<code>loMaxItem</code>,
<code>hiMaxItem</code>} map we created above. If an entry already exists
for this root (since a broken tree might have multiple paths from the
root to our node), then set <code>loMaxItem</code> to the min of the
existing entry and our value, and <code>hiMaxItem</code> to the
max.</p></li>
</ul></li>
<li><p>If that <code>rootID</code> → {<code>loMaxItem</code>,
<code>hiMaxItem</code>} map is still empty, then consider this node to
be a (potential) root, and insert <code>rootID=thisNode</code> ->
{<code>loMaxItem=maxKey</code>, <code>hiMaxItem=maxKey</code>} (where
<code>maxKey</code> is the maximum value of the key datatype).</p></li>
<li><p>Take that <code>rootID</code> → {<code>loMaxItem</code>,
<code>hiMaxItem</code>} map and insert it into the index as the entry
for this node.</p></li>
</ul></li>
<li><p>The next index we build is the "item index". This is a "sorted
map" (implemented as a red-black tree, supporting sub-range iteration)
of <code>key</code> → {<code>nodeID</code>, <code>slotNumber</code>}; a
map that for each key tells us where to find the item with that key.</p>
<ul>
<li><p>Loop over the node index, and for each node check if both (a) it
has <code>level==0</code> (is a leaf node containing items), and (b) its
set of roots that would include it has any overlap with the tree's set
of roots.</p></li>
<li><p>Loop over each of those included leaf nodes, and loop over the
items in each node. Insert the <code>key</code> → {<code>nodeId</code>,
<code>slot</code>} into our sorted map. If there is already an entry for
that key, decide which one wins by:</p>
<ul>
<li><p>Use the one from the node with the owner-tree that is closer to
this tree; node with owner=thisTree wins over a node with
owner=thisTree.parent, which would win over a node with
owner.thisTree.parent.parent. If that's a tie, then...</p></li>
<li><p>Use the one from the node with the higher generation. If that's a
tie, then...</p></li>
<li><p>I don't know, I have the code <code>panic</code>:</p>
<pre><code>// TODO: This is a panic because I'm not really sure what the
// best way to handle this is, and so if this happens I want the
// program to crash and force me to figure out how to handle it.
panic(fmt.Errorf("dup nodes in tree=%v: old=%v=%v ; new=%v=%v",
tree.ID,
oldNode, tree.forrest.graph.Nodes[oldNode],
newNode, tree.forrest.graph.Nodes[newNode]))</code></pre></li>
</ul></li>
</ul>
<p>Note that this algorithm means that for a given node we may use a few
items from that node, while having other items from that same node be
overridden by another node.</p></li>
<li><p>The final index we build is the "error index". This is an index
of what errors correspond to which range of keys, so that we can report
them, and give an idea of "there may be entries missing from this
directory" and similar.</p>
<p>For each error, we'll track the min-key and max-key of the range it
applies to, the node it came from, and what the error string is. We'll
store these into an interval tree keyed on that min-key/max-key
range.</p>
<ul>
<li><p>Create an empty set <code>nodesToProcess</code>. Now populate
it:</p>
<ul>
<li><p>Once again, we'll loop over the node index, but this time we'll
only check that there's overlap between the set of roots that would
include the node and the tree's set of roots. The nodes that are
included in this tree, insert both that node itself and all node IDs
that it has keypointers pointing to into the <code>nodesToProcess</code>
set.</p></li>
<li><p>Also insert all of the tree's roots into
<code>nodesToProcess</code>; this is in case the superblock/root-item
points to an invalid node that we couldn't read.</p></li>
</ul></li>
<li><p>Now loop over <code>nodesToProcess</code>. For each node, create
an empty list of errors. Use the keypointers pointing to and the min
<code>loMaxItem</code> from the node index to construct a set of
expectations for the node; this should be reasonably straight-forward,
given:</p>
<ul>
<li><p>If different keypointers have disagreeing levels, insert an error
in to the list, and don't bother with checking the node's
level.</p></li>
<li><p>If different keypointers have disagreeing generations, insert an
error in to the list, and don't bother with checking the node's
generation.</p></li>
<li><p>If different keypointers have different min-item expectations,
use the max of them.</p></li>
</ul>
<p>Then:</p>
<ul>
<li>If the node is a "bad node" in the graph, insert the error message
associated with it. Otherwise, check those expectations against the node
in the graph.</li>
</ul>
<p>If the list of error messages is non-empty, then insert their
concatenation into the interval tree, with the range set to the min of
the min-item expectations from the keypointers through the max of the
<code>hiMaxItem</code>s from the node index. If the min min-item
expectation turns out to be higher than the max <code>hiMaxItem</code>,
then set the range to the zero-key through the max-key.</p></li>
</ul></li>
</ul>
<p>From there, it should be trivial to implement the usual B+ tree
operations using those indexes; exact-lookup using the item index, and
range-lookups and walks using the item index together with the error
index. Efficiently searching the <code>CSUM_TREE</code> requires knowing
item sizes, so that's why we recorded the item sizes into the graph.</p>
<h3 id="the-rebuild-trees-algorithm">4.3.3. The
<code>rebuild-trees</code> algorithm</h3>
<p>The <code>btrfs inspect rebuild-trees</code> algorithm finds nodes to
attach as extra roots to trees. I think that conceptually it's the the
simplest of the 3 algorithms, but turned out to be the hardest to get
right. So... maybe more than the others reference the source code too
(<a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/</code></a>)
because I might forget some small but important detail.</p>
<p>The core idea here is that we're just going to walk each tree,
inspecting each item in the tree, and checking for any items that are
implied by other items (e.g.: a dir entry item implies the existence of
inode item for the inode that it points at). If an implied item is not
in the tree, but is in some other node, then we look at which potential
roots we could add to the tree that would add that other node. Then,
after we've processed all of the items in the filesystem, we go add
those various roots to the various trees, keeping track of which items
are added or updated. If any of those added/updated items have a version
with a newer generation on a different node, see what roots we could add
to get that newer version. Then add those roots, keeping track of items
that are added/updated. Once we reach steady-state with the newest
version of each item has been added, loop back and inspect all
added/updated items for implied items, keeping track of roots we could
add. Repeat until a steady-state is reached.</p>
<p>There are lots of little details in that process, some of which are
for correctness, and some of which are for "it should run in hours
instead of weeks."</p>
<h4 id="initialization">4.3.3.1. initialization</h4>
<p>First up, we're going to build and in-memory graph, same as above.
But this time, while we're reading the nodes to do that, we're also
going to watch for some specific items and record a few things about
them.</p>
<p>(see: <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/scan.go</code></a>)</p>
<p>For each {<code>nodeID</code>, <code>slotNumber</code>} pair that
matches one of these item types, we're going to record:</p>
<ul>
<li>flags:
<ul>
<li><code>INODE_ITEM</code>s: whether it has the
<code>INODE_NODATASUM</code> flag set</li>
</ul></li>
<li>names:
<ul>
<li><code>DIR_INDEX</code> items: the file's name</li>
</ul></li>
<li>sizes:
<ul>
<li><code>EXTENT_CSUM</code> items: the number of bytes that this is a
sum for (i.e. the item size over the checksum size, times the block
size)</li>
<li><code>EXTENT_DATA</code> items: the number of bytes in this extent
(i.e. either the item size minus
<code>offsetof(btrfs_file_extent_item.disk_bytenr)</code> if
<code>FILE_EXTENT_INLINE</code>, or else the item's
<code>num_bytes</code>).</li>
</ul></li>
<li>data backrefs:
<ul>
<li><code>EXTENT_ITEM</code>s and <code>METADATA_ITEM</code>s: a list of
the same length as the number of refs embedded in the item; for embeded
ExtentDataRefs, the list entry is the subvolume tree ID that the
ExtentDataRef points at, otherwise it is zero.</li>
<li><code>EXTENT_DATA_REF</code> items: a list of length 1, with the
sole member being the subvolume tree ID that the ExtentDataRef points
at.</li>
</ul></li>
</ul>
<h4 id="the-main-loop">4.3.3.2. the main loop</h4>
<p>(see: <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go</code></a>)</p>
<p>Start with that scan data (graph + info about items), and also a
rebuilt forrest from the above algorithm, but with:</p>
<ul>
<li><p>the flag set so that it refuses to look up a tree if it can't
look up all of that tree's ancestors</p></li>
<li><p>an additional "potential-item index" that is similar to the item
index. It is generated the same way and can cache/evict the same way;
the difference is that we invert the check for if the set of roots for a
node has overlap with the tree's set of roots; we're looking for
<em>potential</em> nodes that we could add to this tree.</p></li>
<li><p>some callbacks; we'll get to what we do in these callbacks in a
bit, but for now, what the callbacks are:</p>
<ul>
<li><p>a callback that is called for each added/updated item when we add
a root.</p></li>
<li><p>a callback that is called whenever we add a root</p></li>
<li><p>a callback that intercepts looking up a root item</p></li>
<li><p>a callback that intercepts resolving an UUID to an object
ID.</p></li>
</ul></li>
</ul>
<p>(The callbacks are in <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_treecb.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_treecb.go</code></a>)</p>
<p>We have 5 unordered queues ("work lists"?); these are sets that when
it's time to drain them we'll sort the members and process them in that
order.</p>
<ol type="1">
<li>the tree queue: a list of tree IDs that we need to crawl</li>
<li>the retry-item queue: for each tree ID, a set of items that we
should re-process if we add a root to that tree</li>
<li>the added-item queue: a set of key/tree pairs identifying items that
have been added by adding a root to a tree</li>
<li>the settled-item-queue: a set of key/tree pairs that have have not
just been added by adding a root, but we've also verified that they are
the newest-generation item with that key that we could add to the
tree.</li>
<li>the augment queue: for each item that we want to add to a tree, the
list of roots that we could add to get that item.</li>
</ol>
<p>The roots all start out empty, except for the tree queue, which we
seed with the <code>ROOT_TREE</code>, the <code>CHUNK_TREE</code>, and
the <code>BLOCK_GROUP_TREE</code> (It is a "TODO" task that it should
probably also be seeded with the <code>TREE_LOG</code>, but as I will
say below in the "future work" section, I don't actually understand the
<code>TREE_LOG</code>, so I couldn't implement it).</p>
<p>Now we're going to loop until the tree queue, added-item queue,
settled-item queue, and augment queue are all empty (all queues except
for the retry-item queue). Each loop "pass" has 3 substeps:</p>
<ol type="1">
<li><p>Crawl the trees (drain the tree queue, fill the added-item
queue).</p></li>
<li><p>Either:</p>
<ol type="a">
<li><p>if the added-item queue is non-empty: "settle" those items (drain
the added-item queue, fill the augment queue and the settled-item
queue).</p></li>
<li><p>otherwise: process items (drain the settled-item queue, fill the
augment queue and the tree queue)</p></li>
</ol></li>
<li><p>Apply augments (drain the augment queue and maybe the retry-item
queue, fill the added-item queue).</p></li>
</ol>
<p>OK, let's look at those 3 substeps in more detail:</p>
<ol type="1">
<li><p>Crawl the trees; drain the tree queue, fill the added-item
queue.</p>
<p>We just look up the tree in the rebuilt forrest, which will (per the
above <code>--rebuild</code> algorithm) will either fail to look up the
tree, or succeed, and add to that tree the root node from the
superblock/root-item. Because we set an item-added callback, when adding
that root it will loop over the nodes added by that root, and call our
callback for each item in one of the added nodes. Our callback inserts
each item into the added-item queue. The forrest also calls our
root-added callback, but because of the way this algorithm works, that
turns out to be a no-op at this step.</p>
<p>I mentioned that we added callbacks to intercept the forrest's
looking up of root items and resolving UUIDs; we override the forrest's
"lookup root item" routine and "resolve UUID" routine to instead of
doing normal lookups on the <code>ROOT_TREE</code> and
<code>UUID_TREE</code>, use the above <code>Want<var>XXX</var></code>
routines that we'll define below in the "graph callbacks" section.</p>
<p>It shouldn't matter what order this queue is processed in, but I sort
tree IDs numerically.</p>
<p>The crawling is fairly fast because it's just in-memory, the only
accesses to disk are looking up root items and resolving UUIDs.</p></li>
<li><p>Either:</p>
<ol type="a">
<li><p>Settle items from the added-item queue to the settled-item queue
(and fill the augment queue).</p>
<p>For each item in the queue, we look in the tree's item index to get
the {node, slot} pair for it, then we do the same in the tree's
potential-item index. If the potential-item index contains an entry for
the item's key, then we check if the potential-item's node should "win"
over the queue item's node, deciding the "winner" using the same routine
as when building the item index. If the potential-item's node wins, then
we add the potential node's set of roots to the augment queue. If the
queue-item's node wins, then we add the item to the settled-item queue
(except, as an optimization, if the item is of a type that cannot
possibly imply the existence of another item, then we just drop it and
don't add it to the settled-item queue).</p>
<p>It shouldn't matter what order this queue is processed in, but I sort
it numerically by treeID and then by item key.</p>
<p>This step is fairly fast because it's entirely in-memory, making no
accesses to disk.</p></li>
<li><p>Process items from the settled-item queue (drain the settled-item
queue, fill the augment queue and the tree queue).</p>
<p>This step accesses disk, and so the order we process the queue in
turns out to be pretty important in order to keep our disk access
patterns cache-friendly. For the most part, we just sort each queue item
by tree, then by key. But, we have special handling for
<code>EXTENT_ITEM</code>s, <code>METADATA_ITEM</code>s, and
<code>EXTENT_DATA_REF</code> items: We break <code>EXTENT_ITEM</code>s
and <code>METADATA_ITEM</code>s in to "sub-items", treating each ref
embedded in them as a separate item. For those embedded items that are
<code>EXTENT_DATA_REF</code>s, and for stand-alone
<code>EXTENT_DATA_REF</code> items, we sort them not with the
<code>EXTENT_TREE</code> items, but with the items of the tree that the
extent data ref points at. Recall that during the intitial scan step, we
took note of which tree every extent data ref points at, so we can
perform this sort without accessing disk yet. This splitting does mean
that we may visit/read an <code>EXTENT_ITEM</code> or
<code>METADATA_ITEM</code> multiple times as we process the queue, but
to do otherwise is to solve MinLA, which is NP-hard and also an optimal
MinLA solution I still think would perform worse than this; there is a
reasonably lengthy discussion of this in a comment in <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n251"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:sortSettledItemQueue()</code></a>.</p>
<p>Now we loop over that sorted queue. In the code, this loop is
deceptively simple. Read the item, then pass it to a function that tells
us what other items are implied by it. That function is large, but
simple; it's just a giant table. The trick is how it tells us about
implied items; we give it set of callbacks that it calls to tell us
these things; the real complexity is in the callbacks. These "graph
callbacks" will be discussed in detail below, but as an illustrative
example: It may call <code>.WantOff()</code> with a tree ID, object ID,
item type, and offset to specify a precise item that it believes should
exist.</p>
<p>If we encounter a <code>ROOT_ITEM</code>, add the tree described by
that item to the tree queue.</p></li>
</ol>
<p>(Both the "can this item even imply the existence of another item"
check and the "what items are implied by this item" routine are in <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfscheck/graph.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfscheck/graph.go</code></a>)</p></li>
<li><p>Apply augments; drain the augment queue (and maybe the retry-item
queue), fill the added-item queuee.</p>
<p>It is at this point that I call out that the augment queue isn't
implemented as a simple map/set like the others, the
<code>treeAugmentQueue struct</code> has special handling for sets of
different sizes; optimizing the space for empty and len()==1 sized sets,
and falling back to normal the usual implementation for larger sets;
this is important because those small sets are the overwhelming
majority, and otherwise there's no way the program would be able to run
on my 32GB RAM laptop. Now that I think about it, I bet it would even be
worth it to add optimized storage for len()==2 sized sets.</p>
<p>The reason is that each "want" from above is tracked in the queue
separately; if we were OK merging them, then this optimized storage
wouldn't be nescessary. But we keep them separate, so that:</p>
<ul>
<li><p>For all "wants", including ones with empty sets, graph callbacks
can check if a want has already been processed; avoiding re-doing any
work (see the description of the graph callbacks below).</p></li>
<li><p>For "wants" with non-empty sets, we can see how many different
"wants" could be satisfied with a given root, in order to decide which
root to choose.</p></li>
</ul>
<p>Anyway, we loop over the trees in the augment queue. For each tree we
look at that tree's augment queue and look at all the choices of root
nodes to add (below), and decide on a list to add. The we add each of
those roots to the tree; the adding of each root triggers several calls
to our item-added callback (filling the added-item queue), and our
root-added callback. The root-added callback moves any items from the
retry-item queue for this tree to the added-item queue.</p>
<p>How do we decide between choices of root nodes to add? <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n528"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:resolveTreeAugments()</code></a>
has a good comment explaining the criteria we'd like to optimize for,
and then code that does an OK-ish job of actually optimizing for
that:</p>
<ul>
<li><p>It loops over the augment queue for that tree, building a list of
possible roots, for each possible root making note of 3 things:</p>
<ol type="a">
<li><p>how many "wants" that root satisfies,</p></li>
<li><p>how far from treee the root's owner is (owner=tree is a distance
of 0, owner=tree.parent is a distance of 1, owner=tree.parent.parent is
a distance of 2, and so on), and</p></li>
<li><p>what the generation of that root is.</p></li>
</ol></li>
<li><p>We sort that list first by highest-count-first, then by
lowest-distance-first, then by highest-generation-first.</p></li>
<li><p>We create a "return" set and an "illegal" set. We loop over the
sorted list; for each possible root if it is in the illegal set, we skip
it, otherwise we insert it into the return set and for each "want" that
includes this root we all all roots that satisfy that want to the
illegal list.</p></li>
</ul></li>
</ol>
<p>It is important that the rebuilt forrest have the flag set so that it
refuses to look up a tree if it can't look up all of that tree's
ancestors; otherwise the potential-items index would be garbage as we
wouldn't have a good idea of which nodes are OK to consider; but this
does have the downside that it won't even attempt to improve a tree with
a missing parent. Perhaps the algorithm should flip the flag once the
loop terminates, and then re-seed the tree queue with each
<code>ROOT_ITEM</code> from the <code>ROOT_TREE</code>?</p>
<h4 id="graph-callbacks">4.3.3.3. graph callbacks</h4>
<p>(see: <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go</code></a>)</p>
<p>The graph callbacks are what tie the above together.</p>
<p>For each of these callbacks, whenever I say that it looks up
something in a tree's item index or potential-item index, that implies
looking the tree up from the forrest; if the forrest cannot look up that
tree, then the callback returns early, after either:</p>
<ul>
<li><p>if we are in substep 1 and are processing a tree: we add the tree
that is being processed to the tree queue. (TODO: Wait, this assumes
that an augment will be applied to the <code>ROOT_TREE</code> before the
next pass... if that isn't the case, this will result in the loop never
terminating... I guess I need to add a separate retry-tree
queue?)</p></li>
<li><p>if we are in substep 2 and are processing an item: we add the
item that is being processed to the retry-item queue for the tree that
cannot be looked up</p></li>
</ul>
<p>The 6 methods in the <code>brfscheck.GraphCallbacks</code> interface
are:</p>
<ol type="1">
<li><p><code>FSErr()</code>: There's an error with the filesystem; this
callback just spits it out on stderr. I say such a trivial matter
because, again, for a recovery tool I think it's worth putting care in
to how you handle errors and where you expect them: We expect them here,
so we have to check for them to avoid reading invalid data or whatever,
but we don't actually need to do anything other than watch our
step.</p></li>
<li><p><code>Want()</code>: We want an item in a given tree with a given
object ID and item type, but we don't care about what the item's offset
is.</p>
<p>The callback works by searching the item index to see if it can find
such an item; if so, it has nothing else to do and returns. Otherwise,
it searches the potential-item index; for each matching item it finds it
looks in the node index for the node containing that item, and adds the
roots that would add that node, and adds those roots to a set. Once it
has finished searching the potential-item index, it adds that set to the
augment queue (even if that set is still empty).</p></li>
<li><p><code>WantOff()</code>: The same, but we want a specific
offset.</p></li>
<li><p><code>WantDirIndex()</code>: We want a <code>DIR_INDEX</code>
item for a given inode and filename, but we don't know what the offset
of that item is.</p>
<p>First we scan over the item index, looking at all
<code>DIR_INDEX</code> items for that inode number. For each item, we
can check the scan data to see what the filename in that
<code>DIR_INDEX</code> is, so we can see if the item satisfies this want
without accessing the disk. If there's a match, then there is nothing
else to do, so we return. Otherwise, we do that same search over the
potential-item index; if we find any matches, then we build the set of
roots to add to the augment queue the same as in
<code>Want</code>.</p></li>
<li><p><code>WantFileExt()</code>: We want 1 or more
<code>DATA_EXTENT</code> items in the given tree for the given inode,
and we want them to cover from 0 to a given size bytes of that file.</p>
<p>First we walk that range in the item index, to build a list of the
gaps that we need to fill ("Step 1" in <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n260"><code>rebuild_wantcb.go:_wantRange()</code></a>).
This walk (<a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n195"><code>rebuild_wantcb.go:_walkRange()</code></a>)
requires knowing the size of each file extent; so doing this quickly
without hitting disk is why we recorded the size of each file extent in
our initialization step.</p>
<p>Then ("Step 2" in <code>_wantRange()</code>) we iterate over each of
the gaps, and for each gap do a very similar walk (again, by calling
<code>_walkRange()</code>, but this time over the potential-item index.
For each file extent we find that has is entirely within the gap, we
"want" that extent, and move the beginning of of the gap forward to the
end of that extent. This algorithm is dumb and greedy, potentially
making sub-optimal selections; and so could probably stand to be
improved; but in my real-world use, it seems to be "good
enough".</p></li>
<li><p><code>WantCSum()</code>: We want 1 or more
<code>EXTENT_CSUM</code> items to cover the half-open interval
[<code>lo_logical_addr</code>, <code>hi_logical_addr</code>). Well,
maybe. It also takes a subvolume ID and an inode number; and looks up in
the scan data whether that inode has the <code>INODE_NODATASUM</code>
flag set; if it does have the flag set, then it returns early without
looking for any <code>EXTENT_CSUM</code> items. If it doesn't return
early, then it performs the same want-range routine as
<code>WantFileExt</code>, but with the appropriate tree, object ID, and
item types for csums as opposed to data extents.</p></li>
</ol>
<p>For each of these callbacks, we generate a "wantKey", a tuple
representing the function and its arguments; we check the augment-queue
to see if we've already enqueued a set of roots for that want, and if
so, that callback can return early without checking the potential-item
index.</p>
<h1 id="future-work">5. Future work</h1>
<p>It's in a reasonably useful place, I think; and so now I'm going to
take a break from it for a while. But there's still lots of work to
do:</p>
<ul>
<li><p>RAID almost certainly doesn't work.</p></li>
<li><p>Encryption is not implemented.</p></li>
<li><p>It doesn't understand (ignores) the <code>TREE_LOG</code>
(because I don't understand the <code>TREE_LOG</code>).</p></li>
<li><p><code>btrfs-rec inspect mount</code> should add "lost+found"
directories for inodes that are included in the subvolume's tree but
aren't reachable from the tree's root inode</p></li>
<li><p>I still need to implement <code>btrfs-rec repair
<var>SUBCMD</var></code> subcommands to write rebuilt-information from
<code>btrfs-rec inspect</code> back to the filesystem.</p></li>
<li><p>I need to figure out the error handling/reporting story for
<code>mount</code>.</p></li>
<li><p>It needs a lot more tests</p>
<ul>
<li>I'd like to get the existing btrfs-progs fsck tests to run on
it.</li>
</ul></li>
<li><p>In the process of writing this email, I realized that I probably
need to add a retry-tree queue; see the "graph callbacks" section in the
description of the <code>rebuild-trees</code> algorithm above.</p></li>
<li><p>Shere are a number of "TODO" comments or panics in the code:</p>
<ul>
<li><p>Some of them definitely need done.</p></li>
<li><p>Some of them are <code>panic("TODO")</code> on the basis that if
it's seeing something on the filesystem that it doesn't recognize, it's
probably that I didn't get to implementing that thing/situation, but
it's possible that the thing is just corrupt. This should only be for
situations that the node passed the checksum test, so it being corrupt
would have to be caused by a bug in btrfs rather than a failing drive or
other corruption; I wasn't too worried about btrfs bugs.</p></li>
</ul></li>
<li><p><code>btrfs-rec inspect rebuild-trees</code> is slow, and can
probably be made a lot faster.</p>
<p>Just to give you an idea of the speeds, the run-times for the various
steps on my ThinkPad E15 for a 256GB disk image are as follows:</p>
<pre><code> btrfs-rec inspect rebuild-mappings scan : 7m 31s
btrfs-rec inspect rebuild-mappings list-nodes : 47s
btrfs-rec inspect rebuild-mappings process : 8m 22s
btrfs-rec inspect rebuild-trees : 1h 4m 55s
btrfs-rec inspect ls-files : 29m 55s
btrfs-rec inspect ls-trees : 8m 40s</code></pre>
<p>For the most part, it's all single-threaded (with the main exception
that in several places I/O has been moved to a separate thread from the
main CPU-heavy thread), but a lot of the algorithms could be
parallelized.</p></li>
<li><p>There are a lot of "tunable" values that I haven't really spent
time tuning. These are all annotated with <a
href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/textui/tunable.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>textui.Tunable()</code></a>.
I sort-of intended for them to be adjustable on the CLI.</p></li>
<li><p>Perhaps the <code>btrfs inspect rebuild-trees</code> algorithm
could be adjusted to also try to rebuild trees with missing parents; see
the above discussion of the algorithm.</p></li>
</ul>
<h1 id="problems-for-merging-this-code-into-btrfs-progs">6. Problems for
merging this code into btrfs-progs</h1>
<ul>
<li><p>It's written in Go, not C.</p></li>
<li><p>It's effectively GPLv3+ (not GPLv2-only or GPLv2+) because of use
of some code under the Apache 2.0 license (2 files in the codebase
itself that are based off of Apache-licensed code, and use of unmodified
3rd-party libraries).</p></li>
<li><p>It uses ARC (Adaptive Replacement Cache), which is patented by
IBM, and the patent doesn't expire for another 7 months. An important
property of ARC over LRU is that it is scan-resistant; the above
algorithms do a lot of scanning. On that note, now that RedHat is owned
by IBM: who in the company do we need to get to talk to eachother so
that we can get ARC into the Linux kernel before then?</p></li>
</ul>
<div style="font-family: monospace">
<p>-- <br/> Happy hacking,<br/> ~ Luke Shumaker<br/></p>
</div>
</article>
<footer>
<p>The content of this page is Copyright © 2023 <a href="mailto:lukeshu@sbcglobal.net">Luke Shumaker</a>.</p>
<p>This page is licensed under the <a href="https://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA-3.0</a> license.</p>
</footer>
</body>
</html>
|