Coverage for bzfs_main / snapshot_cache.py: 95%

119 statements  

« prev     ^ index     » next       coverage.py v7.13.0, created at 2025-12-22 08:03 +0000

1# Copyright 2024 Wolfgang Hoschek AT mac DOT com 

2# 

3# Licensed under the Apache License, Version 2.0 (the "License"); 

4# you may not use this file except in compliance with the License. 

5# You may obtain a copy of the License at 

6# 

7# http://www.apache.org/licenses/LICENSE-2.0 

8# 

9# Unless required by applicable law or agreed to in writing, software 

10# distributed under the License is distributed on an "AS IS" BASIS, 

11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

12# See the License for the specific language governing permissions and 

13# limitations under the License. 

14# 

15"""Caching snapshot metadata to minimize 'zfs list -t snapshot' calls. 

16 

17Purpose 

18======= 

19The ``--cache-snapshots`` mode speeds up snapshot scheduling, replication, and monitoring by storing just enough 

20metadata in fast local inodes (no external DB, no daemon). Instead of repeatedly invoking costly 

21``zfs list -t snapshot ...`` across potentially thousands or even millions of datasets, we keep tiny (i.e. empty) 

22per-dataset files whose inode atime/mtime atomically encode what we need to know. This reduces latency, load on ZFS, and 

23network chatter, while remaining dependency free and robust under crashes or concurrent runs. 

24 

25Correctness scope and analysis guidance 

26======================================= 

27The cache files described here are an internal implementation detail of the ``--cache-snapshots`` feature. The supported 

28contract is that only ``bzfs`` itself creates and updates these files via its documented CLI flows; third-party tools or 

29manual edits under ``.cache/mods`` are outside the design and correctness model. 

30 

31From a correctness perspective, the *current* design is deliberately fail-safe under *benign* external modifications: 

32 

33- Deleting any subset of cache files or directories (including wiping the entire ``.cache`` tree) is safe. ``bzfs`` will 

34 simply fall back to live ``zfs list -t snapshot`` and repopulate caches conservatively. 

35- Touching existing cache files so their timestamps move away from the true ZFS properties (for example via ``touch`` or 

36 copying cache files without preserving timestamps) cannot cause incorrect behavior. Such changes only make cache 

37 entries look stale; all ``bzfs`` consumers re-validate cached timestamps against live ZFS properties plus maturity 

38 checks and treat any mismatch as a cache miss, at the cost of extra ZFS I/O and latency but never weakening safety. 

39 

40The cache directories themselves are created with strict Unix permissions: ``--log-dir`` and its subdirectories (including 

41``.cache``) use ``DIR_PERMISSIONS`` (``rwx------``), and cache files are written with ``FILE_PERMISSIONS`` (``rw-------``), 

42with the process-wide ``UMASK`` set accordingly. As a result, only the same Unix user account (and root) can read or 

43modify the cache tree; other users and groups have no access by default. Analyses shall therefore not assume arbitrary 

44hostile writes coming from other unprivileged users on the same system. 

45 

46When reasoning about correctness, or when searching for bugs, analyses shall be restricted to states that can be reached 

47via real use of the public CLIs plus standard ZFS operations. Fabricated scenarios that assume arbitrary or adversarial 

48rewrites of cache files (for example, scripts that deliberately write hand-crafted timestamps into ``.cache/mods`` or 

49invoke internal helpers directly) are explicitly out of scope and shall not be treated as bugs in ``--cache-snapshots``. 

50Under the stated assumptions and invariants below, the fast paths never weaken the safety guarantees of snapshot 

51creation, replication, or monitoring; at worst they cause extra ``zfs list -t snapshot`` calls. 

52 

53Assumptions 

54=========== 

55- OpenZFS >= 2.2 provides two key UTC times with integer-second resolution: ``snapshots_changed`` (dataset level) 

56 and snapshot ``creation`` (snapshot level). 

57 - ``snapshots_changed``: Specifies the UTC time at which a snapshot for a dataset was last created or deleted. 

58 See https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#snapshots_changed 

59 - ``creation`` specifies the UTC time the snapshot was created. 

60 See https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops.7.html#creation 

61- Unix atime/mtime are reliable to read and atomically updatable; 

62- Multiple jobs may touch the same cache files concurrently and out of order. Correctness must rely on per-file locking 

63 plus monotonicity guards rather than global serialization or a single writer model. 

64- System clocks may differ by small skews across hosts; equal-second races can happen. We gate "freshness" with a small 

65 maturity time threshold (``MATURITY_TIME_THRESHOLD_SECS``) before trusting a value as authoritative. 

66 

67Design Rationale 

68================ 

69We intentionally encode only minimal invariants into inode timestamps and not arbitrary text payloads. This keeps I/O 

70tiny and allows safe, atomic low-latency updates via a single ``utime`` call under an exclusive advisory file lock via 

71flock(2). 

72 

73Cache root and hashed path segments 

74----------------------------------- 

75The cache tree lives under ``<log_parent_dir>/.cache/mods`` (see ``LogParams.last_modified_cache_dir``). To keep paths 

76short and safe, variable path segments are stored as URL-safe base64-encoded SHA-256 digests without padding. In what 

77follows, ``hash(X)`` denotes ``sha256_urlsafe_base64(str(X), padding=False)`` or a truncated variant used for brevity. 

78 

79The cache consists of four families: 

80------------------------------------ 

811) Dataset-level ("=") per dataset and location (src or dst); for --create-src-snapshots, --replicate, --monitor-snapshots 

82 - Path: ``<cache_root>/<hash(user@host[#port])>/<hash(dataset)>/=`` 

83 - mtime: the ZFS ``snapshots_changed`` time observed for that dataset. Monotonic writes only. 

84 - Used by: snapshot scheduler, replicate, monitor - as the anchor for cache equality checks. 

85 

862) Replication-scoped ("==") per source dataset and destination dataset+filters; for --replicate 

87 - Path: ``<cache_root>/<hash(src_user@host[#port])>/<hash(src_dataset)>/==/<hash(dst_user@host[#port])>/<hash(dst_dataset)>/<hash(filters)>`` 

88 - Path label encodes destination namespace, destination dataset and the snapshot-filter hash. 

89 - mtime: last replicated source ``snapshots_changed`` for that destination and filter set. Monotonic. 

90 - Used by: replicate - to cheaply decide "src unchanged since last successful run to this dst+filters". 

91 

923) Monitor ("===") per dataset and label (Latest/Oldest); for --monitor-snapshots 

93 - Path: ``<cache_root>/<hash(user@host[#port])>/<hash(dataset)>/===/<kind>/<hash(notimestamp_label)>/<hash(alert_plan)>`` 

94 - ``kind``: alert check mode; either "L" (Latest) or "O" (Oldest). 

95 - ``hash(alert_plan)``: stable digest over the monitor alert plan to scope caches per plan. 

96 - atime: creation time of the relevant latest/oldest snapshot. 

97 - mtime: dataset ``snapshots_changed`` observed when that creation was recorded. Monotonic. 

98 - Used by: monitor - to alert on stale snapshots without listing them every time. 

99 

1004) Snapshot scheduler per-label files (under the source dataset); for --create-src-snapshots 

101 - Path: ``<cache_root>/<hash(src_user@host[#port])>/<hash(src_dataset)>/<hash(notimestamp_label)>`` 

102 - atime: creation time of the latest snapshot matching that label. 

103 - mtime: the dataset-level "=" value at the time of the write (i.e., the then-current ``snapshots_changed``). 

104 - Used by: ``--create-src-snapshots`` - to cheaply decide whether a label is due without ``zfs list -t snapshot``. 

105 

106How trust in a cache file is established 

107======================================== 

108For a cache file to be trusted and used as a fast path, three conditions must hold: 

1091) Equality: the dataset-level "=" mtime must equal the live ZFS ``snapshots_changed`` of the corresponding dataset. 

110 This ensures the filesystem state that the cache describes is the same as the live state. 

1112) Maturity: that live ``snapshots_changed`` is strictly older than ``now - MATURITY_TIME_THRESHOLD_SECS`` to avoid 

112 equal-second races and tame small clock skew between initiator and ZFS hosts. 

1133) Internal consistency for per-label/monitor cache files: their mtime must equal the current dataset-level "=" value, 

114 and their atime must be a plausible creation time not later than mtime (atime <= mtime). A zero atime/mtime indicates 

115 unknown provenance and must force fallback. 

116 

117If any condition fails, the code falls back to ``zfs list -t snapshot`` for just those datasets; upon completion it 

118rewrites the relevant cache files, monotonically. 

119 

120Concurrency and correctness mechanics 

121===================================== 

122All writes go through ``set_last_modification_time_safe()`` which: 

123- Creates parent directories if necessary, opens the file with ``O_NOFOLLOW|O_CLOEXEC`` and takes an exclusive ``flock``. 

124- Updates times atomically via ``os.utime(fd, times=(atime, mtime))``. 

125- Applies a monotonic guard: with ``if_more_recent=True``, older timestamps never clobber newer ones. This is what 

126 makes concurrent runs safe and idempotent. 

127 

128Cache invalidation - what and why 

129================================= 

130Two forms exist: 

1311) Dataset-level invalidation by directory (non-recursive): when a mismatch is detected, top-level files for the 

132 dataset ("=" and flat per-label files) are zeroed to force subsequent ``zfs list -t snapshot``. Monitor caches 

133 live in subdirectories and are refreshed by monitor runs; they are trusted only under the equality+maturity criteria 

134 above. 

1352) Selective invalidation on property unavailability: when ZFS reports ``snapshots_changed=0`` (unavailable), the 

136 dataset-level "=" file is reset to 0, while per-label creation caches are preserved. 

137 

138What could be removed without losing correctness - and why we keep it 

139===================================================================== 

140Because all consumers already require equality + maturity before trusting cache state, explicit invalidation is not 

141strictly necessary for correctness; stale cache files would simply be ignored and later overwritten. However, we keep 

142the invalidation steps to improve operational observability and clarity: 

143 

144The equality+maturity gates already prevent incorrect cache hits, but invalidation improves operational clarity. 

145Zeroing the top-level "=" is an explicit "do not trust" signal. All processes then deterministically skip cache and 

146probe via ``zfs list -t snapshot`` once, after which monotonic rewrites restore a consistent, trusted state. This 

147simplifies observability for operators inspecting (or debugging) cache trees, and immediately establishes a stable 

148cache snapshot of reality. The cost is tiny (an inode metadata write) while the benefit is more operational simplicity. 

149 

150The result is a design that favors simplicity and safety: tiny inode-based atomic updates, conservative guardrails before 

151any cache is trusted, and minimal, well-scoped invalidation to keep the system observable under change and concurrency. 

152""" 

153 

154from __future__ import ( 

155 annotations, 

156) 

157import errno 

158import fcntl 

159import os 

160from subprocess import ( 

161 CalledProcessError, 

162) 

163from typing import ( 

164 TYPE_CHECKING, 

165 Final, 

166 final, 

167) 

168 

169from bzfs_main.parallel_batch_cmd import ( 

170 itr_ssh_cmd_parallel, 

171) 

172from bzfs_main.util.utils import ( 

173 DIR_PERMISSIONS, 

174 FILE_PERMISSIONS, 

175 LOG_TRACE, 

176 SortedInterner, 

177 sha256_urlsafe_base64, 

178 stderr_to_str, 

179) 

180 

181if TYPE_CHECKING: # pragma: no cover - for type hints only 

182 from bzfs_main.bzfs import ( 

183 Job, 

184 ) 

185 from bzfs_main.configuration import ( 

186 Remote, 

187 SnapshotLabel, 

188 ) 

189 

190# constants: 

191DATASET_CACHE_FILE_PREFIX: Final[str] = "=" 

192REPLICATION_CACHE_FILE_PREFIX: Final[str] = "==" 

193MONITOR_CACHE_FILE_PREFIX: Final[str] = "===" 

194MATURITY_TIME_THRESHOLD_SECS: Final[float] = 1.1 # 1 sec ZFS creation time resolution + NTP clock skew is typically < 10ms 

195 

196 

197############################################################################# 

198@final 

199class SnapshotCache: 

200 """Handles last-modified cache operations for snapshot management.""" 

201 

202 def __init__(self, job: Job) -> None: 

203 # immutable variables: 

204 self.job: Final[Job] = job 

205 

206 def get_snapshots_changed(self, path: str) -> int: 

207 """Returns numeric timestamp from cached snapshots-changed file.""" 

208 return self.get_snapshots_changed2(path)[1] 

209 

210 @staticmethod 

211 def get_snapshots_changed2(path: str) -> tuple[int, int]: 

212 """Like zfs_get_snapshots_changed() but reads from local cache.""" 

213 try: # perf: inode metadata reads and writes are fast - ballpark O(200k) ops/sec. 

214 s = os.stat(path, follow_symlinks=False) 

215 return round(s.st_atime), round(s.st_mtime) 

216 except FileNotFoundError: 

217 return 0, 0 # harmless 

218 

219 def last_modified_cache_file(self, remote: Remote, dataset: str, label: str | None = None) -> str: 

220 """Returns the path of the cache file that is tracking last snapshot modification.""" 

221 cache_file: str = DATASET_CACHE_FILE_PREFIX if label is None else label 

222 userhost_dir: str = sha256_urlsafe_base64(remote.cache_namespace(), padding=False) 

223 dataset_dir: str = sha256_urlsafe_base64(dataset, padding=False) 

224 return os.path.join(self.job.params.log_params.last_modified_cache_dir, userhost_dir, dataset_dir, cache_file) 

225 

226 def invalidate_last_modified_cache_dataset(self, dataset: str) -> None: 

227 """Resets the timestamps of top-level cache files of the given dataset to zero. 

228 

229 Purpose: Best-effort invalidation to force ``zfs list -t snapshot`` when the dataset-level '=' cache is stale. 

230 Assumptions: Only top-level files (the '=' file and flat per-label files) are reset; nested monitor caches 

231 (e.g., '===/...') are not recursively traversed. 

232 Design Rationale: Monitor caches are refreshed by monitor runs and guarded by snapshots_changed equality and 

233 maturity checks, preserving correctness without recursive work. 

234 """ 

235 p = self.job.params 

236 cache_file: str = self.last_modified_cache_file(p.src, dataset) 

237 if not p.dry_run: 237 ↛ exitline 237 didn't return from function 'invalidate_last_modified_cache_dataset' because the condition on line 237 was always true

238 try: # Best-effort: no locking needed. Not recursive on purpose. 

239 zero_times = (0, 0) 

240 os_utime = os.utime 

241 with os.scandir(os.path.dirname(cache_file)) as iterator: 

242 for entry in iterator: 

243 os_utime(entry.path, times=zero_times) 

244 os_utime(cache_file, times=zero_times) 

245 except FileNotFoundError: 

246 pass # harmless 

247 

248 def update_last_modified_cache(self, datasets_to_snapshot: dict[SnapshotLabel, list[str]]) -> None: 

249 """Perf: copy last-modified time of the source dataset into the local cache to reduce future 'zfs list -t snapshot' calls.""" 

250 p = self.job.params 

251 src = p.src 

252 src_datasets_set: set[str] = set() 

253 for datasets in datasets_to_snapshot.values(): 

254 src_datasets_set.update(datasets) # union 

255 

256 sorted_datasets: list[str] = sorted(src_datasets_set) 

257 snapshots_changed_dict: dict[str, int] = self.zfs_get_snapshots_changed(src, sorted_datasets) 

258 for src_dataset in sorted_datasets: 

259 snapshots_changed: int = snapshots_changed_dict.get(src_dataset, 0) 

260 self.job.src_properties[src_dataset].snapshots_changed = snapshots_changed 

261 dataset_cache_file: str = self.last_modified_cache_file(src, src_dataset) 

262 if not p.dry_run: 

263 if snapshots_changed == 0: 

264 try: # selective invalidation: only zero the dataset-level '=' cache file 

265 os.utime(dataset_cache_file, times=(0, 0)) 

266 except FileNotFoundError: 

267 pass # harmless 

268 else: # update dataset-level '=' cache monotonically; do NOT touch per-label creation caches here 

269 set_last_modification_time_safe( 

270 dataset_cache_file, unixtime_in_secs=snapshots_changed, if_more_recent=True 

271 ) 

272 

273 def zfs_get_snapshots_changed(self, r: Remote, sorted_datasets: list[str]) -> dict[str, int]: 

274 """For each given dataset, returns the ZFS dataset property "snapshots_changed", which is a UTC Unix time in integer 

275 seconds; See https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#snapshots_changed""" 

276 

277 def _run_zfs_list_command(_cmd: list[str], batch: list[str]) -> list[str]: 

278 try: 

279 return self.job.run_ssh_command_with_retries(r, LOG_TRACE, print_stderr=False, cmd=_cmd + batch).splitlines() 

280 except CalledProcessError as e: 

281 return stderr_to_str(e.stdout).splitlines() 

282 except UnicodeDecodeError: 

283 return [] 

284 

285 assert (not self.job.is_test_mode) or sorted_datasets == sorted(sorted_datasets), "List is not sorted" 

286 p = self.job.params 

287 cmd: list[str] = p.split_args(f"{p.zfs_program} list -t filesystem,volume -s name -Hp -o snapshots_changed,name") 

288 results: dict[str, int] = {} 

289 interner: SortedInterner[str] = SortedInterner(sorted_datasets) # reduces memory footprint 

290 for lines in itr_ssh_cmd_parallel( 

291 self.job, r, [(cmd, sorted_datasets)], lambda _cmd, batch: _run_zfs_list_command(_cmd, batch), ordered=False 

292 ): 

293 for line in lines: 

294 if "\t" not in line: 

295 break # partial output from failing 'zfs list' command; subsequent lines in curr batch cannot be trusted 

296 snapshots_changed, dataset = line.split("\t", 1) 

297 if not dataset: 

298 break # partial output from failing 'zfs list' command; subsequent lines in curr batch cannot be trusted 

299 dataset = interner.interned(dataset) 

300 if snapshots_changed == "-" or not snapshots_changed: 

301 snapshots_changed = "0" 

302 results[dataset] = int(snapshots_changed) 

303 return results 

304 

305 

306def set_last_modification_time_safe( 

307 path: str, 

308 unixtime_in_secs: int | tuple[int, int], 

309 if_more_recent: bool = False, 

310) -> None: 

311 """Like set_last_modification_time() but creates directories if necessary.""" 

312 try: 

313 os.makedirs(os.path.dirname(path), mode=DIR_PERMISSIONS, exist_ok=True) 

314 set_last_modification_time(path, unixtime_in_secs=unixtime_in_secs, if_more_recent=if_more_recent) 

315 except FileNotFoundError: 

316 pass # harmless 

317 

318 

319def set_last_modification_time( 

320 path: str, 

321 unixtime_in_secs: int | tuple[int, int], 

322 if_more_recent: bool = False, 

323) -> None: 

324 """Atomically sets the atime/mtime of the file with the given ``path``, with a monotonic guard. 

325 

326 if_more_recent=True is a concurrency control mechanism that prevents us from overwriting a newer (monotonically 

327 increasing) mtime=snapshots_changed value (which is a UTC Unix time in integer seconds) that might have been written to 

328 the cache file by a different, more up-to-date bzfs process. 

329 

330 For a brand-new file created by this call, we always update the file's timestamp to avoid retaining the file's implicit 

331 creation time ("now") instead of the intended timestamp. 

332 

333 Design Rationale: Open without O_CREAT first; if missing, create exclusively (O_CREAT|O_EXCL) to detect that this call 

334 created the file. Only apply the monotonic early-return check when the file pre-existed; otherwise perform the initial 

335 timestamp write unconditionally. This preserves concurrency safety and prevents silent skips on first write. 

336 """ 

337 unixtimes = (unixtime_in_secs, unixtime_in_secs) if isinstance(unixtime_in_secs, int) else unixtime_in_secs 

338 flags_base: int = os.O_WRONLY | os.O_NOFOLLOW | os.O_CLOEXEC 

339 preexisted: bool = True 

340 

341 try: 

342 fd = os.open(path, flags_base) 

343 except FileNotFoundError: 

344 try: 

345 fd = os.open(path, flags_base | os.O_CREAT | os.O_EXCL, mode=FILE_PERMISSIONS) 

346 preexisted = False 

347 except FileExistsError: 

348 fd = os.open(path, flags_base) # we lost the race, open existing file 

349 

350 try: 

351 # Acquire an exclusive lock; will block if lock is already held by this process or another process. 

352 # The (advisory) lock is auto-released when the process terminates or the fd is closed. 

353 fcntl.flock(fd, fcntl.LOCK_EX) 

354 

355 stats = os.fstat(fd) 

356 st_uid: int = stats.st_uid 

357 if st_uid != os.geteuid(): # verify ownership is current effective UID; same as open_nofollow() 

358 raise PermissionError(errno.EPERM, f"{path!r} is owned by uid {st_uid}, not {os.geteuid()}", path) 

359 

360 # Monotonic guard: only skip when the file pre-existed, to not skip the very first write. 

361 if preexisted and if_more_recent: 

362 st_mtime: int = round(stats.st_mtime) 

363 if unixtimes[1] < st_mtime: 

364 return 

365 if unixtimes[1] == st_mtime and unixtimes[0] == round(stats.st_atime): 

366 return 

367 os.utime(fd, times=unixtimes) # write timestamps 

368 finally: 

369 os.close(fd)