Coverage for bzfs_main/snapshot_cache.py: 95%

116 statements  

« prev     ^ index     » next       coverage.py v7.11.0, created at 2025-11-07 04:44 +0000

1# Copyright 2024 Wolfgang Hoschek AT mac DOT com 

2# 

3# Licensed under the Apache License, Version 2.0 (the "License"); 

4# you may not use this file except in compliance with the License. 

5# You may obtain a copy of the License at 

6# 

7# http://www.apache.org/licenses/LICENSE-2.0 

8# 

9# Unless required by applicable law or agreed to in writing, software 

10# distributed under the License is distributed on an "AS IS" BASIS, 

11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

12# See the License for the specific language governing permissions and 

13# limitations under the License. 

14# 

15"""Caching snapshot metadata to minimize 'zfs list -t snapshot' calls. 

16 

17Purpose 

18======= 

19The ``--cache-snapshots`` mode speeds up snapshot scheduling, replication, and monitoring by storing just enough 

20metadata in fast local inodes (no external DB, no daemon). Instead of repeatedly invoking costly 

21``zfs list -t snapshot ...`` across potentially thousands or even millions of datasets, we keep tiny (i.e. empty) 

22per-dataset files whose inode atime/mtime atomically encode what we need to know. This reduces latency, load on ZFS, and 

23network chatter, while remaining dependency free and robust under crashes or concurrent runs. 

24 

25Assumptions 

26=========== 

27- OpenZFS >= 2.2 provides two key UTC times with integer-second resolution: ``snapshots_changed`` (dataset level) 

28 and snapshot ``creation`` (snapshot level). 

29 - ``snapshots_changed``: Specifies the UTC time at which a snapshot for a dataset was last created or deleted. 

30 See https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#snapshots_changed 

31 - ``creation`` specifies the UTC time the snapshot was created. 

32 See https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops.7.html#creation 

33- Unix atime/mtime are reliable to read and atomically updatable; 

34- Multiple jobs may touch the same cache files concurrently and out of order. Correctness must rely on per-file locking 

35 plus monotonicity guards rather than global serialization or a single writer model. 

36- System clocks may differ by small skews across hosts; equal-second races can happen. We gate "freshness" with a small 

37 maturity time threshold (``MATURITY_TIME_THRESHOLD_SECS``) before trusting a value as authoritative. 

38 

39Design Rationale 

40================ 

41We intentionally encode only minimal invariants into inode timestamps and not arbitrary text payloads. This keeps I/O 

42tiny and allows safe, atomic low-latency updates via a single ``utime`` call under an exclusive advisory file lock via 

43flock(2). 

44 

45Cache root and hashed path segments 

46----------------------------------- 

47The cache tree lives under ``<log_parent_dir>/.cache/mods`` (see ``LogParams.last_modified_cache_dir``). To keep paths 

48short and safe, variable path segments are stored as URL-safe base64-encoded SHA-256 digests without padding. In what 

49follows, ``hash(X)`` denotes ``sha256_urlsafe_base64(str(X), padding=False)`` or a truncated variant used for brevity. 

50 

51The cache consists of four families: 

52------------------------------------ 

531) Dataset-level ("=") per dataset and location (src or dst); for --create-src-snapshots, --replicate, --monitor-snapshots 

54 - Path: ``<cache_root>/<hash(user@host[#port])>/<hash(dataset)>/=`` 

55 - mtime: the ZFS ``snapshots_changed`` time observed for that dataset. Monotonic writes only. 

56 - Used by: snapshot scheduler, replicate, monitor - as the anchor for cache equality checks. 

57 

582) Replication-scoped ("==") per source dataset and destination dataset+filters; for --replicate 

59 - Path: ``<cache_root>/<hash(src_user@host[#port])>/<hash(src_dataset)>/==/<hash(dst_user@host[#port])>/<hash(dst_dataset)>/<hash(filters)>`` 

60 - Path label encodes destination namespace, destination dataset and the snapshot-filter hash. 

61 - mtime: last replicated source ``snapshots_changed`` for that destination and filter set. Monotonic. 

62 - Used by: replicate - to cheaply decide "src unchanged since last successful run to this dst+filters". 

63 

643) Monitor ("===") per dataset and label (Latest/Oldest); for --monitor-snapshots 

65 - Path: ``<cache_root>/<hash(user@host[#port])>/<hash(dataset)>/===/<kind>/<hash(notimestamp_label)>/<hash(alert_plan)>`` 

66 - ``kind``: alert check mode; either "L" (Latest) or "O" (Oldest). 

67 - ``hash(alert_plan)``: stable digest over the monitor alert plan to scope caches per plan. 

68 - atime: creation time of the relevant latest/oldest snapshot. 

69 - mtime: dataset ``snapshots_changed`` observed when that creation was recorded. Monotonic. 

70 - Used by: monitor - to alert on stale snapshots without listing them every time. 

71 

724) Snapshot scheduler per-label files (under the source dataset); for --create-src-snapshots 

73 - Path: ``<cache_root>/<hash(src_user@host[#port])>/<hash(src_dataset)>/<hash(notimestamp_label)>`` 

74 - atime: creation time of the latest snapshot matching that label. 

75 - mtime: the dataset-level "=" value at the time of the write (i.e., the then-current ``snapshots_changed``). 

76 - Used by: ``--create-src-snapshots`` - to cheaply decide whether a label is due without ``zfs list -t snapshot``. 

77 

78How trust in a cache file is established 

79======================================== 

80For a cache file to be trusted and used as a fast path, three conditions must hold: 

811) Equality: the dataset-level "=" mtime must equal the live ZFS ``snapshots_changed`` of the corresponding dataset. 

82 This ensures the filesystem state that the cache describes is the same as the live state. 

832) Maturity: that live ``snapshots_changed`` is strictly older than ``now - MATURITY_TIME_THRESHOLD_SECS`` to avoid 

84 equal-second races and tame small clock skew between initiator and ZFS hosts. 

853) Internal consistency for per-label/monitor cache files: their mtime must equal the current dataset-level "=" value, 

86 and their atime must be a plausible creation time not later than mtime (atime <= mtime). A zero atime/mtime indicates 

87 unknown provenance and must force fallback. 

88 

89If any condition fails, the code falls back to ``zfs list -t snapshot`` for just those datasets; upon completion it 

90rewrites the relevant cache files, monotonically. 

91 

92Concurrency and correctness mechanics 

93===================================== 

94All writes go through ``set_last_modification_time_safe()`` which: 

95- Creates parent directories if necessary, opens the file with ``O_NOFOLLOW|O_CLOEXEC`` and takes an exclusive ``flock``. 

96- Updates times atomically via ``os.utime(fd, times=(atime, mtime))``. 

97- Applies a monotonic guard: with ``if_more_recent=True``, older timestamps never clobber newer ones. This is what 

98 makes concurrent runs safe and idempotent. 

99 

100Cache invalidation - what and why 

101================================= 

102Two forms exist: 

1031) Dataset-level invalidation by directory (non-recursive): when a mismatch is detected, top-level files for the 

104 dataset ("=" and flat per-label files) are zeroed to force subsequent ``zfs list -t snapshot``. Monitor caches 

105 live in subdirectories and are refreshed by monitor runs; they are trusted only under the equality+maturity criteria 

106 above. 

1072) Selective invalidation on property unavailability: when ZFS reports ``snapshots_changed=0`` (unavailable), the 

108 dataset-level "=" file is reset to 0, while per-label creation caches are preserved. 

109 

110What could be removed without losing correctness - and why we keep it 

111===================================================================== 

112Because all consumers already require equality + maturity before trusting cache state, explicit invalidation is not 

113strictly necessary for correctness; stale cache files would simply be ignored and later overwritten. However, we keep 

114the invalidation steps to improve operational observability and clarity: 

115 

116The equality+maturity gates already prevent incorrect cache hits, but invalidation improves operational clarity. 

117Zeroing the top-level "=" is an explicit "do not trust" signal. All processes then deterministically skip cache and 

118probe via ``zfs list -t snapshot`` once, after which monotonic rewrites restore a consistent, trusted state. This 

119simplifies observability for operators inspecting (or debugging) cache trees, and immediately establishes a stable 

120cache snapshot of reality. The cost is tiny (an inode metadata write) while the benefit is more operational simplicity. 

121 

122The result is a design that favors simplicity and safety: tiny inode-based atomic updates, conservative guardrails before 

123any cache is trusted, and minimal, well-scoped invalidation to keep the system observable under change and concurrency. 

124""" 

125 

126from __future__ import ( 

127 annotations, 

128) 

129import errno 

130import fcntl 

131import os 

132import stat 

133from subprocess import ( 

134 CalledProcessError, 

135) 

136from typing import ( 

137 TYPE_CHECKING, 

138 Final, 

139) 

140 

141from bzfs_main.connection import ( 

142 run_ssh_command, 

143) 

144from bzfs_main.parallel_batch_cmd import ( 

145 itr_ssh_cmd_parallel, 

146) 

147from bzfs_main.utils import ( 

148 DIR_PERMISSIONS, 

149 LOG_TRACE, 

150 SortedInterner, 

151 sha256_urlsafe_base64, 

152 stderr_to_str, 

153) 

154 

155if TYPE_CHECKING: # pragma: no cover - for type hints only 

156 from bzfs_main.bzfs import ( 

157 Job, 

158 ) 

159 from bzfs_main.configuration import ( 

160 Remote, 

161 SnapshotLabel, 

162 ) 

163 

164# constants: 

165DATASET_CACHE_FILE_PREFIX: Final[str] = "=" 

166REPLICATION_CACHE_FILE_PREFIX: Final[str] = "==" 

167MONITOR_CACHE_FILE_PREFIX: Final[str] = "===" 

168 

169 

170############################################################################# 

171class SnapshotCache: 

172 """Handles last-modified cache operations for snapshot management.""" 

173 

174 def __init__(self, job: Job) -> None: 

175 # immutable variables: 

176 self.job: Final[Job] = job 

177 

178 def get_snapshots_changed(self, path: str) -> int: 

179 """Returns numeric timestamp from cached snapshots-changed file.""" 

180 return self.get_snapshots_changed2(path)[1] 

181 

182 @staticmethod 

183 def get_snapshots_changed2(path: str) -> tuple[int, int]: 

184 """Like zfs_get_snapshots_changed() but reads from local cache.""" 

185 try: # perf: inode metadata reads and writes are fast - ballpark O(200k) ops/sec. 

186 s = os.stat(path, follow_symlinks=False) 

187 return round(s.st_atime), round(s.st_mtime) 

188 except FileNotFoundError: 

189 return 0, 0 # harmless 

190 

191 def last_modified_cache_file(self, remote: Remote, dataset: str, label: str | None = None) -> str: 

192 """Returns the path of the cache file that is tracking last snapshot modification.""" 

193 cache_file: str = DATASET_CACHE_FILE_PREFIX if label is None else label 

194 userhost_dir: str = sha256_urlsafe_base64(remote.cache_namespace(), padding=False) 

195 dataset_dir: str = sha256_urlsafe_base64(dataset, padding=False) 

196 return os.path.join(self.job.params.log_params.last_modified_cache_dir, userhost_dir, dataset_dir, cache_file) 

197 

198 def invalidate_last_modified_cache_dataset(self, dataset: str) -> None: 

199 """Resets the timestamps of top-level cache files of the given dataset to zero. 

200 

201 Purpose: Best-effort invalidation to force ``zfs list -t snapshot`` when the dataset-level '=' cache is stale. 

202 Assumptions: Only top-level files (the '=' file and flat per-label files) are reset; nested monitor caches 

203 (e.g., '===/...') are not recursively traversed. 

204 Design Rationale: Monitor caches are refreshed by monitor runs and guarded by snapshots_changed equality and 

205 maturity checks, preserving correctness without recursive work. 

206 """ 

207 p = self.job.params 

208 cache_file: str = self.last_modified_cache_file(p.src, dataset) 

209 if not p.dry_run: 209 ↛ exitline 209 didn't return from function 'invalidate_last_modified_cache_dataset' because the condition on line 209 was always true

210 try: # Best-effort: no locking needed. Not recursive on purpose. 

211 zero_times = (0, 0) 

212 os_utime = os.utime 

213 with os.scandir(os.path.dirname(cache_file)) as iterator: 

214 for entry in iterator: 

215 os_utime(entry.path, times=zero_times) 

216 os_utime(cache_file, times=zero_times) 

217 except FileNotFoundError: 

218 pass # harmless 

219 

220 def update_last_modified_cache(self, datasets_to_snapshot: dict[SnapshotLabel, list[str]]) -> None: 

221 """Perf: copy last-modified time of the source dataset into the local cache to reduce future 'zfs list -t snapshot' calls.""" 

222 p = self.job.params 

223 src = p.src 

224 src_datasets_set: set[str] = set() 

225 for datasets in datasets_to_snapshot.values(): 

226 src_datasets_set.update(datasets) # union 

227 

228 sorted_datasets: list[str] = sorted(src_datasets_set) 

229 snapshots_changed_dict: dict[str, int] = self.zfs_get_snapshots_changed(src, sorted_datasets) 

230 for src_dataset in sorted_datasets: 

231 snapshots_changed: int = snapshots_changed_dict.get(src_dataset, 0) 

232 self.job.src_properties[src_dataset].snapshots_changed = snapshots_changed 

233 dataset_cache_file: str = self.last_modified_cache_file(src, src_dataset) 

234 if not p.dry_run: 

235 if snapshots_changed == 0: 

236 try: # selective invalidation: only zero the dataset-level '=' cache file 

237 os.utime(dataset_cache_file, times=(0, 0)) 

238 except FileNotFoundError: 

239 pass # harmless 

240 else: # update dataset-level '=' cache monotonically; do NOT touch per-label creation caches here 

241 set_last_modification_time_safe( 

242 dataset_cache_file, unixtime_in_secs=snapshots_changed, if_more_recent=True 

243 ) 

244 

245 def zfs_get_snapshots_changed(self, remote: Remote, sorted_datasets: list[str]) -> dict[str, int]: 

246 """For each given dataset, returns the ZFS dataset property "snapshots_changed", which is a UTC Unix time in integer 

247 seconds; See https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#snapshots_changed""" 

248 

249 def try_zfs_list_command(_cmd: list[str], batch: list[str]) -> list[str]: 

250 try: 

251 return run_ssh_command(self.job, remote, LOG_TRACE, print_stderr=False, cmd=_cmd + batch).splitlines() 

252 except CalledProcessError as e: 

253 return stderr_to_str(e.stdout).splitlines() 

254 except UnicodeDecodeError: 

255 return [] 

256 

257 assert (not self.job.is_test_mode) or sorted_datasets == sorted(sorted_datasets), "List is not sorted" 

258 p = self.job.params 

259 cmd: list[str] = p.split_args(f"{p.zfs_program} list -t filesystem,volume -s name -Hp -o snapshots_changed,name") 

260 results: dict[str, int] = {} 

261 interner: SortedInterner[str] = SortedInterner(sorted_datasets) # reduces memory footprint 

262 for lines in itr_ssh_cmd_parallel( 

263 self.job, remote, [(cmd, sorted_datasets)], lambda _cmd, batch: try_zfs_list_command(_cmd, batch), ordered=False 

264 ): 

265 for line in lines: 

266 if "\t" not in line: 

267 break # partial output from failing 'zfs list' command; subsequent lines in curr batch cannot be trusted 

268 snapshots_changed, dataset = line.split("\t", 1) 

269 if not dataset: 

270 break # partial output from failing 'zfs list' command; subsequent lines in curr batch cannot be trusted 

271 dataset = interner.interned(dataset) 

272 if snapshots_changed == "-" or not snapshots_changed: 

273 snapshots_changed = "0" 

274 results[dataset] = int(snapshots_changed) 

275 return results 

276 

277 

278def set_last_modification_time_safe( 

279 path: str, 

280 unixtime_in_secs: int | tuple[int, int], 

281 if_more_recent: bool = False, 

282) -> None: 

283 """Like set_last_modification_time() but creates directories if necessary.""" 

284 try: 

285 os.makedirs(os.path.dirname(path), mode=DIR_PERMISSIONS, exist_ok=True) 

286 set_last_modification_time(path, unixtime_in_secs=unixtime_in_secs, if_more_recent=if_more_recent) 

287 except FileNotFoundError: 

288 pass # harmless 

289 

290 

291def set_last_modification_time( 

292 path: str, 

293 unixtime_in_secs: int | tuple[int, int], 

294 if_more_recent: bool = False, 

295) -> None: 

296 """Atomically sets the atime/mtime of the file with the given ``path``, with a monotonic guard. 

297 

298 if_more_recent=True is a concurrency control mechanism that prevents us from overwriting a newer (monotonically 

299 increasing) snapshots_changed value (which is a UTC Unix time in integer seconds) that might have been written to the 

300 cache file by a different, more up-to-date bzfs process. 

301 

302 For a brand-new file created by this call, we always update the file's timestamp to avoid retaining the file's implicit 

303 creation time ("now") instead of the intended timestamp. 

304 

305 Design Rationale: Open without O_CREAT first; if missing, create exclusively (O_CREAT|O_EXCL) to detect that this call 

306 created the file. Only apply the monotonic early-return check when the file pre-existed; otherwise perform the initial 

307 timestamp write unconditionally. This preserves concurrency safety and prevents silent skips on first write. 

308 """ 

309 unixtimes = (unixtime_in_secs, unixtime_in_secs) if isinstance(unixtime_in_secs, int) else unixtime_in_secs 

310 perm: int = stat.S_IRUSR | stat.S_IWUSR # rw------- (user read + write) 

311 flags_base: int = os.O_WRONLY | os.O_NOFOLLOW | os.O_CLOEXEC 

312 preexisted: bool = True 

313 

314 try: 

315 fd = os.open(path, flags_base) 

316 except FileNotFoundError: 

317 try: 

318 fd = os.open(path, flags_base | os.O_CREAT | os.O_EXCL, mode=perm) 

319 preexisted = False 

320 except FileExistsError: 

321 fd = os.open(path, flags_base) # we lost the race, open existing file 

322 

323 try: 

324 # Acquire an exclusive lock; will block if lock is already held by this process or another process. 

325 # The (advisory) lock is auto-released when the process terminates or the fd is closed. 

326 fcntl.flock(fd, fcntl.LOCK_EX) 

327 

328 stats = os.fstat(fd) 

329 st_uid: int = stats.st_uid 

330 if st_uid != os.geteuid(): # verify ownership is current effective UID; same as open_nofollow() 

331 raise PermissionError(errno.EPERM, f"{path!r} is owned by uid {st_uid}, not {os.geteuid()}", path) 

332 

333 # Monotonic guard: only skip when the file pre-existed, to not skip the very first write. 

334 if preexisted and if_more_recent and unixtimes[1] <= round(stats.st_mtime): 

335 return 

336 os.utime(fd, times=unixtimes) # write timestamps 

337 finally: 

338 os.close(fd)