Coverage for bzfs_main / snapshot_cache.py: 95%
119 statements
« prev ^ index » next coverage.py v7.13.0, created at 2025-12-22 08:03 +0000
« prev ^ index » next coverage.py v7.13.0, created at 2025-12-22 08:03 +0000
1# Copyright 2024 Wolfgang Hoschek AT mac DOT com
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7# http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14#
15"""Caching snapshot metadata to minimize 'zfs list -t snapshot' calls.
17Purpose
18=======
19The ``--cache-snapshots`` mode speeds up snapshot scheduling, replication, and monitoring by storing just enough
20metadata in fast local inodes (no external DB, no daemon). Instead of repeatedly invoking costly
21``zfs list -t snapshot ...`` across potentially thousands or even millions of datasets, we keep tiny (i.e. empty)
22per-dataset files whose inode atime/mtime atomically encode what we need to know. This reduces latency, load on ZFS, and
23network chatter, while remaining dependency free and robust under crashes or concurrent runs.
25Correctness scope and analysis guidance
26=======================================
27The cache files described here are an internal implementation detail of the ``--cache-snapshots`` feature. The supported
28contract is that only ``bzfs`` itself creates and updates these files via its documented CLI flows; third-party tools or
29manual edits under ``.cache/mods`` are outside the design and correctness model.
31From a correctness perspective, the *current* design is deliberately fail-safe under *benign* external modifications:
33- Deleting any subset of cache files or directories (including wiping the entire ``.cache`` tree) is safe. ``bzfs`` will
34 simply fall back to live ``zfs list -t snapshot`` and repopulate caches conservatively.
35- Touching existing cache files so their timestamps move away from the true ZFS properties (for example via ``touch`` or
36 copying cache files without preserving timestamps) cannot cause incorrect behavior. Such changes only make cache
37 entries look stale; all ``bzfs`` consumers re-validate cached timestamps against live ZFS properties plus maturity
38 checks and treat any mismatch as a cache miss, at the cost of extra ZFS I/O and latency but never weakening safety.
40The cache directories themselves are created with strict Unix permissions: ``--log-dir`` and its subdirectories (including
41``.cache``) use ``DIR_PERMISSIONS`` (``rwx------``), and cache files are written with ``FILE_PERMISSIONS`` (``rw-------``),
42with the process-wide ``UMASK`` set accordingly. As a result, only the same Unix user account (and root) can read or
43modify the cache tree; other users and groups have no access by default. Analyses shall therefore not assume arbitrary
44hostile writes coming from other unprivileged users on the same system.
46When reasoning about correctness, or when searching for bugs, analyses shall be restricted to states that can be reached
47via real use of the public CLIs plus standard ZFS operations. Fabricated scenarios that assume arbitrary or adversarial
48rewrites of cache files (for example, scripts that deliberately write hand-crafted timestamps into ``.cache/mods`` or
49invoke internal helpers directly) are explicitly out of scope and shall not be treated as bugs in ``--cache-snapshots``.
50Under the stated assumptions and invariants below, the fast paths never weaken the safety guarantees of snapshot
51creation, replication, or monitoring; at worst they cause extra ``zfs list -t snapshot`` calls.
53Assumptions
54===========
55- OpenZFS >= 2.2 provides two key UTC times with integer-second resolution: ``snapshots_changed`` (dataset level)
56 and snapshot ``creation`` (snapshot level).
57 - ``snapshots_changed``: Specifies the UTC time at which a snapshot for a dataset was last created or deleted.
58 See https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#snapshots_changed
59 - ``creation`` specifies the UTC time the snapshot was created.
60 See https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops.7.html#creation
61- Unix atime/mtime are reliable to read and atomically updatable;
62- Multiple jobs may touch the same cache files concurrently and out of order. Correctness must rely on per-file locking
63 plus monotonicity guards rather than global serialization or a single writer model.
64- System clocks may differ by small skews across hosts; equal-second races can happen. We gate "freshness" with a small
65 maturity time threshold (``MATURITY_TIME_THRESHOLD_SECS``) before trusting a value as authoritative.
67Design Rationale
68================
69We intentionally encode only minimal invariants into inode timestamps and not arbitrary text payloads. This keeps I/O
70tiny and allows safe, atomic low-latency updates via a single ``utime`` call under an exclusive advisory file lock via
71flock(2).
73Cache root and hashed path segments
74-----------------------------------
75The cache tree lives under ``<log_parent_dir>/.cache/mods`` (see ``LogParams.last_modified_cache_dir``). To keep paths
76short and safe, variable path segments are stored as URL-safe base64-encoded SHA-256 digests without padding. In what
77follows, ``hash(X)`` denotes ``sha256_urlsafe_base64(str(X), padding=False)`` or a truncated variant used for brevity.
79The cache consists of four families:
80------------------------------------
811) Dataset-level ("=") per dataset and location (src or dst); for --create-src-snapshots, --replicate, --monitor-snapshots
82 - Path: ``<cache_root>/<hash(user@host[#port])>/<hash(dataset)>/=``
83 - mtime: the ZFS ``snapshots_changed`` time observed for that dataset. Monotonic writes only.
84 - Used by: snapshot scheduler, replicate, monitor - as the anchor for cache equality checks.
862) Replication-scoped ("==") per source dataset and destination dataset+filters; for --replicate
87 - Path: ``<cache_root>/<hash(src_user@host[#port])>/<hash(src_dataset)>/==/<hash(dst_user@host[#port])>/<hash(dst_dataset)>/<hash(filters)>``
88 - Path label encodes destination namespace, destination dataset and the snapshot-filter hash.
89 - mtime: last replicated source ``snapshots_changed`` for that destination and filter set. Monotonic.
90 - Used by: replicate - to cheaply decide "src unchanged since last successful run to this dst+filters".
923) Monitor ("===") per dataset and label (Latest/Oldest); for --monitor-snapshots
93 - Path: ``<cache_root>/<hash(user@host[#port])>/<hash(dataset)>/===/<kind>/<hash(notimestamp_label)>/<hash(alert_plan)>``
94 - ``kind``: alert check mode; either "L" (Latest) or "O" (Oldest).
95 - ``hash(alert_plan)``: stable digest over the monitor alert plan to scope caches per plan.
96 - atime: creation time of the relevant latest/oldest snapshot.
97 - mtime: dataset ``snapshots_changed`` observed when that creation was recorded. Monotonic.
98 - Used by: monitor - to alert on stale snapshots without listing them every time.
1004) Snapshot scheduler per-label files (under the source dataset); for --create-src-snapshots
101 - Path: ``<cache_root>/<hash(src_user@host[#port])>/<hash(src_dataset)>/<hash(notimestamp_label)>``
102 - atime: creation time of the latest snapshot matching that label.
103 - mtime: the dataset-level "=" value at the time of the write (i.e., the then-current ``snapshots_changed``).
104 - Used by: ``--create-src-snapshots`` - to cheaply decide whether a label is due without ``zfs list -t snapshot``.
106How trust in a cache file is established
107========================================
108For a cache file to be trusted and used as a fast path, three conditions must hold:
1091) Equality: the dataset-level "=" mtime must equal the live ZFS ``snapshots_changed`` of the corresponding dataset.
110 This ensures the filesystem state that the cache describes is the same as the live state.
1112) Maturity: that live ``snapshots_changed`` is strictly older than ``now - MATURITY_TIME_THRESHOLD_SECS`` to avoid
112 equal-second races and tame small clock skew between initiator and ZFS hosts.
1133) Internal consistency for per-label/monitor cache files: their mtime must equal the current dataset-level "=" value,
114 and their atime must be a plausible creation time not later than mtime (atime <= mtime). A zero atime/mtime indicates
115 unknown provenance and must force fallback.
117If any condition fails, the code falls back to ``zfs list -t snapshot`` for just those datasets; upon completion it
118rewrites the relevant cache files, monotonically.
120Concurrency and correctness mechanics
121=====================================
122All writes go through ``set_last_modification_time_safe()`` which:
123- Creates parent directories if necessary, opens the file with ``O_NOFOLLOW|O_CLOEXEC`` and takes an exclusive ``flock``.
124- Updates times atomically via ``os.utime(fd, times=(atime, mtime))``.
125- Applies a monotonic guard: with ``if_more_recent=True``, older timestamps never clobber newer ones. This is what
126 makes concurrent runs safe and idempotent.
128Cache invalidation - what and why
129=================================
130Two forms exist:
1311) Dataset-level invalidation by directory (non-recursive): when a mismatch is detected, top-level files for the
132 dataset ("=" and flat per-label files) are zeroed to force subsequent ``zfs list -t snapshot``. Monitor caches
133 live in subdirectories and are refreshed by monitor runs; they are trusted only under the equality+maturity criteria
134 above.
1352) Selective invalidation on property unavailability: when ZFS reports ``snapshots_changed=0`` (unavailable), the
136 dataset-level "=" file is reset to 0, while per-label creation caches are preserved.
138What could be removed without losing correctness - and why we keep it
139=====================================================================
140Because all consumers already require equality + maturity before trusting cache state, explicit invalidation is not
141strictly necessary for correctness; stale cache files would simply be ignored and later overwritten. However, we keep
142the invalidation steps to improve operational observability and clarity:
144The equality+maturity gates already prevent incorrect cache hits, but invalidation improves operational clarity.
145Zeroing the top-level "=" is an explicit "do not trust" signal. All processes then deterministically skip cache and
146probe via ``zfs list -t snapshot`` once, after which monotonic rewrites restore a consistent, trusted state. This
147simplifies observability for operators inspecting (or debugging) cache trees, and immediately establishes a stable
148cache snapshot of reality. The cost is tiny (an inode metadata write) while the benefit is more operational simplicity.
150The result is a design that favors simplicity and safety: tiny inode-based atomic updates, conservative guardrails before
151any cache is trusted, and minimal, well-scoped invalidation to keep the system observable under change and concurrency.
152"""
154from __future__ import (
155 annotations,
156)
157import errno
158import fcntl
159import os
160from subprocess import (
161 CalledProcessError,
162)
163from typing import (
164 TYPE_CHECKING,
165 Final,
166 final,
167)
169from bzfs_main.parallel_batch_cmd import (
170 itr_ssh_cmd_parallel,
171)
172from bzfs_main.util.utils import (
173 DIR_PERMISSIONS,
174 FILE_PERMISSIONS,
175 LOG_TRACE,
176 SortedInterner,
177 sha256_urlsafe_base64,
178 stderr_to_str,
179)
181if TYPE_CHECKING: # pragma: no cover - for type hints only
182 from bzfs_main.bzfs import (
183 Job,
184 )
185 from bzfs_main.configuration import (
186 Remote,
187 SnapshotLabel,
188 )
190# constants:
191DATASET_CACHE_FILE_PREFIX: Final[str] = "="
192REPLICATION_CACHE_FILE_PREFIX: Final[str] = "=="
193MONITOR_CACHE_FILE_PREFIX: Final[str] = "==="
194MATURITY_TIME_THRESHOLD_SECS: Final[float] = 1.1 # 1 sec ZFS creation time resolution + NTP clock skew is typically < 10ms
197#############################################################################
198@final
199class SnapshotCache:
200 """Handles last-modified cache operations for snapshot management."""
202 def __init__(self, job: Job) -> None:
203 # immutable variables:
204 self.job: Final[Job] = job
206 def get_snapshots_changed(self, path: str) -> int:
207 """Returns numeric timestamp from cached snapshots-changed file."""
208 return self.get_snapshots_changed2(path)[1]
210 @staticmethod
211 def get_snapshots_changed2(path: str) -> tuple[int, int]:
212 """Like zfs_get_snapshots_changed() but reads from local cache."""
213 try: # perf: inode metadata reads and writes are fast - ballpark O(200k) ops/sec.
214 s = os.stat(path, follow_symlinks=False)
215 return round(s.st_atime), round(s.st_mtime)
216 except FileNotFoundError:
217 return 0, 0 # harmless
219 def last_modified_cache_file(self, remote: Remote, dataset: str, label: str | None = None) -> str:
220 """Returns the path of the cache file that is tracking last snapshot modification."""
221 cache_file: str = DATASET_CACHE_FILE_PREFIX if label is None else label
222 userhost_dir: str = sha256_urlsafe_base64(remote.cache_namespace(), padding=False)
223 dataset_dir: str = sha256_urlsafe_base64(dataset, padding=False)
224 return os.path.join(self.job.params.log_params.last_modified_cache_dir, userhost_dir, dataset_dir, cache_file)
226 def invalidate_last_modified_cache_dataset(self, dataset: str) -> None:
227 """Resets the timestamps of top-level cache files of the given dataset to zero.
229 Purpose: Best-effort invalidation to force ``zfs list -t snapshot`` when the dataset-level '=' cache is stale.
230 Assumptions: Only top-level files (the '=' file and flat per-label files) are reset; nested monitor caches
231 (e.g., '===/...') are not recursively traversed.
232 Design Rationale: Monitor caches are refreshed by monitor runs and guarded by snapshots_changed equality and
233 maturity checks, preserving correctness without recursive work.
234 """
235 p = self.job.params
236 cache_file: str = self.last_modified_cache_file(p.src, dataset)
237 if not p.dry_run: 237 ↛ exitline 237 didn't return from function 'invalidate_last_modified_cache_dataset' because the condition on line 237 was always true
238 try: # Best-effort: no locking needed. Not recursive on purpose.
239 zero_times = (0, 0)
240 os_utime = os.utime
241 with os.scandir(os.path.dirname(cache_file)) as iterator:
242 for entry in iterator:
243 os_utime(entry.path, times=zero_times)
244 os_utime(cache_file, times=zero_times)
245 except FileNotFoundError:
246 pass # harmless
248 def update_last_modified_cache(self, datasets_to_snapshot: dict[SnapshotLabel, list[str]]) -> None:
249 """Perf: copy last-modified time of the source dataset into the local cache to reduce future 'zfs list -t snapshot' calls."""
250 p = self.job.params
251 src = p.src
252 src_datasets_set: set[str] = set()
253 for datasets in datasets_to_snapshot.values():
254 src_datasets_set.update(datasets) # union
256 sorted_datasets: list[str] = sorted(src_datasets_set)
257 snapshots_changed_dict: dict[str, int] = self.zfs_get_snapshots_changed(src, sorted_datasets)
258 for src_dataset in sorted_datasets:
259 snapshots_changed: int = snapshots_changed_dict.get(src_dataset, 0)
260 self.job.src_properties[src_dataset].snapshots_changed = snapshots_changed
261 dataset_cache_file: str = self.last_modified_cache_file(src, src_dataset)
262 if not p.dry_run:
263 if snapshots_changed == 0:
264 try: # selective invalidation: only zero the dataset-level '=' cache file
265 os.utime(dataset_cache_file, times=(0, 0))
266 except FileNotFoundError:
267 pass # harmless
268 else: # update dataset-level '=' cache monotonically; do NOT touch per-label creation caches here
269 set_last_modification_time_safe(
270 dataset_cache_file, unixtime_in_secs=snapshots_changed, if_more_recent=True
271 )
273 def zfs_get_snapshots_changed(self, r: Remote, sorted_datasets: list[str]) -> dict[str, int]:
274 """For each given dataset, returns the ZFS dataset property "snapshots_changed", which is a UTC Unix time in integer
275 seconds; See https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#snapshots_changed"""
277 def _run_zfs_list_command(_cmd: list[str], batch: list[str]) -> list[str]:
278 try:
279 return self.job.run_ssh_command_with_retries(r, LOG_TRACE, print_stderr=False, cmd=_cmd + batch).splitlines()
280 except CalledProcessError as e:
281 return stderr_to_str(e.stdout).splitlines()
282 except UnicodeDecodeError:
283 return []
285 assert (not self.job.is_test_mode) or sorted_datasets == sorted(sorted_datasets), "List is not sorted"
286 p = self.job.params
287 cmd: list[str] = p.split_args(f"{p.zfs_program} list -t filesystem,volume -s name -Hp -o snapshots_changed,name")
288 results: dict[str, int] = {}
289 interner: SortedInterner[str] = SortedInterner(sorted_datasets) # reduces memory footprint
290 for lines in itr_ssh_cmd_parallel(
291 self.job, r, [(cmd, sorted_datasets)], lambda _cmd, batch: _run_zfs_list_command(_cmd, batch), ordered=False
292 ):
293 for line in lines:
294 if "\t" not in line:
295 break # partial output from failing 'zfs list' command; subsequent lines in curr batch cannot be trusted
296 snapshots_changed, dataset = line.split("\t", 1)
297 if not dataset:
298 break # partial output from failing 'zfs list' command; subsequent lines in curr batch cannot be trusted
299 dataset = interner.interned(dataset)
300 if snapshots_changed == "-" or not snapshots_changed:
301 snapshots_changed = "0"
302 results[dataset] = int(snapshots_changed)
303 return results
306def set_last_modification_time_safe(
307 path: str,
308 unixtime_in_secs: int | tuple[int, int],
309 if_more_recent: bool = False,
310) -> None:
311 """Like set_last_modification_time() but creates directories if necessary."""
312 try:
313 os.makedirs(os.path.dirname(path), mode=DIR_PERMISSIONS, exist_ok=True)
314 set_last_modification_time(path, unixtime_in_secs=unixtime_in_secs, if_more_recent=if_more_recent)
315 except FileNotFoundError:
316 pass # harmless
319def set_last_modification_time(
320 path: str,
321 unixtime_in_secs: int | tuple[int, int],
322 if_more_recent: bool = False,
323) -> None:
324 """Atomically sets the atime/mtime of the file with the given ``path``, with a monotonic guard.
326 if_more_recent=True is a concurrency control mechanism that prevents us from overwriting a newer (monotonically
327 increasing) mtime=snapshots_changed value (which is a UTC Unix time in integer seconds) that might have been written to
328 the cache file by a different, more up-to-date bzfs process.
330 For a brand-new file created by this call, we always update the file's timestamp to avoid retaining the file's implicit
331 creation time ("now") instead of the intended timestamp.
333 Design Rationale: Open without O_CREAT first; if missing, create exclusively (O_CREAT|O_EXCL) to detect that this call
334 created the file. Only apply the monotonic early-return check when the file pre-existed; otherwise perform the initial
335 timestamp write unconditionally. This preserves concurrency safety and prevents silent skips on first write.
336 """
337 unixtimes = (unixtime_in_secs, unixtime_in_secs) if isinstance(unixtime_in_secs, int) else unixtime_in_secs
338 flags_base: int = os.O_WRONLY | os.O_NOFOLLOW | os.O_CLOEXEC
339 preexisted: bool = True
341 try:
342 fd = os.open(path, flags_base)
343 except FileNotFoundError:
344 try:
345 fd = os.open(path, flags_base | os.O_CREAT | os.O_EXCL, mode=FILE_PERMISSIONS)
346 preexisted = False
347 except FileExistsError:
348 fd = os.open(path, flags_base) # we lost the race, open existing file
350 try:
351 # Acquire an exclusive lock; will block if lock is already held by this process or another process.
352 # The (advisory) lock is auto-released when the process terminates or the fd is closed.
353 fcntl.flock(fd, fcntl.LOCK_EX)
355 stats = os.fstat(fd)
356 st_uid: int = stats.st_uid
357 if st_uid != os.geteuid(): # verify ownership is current effective UID; same as open_nofollow()
358 raise PermissionError(errno.EPERM, f"{path!r} is owned by uid {st_uid}, not {os.geteuid()}", path)
360 # Monotonic guard: only skip when the file pre-existed, to not skip the very first write.
361 if preexisted and if_more_recent:
362 st_mtime: int = round(stats.st_mtime)
363 if unixtimes[1] < st_mtime:
364 return
365 if unixtimes[1] == st_mtime and unixtimes[0] == round(stats.st_atime):
366 return
367 os.utime(fd, times=unixtimes) # write timestamps
368 finally:
369 os.close(fd)