Coverage for bzfs_main/snapshot_cache.py: 95%
116 statements
« prev ^ index » next coverage.py v7.11.0, created at 2025-11-07 04:44 +0000
« prev ^ index » next coverage.py v7.11.0, created at 2025-11-07 04:44 +0000
1# Copyright 2024 Wolfgang Hoschek AT mac DOT com
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7# http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14#
15"""Caching snapshot metadata to minimize 'zfs list -t snapshot' calls.
17Purpose
18=======
19The ``--cache-snapshots`` mode speeds up snapshot scheduling, replication, and monitoring by storing just enough
20metadata in fast local inodes (no external DB, no daemon). Instead of repeatedly invoking costly
21``zfs list -t snapshot ...`` across potentially thousands or even millions of datasets, we keep tiny (i.e. empty)
22per-dataset files whose inode atime/mtime atomically encode what we need to know. This reduces latency, load on ZFS, and
23network chatter, while remaining dependency free and robust under crashes or concurrent runs.
25Assumptions
26===========
27- OpenZFS >= 2.2 provides two key UTC times with integer-second resolution: ``snapshots_changed`` (dataset level)
28 and snapshot ``creation`` (snapshot level).
29 - ``snapshots_changed``: Specifies the UTC time at which a snapshot for a dataset was last created or deleted.
30 See https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#snapshots_changed
31 - ``creation`` specifies the UTC time the snapshot was created.
32 See https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops.7.html#creation
33- Unix atime/mtime are reliable to read and atomically updatable;
34- Multiple jobs may touch the same cache files concurrently and out of order. Correctness must rely on per-file locking
35 plus monotonicity guards rather than global serialization or a single writer model.
36- System clocks may differ by small skews across hosts; equal-second races can happen. We gate "freshness" with a small
37 maturity time threshold (``MATURITY_TIME_THRESHOLD_SECS``) before trusting a value as authoritative.
39Design Rationale
40================
41We intentionally encode only minimal invariants into inode timestamps and not arbitrary text payloads. This keeps I/O
42tiny and allows safe, atomic low-latency updates via a single ``utime`` call under an exclusive advisory file lock via
43flock(2).
45Cache root and hashed path segments
46-----------------------------------
47The cache tree lives under ``<log_parent_dir>/.cache/mods`` (see ``LogParams.last_modified_cache_dir``). To keep paths
48short and safe, variable path segments are stored as URL-safe base64-encoded SHA-256 digests without padding. In what
49follows, ``hash(X)`` denotes ``sha256_urlsafe_base64(str(X), padding=False)`` or a truncated variant used for brevity.
51The cache consists of four families:
52------------------------------------
531) Dataset-level ("=") per dataset and location (src or dst); for --create-src-snapshots, --replicate, --monitor-snapshots
54 - Path: ``<cache_root>/<hash(user@host[#port])>/<hash(dataset)>/=``
55 - mtime: the ZFS ``snapshots_changed`` time observed for that dataset. Monotonic writes only.
56 - Used by: snapshot scheduler, replicate, monitor - as the anchor for cache equality checks.
582) Replication-scoped ("==") per source dataset and destination dataset+filters; for --replicate
59 - Path: ``<cache_root>/<hash(src_user@host[#port])>/<hash(src_dataset)>/==/<hash(dst_user@host[#port])>/<hash(dst_dataset)>/<hash(filters)>``
60 - Path label encodes destination namespace, destination dataset and the snapshot-filter hash.
61 - mtime: last replicated source ``snapshots_changed`` for that destination and filter set. Monotonic.
62 - Used by: replicate - to cheaply decide "src unchanged since last successful run to this dst+filters".
643) Monitor ("===") per dataset and label (Latest/Oldest); for --monitor-snapshots
65 - Path: ``<cache_root>/<hash(user@host[#port])>/<hash(dataset)>/===/<kind>/<hash(notimestamp_label)>/<hash(alert_plan)>``
66 - ``kind``: alert check mode; either "L" (Latest) or "O" (Oldest).
67 - ``hash(alert_plan)``: stable digest over the monitor alert plan to scope caches per plan.
68 - atime: creation time of the relevant latest/oldest snapshot.
69 - mtime: dataset ``snapshots_changed`` observed when that creation was recorded. Monotonic.
70 - Used by: monitor - to alert on stale snapshots without listing them every time.
724) Snapshot scheduler per-label files (under the source dataset); for --create-src-snapshots
73 - Path: ``<cache_root>/<hash(src_user@host[#port])>/<hash(src_dataset)>/<hash(notimestamp_label)>``
74 - atime: creation time of the latest snapshot matching that label.
75 - mtime: the dataset-level "=" value at the time of the write (i.e., the then-current ``snapshots_changed``).
76 - Used by: ``--create-src-snapshots`` - to cheaply decide whether a label is due without ``zfs list -t snapshot``.
78How trust in a cache file is established
79========================================
80For a cache file to be trusted and used as a fast path, three conditions must hold:
811) Equality: the dataset-level "=" mtime must equal the live ZFS ``snapshots_changed`` of the corresponding dataset.
82 This ensures the filesystem state that the cache describes is the same as the live state.
832) Maturity: that live ``snapshots_changed`` is strictly older than ``now - MATURITY_TIME_THRESHOLD_SECS`` to avoid
84 equal-second races and tame small clock skew between initiator and ZFS hosts.
853) Internal consistency for per-label/monitor cache files: their mtime must equal the current dataset-level "=" value,
86 and their atime must be a plausible creation time not later than mtime (atime <= mtime). A zero atime/mtime indicates
87 unknown provenance and must force fallback.
89If any condition fails, the code falls back to ``zfs list -t snapshot`` for just those datasets; upon completion it
90rewrites the relevant cache files, monotonically.
92Concurrency and correctness mechanics
93=====================================
94All writes go through ``set_last_modification_time_safe()`` which:
95- Creates parent directories if necessary, opens the file with ``O_NOFOLLOW|O_CLOEXEC`` and takes an exclusive ``flock``.
96- Updates times atomically via ``os.utime(fd, times=(atime, mtime))``.
97- Applies a monotonic guard: with ``if_more_recent=True``, older timestamps never clobber newer ones. This is what
98 makes concurrent runs safe and idempotent.
100Cache invalidation - what and why
101=================================
102Two forms exist:
1031) Dataset-level invalidation by directory (non-recursive): when a mismatch is detected, top-level files for the
104 dataset ("=" and flat per-label files) are zeroed to force subsequent ``zfs list -t snapshot``. Monitor caches
105 live in subdirectories and are refreshed by monitor runs; they are trusted only under the equality+maturity criteria
106 above.
1072) Selective invalidation on property unavailability: when ZFS reports ``snapshots_changed=0`` (unavailable), the
108 dataset-level "=" file is reset to 0, while per-label creation caches are preserved.
110What could be removed without losing correctness - and why we keep it
111=====================================================================
112Because all consumers already require equality + maturity before trusting cache state, explicit invalidation is not
113strictly necessary for correctness; stale cache files would simply be ignored and later overwritten. However, we keep
114the invalidation steps to improve operational observability and clarity:
116The equality+maturity gates already prevent incorrect cache hits, but invalidation improves operational clarity.
117Zeroing the top-level "=" is an explicit "do not trust" signal. All processes then deterministically skip cache and
118probe via ``zfs list -t snapshot`` once, after which monotonic rewrites restore a consistent, trusted state. This
119simplifies observability for operators inspecting (or debugging) cache trees, and immediately establishes a stable
120cache snapshot of reality. The cost is tiny (an inode metadata write) while the benefit is more operational simplicity.
122The result is a design that favors simplicity and safety: tiny inode-based atomic updates, conservative guardrails before
123any cache is trusted, and minimal, well-scoped invalidation to keep the system observable under change and concurrency.
124"""
126from __future__ import (
127 annotations,
128)
129import errno
130import fcntl
131import os
132import stat
133from subprocess import (
134 CalledProcessError,
135)
136from typing import (
137 TYPE_CHECKING,
138 Final,
139)
141from bzfs_main.connection import (
142 run_ssh_command,
143)
144from bzfs_main.parallel_batch_cmd import (
145 itr_ssh_cmd_parallel,
146)
147from bzfs_main.utils import (
148 DIR_PERMISSIONS,
149 LOG_TRACE,
150 SortedInterner,
151 sha256_urlsafe_base64,
152 stderr_to_str,
153)
155if TYPE_CHECKING: # pragma: no cover - for type hints only
156 from bzfs_main.bzfs import (
157 Job,
158 )
159 from bzfs_main.configuration import (
160 Remote,
161 SnapshotLabel,
162 )
164# constants:
165DATASET_CACHE_FILE_PREFIX: Final[str] = "="
166REPLICATION_CACHE_FILE_PREFIX: Final[str] = "=="
167MONITOR_CACHE_FILE_PREFIX: Final[str] = "==="
170#############################################################################
171class SnapshotCache:
172 """Handles last-modified cache operations for snapshot management."""
174 def __init__(self, job: Job) -> None:
175 # immutable variables:
176 self.job: Final[Job] = job
178 def get_snapshots_changed(self, path: str) -> int:
179 """Returns numeric timestamp from cached snapshots-changed file."""
180 return self.get_snapshots_changed2(path)[1]
182 @staticmethod
183 def get_snapshots_changed2(path: str) -> tuple[int, int]:
184 """Like zfs_get_snapshots_changed() but reads from local cache."""
185 try: # perf: inode metadata reads and writes are fast - ballpark O(200k) ops/sec.
186 s = os.stat(path, follow_symlinks=False)
187 return round(s.st_atime), round(s.st_mtime)
188 except FileNotFoundError:
189 return 0, 0 # harmless
191 def last_modified_cache_file(self, remote: Remote, dataset: str, label: str | None = None) -> str:
192 """Returns the path of the cache file that is tracking last snapshot modification."""
193 cache_file: str = DATASET_CACHE_FILE_PREFIX if label is None else label
194 userhost_dir: str = sha256_urlsafe_base64(remote.cache_namespace(), padding=False)
195 dataset_dir: str = sha256_urlsafe_base64(dataset, padding=False)
196 return os.path.join(self.job.params.log_params.last_modified_cache_dir, userhost_dir, dataset_dir, cache_file)
198 def invalidate_last_modified_cache_dataset(self, dataset: str) -> None:
199 """Resets the timestamps of top-level cache files of the given dataset to zero.
201 Purpose: Best-effort invalidation to force ``zfs list -t snapshot`` when the dataset-level '=' cache is stale.
202 Assumptions: Only top-level files (the '=' file and flat per-label files) are reset; nested monitor caches
203 (e.g., '===/...') are not recursively traversed.
204 Design Rationale: Monitor caches are refreshed by monitor runs and guarded by snapshots_changed equality and
205 maturity checks, preserving correctness without recursive work.
206 """
207 p = self.job.params
208 cache_file: str = self.last_modified_cache_file(p.src, dataset)
209 if not p.dry_run: 209 ↛ exitline 209 didn't return from function 'invalidate_last_modified_cache_dataset' because the condition on line 209 was always true
210 try: # Best-effort: no locking needed. Not recursive on purpose.
211 zero_times = (0, 0)
212 os_utime = os.utime
213 with os.scandir(os.path.dirname(cache_file)) as iterator:
214 for entry in iterator:
215 os_utime(entry.path, times=zero_times)
216 os_utime(cache_file, times=zero_times)
217 except FileNotFoundError:
218 pass # harmless
220 def update_last_modified_cache(self, datasets_to_snapshot: dict[SnapshotLabel, list[str]]) -> None:
221 """Perf: copy last-modified time of the source dataset into the local cache to reduce future 'zfs list -t snapshot' calls."""
222 p = self.job.params
223 src = p.src
224 src_datasets_set: set[str] = set()
225 for datasets in datasets_to_snapshot.values():
226 src_datasets_set.update(datasets) # union
228 sorted_datasets: list[str] = sorted(src_datasets_set)
229 snapshots_changed_dict: dict[str, int] = self.zfs_get_snapshots_changed(src, sorted_datasets)
230 for src_dataset in sorted_datasets:
231 snapshots_changed: int = snapshots_changed_dict.get(src_dataset, 0)
232 self.job.src_properties[src_dataset].snapshots_changed = snapshots_changed
233 dataset_cache_file: str = self.last_modified_cache_file(src, src_dataset)
234 if not p.dry_run:
235 if snapshots_changed == 0:
236 try: # selective invalidation: only zero the dataset-level '=' cache file
237 os.utime(dataset_cache_file, times=(0, 0))
238 except FileNotFoundError:
239 pass # harmless
240 else: # update dataset-level '=' cache monotonically; do NOT touch per-label creation caches here
241 set_last_modification_time_safe(
242 dataset_cache_file, unixtime_in_secs=snapshots_changed, if_more_recent=True
243 )
245 def zfs_get_snapshots_changed(self, remote: Remote, sorted_datasets: list[str]) -> dict[str, int]:
246 """For each given dataset, returns the ZFS dataset property "snapshots_changed", which is a UTC Unix time in integer
247 seconds; See https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#snapshots_changed"""
249 def try_zfs_list_command(_cmd: list[str], batch: list[str]) -> list[str]:
250 try:
251 return run_ssh_command(self.job, remote, LOG_TRACE, print_stderr=False, cmd=_cmd + batch).splitlines()
252 except CalledProcessError as e:
253 return stderr_to_str(e.stdout).splitlines()
254 except UnicodeDecodeError:
255 return []
257 assert (not self.job.is_test_mode) or sorted_datasets == sorted(sorted_datasets), "List is not sorted"
258 p = self.job.params
259 cmd: list[str] = p.split_args(f"{p.zfs_program} list -t filesystem,volume -s name -Hp -o snapshots_changed,name")
260 results: dict[str, int] = {}
261 interner: SortedInterner[str] = SortedInterner(sorted_datasets) # reduces memory footprint
262 for lines in itr_ssh_cmd_parallel(
263 self.job, remote, [(cmd, sorted_datasets)], lambda _cmd, batch: try_zfs_list_command(_cmd, batch), ordered=False
264 ):
265 for line in lines:
266 if "\t" not in line:
267 break # partial output from failing 'zfs list' command; subsequent lines in curr batch cannot be trusted
268 snapshots_changed, dataset = line.split("\t", 1)
269 if not dataset:
270 break # partial output from failing 'zfs list' command; subsequent lines in curr batch cannot be trusted
271 dataset = interner.interned(dataset)
272 if snapshots_changed == "-" or not snapshots_changed:
273 snapshots_changed = "0"
274 results[dataset] = int(snapshots_changed)
275 return results
278def set_last_modification_time_safe(
279 path: str,
280 unixtime_in_secs: int | tuple[int, int],
281 if_more_recent: bool = False,
282) -> None:
283 """Like set_last_modification_time() but creates directories if necessary."""
284 try:
285 os.makedirs(os.path.dirname(path), mode=DIR_PERMISSIONS, exist_ok=True)
286 set_last_modification_time(path, unixtime_in_secs=unixtime_in_secs, if_more_recent=if_more_recent)
287 except FileNotFoundError:
288 pass # harmless
291def set_last_modification_time(
292 path: str,
293 unixtime_in_secs: int | tuple[int, int],
294 if_more_recent: bool = False,
295) -> None:
296 """Atomically sets the atime/mtime of the file with the given ``path``, with a monotonic guard.
298 if_more_recent=True is a concurrency control mechanism that prevents us from overwriting a newer (monotonically
299 increasing) snapshots_changed value (which is a UTC Unix time in integer seconds) that might have been written to the
300 cache file by a different, more up-to-date bzfs process.
302 For a brand-new file created by this call, we always update the file's timestamp to avoid retaining the file's implicit
303 creation time ("now") instead of the intended timestamp.
305 Design Rationale: Open without O_CREAT first; if missing, create exclusively (O_CREAT|O_EXCL) to detect that this call
306 created the file. Only apply the monotonic early-return check when the file pre-existed; otherwise perform the initial
307 timestamp write unconditionally. This preserves concurrency safety and prevents silent skips on first write.
308 """
309 unixtimes = (unixtime_in_secs, unixtime_in_secs) if isinstance(unixtime_in_secs, int) else unixtime_in_secs
310 perm: int = stat.S_IRUSR | stat.S_IWUSR # rw------- (user read + write)
311 flags_base: int = os.O_WRONLY | os.O_NOFOLLOW | os.O_CLOEXEC
312 preexisted: bool = True
314 try:
315 fd = os.open(path, flags_base)
316 except FileNotFoundError:
317 try:
318 fd = os.open(path, flags_base | os.O_CREAT | os.O_EXCL, mode=perm)
319 preexisted = False
320 except FileExistsError:
321 fd = os.open(path, flags_base) # we lost the race, open existing file
323 try:
324 # Acquire an exclusive lock; will block if lock is already held by this process or another process.
325 # The (advisory) lock is auto-released when the process terminates or the fd is closed.
326 fcntl.flock(fd, fcntl.LOCK_EX)
328 stats = os.fstat(fd)
329 st_uid: int = stats.st_uid
330 if st_uid != os.geteuid(): # verify ownership is current effective UID; same as open_nofollow()
331 raise PermissionError(errno.EPERM, f"{path!r} is owned by uid {st_uid}, not {os.geteuid()}", path)
333 # Monotonic guard: only skip when the file pre-existed, to not skip the very first write.
334 if preexisted and if_more_recent and unixtimes[1] <= round(stats.st_mtime):
335 return
336 os.utime(fd, times=unixtimes) # write timestamps
337 finally:
338 os.close(fd)