BlobStore

From GnuCash
Revision as of 02:17, 4 May 2026 by Christopherlam (talk | contribs) (intitial discussion... design doc from claude. very rough.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

GnuCash Content-Addressable Blob Storage (CAS)

Overview

Add a content-addressable storage (CAS) system to GnuCash for attaching files (PDFs, images, etc.) to transactions and splits. Blobs are stored by SHA256 content hash with SQLite-backed refcounting, providing deduplication, safe cleanup, and a third option alongside the existing doclink (URL/filepath) mechanism.

Key properties:

  • Stores in per-book SQLite database (<book-path>.blobs/blobs.sqlite)
  • Refcounted; garbage collected on book save
  • Address (BlobId) stored in transaction/split KVP at /kvp/attachments/<id>
  • Launchable via external handler
  • UI shows attachment icon in transaction view

Storage Model

File Layout

Assuming book file is at ~/.local/share/gnucash/MyBook.gnucash:

~/.local/share/gnucash/
├── MyBook.gnucash                          # Book file
├── MyBook.blobs/                           # Per-book blob directory (sibling)
│   ├── blobs.sqlite                        # Metadata database
│   ├── a1/
│   │   ├── a1b2c3d4e5f6...pdf
│   │   └── a1f7g8h9i0j1...jpg
│   ├── b2/
│   └── ff/
│       └── ffa1b2c3d4e5...png

Location rule: If book is at path /some/dir/MyBook.gnucash, blobs live in /some/dir/MyBook.blobs/ (sibling directory with .blobs suffix).

Sharding: First 2 hex chars of SHA256 hash form subdirectory. Reduces filesystem strain on large collections.

Per-Book Metadata Database

File: <book-path>.blobs/blobs.sqlite

Example: /home/user/.local/share/gnucash/MyBook.gnucash.blobs/blobs.sqlite

Schema:

CREATE TABLE blobs (
    blob_id TEXT PRIMARY KEY,           -- SHA256 hex digest
    original_filename TEXT NOT NULL,    -- e.g. "invoice_2024_Q1.pdf"
    mime_type TEXT,                     -- e.g. "application/pdf"
    size_bytes INTEGER NOT NULL,
    created_at INTEGER NOT NULL,        -- Unix timestamp
    refcount INTEGER NOT NULL DEFAULT 1 -- Number of KVP references
);

CREATE INDEX idx_refcount ON blobs(refcount);

Invariant: refcount > 0 always. Rows with refcount = 0 are marked for deletion and removed during GC sweep.


API & Lifecycle

BlobStore Class

class BlobStore {
public:
    // Lifecycle
    BlobStore(QofBook* book);
    ~BlobStore();
    
    // Core operations
    BlobId attach(const char* filepath, const char* mime_type);
    // -> Computes SHA256, copies file to sharded dir, inserts into blobs table
    //    with refcount=1. Returns the blob_id.
    
    void launch(BlobId id);
    // -> Looks up original_filename and mime_type in blobs table.
    //    Opens blob in appropriate external handler (xdg-open on Linux, etc.).
    
    void detach(BlobId id);
    // -> Decrements refcount in blobs table.
    //    If refcount > 0, marks orphaned for GC. Otherwise stays.
    //    Actual file deletion deferred to GC sweep.
    
    void gc_sweep();
    // -> Scans blobs table. For any row with refcount <= 0,
    //    delete the file from disk and remove row from table.
    
    // Metadata lookup
    gboolean get_blob_info(BlobId id, char** out_filename, char** out_mime);
    // -> Returns original_filename and mime_type for UI display.
};

Attachment Workflow

User attaches PDF to transaction:

  1. User clicks "Attach File" in transaction editor
  2. File picker dialog opens
  3. User selects invoice.pdf
  4. blob_store->attach("invoice.pdf", "application/pdf") is called
  5. BlobStore:
    • Reads file, computes SHA256 → a1b2c3d4e5f6...
    • Creates directory <book-dir>/blobs/a1/ if needed
    • Copies file to <book-dir>/blobs/a1/a1b2c3d4e5f6.pdf
    • Inserts row: (a1b2c3d4e5f6, "invoice.pdf", "application/pdf", 4096, now, 1) into blobs table
    • Returns BlobId("a1b2c3d4e5f6")
  6. Transaction KVP is updated: /kvp/attachments/a1b2c3d4e5f6"a1b2c3d4e5f6" (or just store the id)
  7. UI shows attachment icon next to transaction

User launches blob:

  1. User clicks attachment icon
  2. blob_store->launch(BlobId("a1b2c3d4e5f6")) is called
  3. BlobStore looks up "invoice.pdf" and "application/pdf" from blobs table
  4. Constructs full path: <book-dir>/blobs/a1/a1b2c3d4e5f6.pdf
  5. Calls system handler: xdg-open (or equivalent)
  6. PDF viewer opens

User detaches blob from transaction:

  1. User clicks "Remove attachment" in transaction editor
  2. Transaction KVP entry /kvp/attachments/a1b2c3d4e5f6 is deleted
  3. blob_store->detach(BlobId("a1b2c3d4e5f6")) is called
  4. BlobStore decrements refcount: 1 → 0
  5. Blob is marked for cleanup but file stays on disk

Book is saved:

  1. qof_book_save() is called (synchronous, main GTK thread)
  2. Before writing, trigger blob_store->gc_sweep()
  3. GC scans blobs table for any row with refcount <= 0
  4. For each: delete file from disk, remove row from table
  5. Book is written to disk (KVP no longer references deleted blobs)

KVP Structure

Transaction-Level Attachment

BlobId is stored in transaction KVP as a string:

/kvp/attachments/<blob-id> → "a1b2c3d4e5f6"

Design decision: Store just the BlobId string in KVP. Metadata (original filename, mime type) lives in blobs.sqlite and is fetched on demand by UI. Blob management is entirely C++ (libgnucash/engine/gnc-blobstore.cpp|hpp); Scheme has no involvement.

Split-Level Attachment (Future)

If blobs can attach to splits as well, same pattern:

/kvp/attachments/<blob-id> → "b2c3d4e5f6a1"

(Also managed by C++ transaction/split editor code, not Scheme.)


Deduplication & Refcounting

Scenario: User attaches the same PDF to two different transactions.

  1. First attach: SHA256(invoice.pdf) → a1b2c3d4e5f6
    • File copied to blobs/a1/a1b2c3d4e5f6.pdf
    • Row inserted: (a1b2c3d4e5f6, "invoice.pdf", ..., refcount=1)
  2. Second attach: Same PDF file
    • SHA256 hash matches → a1b2c3d4e5f6
    • File already exists; skip copy
    • Increment refcount: 1 → 2
    • Both transactions' KVP point to same BlobId
    • Single file on disk, two references in metadata
  3. User detaches from first transaction:
    • Decrement refcount: 2 → 1
    • File stays; only one reference left
    • Second transaction still has working link
  4. User detaches from second transaction:
    • Decrement refcount: 1 → 0
    • On next save, GC sweep deletes the file

Benefit: No disk duplication. 100 transactions with the same invoice PDF = 1 file + 1 row with refcount=100.


Garbage Collection

Trigger

GC runs synchronously during qof_book_save(), before the book is written to disk.

void qof_book_save(...) {
    // ... pre-save setup ...
    
    // GC first
    BlobStore* bs = qof_book_get_data_fin(book, "blob-store");
    if (bs) {
        bs->gc_sweep();
    }
    
    // Then save
    // ...
}

Algorithm

  1. Query blobs table: SELECT blob_id, refcount FROM blobs WHERE refcount <= 0
  2. For each orphaned blob:
    • Construct path: <book-dir>/blobs/<first-2-chars>/<blob-id>.*
    • Delete file from disk
    • DELETE FROM blobs WHERE blob_id = ?
  3. Commit transaction to <book-path>.blobs/blobs.sqlite

Robustness: If file doesn't exist on disk (corrupted state), log warning and continue deleting the row.


Thread Safety & Locking

Current assumption: qof_book_save() is synchronous and runs on the main GTK thread. No background I/O during attach/detach/GC.

Synchronization strategy:

  • blobs.sqlite uses SQLite's own WAL (Write-Ahead Logging) locking
  • No explicit mutex needed if all access is serialized by GTK main loop
  • If future work adds async save: add pthread_mutex_t to BlobStore and wrap all SQLite operations

Invariant to maintain: Only one BlobStore instance per QofBook. Enforce via singleton pattern tied to qof_book_set_data_fin().


Lifecycle & Cleanup

Book Open

  1. gnc_book_load() or equivalent
  2. Construct BlobStore instance
  3. Derive blobs directory: <book-path>.blobs/ (create if needed)
  4. Open <book-path>.blobs/blobs.sqlite; create table if needed
  5. Attach to book via qof_book_set_data_fin(book, "blob-store", bs, cleanup_fn)

Book Save

  1. User clicks Save
  2. qof_book_save() calls blob_store->gc_sweep()
  3. Orphaned files and rows deleted
  4. Book written to disk (KVP now has only valid BlobIds)

Book Close

  1. qof_book_close() triggers cleanup function
  2. BlobStore destructor:
    • No GC here (already ran on save)
    • Close blobs.sqlite connection
    • Free memory

UI Integration

Transaction View (Existing)

Add attachment icon/indicator next to transaction:

  • Icon shows "📎" or similar if /kvp/attachments contains any BlobId
  • On hover: tooltip lists original filenames from blobs.sqlite
  • Click icon → opens Blob Manager dialog (see below)

Blob Manager Dialog (New)

Optional but recommended global inventory:

┌─ Blob Manager ──────────────────────┐
│ Book: MyBook.gnucash                │
│                                      │
│ Filename          │ Size  │ Refcount│
│ ─────────────────────────────────── │
│ invoice_2024.pdf  │ 4.2MB │    3    │
│ receipt_jan.jpg   │ 1.1MB │    1    │
│ contract.pdf      │ 2.3MB │    0 ⚠️ │
│                                      │
│ [Delete Orphaned]  [Launch] [Detach]│
│                  [Close]             │
└──────────────────────────────────────┘

Features:

  • List all blobs with metadata
  • Show refcount; highlight refcount <= 0 as orphaned
  • Quick launch/detach from here
  • Manual "Delete Orphaned" to run GC now (for user impatience)
  • Useful for diagnostics: "Why is this blob still taking space?"

Comparison with Existing doclink

Feature doclink BlobStore
Storage External (user manages) Internal (GnuCash manages)
Format URL or absolute filepath SHA256 content hash
Deduplication No Yes (refcounted)
Portability Poor (breaks if file moved) Good (lives in book dir)
Cleanup Manual Automatic (GC on save)
Use case Links to external assets Embedded blobs

Note: Both can coexist. A transaction can have a doclink and a blob attachment. Doclink remains for external references (e.g., "see our website"); blob for embedded PDFs.


Implementation Roadmap

Phase 1: Core (Minimal)

  1. Create BlobStore class: attach(), launch(), detach(), gc_sweep()
  2. Per-book blobs.sqlite with refcount table
  3. Hook into qof_book_load() and qof_book_save()
  4. KVP integration: store/retrieve BlobId in /kvp/attachments/

Phase 2: UI (Recommended)

  1. Transaction editor: "Attach File" button + attachment icon
  2. Click icon → Blob Manager dialog
  3. Launchable from manager

Phase 3: Polish (Future)

  1. Drag-and-drop attachment to transaction
  2. Thumbnail preview in Blob Manager
  3. Search/filter blobs by filename or date
  4. Export option to bundle blobs with book backup

CLI Extension: Blobs GC

Optional command-line tool for explicit garbage collection without opening the GUI.

Invocation

gnucash --blobs-gc /path/to/MyBook.gnucash [--dry-run] [--report] [--json]

Options

  • --dry-run: Report orphans but don't delete; exit with count of files to be freed
  • --report: List all blobs with blob_id, filename, size, refcount, and orphan status
  • --json: Output --report in JSON format (for scripting)
  • (no flags): Actually delete orphans (like a normal GC)

Use Cases

  • Impatience: User wants to free disk space immediately without waiting for next save
  • Diagnostics: Run headless to report blob state: gnucash --blobs-gc book.gnucash --report
  • Scripting: Automated cleanup in cron: gnucash --blobs-gc book.gnucash --json | jq '.orphaned | length'
  • Safety: Dry-run first to see what will be deleted: gnucash --blobs-gc book.gnucash --dry-run

Implementation Notes

  • Open only <book-path>.blobs/blobs.sqlite; skip loading full book (fast, lightweight)
  • Serialize to JSON (if --json): {"orphaned": [...], "total_files": N, "bytes_freed": M}
  • Human-readable table output (default): aligned columns with file/size/refcount
  • Exit codes: 0 (success), 1 (error opening book/DB), 2 (no blobs.sqlite found)

Open Questions & Notes

  1. MIME type detection: Use libmagic or user input? (Suggest libmagic via GLib)
  2. File size limits: Any constraints? E.g., reject >50MB blobs?
  3. Orphan notification: Should UI warn user if closing book with orphaned blobs?
  4. Export/backup: When user exports book to XML/backup, do blobs get included? (Suggest bundling into ZIP.)
  5. Cross-book references: Can transaction A (book X) reference a blob from book Y? (No for now; blobs per-book.)

Testing Strategy

  • Unit: attach same file twice → verify refcount increment
  • Unit: detach both → verify GC cleanup
  • Integration: attach to split, save, reopen, verify blob present
  • Edge case: corrupt blobs.sqlite → verify graceful recovery
  • Performance: attach 1000-item collection → measure GC time

Security & Privacy Considerations

  • Filename squatting: Original filename stored in metadata; not used to create paths (SHA256 path is canonical).
  • Symlink attacks: Validate blob path is within <book-dir>/blobs/ before open().
  • Disk space: No quota enforcement; assume admin manages storage.