BlobStore
Contents
- 1 GnuCash Content-Addressable Blob Storage (CAS)
- 1.1 Overview
- 1.2 Storage Model
- 1.3 API & Lifecycle
- 1.4 KVP Structure
- 1.5 Deduplication & Refcounting
- 1.6 Garbage Collection
- 1.7 Thread Safety & Locking
- 1.8 Lifecycle & Cleanup
- 1.9 UI Integration
- 1.10 Comparison with Existing doclink
- 1.11 Implementation Roadmap
- 1.12 CLI Extension: Blobs GC
- 1.13 Open Questions & Notes
- 1.14 Testing Strategy
- 1.15 Security & Privacy Considerations
GnuCash Content-Addressable Blob Storage (CAS)
Overview
Add a content-addressable storage (CAS) system to GnuCash for attaching files (PDFs, images, etc.) to transactions and splits. Blobs are stored by SHA256 content hash with SQLite-backed refcounting, providing deduplication, safe cleanup, and a third option alongside the existing doclink (URL/filepath) mechanism.
Key properties:
- Stores in per-book SQLite database (
<book-path>.blobs/blobs.sqlite) - Refcounted; garbage collected on book save
- Address (BlobId) stored in transaction/split KVP at
/kvp/attachments/<id> - Launchable via external handler
- UI shows attachment icon in transaction view
Storage Model
File Layout
Assuming book file is at ~/.local/share/gnucash/MyBook.gnucash:
~/.local/share/gnucash/ ├── MyBook.gnucash # Book file ├── MyBook.blobs/ # Per-book blob directory (sibling) │ ├── blobs.sqlite # Metadata database │ ├── a1/ │ │ ├── a1b2c3d4e5f6...pdf │ │ └── a1f7g8h9i0j1...jpg │ ├── b2/ │ └── ff/ │ └── ffa1b2c3d4e5...png
Location rule: If book is at path /some/dir/MyBook.gnucash, blobs live in /some/dir/MyBook.blobs/ (sibling directory with .blobs suffix).
Sharding: First 2 hex chars of SHA256 hash form subdirectory. Reduces filesystem strain on large collections.
Per-Book Metadata Database
File: <book-path>.blobs/blobs.sqlite
Example: /home/user/.local/share/gnucash/MyBook.gnucash.blobs/blobs.sqlite
Schema:
CREATE TABLE blobs (
blob_id TEXT PRIMARY KEY, -- SHA256 hex digest
original_filename TEXT NOT NULL, -- e.g. "invoice_2024_Q1.pdf"
mime_type TEXT, -- e.g. "application/pdf"
size_bytes INTEGER NOT NULL,
created_at INTEGER NOT NULL, -- Unix timestamp
refcount INTEGER NOT NULL DEFAULT 1 -- Number of KVP references
);
CREATE INDEX idx_refcount ON blobs(refcount);
Invariant: refcount > 0 always. Rows with refcount = 0 are marked for deletion and removed during GC sweep.
API & Lifecycle
BlobStore Class
class BlobStore {
public:
// Lifecycle
BlobStore(QofBook* book);
~BlobStore();
// Core operations
BlobId attach(const char* filepath, const char* mime_type);
// -> Computes SHA256, copies file to sharded dir, inserts into blobs table
// with refcount=1. Returns the blob_id.
void launch(BlobId id);
// -> Looks up original_filename and mime_type in blobs table.
// Opens blob in appropriate external handler (xdg-open on Linux, etc.).
void detach(BlobId id);
// -> Decrements refcount in blobs table.
// If refcount > 0, marks orphaned for GC. Otherwise stays.
// Actual file deletion deferred to GC sweep.
void gc_sweep();
// -> Scans blobs table. For any row with refcount <= 0,
// delete the file from disk and remove row from table.
// Metadata lookup
gboolean get_blob_info(BlobId id, char** out_filename, char** out_mime);
// -> Returns original_filename and mime_type for UI display.
};
Attachment Workflow
User attaches PDF to transaction:
- User clicks "Attach File" in transaction editor
- File picker dialog opens
- User selects
invoice.pdf -
blob_store->attach("invoice.pdf", "application/pdf")is called - BlobStore:
- Reads file, computes SHA256 →
a1b2c3d4e5f6... - Creates directory
<book-dir>/blobs/a1/if needed - Copies file to
<book-dir>/blobs/a1/a1b2c3d4e5f6.pdf - Inserts row:
(a1b2c3d4e5f6, "invoice.pdf", "application/pdf", 4096, now, 1)into blobs table - Returns
BlobId("a1b2c3d4e5f6")
- Reads file, computes SHA256 →
- Transaction KVP is updated:
/kvp/attachments/a1b2c3d4e5f6→"a1b2c3d4e5f6"(or just store the id) - UI shows attachment icon next to transaction
User launches blob:
- User clicks attachment icon
-
blob_store->launch(BlobId("a1b2c3d4e5f6"))is called - BlobStore looks up
"invoice.pdf"and"application/pdf"from blobs table - Constructs full path:
<book-dir>/blobs/a1/a1b2c3d4e5f6.pdf - Calls system handler:
xdg-open(or equivalent) - PDF viewer opens
User detaches blob from transaction:
- User clicks "Remove attachment" in transaction editor
- Transaction KVP entry
/kvp/attachments/a1b2c3d4e5f6is deleted -
blob_store->detach(BlobId("a1b2c3d4e5f6"))is called - BlobStore decrements refcount:
1 → 0 - Blob is marked for cleanup but file stays on disk
Book is saved:
-
qof_book_save()is called (synchronous, main GTK thread) - Before writing, trigger
blob_store->gc_sweep() - GC scans blobs table for any row with
refcount <= 0 - For each: delete file from disk, remove row from table
- Book is written to disk (KVP no longer references deleted blobs)
KVP Structure
Transaction-Level Attachment
BlobId is stored in transaction KVP as a string:
/kvp/attachments/<blob-id> → "a1b2c3d4e5f6"
Design decision: Store just the BlobId string in KVP. Metadata (original filename, mime type) lives in blobs.sqlite and is fetched on demand by UI. Blob management is entirely C++ (libgnucash/engine/gnc-blobstore.cpp|hpp); Scheme has no involvement.
Split-Level Attachment (Future)
If blobs can attach to splits as well, same pattern:
/kvp/attachments/<blob-id> → "b2c3d4e5f6a1"
(Also managed by C++ transaction/split editor code, not Scheme.)
Deduplication & Refcounting
Scenario: User attaches the same PDF to two different transactions.
- First attach: SHA256(
invoice.pdf) →a1b2c3d4e5f6- File copied to
blobs/a1/a1b2c3d4e5f6.pdf - Row inserted:
(a1b2c3d4e5f6, "invoice.pdf", ..., refcount=1)
- File copied to
- Second attach: Same PDF file
- SHA256 hash matches →
a1b2c3d4e5f6 - File already exists; skip copy
- Increment refcount:
1 → 2 - Both transactions' KVP point to same BlobId
- Single file on disk, two references in metadata
- SHA256 hash matches →
- User detaches from first transaction:
- Decrement refcount:
2 → 1 - File stays; only one reference left
- Second transaction still has working link
- Decrement refcount:
- User detaches from second transaction:
- Decrement refcount:
1 → 0 - On next save, GC sweep deletes the file
- Decrement refcount:
Benefit: No disk duplication. 100 transactions with the same invoice PDF = 1 file + 1 row with refcount=100.
Garbage Collection
Trigger
GC runs synchronously during qof_book_save(), before the book is written to disk.
void qof_book_save(...) {
// ... pre-save setup ...
// GC first
BlobStore* bs = qof_book_get_data_fin(book, "blob-store");
if (bs) {
bs->gc_sweep();
}
// Then save
// ...
}
Algorithm
- Query blobs table:
SELECT blob_id, refcount FROM blobs WHERE refcount <= 0 - For each orphaned blob:
- Construct path:
<book-dir>/blobs/<first-2-chars>/<blob-id>.* - Delete file from disk
-
DELETE FROM blobs WHERE blob_id = ?
- Construct path:
- Commit transaction to
<book-path>.blobs/blobs.sqlite
Robustness: If file doesn't exist on disk (corrupted state), log warning and continue deleting the row.
Thread Safety & Locking
Current assumption: qof_book_save() is synchronous and runs on the main GTK thread. No background I/O during attach/detach/GC.
Synchronization strategy:
- blobs.sqlite uses SQLite's own WAL (Write-Ahead Logging) locking
- No explicit mutex needed if all access is serialized by GTK main loop
- If future work adds async save: add pthread_mutex_t to BlobStore and wrap all SQLite operations
Invariant to maintain: Only one BlobStore instance per QofBook. Enforce via singleton pattern tied to qof_book_set_data_fin().
Lifecycle & Cleanup
Book Open
-
gnc_book_load()or equivalent - Construct BlobStore instance
- Derive blobs directory:
<book-path>.blobs/(create if needed) - Open
<book-path>.blobs/blobs.sqlite; create table if needed - Attach to book via
qof_book_set_data_fin(book, "blob-store", bs, cleanup_fn)
Book Save
- User clicks Save
-
qof_book_save()callsblob_store->gc_sweep() - Orphaned files and rows deleted
- Book written to disk (KVP now has only valid BlobIds)
Book Close
-
qof_book_close()triggers cleanup function - BlobStore destructor:
- No GC here (already ran on save)
- Close blobs.sqlite connection
- Free memory
UI Integration
Transaction View (Existing)
Add attachment icon/indicator next to transaction:
- Icon shows "📎" or similar if
/kvp/attachmentscontains any BlobId - On hover: tooltip lists original filenames from blobs.sqlite
- Click icon → opens Blob Manager dialog (see below)
Blob Manager Dialog (New)
Optional but recommended global inventory:
┌─ Blob Manager ──────────────────────┐ │ Book: MyBook.gnucash │ │ │ │ Filename │ Size │ Refcount│ │ ─────────────────────────────────── │ │ invoice_2024.pdf │ 4.2MB │ 3 │ │ receipt_jan.jpg │ 1.1MB │ 1 │ │ contract.pdf │ 2.3MB │ 0 ⚠️ │ │ │ │ [Delete Orphaned] [Launch] [Detach]│ │ [Close] │ └──────────────────────────────────────┘
Features:
- List all blobs with metadata
- Show refcount; highlight
refcount <= 0as orphaned - Quick launch/detach from here
- Manual "Delete Orphaned" to run GC now (for user impatience)
- Useful for diagnostics: "Why is this blob still taking space?"
Comparison with Existing doclink
| Feature | doclink | BlobStore |
|---|---|---|
| Storage | External (user manages) | Internal (GnuCash manages) |
| Format | URL or absolute filepath | SHA256 content hash |
| Deduplication | No | Yes (refcounted) |
| Portability | Poor (breaks if file moved) | Good (lives in book dir) |
| Cleanup | Manual | Automatic (GC on save) |
| Use case | Links to external assets | Embedded blobs |
Note: Both can coexist. A transaction can have a doclink and a blob attachment. Doclink remains for external references (e.g., "see our website"); blob for embedded PDFs.
Implementation Roadmap
Phase 1: Core (Minimal)
- Create
BlobStoreclass:attach(),launch(),detach(),gc_sweep() - Per-book blobs.sqlite with refcount table
- Hook into
qof_book_load()andqof_book_save() - KVP integration: store/retrieve BlobId in
/kvp/attachments/
Phase 2: UI (Recommended)
- Transaction editor: "Attach File" button + attachment icon
- Click icon → Blob Manager dialog
- Launchable from manager
Phase 3: Polish (Future)
- Drag-and-drop attachment to transaction
- Thumbnail preview in Blob Manager
- Search/filter blobs by filename or date
- Export option to bundle blobs with book backup
CLI Extension: Blobs GC
Optional command-line tool for explicit garbage collection without opening the GUI.
Invocation
gnucash --blobs-gc /path/to/MyBook.gnucash [--dry-run] [--report] [--json]
Options
-
--dry-run: Report orphans but don't delete; exit with count of files to be freed -
--report: List all blobs with blob_id, filename, size, refcount, and orphan status -
--json: Output--reportin JSON format (for scripting) - (no flags): Actually delete orphans (like a normal GC)
Use Cases
- Impatience: User wants to free disk space immediately without waiting for next save
- Diagnostics: Run headless to report blob state:
gnucash --blobs-gc book.gnucash --report - Scripting: Automated cleanup in cron:
gnucash --blobs-gc book.gnucash --json | jq '.orphaned | length' - Safety: Dry-run first to see what will be deleted:
gnucash --blobs-gc book.gnucash --dry-run
Implementation Notes
- Open only
<book-path>.blobs/blobs.sqlite; skip loading full book (fast, lightweight) - Serialize to JSON (if
--json):{"orphaned": [...], "total_files": N, "bytes_freed": M} - Human-readable table output (default): aligned columns with file/size/refcount
- Exit codes: 0 (success), 1 (error opening book/DB), 2 (no blobs.sqlite found)
Open Questions & Notes
- MIME type detection: Use
libmagicor user input? (Suggest libmagic via GLib) - File size limits: Any constraints? E.g., reject >50MB blobs?
- Orphan notification: Should UI warn user if closing book with orphaned blobs?
- Export/backup: When user exports book to XML/backup, do blobs get included? (Suggest bundling into ZIP.)
- Cross-book references: Can transaction A (book X) reference a blob from book Y? (No for now; blobs per-book.)
Testing Strategy
- Unit: attach same file twice → verify refcount increment
- Unit: detach both → verify GC cleanup
- Integration: attach to split, save, reopen, verify blob present
- Edge case: corrupt blobs.sqlite → verify graceful recovery
- Performance: attach 1000-item collection → measure GC time
Security & Privacy Considerations
- Filename squatting: Original filename stored in metadata; not used to create paths (SHA256 path is canonical).
- Symlink attacks: Validate blob path is within
<book-dir>/blobs/beforeopen(). - Disk space: No quota enforcement; assume admin manages storage.