Overview
Bichon is designed with a Single-Instance Storage philosophy at the Account Level. The primary goal is to ensure that any unique email is stored only once within an account, regardless of how many folders it appears in.
1. Storage Architecture
Bichon decouples email information into two separate storage layers to optimize both search performance and storage integrity:
- Metadata Index: Extracted headers and properties (e.g., Subject, Sender, Date) are stored in a dedicated index for rapid querying.
- Blob Storage: The full, raw email content (MIME) is stored in a separate directory.
The Primary Key
Both the Metadata and the Full Content use a unique key derived from the email's Message-ID. This serves as the "Primary Key" for the lifecycle of that email within the system.
2. De-duplication Logic: "Delete-then-Write"
To maximize write efficiency, Bichon does not perform a traditional "update" or "check-if-exists" read operation. Instead, it follows a strict sequence:
- Extract: Identify the
Message-IDof the incoming email. - Purge: Immediately delete any existing Metadata and Full Content associated with that
Message-ID. - Insert: Write the new Metadata and Full Content to the storage layers.
This "Last-In-Wins" strategy ensures that the database remains clean and that write operations are not slowed down by complex conflict resolution.
3. Practical Implications
Folder Synchronization (e.g., The Trash Scenario)
Because Bichon maintains only one copy per Message-ID at the account level, moving emails between folders results in an "overwrite" rather than a "duplicate":
- Example:
- Email X is synced from the
Inbox. Bichon stores it. - You move Email X to the
Trashfolder in your mail client. - When Bichon syncs the
Trashfolder, it sees Email X again. - Bichon deletes the "Inbox version" of Email X and writes the "Trash version."
- Email X is synced from the
- Outcome: The email appears to have moved. While Bichon's primary intent is simply to ensure only one copy exists, the side effect is a clean representation of the email's latest state.
Bulk Imports via nosync
When using the nosync tool to import large datasets:
- If the source data contains duplicate emails (same
Message-ID), the version that is processed last will be the one that persists in Bichon. - This ensures that no matter how many times a duplicate is imported, the storage footprint does not grow unnecessarily.
4. Summary Table
| Feature | Implementation |
|---|---|
| Deduplication Scope | Account Level |
| Primary Key | Derived from Message-ID |
| Storage Strategy | Separate Metadata (Index) vs. Full Content (Filesystem) |
| Write Pattern | Atomic Delete-then-Write |
| Design Goal | High write throughput & Single-instance storage |