1 De‐duplication
rustmailer edited this page 2026-01-23 13:21:10 +08:00

Overview

Bichon is designed with a Single-Instance Storage philosophy at the Account Level. The primary goal is to ensure that any unique email is stored only once within an account, regardless of how many folders it appears in.

1. Storage Architecture

Bichon decouples email information into two separate storage layers to optimize both search performance and storage integrity:

  • Metadata Index: Extracted headers and properties (e.g., Subject, Sender, Date) are stored in a dedicated index for rapid querying.
  • Blob Storage: The full, raw email content (MIME) is stored in a separate directory.

The Primary Key

Both the Metadata and the Full Content use a unique key derived from the email's Message-ID. This serves as the "Primary Key" for the lifecycle of that email within the system.

2. De-duplication Logic: "Delete-then-Write"

To maximize write efficiency, Bichon does not perform a traditional "update" or "check-if-exists" read operation. Instead, it follows a strict sequence:

  1. Extract: Identify the Message-ID of the incoming email.
  2. Purge: Immediately delete any existing Metadata and Full Content associated with that Message-ID.
  3. Insert: Write the new Metadata and Full Content to the storage layers.

This "Last-In-Wins" strategy ensures that the database remains clean and that write operations are not slowed down by complex conflict resolution.

3. Practical Implications

Folder Synchronization (e.g., The Trash Scenario)

Because Bichon maintains only one copy per Message-ID at the account level, moving emails between folders results in an "overwrite" rather than a "duplicate":

  • Example:
    1. Email X is synced from the Inbox. Bichon stores it.
    2. You move Email X to the Trash folder in your mail client.
    3. When Bichon syncs the Trash folder, it sees Email X again.
    4. Bichon deletes the "Inbox version" of Email X and writes the "Trash version."
  • Outcome: The email appears to have moved. While Bichon's primary intent is simply to ensure only one copy exists, the side effect is a clean representation of the email's latest state.

Bulk Imports via nosync

When using the nosync tool to import large datasets:

  • If the source data contains duplicate emails (same Message-ID), the version that is processed last will be the one that persists in Bichon.
  • This ensures that no matter how many times a duplicate is imported, the storage footprint does not grow unnecessarily.

4. Summary Table

Feature Implementation
Deduplication Scope Account Level
Primary Key Derived from Message-ID
Storage Strategy Separate Metadata (Index) vs. Full Content (Filesystem)
Write Pattern Atomic Delete-then-Write
Design Goal High write throughput & Single-instance storage