LSM Tree vs B+ Tree Notes

October 4, 2024

How They Work

LSM Tree

sequential writes Alt text for the image

  1. Write Process:

    • New data is first written to an in-memory buffer called MemTable.
    • When MemTable reaches a size threshold, it's flushed to disk as an immutable SST (Sorted String Table) file.
    • Periodically, multiple SST files are merged and sorted in a process called compaction.
  2. Read Process:

    • Reads check the MemTable first.
    • If not found, it checks the most recent SST files, then proceeds through older SSTs.
    • Bloom filters are often used to quickly determine which SSTs might contain the data.
  3. Levels:

    • Data is organized into multiple levels, with each level containing larger and less frequently updated data.
    • Lower levels (closer to memory) are smaller and more frequently updated.

B+ Tree

  1. Structure:

    • Balanced tree with a root, internal nodes, and leaf nodes.
    • Internal nodes contain keys and pointers to child nodes.
    • Leaf nodes contain keys and either data or pointers to data.
    • Leaf nodes are typically linked for efficient range queries.
  2. Write Process:

    • Tree is traversed from root to appropriate leaf.
    • If leaf has space, data is inserted.
    • If leaf is full, it splits into two nodes, and the split propagates up the tree if necessary.
  3. Read Process:

    • Tree is traversed from root to leaf, using keys to guide the search.
    • Once the leaf is reached, data can be retrieved directly or via a pointer.

Performance Characteristics Explained

LSM Tree

  1. Excellent Write Performance (O(1) amortized):

    • Why: Writes are initially to in-memory MemTable, which is very fast.
    • Flushing to disk happens in large sequential writes, which are efficient.
    • Amortized O(1) because occasional larger operations (flushes and compactions) are spread over many writes.
  2. Good but Potentially Degrading Read Performance (O(log N) worst case):

    • Why: May need to check multiple SST files, increasing with the number of levels.
    • Bloom filters help, but their effectiveness decreases as data size grows.
    • Worst case occurs when data is in the oldest/largest SST level.
  3. High Space Efficiency:

    • Why: Compaction process merges and removes obsolete data.
    • Sequential storage in SSTs allows for better compression.
  4. Write Amplification:

    • Why: Data may be rewritten multiple times during compaction.
    • Each level typically involves rewriting the entire level's worth of data.
  5. Read Amplification:

    • Why: Might need to check multiple SST files for a single read.
    • Can be mitigated by caching and Bloom filters, but still present.

B+ Tree

random writes

  1. Good Write Performance (O(log N)):

    • Why: Needs to traverse the tree for each write, hence logarithmic.
    • Splits and rebalances can cause additional writes, but are infrequent.
  2. Excellent Read Performance (O(log N)):

    • Why: Tree structure allows efficient traversal to locate data.
    • Height of the tree grows logarithmically with the amount of data.
    • Once leaf level is reached, data access is direct and fast.
  3. Moderate Space Efficiency:

    • Why: Maintains structure for fast access, which uses some extra space.
    • Internal nodes contain only keys and pointers, saving some space.
    • Can have some internal fragmentation due to partially filled nodes.
  4. Lower Write Amplification (compared to LSM):

    • Why: Updates are often in-place or require minimal restructuring.
    • No equivalent to the LSM compaction process.
  5. Minimal Read Amplification:

    • Why: Each read operation typically only traverses the tree once.
    • No need to check multiple files or structures for a single read.

Use Case Suitability

LSM Tree

  • Ideal for write-heavy workloads due to fast initial writes.
  • Good for scenarios with high insert rates but less frequent updates or deletes.
  • Suitable for time-series data, log data, and some types of analytical workloads.

B+ Tree

  • Well-suited for workloads with a mix of reads and writes.
  • Excellent for scenarios requiring frequent range queries.
  • Ideal for transactional workloads in traditional relational databases.

https://planetscale.com/blog/btrees-and-database-indexes

https://blog.jcole.us/innodb/

https://blog.jcole.us/2013/01/10/btree-index-structures-in-innodb/

https://github.com/facebook/rocksdb/wiki/RocksDB-Overview

https://www.reddit.com/r/databasedevelopment/comments/187cp1g/comment/kbdv6j2/