[chord] my plans to change merkle

Fri Aug 1 13:53:25 EDT 2003

yeah, i can do that. when?

as far as i can tell, the bottom level of the Merkle tree does not
actually contain the keys, but just a hash over all the keys.
currently, when you insert a key into the Merkle tree, you will
need to rehash all the keys in that range to produce the hash.
josh, please correct me if i am wrong.

that said, the db3 documentation claims that with the b-tree access
methods, the index is contained entirely in memory. so iterating
them should not be too bad...

but, a simple test shows otherwise. if you use dbm to insert 1024
blocks, trying 32 blocks at a time, this is the performance

trial 0  - 10.5 s
trial 1  - 14.8 s
trial 2  - 18.8 s
trial 3  - 38.1 s
trial 4  - 53.3 s
trial 5  - 32.0 s  <--- probably some db3 caching parameter is changed here
trial 6  - 35.6 s
trial 7  - 40.4 s
trial 8  - 44.4 s
trial 9  - 50.3 s
trial 10 - 55.0 s

to investigate why the performance drops, i made the following changes
  - on a block insert, only insert data into db, but don't insert key
    into the merkle tree. instead, keep it in a vector
  - after 1024 blocks, schedule a callback to insert keys in the vector
    into the merkle tree (note, i didn't change the actual merkle tree
    insert code, so it still iterates the db on every insert)
  - measure time takes to insert keys into the tree

note, i am cheating here, because each dbm call also inserts 1024 blocks.
but that's intentional, i want to see where the time is spent...

          dbm time    time to insert into merkle tree
trial 0   7.5         4.7
trial 1   7.6         8.5
trial 2   7.4         22.5
trial 3   7.2         40.7

okay, so we now know where the time goes. there are several possible causes

  - iterating is expensive

  - to iterate, we use DB_SET_RANGE to find the smallest key of a range,
    but it doesn't seem like db3 can guarantee it will return that range
    before returning a key out of that range.

having done this, i think one of the two following changes may be
acceptable

  - keep Merkle tree completely in cache, even with the lowest level,
    so we don't ever need to iterate the db. 1,000,000 blocks require
    20M... perhaps a bit expensive

  - batch insert keys into the Merkle tree, like i am doing now, but
    only iterate the entire DB once for all the inserts. this is probably
    the best solution.

but i will discuss this in a meeting before committing anything.

and josh, any insights? please chip in... thanks.

benjie

On Fri, Aug 01, 2003 at 12:16:01PM -0400, Frank Dabek wrote:
> Benjie,
> 
> 	We talked about this at Chord meeting today and are all a bit confused.
> Isn't the bottom level of the Merkle tree already an in-memory cache of
> all of the keys in the system? 
> 
> 	Why don't you present these changes at a meeting next week before you
> start hacking the code?
> 
> --Frank
> 
> On Fri, 2003-08-01 at 09:16, Benjie Chen wrote:
> > yes, i was planning to insert into the db, but not into the
> > merkle tree, on every block insert. when inserting keys into
> > the tree, we will end up traversing the db to keep the tree
> > fresh.
> > 
> > benjie
> > 
> > 
> > On Fri, Aug 01, 2003 at 07:32:41AM -0400, Frans Kaashoek wrote:
> > > wouldn't it be better to do the insert in the database, but
> > > don't iteratate throught the whole database on each insert?
> > > 
> > > perhaps i don't understand the plan, but this sounds like a
> > > bad hack.
> > > 
> > > On Fri, 01 August 2003 at 00:21 (-0400), Benjie Chen wrote:
> > > > 
> > > > so i mentioned earlier that there is a slight efficiency issue with
> > > > the Merkle implementation, namely that it tries to iterate thru the
> > > > database on every db insert, and that this could be quite expensive.
> > > > 
> > > > my plan to fix/change this is to keep a fixed sized array of
> > > > inserted keys in memory, and periodically inserts them into the
> > > > merkle tree, instead of insert into the merkle tree on every store.
> > > > in the worst case, i will insert keys into the merkle tree once
> > > > every time the synchronization protocol runs.
> > > > 
> > > > with only 200K, i can keep 10,000 keys. so this should be okay
> > > > memory-wise.
> > > > 
> > > > anyone has a problem with this?
> > > > 
> > > > benjie
> > > > 
> > > > 
> > > > -- 
> > > > benjie chen
> > > > benjie at lcs.mit.edu
> > > > _______________________________________________
> > > > chord mailing list
> > > > chord at amsterdam.lcs.mit.edu
> > > > https://amsterdam.lcs.mit.edu/mailman/listinfo/chord

-- 
benjie chen
benjie at lcs.mit.edu