[chord] dhashclient question

Sun Apr 9 14:40:33 EDT 2006

Thank you very much for your reply. From my calculation, I get the probability
for a block to be available when half of the CFS nodes are down is about 0.6
based on the formula when m=7, l=14, p0=0.5. This is a much worse robustness
performance than that when there are just 6 replicas for a block.

>From your dhblock.c source code, you configure both replica and erasure coded
fragment:

#define set_int Configurator::only ().set_int
  /** MTU **/
  ok = ok && set_int ("dhash.mtu", 1210);
  /** Number of fragments to encode each block into */
  ok = ok && set_int ("dhash.efrags", 14);
  /** XXX Number of fragments needed to reconstruct a given block */
  ok = ok && set_int ("dhash.dfrags", 7);
  /** XXX Number of replica for each mutable block **/
  ok = ok && set_int ("dhash.replica", 5);
  assert (ok);
#undef set_int

So do you use replica and erasure coded fragments at the same time to guarantee
the robustness of the CFS network? Thank you very much!

Thanks,
Yanyan :)

Quoting Frank Dabek <fdabek at gmail.com>:

> DHash's default replication strategy creates 14 erasure coded
> fragments of each block; 7 fragments are necessary to reconstruct the
> block. The math for figuring out what failure rate you should expect
> is a little harder, but you can find it in our paper at the 1st NSDI
> "Designing a DHT for..."; Also look for a Weatherspoon paper that has
> a similar formula. Hopefully this works out to the right number. If
> you'd rather not do the math you can configure DHash to use
> replication. Try setting the efrags and dfrags parameters in your
> configuration file. You'll want dfrags = 1 and efrags = (replication
> level)
>
> Some unsolicited advice: what are you trying to model by killing half
> the nodes? CA falls into the ocean? I know that's how we did it in our
> early papers, but, in retrsopect, it was a pretty unrealistic
> approach.
>
> --Frank
>
> On 4/8/06, Yanyan Wang <Yanyan.Wang at colorado.edu> wrote:
> > This is a follow-up question to a question I asked you before. I am trying
> to do
> > robustness experiments exactly like your experiments in section 7.2.5
> effect of
> > failure in your SOSP'01 paper "Wide-area cooperative storage with CFS".
> From
> > your experiment results and explanations, I should be able to observe the
> error
> > rate about 0.5^6=0.016 when the fraction of failed nodes equal to 0.5 when
> CFS
> > keeps 6 copies of a block. Based on my understanding of your source code,
> you
> > have set the number of copies of a block to 6. But I always get the error
> rate
> > about 0.002 in my experiments. I only got the error rate equal to 0.008 in
> my
> > first experiment run. I don't know what the problem is. So I have to ask
> for
> > your favor to help me think of some explanations.
> >
> > I did my experiments on Emulab. For each experiments, I copied the compiled
> CFS
> > binary distribution "lsd", "filestore" onto each testbed host. Then I
> started
> > 1000 CFS servers on these testbed hosts with "lsd". A client script on one
> of
> > the CFS servers sends 1000 "filestore" store requests to the CFS network.
> Then
> > half of the lsd processes(chosen randomly) were killed. Then the client
> script
> > sends 1000 "filestore" retrieve requests to the CFS network, each after one
> > second. After each experiment run, I cleaned up the execution environments
> of
> > all the CFS servers (basically the db for all the servers).
> >
> > Another observation is, all the errors I observed happened for the retrieve
> > requests sent early. But I think that the errors that should happen in this
> > experiment are those because of loss of all the six copies of a block. Then
> > these errors could happen for any retrieval requests because the errors are
> > unrecoverable.
> >
> > So I am very confused about my observations. I am wondering if CFS has
> other
> > mechanism in decreasing the error rate or if there is any problem in my
> > experiment setup. Thank you very much for your help!
> >
> > Thanks,
> > Yanyan :)
> >
> >
> > Quoting Yanyan Wang <Yanyan.Wang at colorado.edu>:
> >
> > > Hello chord authors,
> > >
> > > I met a weird problem when I tried the robustness of chord network. Could
> you
> > > please kindly help me explain it? I used the chord prototype to do the
> > > experiment. I did the following:
> > >
> > > 1. I started two chord nodes each of which has 16 virtual nodes;
> > > 2. I inserted 32 strings into the chord network and got 32 corresponding
> > > keys;
> > > 3. I killed one of the lsd processes;
> > > 4. I retrieved the 32 keys.
> > >
> > > >From the explanation in your sigcomm paper, I expected about half of the
> 32
> > > key
> > > retrievals should fail because of the key lost as the result of my
> killing
> > > one
> > > lsd process. But my result is all the retrievals succeed in getting the
> > > corresponding strings. I am very surprised at this result. The logs of
> the
> > > live
> > > chord node for the retrieval of key
> 9f8ff2967cff70d7373cb304807e053a44d8a047
> > > are:
> > >
> > > ...
> > > lsd: will order successors 1
> > > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > > DHASH_NOENT at 3eb6ccaf98300a77989e0059cbe9a465bbc6f35d
> > > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > > DHASH_NOENT at 40f746245f6c3bb924ad6e5ddcf15811ea456e9a
> > > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > > DHASH_NOENT at 1ac46b2343075c88e9c0f601e96aa64b3651d8d7
> > > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > > DHASH_NOENT at 48896252b67623a835fb117b6559c70eeb583fc7
> > > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > > DHASH_NOENT at 15282e83543d99cbf76669c03b4d26b0e15b96da
> > > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > > DHASH_NOENT at 6793bc04e0c544d5c23e84c6c0605cd0998ee2b7
> > > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > > DHASH_NOENT at 766a83e4c5a2f0e76723006c6dad002de53aeaf9
> > >
> > > It seems all dhash_download are failed. But why it still can get the
> correct
> > > string of the key? I am very confused and I would be very appreciated for
> any
> > > ideas. Thanks a lot!
> > >
> > > Thanks!
> > > Yanyan :)
> > >
> > >
> > > =================================
> > > Yanyan Wang
> > > Department of Computer Science
> > > University of Colorado at Boulder
> > > Boulder, CO, 80302
> > > =================================
> > >
> > > _______________________________________________
> > > chord mailing list
> > > chord at amsterdam.lcs.mit.edu
> > > https://amsterdam.lcs.mit.edu/mailman/listinfo/chord
> > >
> >
> >
> > Yanyan :)
> >
> >
> > =================================
> > Yanyan Wang
> > Department of Computer Science
> > University of Colorado at Boulder
> > Boulder, CO, 80302
> > =================================
> >
> > _______________________________________________
> > chord mailing list
> > chord at amsterdam.lcs.mit.edu
> > https://amsterdam.lcs.mit.edu/mailman/listinfo/chord
> >
>