[chord] dhashclient question

Sun Apr 9 12:46:13 EDT 2006

DHash's default replication strategy creates 14 erasure coded
fragments of each block; 7 fragments are necessary to reconstruct the
block. The math for figuring out what failure rate you should expect
is a little harder, but you can find it in our paper at the 1st NSDI
"Designing a DHT for..."; Also look for a Weatherspoon paper that has
a similar formula. Hopefully this works out to the right number. If
you'd rather not do the math you can configure DHash to use
replication. Try setting the efrags and dfrags parameters in your
configuration file. You'll want dfrags = 1 and efrags = (replication
level)

Some unsolicited advice: what are you trying to model by killing half
the nodes? CA falls into the ocean? I know that's how we did it in our
early papers, but, in retrsopect, it was a pretty unrealistic
approach.

--Frank

On 4/8/06, Yanyan Wang <Yanyan.Wang at colorado.edu> wrote:
> This is a follow-up question to a question I asked you before. I am trying to do
> robustness experiments exactly like your experiments in section 7.2.5 effect of
> failure in your SOSP'01 paper "Wide-area cooperative storage with CFS". From
> your experiment results and explanations, I should be able to observe the error
> rate about 0.5^6=0.016 when the fraction of failed nodes equal to 0.5 when CFS
> keeps 6 copies of a block. Based on my understanding of your source code, you
> have set the number of copies of a block to 6. But I always get the error rate
> about 0.002 in my experiments. I only got the error rate equal to 0.008 in my
> first experiment run. I don't know what the problem is. So I have to ask for
> your favor to help me think of some explanations.
>
> I did my experiments on Emulab. For each experiments, I copied the compiled CFS
> binary distribution "lsd", "filestore" onto each testbed host. Then I started
> 1000 CFS servers on these testbed hosts with "lsd". A client script on one of
> the CFS servers sends 1000 "filestore" store requests to the CFS network. Then
> half of the lsd processes(chosen randomly) were killed. Then the client script
> sends 1000 "filestore" retrieve requests to the CFS network, each after one
> second. After each experiment run, I cleaned up the execution environments of
> all the CFS servers (basically the db for all the servers).
>
> Another observation is, all the errors I observed happened for the retrieve
> requests sent early. But I think that the errors that should happen in this
> experiment are those because of loss of all the six copies of a block. Then
> these errors could happen for any retrieval requests because the errors are
> unrecoverable.
>
> So I am very confused about my observations. I am wondering if CFS has other
> mechanism in decreasing the error rate or if there is any problem in my
> experiment setup. Thank you very much for your help!
>
> Thanks,
> Yanyan :)
>
>
> Quoting Yanyan Wang <Yanyan.Wang at colorado.edu>:
>
> > Hello chord authors,
> >
> > I met a weird problem when I tried the robustness of chord network. Could you
> > please kindly help me explain it? I used the chord prototype to do the
> > experiment. I did the following:
> >
> > 1. I started two chord nodes each of which has 16 virtual nodes;
> > 2. I inserted 32 strings into the chord network and got 32 corresponding
> > keys;
> > 3. I killed one of the lsd processes;
> > 4. I retrieved the 32 keys.
> >
> > >From the explanation in your sigcomm paper, I expected about half of the 32
> > key
> > retrievals should fail because of the key lost as the result of my killing
> > one
> > lsd process. But my result is all the retrievals succeed in getting the
> > corresponding strings. I am very surprised at this result. The logs of the
> > live
> > chord node for the retrieval of key 9f8ff2967cff70d7373cb304807e053a44d8a047
> > are:
> >
> > ...
> > lsd: will order successors 1
> > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > DHASH_NOENT at 3eb6ccaf98300a77989e0059cbe9a465bbc6f35d
> > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > DHASH_NOENT at 40f746245f6c3bb924ad6e5ddcf15811ea456e9a
> > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > DHASH_NOENT at 1ac46b2343075c88e9c0f601e96aa64b3651d8d7
> > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > DHASH_NOENT at 48896252b67623a835fb117b6559c70eeb583fc7
> > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > DHASH_NOENT at 15282e83543d99cbf76669c03b4d26b0e15b96da
> > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > DHASH_NOENT at 6793bc04e0c544d5c23e84c6c0605cd0998ee2b7
> > lsd: dhash_download failed: 9f8ff2967cff70d7373cb304807e053a44d8a047:0:1:
> > DHASH_NOENT at 766a83e4c5a2f0e76723006c6dad002de53aeaf9
> >
> > It seems all dhash_download are failed. But why it still can get the correct
> > string of the key? I am very confused and I would be very appreciated for any
> > ideas. Thanks a lot!
> >
> > Thanks!
> > Yanyan :)
> >
> >
> > =================================
> > Yanyan Wang
> > Department of Computer Science
> > University of Colorado at Boulder
> > Boulder, CO, 80302
> > =================================
> >
> > _______________________________________________
> > chord mailing list
> > chord at amsterdam.lcs.mit.edu
> > https://amsterdam.lcs.mit.edu/mailman/listinfo/chord
> >
>
>
> Yanyan :)
>
>
> =================================
> Yanyan Wang
> Department of Computer Science
> University of Colorado at Boulder
> Boulder, CO, 80302
> =================================
>
> _______________________________________________
> chord mailing list
> chord at amsterdam.lcs.mit.edu
> https://amsterdam.lcs.mit.edu/mailman/listinfo/chord
>