FAQ for Ray

Q: Is Ray used in practice?

A: Ray is an open-source project (https://github.com/ray-project/ray)
with many users, including OpenAI.  Anyscale offers a commericial
offering of Ray as a service.

Q: What application benefit from Ray?

A: Ray is in the line of systems such as MapReduce, Spark, etc., but
can handle a wider class of applications by supporting both functional
and stateful computations and providing low-latency for invoking
computations.

Q: What is the scheduling policy that Ray uses?

A: The default policy is try to run a task locally. It the local
scheduler doesn't have enough capacity for the request, asks a random
other scheduler if it can run the task (based on the requested
resources for the task).  The details are here:
https://docs.ray.io/en/latest/ray-core/scheduling/index.html

Q: What's the difference between actors and tasks?

A: Actors is a kind of task: a stateful one that can be invoked
multiple times and keeps its state between invocations.

Q: What is the actor model?

A: A model of computation where a computation can maintain local state
invocations. For more specifics, see
https://en.wikipedia.org/wiki/Actor_model

Q: Why even bother supporting actors?

A: Developers may want to have long-running tasks that can maintain
state between invocations. For example, because the Router in 3(a) is
an actor, it can perform batching. Each model actor in 3(a) keeps
weights between invocation to keep them warm in the local GPU memory.

Q: What is lineage reconstruction?

A: Lineage reconstruction recomputes results by tracing back the call
graph to a task that hasn't crashed and then re-executing its tasks,
reproducing the tasks that crashed.  The computations must be
idempotent.


Q: What are the tradeoffs between recovering value through lineage
tree re-execution compared to recovering values through persistence
log?

A: Lineage recovery requires that the computations are
idempotent. Lineage recovery may take a long time and expensive since
the system may have to re-run many computations, which may produce
large amounts of data.  This is the reason that Ray tries to speed up
this recovery up by re-using secondary copies. If the second copy,
exists it doesn't have to re-run the computation to generate it.

Q: How would the system handle non-idempotent tasks, or what would be
the strategy to deal with side effects?

A: Ray by itself doesn't handle non-idempotent tasks. You would need,
for example, a transaction-like implementation plan with write-ahead
logs.  This out of scope of the Ray system itself, but actors could
implement a plan like that on their own.

Q: What is fate sharing?

A: Fate sharing refers the general idea of forcing a tasks to share
fate with the tasks that failed; that is, fail it too.  In Ray this is
important because task blocked on a dangling future may block forever
because the future will never complete because the task that might
have completed it has crashed.

Q: Why is the system called Ray?

A: There is no real reason behind the name. According to the authors,
it was one of several choices and it sounded nice.

Q: Is the idea of "ownership" here inspired by ownership in other
contexts, e.g. programming langugae design (such as Rust)?

A: The idea of ownership is a common idea in systems.  I don't think
there is much overlap between Ray and Rust in terms of their specific
use of ownership.  For example, a borrower can modify an object in
Ray, which is disallowed in Rust.

Q: If Ray were implemented in Go could one use a distributed garbage
collector instead of reference counting?

A: One could implement a distributed garbage collector that leverages
Go's local garbage collector but that is particular challenging in
distributed systems, because of failures and because of potentially
long pause times (because the GC has to contact a remote node).  Here
is a survey article on distributed garbage collectors:
http://portal.acm.org/citation.cfm?doid=292469.292471

Q: What is the security model in Ray?

A: Each Ray deployment is for a single application, isolated from
other deployments using standard approaches: renting private machines,
virtual machines, or containers.  That is, Ray itself is not
multi-tenant.