Resource Overbooking and Application Profiling in Shared Hosting Platforms

Summary:

This paper describes a system that uses overbooking to improve resource
utilization while providing statistical resource reservation guarantees to
applications in a shared hosting cluster environment. This approach allows
running untrusted applications on the same platform. To provide statistical
guarantees, the resource requirements of applications are derived offline based
on synthetic workloads.

Contributions:

The main contributions of this paper are the following: 1) the use of offline
profiling to derive application level resource requirements in terms of a token
bucket model, 2) the use of statistical overbooking to improve resource
utilization and 3) an algorithm for assigning components of an application to
different machines in a cluster environment.


Analysis of the paper:

I believe that all the contributions are novel but it is the combination of
these contributions that makes the paper interesting. The paper clearly
explains how a token bucket model for an application can be derived using
profiling. This model together with a overbooking tolerance can be used to
perform admission control so that a new application is allowed only
if the overbooking tolerance of all currently running applications is not
exceeded. Finally, for applications with multiple components, the paper
describes how admission control can be used to implement an algorithm for
placing components on multiple machines.

Issues with the paper (in no particular order):

1. Fig 4b seems to indicate that provisioning resources based on 99 percentile
requirements requires 20 times more resources than provisioning based on median
(or average) requirements. The throughput benefits, on the other hand, seem to
be only twice more (first column of Table 2). The authors need to describe this
cost associated with their provisioning approach.

2. Profiling of resource requirements is done one application at a
time. Typically, when applications are multiplexed, they require more resources
because of effects such as context switching and cache pollution. It is not
clear how the authors take this problem into account.

3. Profiling of resource requirements uses a particular machine. The authors
need to address what happens when the CPU speed (or memory, disk speed) of the
hosting machines changes over time. In particular, what are the costs of
profiling? How long does it take to develop a sound profile?

4. The simulation performance in Figure 7 is not convincing. The linear speedup
does not take into account real issues such as memory, disk and network
bottlenecks. The authors need to run web servers in their experiments rather
than simply simulate equations 1 and 3.

5. The experiments in the paper are performed with workloads that were also
used during profiling. The authors need to show how their system behaves when
the tail of the workload is different from the profiled requirements.

6. (this one is borrowed from a student who thought about it!) The use of O,
the overbooking tolerance seems incorrect. It is defined as the probability
that an application's requirement can be violated and depends on the
application's usage distribution U. However, it is used in the paper without
considering this distribution. For example, Equation 1 incorrectly assumes that
assigning (1-O) of a capsules maximum resource needs will satisfy the
overbooking tolerance. This is not so if an application is not bursty at
all. In this case, assigning 99% (O = 0.01) of the maximum requirements will
not meet any requirements of the applications!

7. The authors do not show the benefits of their approach: do the applications
meet their overbooking tolerance? This needs to be shown, especially given 6
above, and also because if the overbooking tolerance is not managed well then
it is unclear why the system should be over provisioning so much (see point 1
above).

8. (this one is borrowed from a student who thought about it!) It is unclear
how the greedy capsule placement algorithm will always find a placement if one
exists. If two nodes A and B have 3 and 5 capacity and three capsules C1, C2
and C3 need 2, 3 and 3 units respectively, then these capsules can be assigned
to the nodes. However, the greedy algorithm doesn't seem to find a placement
since it could place C1 at A (under random placement) and then there is no
feasible assignment.