Resource Overbooking and Application Profiling in Shared Hosting Platforms Summary: This paper describes a system that uses overbooking to improve resource utilization while providing statistical resource reservation guarantees to applications in a shared hosting cluster environment. This approach allows running untrusted applications on the same platform. To provide statistical guarantees, the resource requirements of applications are derived offline based on synthetic workloads. Contributions: The main contributions of this paper are the following: 1) the use of offline profiling to derive application level resource requirements in terms of a token bucket model, 2) the use of statistical overbooking to improve resource utilization and 3) an algorithm for assigning components of an application to different machines in a cluster environment. Analysis of the paper: I believe that all the contributions are novel but it is the combination of these contributions that makes the paper interesting. The paper clearly explains how a token bucket model for an application can be derived using profiling. This model together with a overbooking tolerance can be used to perform admission control so that a new application is allowed only if the overbooking tolerance of all currently running applications is not exceeded. Finally, for applications with multiple components, the paper describes how admission control can be used to implement an algorithm for placing components on multiple machines. Issues with the paper (in no particular order): 1. Fig 4b seems to indicate that provisioning resources based on 99 percentile requirements requires 20 times more resources than provisioning based on median (or average) requirements. The throughput benefits, on the other hand, seem to be only twice more (first column of Table 2). The authors need to describe this cost associated with their provisioning approach. 2. Profiling of resource requirements is done one application at a time. Typically, when applications are multiplexed, they require more resources because of effects such as context switching and cache pollution. It is not clear how the authors take this problem into account. 3. Profiling of resource requirements uses a particular machine. The authors need to address what happens when the CPU speed (or memory, disk speed) of the hosting machines changes over time. In particular, what are the costs of profiling? How long does it take to develop a sound profile? 4. The simulation performance in Figure 7 is not convincing. The linear speedup does not take into account real issues such as memory, disk and network bottlenecks. The authors need to run web servers in their experiments rather than simply simulate equations 1 and 3. 5. The experiments in the paper are performed with workloads that were also used during profiling. The authors need to show how their system behaves when the tail of the workload is different from the profiled requirements. 6. (this one is borrowed from a student who thought about it!) The use of O, the overbooking tolerance seems incorrect. It is defined as the probability that an application's requirement can be violated and depends on the application's usage distribution U. However, it is used in the paper without considering this distribution. For example, Equation 1 incorrectly assumes that assigning (1-O) of a capsules maximum resource needs will satisfy the overbooking tolerance. This is not so if an application is not bursty at all. In this case, assigning 99% (O = 0.01) of the maximum requirements will not meet any requirements of the applications! 7. The authors do not show the benefits of their approach: do the applications meet their overbooking tolerance? This needs to be shown, especially given 6 above, and also because if the overbooking tolerance is not managed well then it is unclear why the system should be over provisioning so much (see point 1 above). 8. (this one is borrowed from a student who thought about it!) It is unclear how the greedy capsule placement algorithm will always find a placement if one exists. If two nodes A and B have 3 and 5 capacity and three capsules C1, C2 and C3 need 2, 3 and 3 units respectively, then these capsules can be assigned to the nodes. However, the greedy algorithm doesn't seem to find a placement since it could place C1 at A (under random placement) and then there is no feasible assignment.