DieCast: Testing Distributed Systems with an Accurate Scale Model
Large-scale network services can consist of tens of thousands of machines running thousands of unique software configurations spread across hundreds of physical networks. Testing such services for complex performance problems and configuration errors remains a difficult problem. Existing testing techniques, such as simulation or running smaller instances of a service, have limitations in predicting overall service behavior at such scales. Testing large services should ideally be done at the same scale and configuration as the target deployment, which can be technically and economically infeasible. We present DieCast, an approach to scaling network services in which we multiplex all of the nodes in a given service configuration as virtual machines across a much smaller number of physical machines in a test harness. We show how to accurately scale CPU, network, and disk to provide the illusion that each VM matches a machine in the original service in terms of both available computing resources and communication behavior. We present the architecture and evaluation of a system we built to support such experimentation and discuss its limitations. We show that for a variety of services---including a commercial high-performance cluster-based file system---and resource utilization levels, DieCast matches the behavior of the original service while using a fraction of the physical resources.