Wednesday, September 26, 2012


CLUSTER SIZING - No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics

Motivation:
The cloud provider offer Infrastructure-as-a-Service platforms which are computing resources to run a job. This is a pay-as-you-go service model. Many users have to provision the resources by themselves and as the jobs vary the resources needed by users vary. To make the resource allocation easier for the user, they have developed Elastisizer. This system uses declarative style queries to get the cluster sizing information for a specific job.

4 major use-cases are considered for the evaluation of Elastisizer.
  • Tuning for elastic workloads: If a job takes 3 hrs on 10 m1.large nodes what happens if we add 5 more
  • Planning from a development cluster to production: When a job is run in development cluster, Elastisizer will estimate various configurations for production cluster
  • Cluster provisioning under multiple objectives: What happens if the data input or the cluster resources are not singleton
  • Shifting workloads in time to lower execution costs:  Some instances vary cost by time of usage. If the use of such instances can lower the execution cost.

Elastisizer:
It has 2 major components - A What-if Engine and two Enumeration and Optimization Engines (EOE). Most of the analysis and modeling is done by the What-if engine. The EOEs come into picture when there is some query optimization or multi-objective cluster provisioning is required.
Elastisizer creates job profiles for the jobs running on a cluster and based on the data collected it creates a virtual profile for similar jobs but with different input data and cluster resources. A profile is created for the MapReduce job's execution based on Cost fields, Cost Statistics, Dataflow fields and Dataflow Statistics. The profiler uses dynamic instrumentation to collect run-time monitoring data. Hadoop is implemented in Java. The profiler implemented uses BTrace dynamic instrumentation for Java. This helps in collecting the raw monitoring data using the Java classes internal to Hadoop.

Estimating Virtual Job Profiles:
The authors used Black-Box and White-Box models to estimate the virtual job profile. Black-Box models are used where they do not know the details about the process. Like, the input data and new cluster resources, we don’t know how they are implemented. To estimate these, they have used black-box models based on training samples.
The white-box models are used to estimate dataflow and its statistics and cost fields with the help of what-if engine. The Hadoop MapReduce framework lacks query semantics and data representation of database systems. Hence, the statistical information about input data cannot be determined. The what-if engine now assumes that the dataflow is proportional to the data input size and calculates the statistics from the virtual job profile. When there is some additional information available from input data like Hive and Pig, the input statistics can be over-written with details and appropriate calculations for data flow statistics can be made. The what-if engine uses this dataflow statistics to calculate the dataflow and cost fields based on some models.

Comments:
  • I feel that the Elastisizer is best for the second use-case i.e. for moving jobs from development to production clusters due to the very similarity in the jobs.
  • There is some Slowdown when using Dynamic instrumentation as there will be some delay while collecting the monitoring data.
  • The training samples used for black-box modeling - are a scaled down version or lowered runtime version of original. There might be some network bandwidth or system performance issues when running the heavy data versions.
  • There might be other jobs running on the system at the time of running training data.
  • Not much discussed about lowering costs by shifting workloads in time.