Automated Anomaly Detection and Optimization in Cloud Computing February 8, 2012 There is an evolution happening in how organizations of all shapes and sizes consume business services. More and more, business services are being consumed from either public or private “clouds”. But as Kirill Sheynkman points out in “21 Experts Define Cloud Computing”: “The ‘cloud’ model initially has focused on making the hardware layer consumable as on-demand compute and storage capacity. This is an important first step, but for companies to harness the power of the cloud, complete application infrastructure needs to be easily configured, deployed, dynamically-scaled and managed in these virtualized hardware environments.” In other words, there is a big difference between an application that runs on a cloud and an application that is actually “cloud aware” and can take advantage of the dynamic environment in which it exists. The challenge is that to be “cloud aware” an application must be able to scale well not just vertically but horizontally. This adds a whole new level of complexity because the application now has to take into account cloud computing infrastructure elements, such as cache, storage, network, bus, management, cost, etc. Automated horizontal scaling of applications in a cloud computing environment is a complex problem. One fundamental road block is the inability of current monitoring/management systems to adequately capture the current state and predict the future state of an application and the cloud computing infrastructure in which it is running. Both are fundamental to automating the horizontal scaling of applications in the cloud. Today’s monitoring/management tools generally rely on experts to configure thresholds manually on single data streams and then trigger an alert or workflow if the threshold is exceeded. This approach is woefully inadequate for managing large, dynamic systems, such as cloud computing applications and infrastructures. This has been widely recognized and the source of numerous academic efforts, including IBM Research’s autonomic computing initiative. Attempts to overcome this roadblock and create an adequate picture of the current and likely future state of a cloud and thereby allow intelligent self-management have focused on advanced anomaly and machine learning techniques. Both techniques are important fields of research and have been treated within diverse areas. A lot of theoretical work has been done and there are many available components such as high performance message buses, real time data correlation engines, and efficient machine learning algorithms which can be leveraged to create functional solutions. Commercially, these components have already found themselves applied to advanced financial and medical analysis solutions. One commercial product that stands out in the market as potential solution for automating the horizontal scaling of cloud applications is VMWare’s vSphere Distributed Resource Scheduler (DRS). VMWare claims that DRS can, on a continuous basis, “intelligently allocate available (cloud) resources among virtual machines according to business needs.” It does this by pre-defined resource allocation rules, predictive usage algorithms, continuous monitoring of physical and virtual machine resource usage, and the triggering of automated administrative workflows. Although DRS is arguably the most sophisticated cloud optimization solution on the market today, its scope is very limited: 1) DRS is a VMware-only tool; 2) its purpose is to optimize the placement of VMs on a cluster and not optimize application performance; 3) it is not Service Level Objective (SLO) aware; 4) its workload optimization logic considers only CPU and RAM utilization metrics and not I/O or network latency which are critical to application-level performance. Each of these limitations rule out DRS as a tool for creating truly “cloud aware” applications capable of intelligently scaling horizontally and, thereby, of taking advantage of the one of the key benefits of cloud computing. To begin to apply anomaly detection and machine learning techniques and systems in the context of creating “cloud aware” applications, we suggest the following basic design principles/requirements: The solution: Must automate the optimal horizontal scaling of stateless and stateful cloud applications in reference to pre-defined SLOs and other standard affinity, availability, and costing rules Must be able to detect, log and alert on anomalous patterns across multiple, real time, multivariate data streams. Must be (as much as possible) application and cloud agnostic. Must be able to respond quickly to workload spikes and scale back more slowly. Should use a modular control framework based on Model-Predict-Control (MPC) principles that uses a model of the system and its current state to compute the (near) optimal sequence of actions of the system that maintain the desired constraints using short timelines and an iterative methodology. Reference signals to the control framework should be based on multi-dimensional workload calculations, including: CPU utilization, RAM utilization, I/O latency, and network latency. Should be able to add additional factors. The control framework should use Machine Learning (ML) algorithms to automate system modeling to a) predict probability of an action will achieve desired SLOs and identify complex patterns. The control framework should have a loose-coupled event-driven architecture where all main modules are publishers or subscribers to a high performance message bus/event broker. The control framework should use a high performance Complex Event Processing (CEP) engine to aggregate and normalize multiple, real-time data streams