To manage or not to manage: Addressing the benefit overhead tradeoff in network management Danny Raz, Technion The increased complexity of networking infrastructure and protocols and the desire to provide high quality services at the lowest possible cost, drive many organizations to deploy more network and system management tools in their networks. It is often argued that due to the high complexity of management, a much more cost effective way to assure performance is just to acquire more resources. This is particularly true for performance management of Information Technology (IT), where the goal is to coordinate networked resources in such a way that the business-level objectives are met at all times, at a lowest possible cost, and with optimum capacity. As Service-Oriented Architecture (SOA) spreads as a popular way of organizing and providing distributed capabilities to solve business problems, cost-effective performance management becomes essential. Practicing IT administrators know well that committing more resources to management improves the overall quality of service up to a certain point, after which management costs start dominating the total cost of ownership and management off-sets its own advantages. Thus, although it is naturally desirable that the network would perform at the highest possible level, this may not be the best solution due to the associated cost. Time is now mature for the research community to address this fundamental tradeoff in a rigorous way by showing exactly how much effort should be invested into management to gain the maximal benefit. In order to do that, one should accurately define both the cost associated with the management process and the expected benefit. Of course, considering the overall benefit of general management systems and all aspects of the associated overhead may be impossible due to the variety of different aspects involved and different network conditions. However, when applying to specific tasks, within the network management domain, one can rigorously define this tradeoff, and then provide a general tool to find optimal working points for such systems. Consider for example a service that is being provided by a set of servers over the network. The goal of the service provider is to provide the best service (say, minimizing the service time) given the amount of available resources (e.g., the number of servers). The provider can add a load sharing system (for example as suggested in RFC 2391) and improve the service time. However, the same resources (budget) could be used to add additional servers to the system and thus provide better service to end customers. The dilemma here is between adding more computational power and adding management abilities, where the goal is to achieve the best improvement in the overall system performance. Note that in order to be effective, the load sharing system needs updated load information from the servers. Handling such load information requests requires small but nonzero resources (e.g., CPU) from each server. Thus, it is not easy to predict the actual amount of improvement expected from preferring a specific configuration. Yet, for this concrete example, one can formalize the cost and expected benefit and define an optimal working point. As indicated by this example, it is important to identify just the right amount of resources that should be allocated to management tasks (such as monitoring) in order to maximize the overall system performance. In additional to being an important and interesting research direction, this approach can be proven to provide practical tool that can help in providing cost effective services to the community.