Network Troubleshooting: A Collaborative Effort The Internet has not yet reached the "five nines" availability achieved by the phone network. Service disruptions can happen for a variety of reasons that range from equipment failures to malicious attacks. Although in an ideal scenario the network should hide these events from end users, the Internet is still far from it. Worse, when a service disruption happens, end users and operators have very little support to troubleshoot the root cause of the problem. Even large ISPs do not have automated tools for troubleshooting. Network fault diagnosis is done in an ad-hoc fashion (using traceroute or mining network monitoring data). This manual process is slow and ineffective. The very features that have made the Internet a success (such as the decentralized control and the IP protocol stack) also represent the greatest challenges to troubleshooting. The origin of a service disruption may be at any layer of the protocol stack, any of the networks in the path from the source to the destination, or at other components such as DHCP or DNS. Even if we just consider service disruptions caused by routing changes, there is no single entity that is fully equipped to diagnose these disruptions. End users or a troubleshooting system running at the end-user machine are the only ones that have a detailed description of the problem: in which context the problem manifests itself; to which destinations. Unfortunately, there is very little end users can do with this information to find the actual origin of the problem and solve it. The state of the art is, in increasing order of desperation: run traceroute (note that traceroute only gives reachability information), reboot your equipment, try again later in the hope that the problem will "fix itself", and call your provider if it does not. Providers can have a lot of information from their own networks, and one can imagine an automatic system that would take as input IGP and BGP messages, router logs and configurations, SNMP statistics, etc. to pinpoint the origin of a path change or an unreachability issue. If the problem appears in any of these datasets, then providers should eventually be able to detect and hopefully fix it. However, there are scenarios in which a customer is experiencing a problem, but it is not observable from the providers network. The problem might not be at the core network; it might be at the customers LAN or access connection, or at the destination end. If it is at the core network, it might be at some other AS in the Internet or between ASes. If the customer is multihomed, then diagnosis is even more complicated, because we need to know which of the providers is used for the destination. These factors indicate that troubleshooting should be a collaborative effort between the edge and the core. Inside the core multiple ASes also need to cooperate. In order to build such a large-scale collaborative system we need to address a number of technical and business questions. How to build an automatic troubleshooting system that is accurate and precise? How to split the task between edge and core? What are the incentives for ASes to cooperate? How much information would ISPs be willing to exchange?