Minutes of the 22nd NMRG meeting IETF 68, Prague, Czech Republic 23 March 2007 Minutes: Vladislav Marinov, Juergen Schoenwaelder Participants: The meeting was attended by about 40 people. The names were recorded on the rooster which went to the IETF secretariat. Agenda: 09:00 Agenda bashing and administrivia Juergen Schoenwaelder (Chair) 09:10 Report from the NMRG workshop in Utrecht Aiko Pras 09:30 Last call review draft-irtf-nmrg-snmp-measure-01.txt Juergen Schoenwaelder 10:00 SNMP Traffic Analysis Update Juergen Schoenwaelder 10:30 Distributed monitoring algorithms Alberto Gonzalez Prieto 11:00 Discussion Everybody 11:20 Wrapup Juergen Schoenwaelder (Chair) Abbreviations: JS: Juergen Schoenwaelder AP: Aiko Pras BW: Bert Wijnen DP: David Perkins RB: Randy Bush LD: Luca Deri DH: David Harrington SL: Simon Leinen DR: Dan Romascanu AG: Alberto Gonzalez Prieto AB: Andy Bierman PD: Petre Dini 1. Introduction to NMRG (JS) The agenda was updated since the NETCONF RBAC people from France could not make it to Prague meeting. The relevant I-D is draft-cridlig-netconf-rbac-00.txt. 2. EMANICS / NMRG Workshop Report (AP) AP reviewed the EMANICS/NMRG workshop in October 2006 on future research directions. See his slides for details. About 25% of the attendees were at the IAB plenary the night before the NMRG meeting; so AP decided to go over all slides (and not just the ones not presented at the plenary). C: Your bullet that "Network management is risk management" was very interesting. There is always risk in uncertainty, we are not sure what happens in the network. Q: About the economic slides - is there work in the area? A: Yes, one of the EMANICS work packages is economic management; but work is still focusing on SLAs and legal issues. I want to see answers of this question myself in the future. 3. Last call review draft-irtf-nmrg-snmp-measure-01.txt (JS) This document was last called and 15 reviews were submitted. Bert Wijnen acts as the shepherd for this document. JS went through all reviews, discussing those issues he felt worth bringing up (i.e., the edits may not be fully understood); see the issues list in the proceedings. Below are the actions agreed on during the meeting: - Change "inform" to "inform-request" but leave "trap2" - JS to contact Frank - he wants to see a figure but it is unclear what the figure should show - Change the requirement to keep trace sources to a should keep rather then a must keep. Explain why it is good to keep the original data sets but also explain issues arising from this material being sensitive - The anonymization section should provide sufficient pointers for people interested in anonymization but it is not the goal of the document to provide an extensive discussion of this topic; the document is intended to make the operator communities aware that we want trace data; making the document too long will make it harder to use it for its intended purpose. Decided to add a summary what the state of the art in the area is - The main reason to use names for protocol operations is that they are frequently read by humans for debugging and analysis while the error-status value is usually of secondary importance (as is the version number). [Ed: the more I think about it, it may actually be useful to use strings rather than numbers] - Add a statement how the relaxng compact notation schema can be converted into other XML schema languages - Add a section with warnings about generalizing conclusions from biased data sets; conclusions derived from a biased small data set might not be universally true - Change the base type mapping so as to not loose information (that is report Counter32 and friends (including Opaque) correctly; do not use the SMIng base type model). - Clarify that the IPv6 address type is just there to capture src and dst addresses (do we have to add MAC addresses?) - State that it was a design decision to keep conversion tools MIB agnostic. Hence, fancy OID formats can't be done. (JS explained that he wrote a separate tool called smixlate to handle this.) - Document that packets containing mal-formed SNMP messages or messages which are neither SNMPv1/SNMPv2c/SNMPv3 can be dropped and that implementations should report the number of such dropped messages - Add examples for CSV and XML formatted packets - Add rough numbers about trace sizes, e.g., a flow polling N interfaces (inOctets, outOctets) every 5 minutes causes traces of size xxx pcap, yyy csv, zzz xml. It would be nice to extract such a flow to get the numbers... - Keep XML format since it retains all information; the CSV format only retains a subset of the information. - Rewrite section 2.5 so that it stresses the point that some trace providers may prefer to not give away trace data but are willing to execute analysis scripts if they can be checked that they do not harm; this requires to use high-level languages typically understood by operators (e.g. Perl - make clear Perl is just an example). - Keep section 3 as it was generally considered to be useful, especially for explaining how traces may be used and to give an idea which questions will be answered with this research - Explain that pcap is a bad format since it requires complex analysis code (IP reassembly, BER decoding) and thus makes it difficult to verify an analysis script does not do harm 3. Trace Analysis Update (JS) JS presented an update of the trace analysis work done at Jacobs University and the University of Twente. See the slides for the details of the presentation. Below are some questions that came up during the presentation. Q: You did observe SNMPv3 messages? A: Yes, but very sporadic; might have been an attempt to test if an agent supports SNMPv3 Q: Were there responses to the SNMPv3 messages? A: I do not recall precisely, but I assume so Q: There were only very few informs in traces that use SNMPv2c A: Yes, it seems that GetBulk is a stronger incentive to switch to SNMPv2c in the traces we have collected so far. Q: Can we extrapolate in larger aspect? A: No, the data set is way too small for generalizations Q: Which applications were used? A: We didn't ask for this information, don't even know if we can expect operators to know Q: Operators should provide us with information what happened on the network during the trace collection period. It will be useful to capture what applications were used A: We want to identify communication patterns, without knowing which applications generated them. Note that we already try to track meta data but we do not necessarily expect it to be complete Q: Is the flow definition consistent with the IPFIX definition? A: No because they use port numbers, but many management scripts kick short lived snmp retrieval processes and thus dynamic port numbers are assigned for the manager processes C: There was also one long trace, huge in size. It turned out to be someone walking into a timefilter table and never getting out, staying there for days. Q: At some point it was mentioned that there are some dumb applications that simply retrieve read-only data A: Probably applications should look at the semantics of the data and then caching is not that hard; of course, a generic MIB browser or data poller will not do this C: Applications do not support caching anyway C: Probably caching is not worth the cost of optimization because sometimes routers get rebooted and reconfigured and it is not worth the risk, i.e., configuration might be changed suddenly so the cost of optimizing applications w.r.t. to retrieving read-only data might not be worth because of the risk 4. Distributed Monitoring and Aggregation (KTH) AG presented the work done at KTH on distributed monitoring and aggregation based on a live demo. The slides are recorded in the proceedings. Questions: Q: SL, as an operator, is this useful for your environment? A: Our network is not so big, no need to optimize queries, you can optimize the number of samples in order to improve the mean error objective. Q: I would be interested in short term samples, your algorithm is adaptive, the network is running stable, you adapt so that network is queried every second hour. What about short time changes? What will happen if there is a failure after several minutes? A: We are not polling. The protocol follows the changes, when difference from the previous value is significant, a message will be sent immediately. Q: What about polling overhead? Routers should do useful work and polling introduces overhead. A: Aggregation is not performed in the router. None of the routers have to process more than 1 message per second. Q: You are using routers for points of observations. Is there any research about what are the better observation points for different applications? A: I am not aware of this. 5. Open Discussion Q: [AB] Discussion during the IETF ops-area meetings concentrated on NETCONF, XSD, data model contexts, concerns about document control, life cycle management. Is that something the IRTF can help with or IETF should do that. Is XSD good enough so we can move on? C: [DP] Work needs to be done. It is a serious issue when people need to do work on modeling and have no language. Work on finding a good language needs to be done quickly. Otherwise, every company will have its own way of doing things in different languages. Q: [AP] Do you want standard or research? A: [DP] IETF => standard Q: XML schema is a optimal language. What is the problem? A: [DP] Why it is not sufficient will be explained in standardization. C: Cannot wait for research activities. Think it is IETF activity. C: [DR] Focus should be what NMRG should do, what specific questions on research we should focus on, what data modeling techniques and languages are available. Suggest about research discussion on NETCONF data modeling. NMRG should prepare a state of art about what happens in research. C: [JS] SMING lesson is that being protocol independent gets difficult, perhaps people can write what the pitfalls are. Agreeing on syntax is part of engineering not research. C: Get more and more data from companies, research how much meta-data is used and how much it is read by humans, how much they read descriptive data. C: Perhaps we can produce a lessons learned document from the SMING experiment before Chicago C: [PD] I think a good topic for research should be manageability and in particular self-manageability, self, self... I didn't see guidelines how to build autonomic systems and components, how to build and manage them. This will be good topic for research. Suggest to form sub-group and come up with draft document. Some work in the IETF will be implemented as system entities which sooner or later support self manageability. Some documents in IETF have management related aspects. I would be interested in small subsection where entities have SELF inside? C: [JS] Do you mean interaction between control loops? C: Self-management networks are interesting as there you say how you can predict unexpected things. C: Still use aggregation, traffic aggregation, what properties were aggregated, combined. When you have 2 entities that interact some of their features disappear. C: Behavior can be harmful for the network C: [JS] Document what should be considered when autonomic work is done, such as control loop interactions C: Where is the authoritative control coming from: box or centralized entity? C: Acceptance of autonomic management correlates with policy based network approach acceptance. Should we research policy based network?