Re: Interface load > 100% how come?

Paul Koch (koch@smople.thehub.com.au)
Fri, 6 Sep 1996 10:40:18 +1000 (EST)

On Thu, 5 Sep 1996, Juergen Schoenwaelder wrote:

>
> Tatsuo Natsukawa <natsu@natsu.itjit.ad.jp> said:
>
> Tatsuo> I think ifload, which comes with scotty, and tkined uses the
> Tatsuo> following formula to compute an interface load:
>
> [ ... formula deleted ... ]
>
> Tatsuo> Sometimes, ifload and tkined report an interface load greater than
> Tatsuo> 100%, e.g. 106%. How could it be?
>
> Good question. I think that the current ifload implementation should
> really get it right. However, there is always a chance for errors. The
> only way to track this down is to add a couple of "puts" to the
> monitoring script to save the values returned by the SNMP responses
> and to redo the calculation by hand. This will show if the calculation
> performed by the scotty script is broken or if the agent returns
> inconsistent values. (Note, early versions of snmp_monitor.tcl are
> broken since they used the local time interval instead of the time
> interval measured by the agent. Even a small unusal network delay
> could lead to incorrect results. But this has been fixed a couple of
> month ago.)
> Juergen
> --
> Juergen Schoenwaelder schoenw@cs.utwente.nl http://www.cs.utwente.nl/~schoenw
> Computer Science Department, University of Twente, (Fax: +31-53-489-3247)
> P.O. Box 217, NL-7500 AE Enschede, The Netherlands. (Tel. +31-53-489-3678)
>

Getting over 100 percent utilisation is quite common. But how can this be
you say ?? Well, from experience, this is what quite a few types of routers
actually report. It all depends on where abouts in the software the router
grabs the port byte counts.

For example, 3Com routers can have a single logical port configured with
multiple physical paths (I think the port and path terms were mixed up
by someone at 3Com). Anyway, from what I can see, the SNMP ifInOctets and
ifOutOctets are counted well before they actually get to the software
controlling the interface.

Lets take and example setup:

Unix box ----- router_A --- serial link --- router_B ----- Unix box

I did some testing with no other devices on the network and found I could
push the link to almost 100%. The Unix boxes would not over-run the link.
So in this example no traffic was being dropped and the link was being used
to its best capacity.

Unix box ----- router_A --- serial link --- router_B ----- Unix box
DOS crap DOS crap
Other devices Other devices

Then I did the next test of generating lots of different traffic by lots
of different software. From the ifInOctets and IfOutOctets I could see
the link was being hit at around 250-300%. Thus around 150-200% of the
traffic was being dropped and retransmitted.

I checked my calculations by hand and even took it over a long time period
(ie. 10-20 minutes) to discount the possible collection time error
mentioned above by Juergen.

Sigh! I jumped up and kicked the router in the butt!
But, then I changed my position. This is not a 'bad' feature, but a very
useful one. From the %utilisation, you can instantly determine if your
links are running at their best capactiy or being over-run. A link running
at 100% for a period of time may not be a problem at all. But a link running
at 106% means that 6% of the traffic is being dropped and transmitted.

I then did some other tests and found that it was mainly the fault of
DOS boxes and their cruddy IP/IPX protocol stacks.

For example:

Unix box ----- router_A --- serial link --- router_B ----- Unix box
FreeBSD box FreeBSD box
FreeBSD box FreeBSD box

I started up lots of ftp transfers between these boxes and found the link
would still only run at 100% because the TCP stacks would back off. I
performed the same tests with DOS boxes and the ifInOctets and ifOutOctets
showed the link going through the roof.

By the way, when calculating the utilisation of a serial link, you have
to keep the transmit and receive utilisations seperate because it is not
statistically sound to combine them and have a single figure.

My wife is a statistician and has taught me a lot about statistics.

One of the bad things about network management and monitoring is you
really need to be a jack of all trades. You need to have a good networking,
PC, Unix, programming, statistics knowledge. Sigh! Well back to the grind.

Oh, By the way Juergen, I have been doing some programming with scotty
lately and find it really really nice. I have 2.1.1 running on a FreeBSD
2.1.0 and 2.1.5 box with TCL7.5 and TK4.1. It compiled with no problems
straight out of the box, well off the net anyway. I am designing some
automated collections, analysis and graphing utilities for network
monitoring. I find that the fast bulk requests done with scotty generates
so much less traffic than SunNetManager. I compared them both side by side
with my LanStat statistical lan analyser. I am building a monitoring
station using FreeBSD, scotty, fping, jgraph, ghostscript, a web server
some analysis code I have had to write and my LanStat analyser.

rgds

Paul Koch
NetRepair Technical Services Pty Ltd
P.O. Box 12717
Elizabeth Street
Brisbane, 4002
Queensland, Australia
Phone: +61 18 746017 (Mobile)
+61 7 3801 2083 (After Hours)
Email: koch@thehub.com.au