[tkined] SQL Databases...

From: Cary O'Brien (cobrien@optimsys.com)
Date: Thu Feb 24 2000 - 14:17:12 MET


Sent this wrongly. Retrying. Again!. But this time for sure...

> Date: Mon, 21 Feb 2000 18:06:39 +0100 (CET)
> From: Szokoli Gabor <szocske@vaskutya.sch.bme.hu>
> Precedence: bulk
>
> On Thu, 17 Feb 2000, Eddie Corns wrote:
>
> > The major limit
> > in this instance turned out to be the unexpected (to me) serial update problem
> > for SQL (as I believe it's technically known).
> You scared me.
> I'm not good at RDBMSs either, what's this problem?
>
> Szocske
>
> This is how I understand it. Imagine you were creating a database to hold
> this sort of data, something that would surely spring to mind would be create
> a sequential list of records such that as you do each poll you just write into
> the _next_ record. This would be the fastest you could do. However when
> dealing with SQL (probably similar on other RDBMSs) a) you have no idea
> whether it will put sequential records next to each other and b) you have no
> way of telling it you are doing things in sequence anyway. This means that
> for every single record it has to actually search for the place to insert the
> record.
>

Not exactly (although this is not guaranteed). In most cases the table is
stored in one, well, thing [1], and the optional index in another. So
adds usually go at the end of the table. What can be expensive is updating
the index, which is usually a b-tree or less often a hash table, of the
keys.

There are also things called "clusters". This tells the database to
store records from different tables with the same key values in the
same place. So if I am in the "employees" and "publications" tables,
then these can be "clustered" so that the database tries to keep these
entries physically near (to maximize the chance they are all in cache
and minimize disk seeks if they aren't).

One way to speed inserts up is to group a bunch in a transaction. This
means issuing a "begin" beforehand and a "commit" at the end. What happens
is this tells the database to basically [2] keep the information in
memory during the transaction and write it all out in one batch at
the commit.

Another classic 'hack' to speed up inserts is to drop (delete) the indexes,
add the data, and then recreate the indexes. If you have enough rows
to insert this can be quicker overall, but while the indexes are dropped,
searches on the table can be slow. This can be avoided at the cost of
disk space by keeping two copies of the table and ping-ponging between
the two tables.

In addition, most databases have some sort of fast load option. In
PostgreSQL it is the copy command, and in Oracle it is the "fast-path"
to SQLLOAD. The price you have here is that you need to drop the
indexes and re-build. But for bulk insertion the overall procedure
can be MUCH faster.

> In practical terms, my first attempt to update the database by doing an UPDATE
> command for each poll required several hours for each iteration, however once
> I converted to outputting the data to a file then importing it, it only took a
> few seconds! I assume Stuart was noticing similarly huge differences.
>

Good!

> I've just discovered that most of the time now (of the 2 minutes per
> iteration) is between the SNMP walk command (that gets the ifDescr) and
> timeouts for hosts that don't respond. I've just recoded a test version of
> the walk code to do the parallel SNMP gets that I do for the main data
> gathering and have reduced the time from 1 minute 20 seconds to about 8
> seconds. I also tried various other versions including sending out every
> single SNMP get request asynchronously to all ifDescr.X for all 80+ hosts
> (after extracting X from ifNumber.0), then waiting for all the results,
> that one was still 45 seconds.
>

That's pretty good. But currectly setup you ought to be able to load
80 rows into a database in a couple of seconds. It may just be a case
of pipelining or something. Something like

        set db [open "|psql dbname" w]
        ...
        put $db "insert into table values ($host,$time)"

You may want to bracket groups of 50-100 with begin/commit as described
above.

As you've probably guessed, I'm a big database fan. Except in very
special cases, my feeling is that the effort up front in getting it to
work with the database will be repayed many times over in the
flexability of reporting. Plus knowing about RDBMSs may come in handy
for other tasks.

In any case, isolate the data storage routines from the rest of the
code. Then you have the option of swapping out the RDBMS by changing
a single "package require" line in your startup code.

- - -- cary

[1] PostgreSQL uses separate files. Oracle allocates space itself in
    things it builds out of sets of files (or raw devices) called
    tablespaces (right?).

[2] Transactions are really more complicated than this.

--
!! This message is brought to you via the `tkined & scotty' mailing list.
!! Please do not reply to this message to unsubscribe. To subscribe or
!! unsubscribe, send a mail message to <tkined-request@ibr.cs.tu-bs.de>.
!! See http://wwwsnmp.cs.utwente.nl/~schoenw/scotty/ for more information.



This archive was generated by hypermail 2b29 : Mon Jan 08 2001 - 15:27:37 MET