Report on lr_dlf2xml performance
Francis J. Lacoste
francis.lacoste at Contre.COM
Sat Mar 16 19:15:41 CET 2002
Hello all,
I've implemented the idea of mixed algorithm which chose between
the algorithm using a in memory hash table or the one using an array
and sorted DLF based on the number of different keys present in the DLF.
This improved the performance of the report engine by :
- lowering the number of time we need to sort the DLF.
- lowering the number of time we have to process the DLFs
- we don't need to sort the DLF on timestamp anymore
- memory usage is still bounded
This mixed algorithm used a little bit more memory than the pure sorted
one, but is overall faster than the sequential and sort ones.
The major change is that when we first count the records in the DLF, we
also count the number of different keys present for each non-numeric
field. This is done using an in-memory hash, but in order not to use too
much memory, once the number of keys is higher than 5000, we use
interpolation to approximate the number of keys. Those statistics are
stored in a Lire::AsciiDlf::DlfInfo object which is available when the
report operations are initialized. Based on those statistics, the Group
operation can choose between the faster in-memory hash or slower but
memory bounded array data structure. This had also the nice side effect
of removing the need of processing the timegroup element sorted by
timestamp. We could improve the performance a little bit by raising the
keys threshold (at the cost of higher memory usage) used to choose
between the two data structures. It is now 500 for top-level group and
50 for second-level group. (For example, this means that a
top-client_host report will use an in-memory hash if there are less than
500 client hosts in the DLF file, otherwise it will sort the DLF on
client_host and uses an array to compute the reports.)
This new algorithm is now commited in the CVS. I think we can live with
those performance until we reach our 1.0 release.
Notes concerning those results
------------------------------
- The cisco log file didn't 1 month of data but 1 year.
- The numbers for the parallel, sequential and sort algorithm are
the same as last time.
- There seems to be some skewed in the results (particularly in the
system time) depending on when the tests were run. For example, after
restarting X, the system time was a lot higher than before (<2 before,
can reach 65 after the restart).
IIS log file containing less than one day of data
=================================================
Parallel algorithm
Input size Real User System Vsize RSS
5K 25 22.85 1.2 9124K 7848K
50K 229 215.17 9.88 12244K 10960K
100K 477 429.12 20.92 15428K 14112K
250K 1454 1354.81 69.21 17248K 14976K
Sequential algorithm
Input size Real User System Vsize RSS
5K 30 28.33 0.21 8896K 7640K
50K 271 266.83 2.11 10804K 9520K
100K 551 538.93 4.86 12732K 10128K
250K 1387 1350.58 12.48 17248K 14076K
Sort algorithm
Input size Real User System Vsize RSS
5K 30 26.98 0.24 8396K 7136K
50K 293 251.17 1.89 8744K 7484K
100K 581 497.57 4.1 9128K 7852K
250K 1792 1261.16 9.44 10032K 6676K
Mixed algorithm (500/50)
Input size Real User System Vsize RSS
5K 27 26.3 0.22 8500K 7244K
50K 264 249.66 1.72 9036K 7772K
100K 560 502.32 24.6 9620K 8356K
250K 1505 1262.17 62.31 10116K 7512K
Cisco log containing 1 year of data
====================================
Parallel algorithm
Input size Real User System Vsize RSS
5K 10 10.3 0.07 10104K 8840K
50K 95 94.27 0.2 36144K 34716K
100K 188 187.36 0.47 64500K 62836K
240K 451 430.19 0.98 128908K 124480K
Sequential algorithm
Input size Real User System Vsize RSS
5K 13 12.32 0.12 9372K 8108K
50K 120 116.44 0.53 26776K 25328K
100K 235 230.66 0.97 46064K 44404K
240K 549 535.83 1.99 86920K 82196K
Sort algorithm
Input size Real User System Vsize RSS
5K 14 13.23 0.05 7664K 6404K
50K 131 123.4 0.3 8920K 7612K
100K 267 246.14 0.4 9956K 8584K
240K 637 563.66 1.28 12308K 9656K
Mixed algorithm (500/50)
Input size Real User System Vsize RSS
5K 12 12.17 0.02 7892K 6620K
50K 119 113.8 0.17 9088K 7784K
100K 242 223.65 0.39 10124K 8756K
240K 556 508.71 0.63 12464K 10844K
--
Francis J. Lacoste
francis at Contre.COM
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
Url : http://lists.logreport.org/pipermail/development/attachments/20020316/04a4d8aa/attachment.bin
More information about the Development
mailing list