Report on lr_dlf2xml performance

Francis J. Lacoste francis.lacoste at Contre.COM
Sat Mar 16 19:15:41 CET 2002


Hello all,

I've implemented the idea of mixed algorithm which chose between
the algorithm using a in memory hash table  or the one using an array
and sorted DLF based on the number of different keys present in the DLF.

This improved the performance of the report engine by :
- lowering the number of time we need to sort the DLF.
- lowering the number of time we have to process the DLFs
- we don't need to sort the DLF on timestamp anymore
- memory usage is still bounded

This mixed algorithm used a little bit more memory than the pure sorted
one, but is overall faster than the sequential and sort ones.

The major change is that when we first count the records in the DLF, we
also count the number of different keys present for each non-numeric
field. This is done using an in-memory hash, but in order not to use too
much memory, once the number of keys is higher than 5000, we use
interpolation to approximate the number of keys. Those statistics are
stored in a Lire::AsciiDlf::DlfInfo object which is available when the
report operations are initialized. Based on those statistics, the Group
operation can choose between the faster in-memory hash or slower but
memory bounded array data structure. This had also the nice side effect
of removing the need of processing the timegroup element sorted by
timestamp. We could improve the performance a little bit by raising the
keys threshold (at the cost of higher memory usage) used to choose
between the two data structures. It is now 500 for top-level group and
50 for second-level group. (For example, this means that a
top-client_host report will use an in-memory hash if there are less than
500 client hosts in the DLF file, otherwise it will sort the DLF on
client_host and uses an array to compute the reports.)

This new algorithm is now commited in the CVS. I think we can live with
those performance until we reach our 1.0 release.

Notes concerning those results
------------------------------

- The cisco log file didn't 1 month of data but 1 year.
- The numbers for the parallel, sequential and sort algorithm are
  the same as last time.
- There seems to be some skewed in the results (particularly in the
  system time) depending on when the tests were run. For example, after
  restarting X, the system time was a lot higher than before (<2 before,
  can reach 65 after the restart). 


IIS log file containing less than one day of data
=================================================

Parallel algorithm
Input size	Real	User	System	Vsize	RSS
5K		25	22.85	1.2	9124K	7848K
50K		229	215.17	9.88	12244K	10960K
100K		477	429.12	20.92	15428K	14112K
250K		1454	1354.81 69.21	17248K  14976K

Sequential algorithm
Input size	Real	User	System	Vsize	RSS
5K		30	28.33	0.21	8896K	7640K
50K		271	266.83	2.11	10804K	9520K
100K		551	538.93	4.86	12732K	10128K
250K		1387	1350.58	12.48	17248K  14076K

Sort algorithm
Input size	Real	User	System	Vsize	RSS
5K		30	26.98	0.24	8396K	7136K
50K		293	251.17	1.89	8744K	7484K
100K		581	497.57	4.1	9128K	7852K
250K		1792	1261.16	9.44	10032K	6676K

Mixed algorithm (500/50)
Input size	Real	User	System	Vsize	RSS
5K		27	26.3	0.22	8500K	7244K
50K		264	249.66	1.72	9036K	7772K
100K		560	502.32	24.6	9620K	8356K
250K		1505	1262.17	62.31	10116K	7512K

Cisco log containing 1 year of data
====================================

Parallel algorithm
Input size	Real	User	System	Vsize	RSS
5K		10	10.3	0.07	10104K	8840K
50K		95	94.27	0.2	36144K	34716K
100K		188	187.36	0.47	64500K	62836K
240K		451	430.19  0.98	128908K 124480K

Sequential algorithm
Input size	Real	User	System	Vsize	RSS
5K		13	12.32	0.12	9372K	8108K
50K		120	116.44	0.53	26776K	25328K
100K		235	230.66	0.97	46064K	44404K
240K		549	535.83	1.99	86920K	82196K

Sort algorithm
Input size	Real	User	System	Vsize	RSS
5K		14	13.23	0.05	7664K	6404K
50K		131	123.4	0.3	8920K	7612K
100K		267	246.14	0.4	9956K	8584K
240K		637	563.66	1.28	12308K  9656K

Mixed algorithm (500/50)
Input size	Real	User	System	Vsize	RSS
5K		12	12.17	0.02	7892K	6620K
50K		119	113.8	0.17	9088K	7784K
100K		242	223.65	0.39	10124K	8756K
240K		556	508.71	0.63	12464K  10844K

-- 
Francis J. Lacoste
francis at Contre.COM
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
Url : http://lists.logreport.org/pipermail/development/attachments/20020316/04a4d8aa/attachment.bin 


More information about the Development mailing list