Notes for implementation of the report generation on top of SQLite
Francis J. Lacoste
flacoste at logreport.org
Sun Jul 13 04:51:20 CEST 2003
Hello everyone,
Here is a design document for the implementation of the current
report engine on top of SQLite.
It starts by a section which describes the current design. The second
section discuss the main challenge in implementing the Lire report
framework on top of the relational model. After that comes a brief
discussion of the benefit of SQLite. Finally, the new design is
described. It is not complete (it doesn't flesh out all the details,
but it should give a good idea of how to implement the thing).
Of course, comments, criticisms and suggestions for improvements are
welcome.
Kind regards,
Francis J. Lacoste
--
Francis J. Lacoste . . http://www.logreport.org
/^LogReport$/ . . flacoste at logreport.org
1. Current Report Generation Design
================================
Here are the major components of the current report generation
framework:
Lire::ReportSpec
----------------
This object represent a report specification. These objects are
usually instantiated from a XML file. A report specification is a
parametrisable object which can be used to generate a table of
data (a Lire::Report::Subreport object). (A ReportSpec should
probably renamed TableSpec which reflects better the purpose of
the object). The two most important attributes of this object are
filter_spec() and calc_spec() which are tree of objects that know
how to compute the data.
We have four kind of objects defined there:
Lire::Aggregator -- These are the objects that can be used to
group DLF together. Current Lire::Aggregator are:
- Lire::Group which groups DLF records according to one or more
field containing discrete values. For example, all DLF records
related to the client IP address can be grouped together by this
aggregator.
- Lire::Rangegroup which groups DLF records based on a numerical
field. This aggregator can be used to group together DLF records
which are in the same file size range.
- Lire::Timegroup which groups DLF records in a configurable
period of time. For example, you can use this aggregator to
group all DLF records which happened during the same hour.
- Lire::Timeslot which is similar to the Lire::Timegroup
aggregator but is cyclical. For example, if the DLF records span
two days, a Timegroup grouping by hour will generate 48 groups
(one for each hours of each days) whereas the Timeslot will
always generate 24 groups (one for each hour of the day).
Lire::GroupOp -- (This should be renamed to Lire::Aggregate).
These are aggregate functions which will compute a value from all
the DLF records in a group. In Lire we have the following
Aggregates: Min, Max, Avg, Count, First and Last.
Lire::FilterExpr -- These are objects which can be used to select
a subset of the DLF records. Currently, filters can be set at the
Table level and in sections of the report configuration.
Lire::Records -- This is an operator which will include in the
table fields from each DLF records.
Lire::Report
-------------
This object contains the complete generated report. The
Lire::Report can be instantiated from a XML report file or it can
be created by the Lire::ReportConfig object. The actual report's
data will be filled by each Lire::ReportSpec object. The
Lire::Report object contains one Lire::Report::Section object for
each configured section in the report.
The data are contained in Lire::Subreport objects (this should
really be renamed to Lire::Report::Table. Thus would also ends the
Report/Subreport confusion). The table's data is contained in
Group and Entry objects. Lire's table aren't real tables because
the table's hierarchy is maintained.
Here is the mapping from the ReportSpec package to their
data-holding counterparts:
Lire::ReportConfig produces Lire::Report
Lire::Section -> Lire::Report::Section
Lire::ReportSpec -> Lire::Report::Subreport
Lire::Aggregator -> Lire::Report::Group
Lire::GroupOp -> Lire::Report::Entry
Lire::Records -> Lire::Report::Entry in a
Lire::Report::Group
Lire::ReportConfig
------------------
Object which contains the configuration of how the report should
be generated. It contains a list of Lire::Section. Each
Lire::Section can also contain Lire::FilterSpec which should be
used by all the Lire::ReportSpec in that section. The
Lire::ReportConfig object is responsible for instantiating the
configured Lire::ReportSpec and setting their parameters.
Lire::ReportGenerator
---------------------
This object implements the algorithm to generate a complete
Lire::Report using a Lire::ReportConfig and a Lire::DlfStore. The
algorithm is straightforward:
1. Generate DLF streams for all extended and derived schemas used
in the report.
2. Call init_report() on all ReportSpec.
3. Call update_report() on each ReportSpec for all DLF records
selected by the FilterSpec.
4. Call end_report() on each ReportSpec.
5 Call create_report() on the ReportConfig object which will ask
each ReportSpec to create a Lire::Report::Subreport object.
2. Main Difference between SQL and Lire Report Specs
=================================================
The main problem when using SQL to implement our report generation
language is that the Lire query model doesn't follow the relational
model except for the most simple reports. A "Top-10 requesting hosts"
table maps easily to a SQL query. A "Top-5 request's type for each
top-10 requesting hosts" doesn't.
Plus, our "table" contains more information than the result of a SQL
query. In Lire all tables also contains summary values which are the
value of the aggregate functions for the whole table. For example, the
"Top-10 requesting hosts", will also contains the total number of
requests. At least 2 SQL92 queries would be needed to generate our
simple table.
Of course, Oracle has an extension to SQL92 (CONNECT BY) for tree-like
queries. (I think it even support producing summary rows in such
queries). But since Oracle is far from being free software and that
these nice extensions aren't available in any DFSG RDBMS, we will need
a more complex algorithm to implement our queries on top of the SQL92
subset supported by SQLite.
3. Why SQLite?
===========
SQLite is a fast and portable DB library which is extensible (new
functions can be easily added in C or perl).
Because of its single-process/library design, it has less overhead
than the network-multi-user RDBMS like MySQL or PostgreSQL.
It is perfectly suited for the type of data laundering we need to do
in Lire.
4. New Report Generation Design
============================
Two new objects would be introduced to bridge the gap between SQL92
and our Lire report language:
Lire::DlfQuery
--------------
This object would be used to represent a SQL query on a DLF table.
It would have the following attributes:
schema -- The DLF schema on which it will operates.
fields -- The list of values that will be available in each result.
This can be either a DLF field (for the Records implementation) or an
SQL aggregate function which will produce one value (for use by
Aggregate implementation). This will map to the SELECT ... FROM of a
SQL query.
group_fields -- This is the list of fields (a subset of the fields
attribute) that are used for aggregation. This will ends up in the
GROUP BY clause of the SQL query.
order_fields -- This is the list of fields (a subset of the fields
attribute) that should be used to order the result of the query.
This will ends up in the ORDER BY clause of the SQL query.
filter_expr -- The SQL WHERE clause. FilterExpr will now be compiled
to a SQL expression that can be used in the WHERE clause.
limit -- This is an integer which would be used to limit the number
of rows returned. (LIMIT extension available in SQLite).
nested_queries -- A list of DlfQueries which are nested in this one
and that should share the group_fields, filter_expr and order_fields.
Constructor:
new( $schema )
Construction methods:
add_field( $name, [ $expr ] )
add_group_field ( $name, [ $expr ] )
add_order_by_field( $name, [ $expr ] )
In these methods, $name is the name under which the value will be
available in the DlfResult object. $expr is the SQL expression
used to compute the value. It can be omitted when the value is a
dlf's field value.
set_filter_expr( $expr ) -- Sets the WHERE clause of the SQL query.
set_limit( $limit )
add_nested_query() -- Creates a new DlfQuery which will share this
query's group fields and order by clauses.
Two methods will be used to get the two kind of results needed in
Lire's report:
execute_summary( $store ) -- This will run the query using only the
aggregate fields and its parent's group by fields. This method
returns a DlfResult object.
execute( $store ) -- Executes the query. It will returns the table
row. This method returns a DlfResult object.
Each Aggregator in Lire will know how to create a DlfQuery that can be
used to generate its report data.
create_dlf_query( [$parent] ) -- This method will be added to the
Lire::Aggregator interface. The $parent parameter will be used
when the Aggregator is nested in another Aggregator. It will
contains the parent's Lire::DlfQuery.
Each Aggregate will know how to add
Some examples:
1. "Top-10 requesting hosts":
execute():
SELECT client_host, count() AS "requests"
FROM dlf_dns
GROUP BY client_host
ORDER BY requests DESC
LIMIT 10
execute_summary():
SELECT count() AS "requests" FROM dlf_dns
2. Top-5 request's type for each top-10 requesting hosts
execute():
Parent:
SELECT client_host, count() AS "requests"
FROM dlf_dns
GROUP BY client_host
ORDER BY requests DESC
LIMIT 10
Nested:
SELECT client_host, requesttype, count() AS "requests"
FROM dlf_dns
GROUP BY client_host, requesttype
ORDER BY requests DESC
LIMIT 10
execute_summary():
Parent:
SELECT count() AS "requests"
FROM dlf_dns
Nested:
SELECT client_host, count() AS "requests"
FROM dlf_dns
GROUP BY client_host
ORDER BY requests DESC
Object dumps:
{
fields => [ "client_host", 'count() AS "requests" ],
group_by => [ "client_host" ],
order_by => [ "-requests" ],
limit => 10,
nested => [
{
fields => [ "requesttype" ],
group_by => [ "requesttype" ],
limit => 5
},
],
}
Other implementation notes
--------------------------
Since SQLite doesn't have many built-in functions, each of our
Aggregate and Aggregator will need to register extensions to do its
job.
The Group Aggregator is the only which won't requires any extension.
The Rangegroup and Timegroup will need to register an extension
function which will compute a value that can be used for
aggregation:
GROUP BY lr_timegroup( time, '1d' ) -- dlf_field, period
GROUP BY lr_timeslot( time, '1d' ) -- dlf_field, unit
GROUP BY lr_rangegroup( size, 10, ... ) -- dlf_field, range-size,
...
Even the first and last aggregate, which could be theoretically
implemented using a SQL subselect will need such extensions.
(Because SQLite doesn't support subselects referencing the parent's
query values)
SELECT ..., lr_first( timestamp, 'timestamp' ) -- dlf_field,
sort_order
Since the patched SQLite supports new functions defined in perl, it
will be a simple matter of putting the existing code in a function
that will be called from SQLite.
Lire::DlfResult
---------------
The Lire::DlfResult object offers an interface on top of the SQL
results.
Its interface will have the following methods:
next() -- Position the cursor on the next row of data returned.
Returns 0 when there is no more rows.
get_field( $name ) -- Returns the value of the column $name in
the result.
New method in Lire::Aggregator
-------------------------------
create_table( $store ) -- Will create a Lire::Report::Subreport by
running the Lire::DlfQuery on $store. This will be a
TemplateMethod (implement a parametrisable algorithm) with the
following pseudo-code:
$query->execute_summary()
For each rows:
Create a group
Set its summary values.
$query->execute()
For each rows:
Create an entry in the correct group.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.logreport.org/pipermail/development/attachments/20030712/d6ecf912/attachment.bin
More information about the Development
mailing list