Notes for implementation of the report generation on top of SQLite

Francis J. Lacoste flacoste at logreport.org
Sun Jul 13 04:51:20 CEST 2003


Hello everyone,

Here is a design document for the implementation of the current 
report engine on top of SQLite.

It starts by a section which describes the current design. The second
section discuss the main challenge in implementing the Lire report
framework on top of the relational model. After that comes a brief
discussion of the benefit of SQLite. Finally, the new design is
described. It is not complete (it doesn't flesh out all the details,
but it should give a good idea of how to implement the thing).

Of course, comments, criticisms and suggestions for improvements are
welcome.

Kind regards,

Francis J. Lacoste

-- 
Francis J. Lacoste              . .           http://www.logreport.org
/^LogReport$/               . .               flacoste at logreport.org


1. Current Report Generation Design
   ================================

  Here are the major components of the current report generation
framework:

  Lire::ReportSpec
  ----------------

    This object represent a report specification. These objects are
    usually instantiated from a XML file. A report specification is a
    parametrisable object which can be used to generate a table of
    data (a Lire::Report::Subreport object). (A ReportSpec should
    probably renamed TableSpec which reflects better the purpose of
    the object). The two most important attributes of this object are
    filter_spec() and calc_spec() which are tree of objects that know
    how to compute the data.

    We have four kind of objects defined there:

    Lire::Aggregator -- These are the objects that can be used to
    group DLF together. Current Lire::Aggregator are:

      - Lire::Group which groups DLF records according to one or more
      field containing discrete values. For example, all DLF records
      related to the client IP address can be grouped together by this
      aggregator.

      - Lire::Rangegroup which groups DLF records based on a numerical
      field. This aggregator can be used to group together DLF records
      which are in the same file size range.

      - Lire::Timegroup which groups DLF records in a configurable
      period of time. For example, you can use this aggregator to
      group all DLF records which happened during the same hour.

      - Lire::Timeslot which is similar to the Lire::Timegroup
      aggregator but is cyclical. For example, if the DLF records span
      two days, a Timegroup grouping by hour will generate 48 groups
      (one for each hours of each days) whereas the Timeslot will
      always generate 24 groups (one for each hour of the day).

    Lire::GroupOp -- (This should be renamed to Lire::Aggregate).
    These are aggregate functions which will compute a value from all
    the DLF records in a group. In Lire we have the following
    Aggregates: Min, Max, Avg, Count, First and Last.

    Lire::FilterExpr -- These are objects which can be used to select
    a subset of the DLF records. Currently, filters can be set at the
    Table level and in sections of the report configuration.

    Lire::Records -- This is an operator which will include in the
    table fields from each DLF records.

  Lire::Report
 -------------

    This object contains the complete generated report. The
    Lire::Report can be instantiated from a XML report file or it can
    be created by the Lire::ReportConfig object. The actual report's
    data will be filled by each Lire::ReportSpec object. The
    Lire::Report object contains one Lire::Report::Section object for
    each configured section in the report.

    The data are contained in Lire::Subreport objects (this should
    really be renamed to Lire::Report::Table. Thus would also ends the
    Report/Subreport confusion). The table's data is contained in
    Group and Entry objects. Lire's table aren't real tables because
    the table's hierarchy is maintained.

    Here is the mapping from the ReportSpec package to their
    data-holding counterparts:

    Lire::ReportConfig produces Lire::Report
    Lire::Section      ->       Lire::Report::Section
    Lire::ReportSpec   ->       Lire::Report::Subreport
    Lire::Aggregator   ->       Lire::Report::Group
    Lire::GroupOp      ->       Lire::Report::Entry
    Lire::Records      ->       Lire::Report::Entry in a
Lire::Report::Group

  Lire::ReportConfig
  ------------------

    Object which contains the configuration of how the report should
    be generated. It contains a list of Lire::Section. Each
    Lire::Section can also contain Lire::FilterSpec which should be
    used by all the Lire::ReportSpec in that section. The
    Lire::ReportConfig object is responsible for instantiating the
    configured Lire::ReportSpec and setting their parameters.

  Lire::ReportGenerator
  ---------------------

    This object implements the algorithm to generate a complete
    Lire::Report using a Lire::ReportConfig and a Lire::DlfStore. The
    algorithm is straightforward:

     1. Generate DLF streams for all extended and derived schemas used
     in the report.

     2. Call init_report() on all ReportSpec.

     3. Call update_report() on each ReportSpec for all DLF records
     selected by the FilterSpec.

     4. Call end_report() on each ReportSpec.

     5 Call create_report() on the ReportConfig object which will ask
     each ReportSpec to create a Lire::Report::Subreport object.
 
2. Main Difference between SQL and Lire Report Specs
   =================================================

  The main problem when using SQL to implement our report generation
  language is that the Lire query model doesn't follow the relational
  model except for the most simple reports. A "Top-10 requesting hosts"
  table maps easily to a SQL query. A "Top-5 request's type for each
  top-10 requesting hosts" doesn't.

  Plus, our "table" contains more information than the result of a SQL
  query. In Lire all tables also contains summary values which are the
  value of the aggregate functions for the whole table. For example, the
  "Top-10 requesting hosts", will also contains the total number of
  requests. At least 2 SQL92 queries would be needed to generate our
  simple table.

  Of course, Oracle has an extension to SQL92 (CONNECT BY) for tree-like
  queries. (I think it even support producing summary rows in such
  queries). But since Oracle is far from being free software and that
  these nice extensions aren't available in any DFSG RDBMS, we will need
  a more complex algorithm to implement our queries on top of the SQL92
  subset supported by SQLite.

3. Why SQLite?
   ===========

  SQLite is a fast and portable DB library which is extensible (new
  functions can be easily added in C or perl).

  Because of its single-process/library design, it has less overhead
  than the network-multi-user RDBMS like MySQL or PostgreSQL.

  It is perfectly suited for the type of data laundering we need to do
  in Lire.

4. New Report Generation Design
   ============================

  Two new objects would be introduced to bridge the gap between SQL92
  and our Lire report language:

  Lire::DlfQuery 
  --------------

  This object would be used to represent a SQL query on a DLF table.
  It would have the following attributes:

  schema -- The DLF schema on which it will operates.

  fields -- The list of values that will be available in each result.
  This can be either a DLF field (for the Records implementation) or an
  SQL aggregate function which will produce one value (for use by
  Aggregate implementation). This will map to the SELECT ... FROM of a
  SQL query.

  group_fields -- This is the list of fields (a subset of the fields
  attribute) that are used for aggregation. This will ends up in the
  GROUP BY clause of the SQL query.

  order_fields -- This is the list of fields (a subset of the fields
  attribute) that should be used to order the result of the query.
  This will ends up in the ORDER BY clause of the SQL query.

  filter_expr -- The SQL WHERE clause. FilterExpr will now be compiled
  to a SQL expression that can be used in the WHERE clause.

  limit -- This is an integer which would be used to limit the number
  of rows returned. (LIMIT extension available in SQLite).

  nested_queries -- A list of DlfQueries which are nested in this one
  and that should share the group_fields, filter_expr and order_fields.

  Constructor:

    new( $schema )

  Construction methods:

    add_field( $name, [ $expr ] )

    add_group_field ( $name, [ $expr ] )

    add_order_by_field( $name, [ $expr ] )

    In these methods, $name is the name under which the value will be
    available in the DlfResult object. $expr is the SQL expression
    used to compute the value. It can be omitted when the value is a
    dlf's field value.

    set_filter_expr( $expr ) -- Sets the WHERE clause of the SQL query.

    set_limit( $limit )

    add_nested_query() -- Creates a new DlfQuery which will share this
    query's group fields and order by clauses.

  Two methods will be used to get the two kind of results needed in
  Lire's report:

    execute_summary( $store ) -- This will run the query using only the
    aggregate fields and its parent's group by fields. This method
    returns a DlfResult object.

    execute( $store ) -- Executes the query. It will returns the table
    row. This method returns a DlfResult object.

  Each Aggregator in Lire will know how to create a DlfQuery that can be
  used to generate its report data. 

      create_dlf_query( [$parent] ) -- This method will be added to the
      Lire::Aggregator interface. The $parent parameter will be used
      when the Aggregator is nested in another Aggregator. It will
      contains the parent's Lire::DlfQuery.

  Each Aggregate will know how to add

  Some examples:

  1. "Top-10 requesting hosts":

    execute():

    SELECT client_host, count() AS "requests" 
      FROM dlf_dns
    GROUP BY client_host
    ORDER BY requests DESC
      LIMIT 10

   execute_summary():

    SELECT count() AS "requests" FROM dlf_dns

  2. Top-5 request's type for each top-10 requesting hosts

    execute():

      Parent:
          SELECT client_host, count() AS "requests" 
            FROM dlf_dns
          GROUP BY client_host
          ORDER BY requests DESC
          LIMIT 10

      Nested:
          SELECT client_host, requesttype, count() AS "requests" 
            FROM dlf_dns
          GROUP BY client_host, requesttype
          ORDER BY requests DESC
          LIMIT 10

    execute_summary():

      Parent:
          SELECT count() AS "requests" 
            FROM dlf_dns

      Nested:
          SELECT client_host, count() AS "requests" 
            FROM dlf_dns
          GROUP BY client_host
          ORDER BY requests DESC

    Object dumps:

    {
      fields => [ "client_host", 'count() AS "requests" ],
      group_by => [ "client_host" ],
      order_by => [ "-requests" ],
      limit => 10,
      nested => [ 
          { 
              fields => [ "requesttype" ],
              group_by => [ "requesttype" ],
              limit => 5
          },
      ],
    }

  Other implementation notes
  --------------------------

    Since SQLite doesn't have many built-in functions, each of our
    Aggregate and Aggregator will need to register extensions to do its
    job.

    The Group Aggregator is the only which won't requires any extension.
    The Rangegroup and Timegroup will need to register an extension
    function which will compute a value that can be used for
    aggregation:

      GROUP BY lr_timegroup( time, '1d' ) -- dlf_field, period

      GROUP BY lr_timeslot( time, '1d' ) -- dlf_field, unit

      GROUP BY lr_rangegroup( size, 10, ... ) -- dlf_field, range-size,
...

    Even the first and last aggregate, which could be theoretically
    implemented using a SQL subselect will need such extensions.
    (Because SQLite doesn't support subselects referencing the parent's
    query values)

     SELECT ..., lr_first( timestamp, 'timestamp' ) -- dlf_field,
sort_order

    Since the patched SQLite supports new functions defined in perl, it
    will be a simple matter of putting the existing code in a function
    that will be called from SQLite.

  Lire::DlfResult
  ---------------

    The Lire::DlfResult object offers an interface on top of the SQL
    results.

    Its interface will have the following methods:

      next() -- Position the cursor on the next row of data returned.
      Returns 0 when there is no more rows.

      get_field( $name ) -- Returns the value of the column $name in
      the result.

  New method in Lire::Aggregator
  -------------------------------

    create_table( $store ) -- Will create a Lire::Report::Subreport by
    running the Lire::DlfQuery on $store. This will be a
    TemplateMethod (implement a parametrisable algorithm) with the 
    following pseudo-code:

        $query->execute_summary()
        For each rows:
            Create a group
            Set its summary values. 

        $query->execute()
        For each rows:
            Create an entry in the correct group. 



-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.logreport.org/pipermail/development/attachments/20030712/d6ecf912/attachment.bin 


More information about the Development mailing list