Last update: 2005-04-18

Proposal for Lire-2.0 related development

tools for computer/network log file analysis

This document can serve as the basic of a sponsorship request to NLnet for more Lire development. It is made of 4 parts. The first one discusses the two different audience of Lire. The second part lists the features for which the sponsorship is requested. The benefits of these features is discussed in the light of the two Lire audience. The third section discusses man-power availability for this development. Finally, the fourth section discuss how the schedule and release management could be done.

Audience for 2.0 development

I think that the Lire audience is made of two classes of users. (As a side note, it will be important to keep this in mind while designing the user survey. At least to try to verify this hypothesis) We could also say that there are two Lire.

Lire - The Log Analyser

This first class of users sees Lire as an application which can generate reports for a lot of log formats. What these people are interested in is:

Support for more log formats.
Better support for existing ones.
Good default reports.
Tools to make it easier to configure Lire.
Better output formats.

I think that most of current Lire users fall into this group.

Lire - The Log Analysis Toolkit

This second usage patterns is less concerned about the default reports or the supported log formats but more interested in using Lire as a toolkit. That is a set of tools that can be easily integrated in a custom deployment situation. The people are interested in:

Clean API which makes it straightforward to integrate and extend Lire.
Documentation on how to integrate Lire.
Good output formats.
Better reporting functionalities.

We have few users of this kind, but they are the ones making the most contributions to Lire: Sun, MavEtJu, Noël Bardelot (that's the one integrating Lire into Cyber Sentry). This was also the important use case for the NFB project and future consultation gigs.

Complementary Areas

Of course, there are areas of interest to both audiences. One of these would be performance. It could also be argued that the first use case is a specific customisation of the toolkit. I've selected features that appeal to both class of users or that appealed to the toolkit's users. If the Board feels that it would be a better strategy to focus on the "Log Analyser" case, I could modify that feature list in consequence.

Features discussion

Each feature starts by a discussion on what should be the scope of the implementation. When possible, an initial breakdown of the tasks involved in implementing that features are given.

After that comes an estimate of the amount of development days estimated to implement the feature. The estimate are done without dependencies. That is, it is done by estimating the amount of work needed to add that feature to the current source tree without adding nothing else. This will make it easier for release management as the features can be added in any order.

After that comes a discussion of the benefits of adding that feature. The benefits are discussed from the point of view of the two class of users previously identified.

Finally, risks related to that feature are discussed. If possible, a risk-avoidance strategy will be proposed.

SQLite Based Report Generation

This feature is about re-implementing the report generation component to use SQLite as the query engine. I've sent an email to the development list with ideas on how to implement that feature.

Tasks

Implement DLFQuery and DlfResult.
Map Lire::FilterExpr to SQL.
Implement Group, Timegroup, Timeslot, Rangegroup, Records, Min, Max, First, Last, Count, Avg, Sum using DlfQuery and DlfResult.

Estimate: 15 days.

Benefits:

I think that if only one feature should be implemented, this should be the one, as this should bring a considerable performance improvement. Performance improvements will appeal to both class of users.
From a development point of view, this will also move some of the complexity outside of Lire. By using SQLite, much of the optimisation of the core query language is left to an outside package (a RDBMS which only purpose is to run queries fast).
It will also make it relatively easier to add new operators to our language since we can build on the existing features of SQLite.

Risks

The performance gain might be less than I think it is. Although we will move to a faster data structure than a plain text file and that queries will involve more C code, it is possible that the overhead of callbacks to perl diminish the overall performance gain.
The best strategy to avoid this risk is to start by writing a small test program that would generate a nested timegroup table using SQLite and compare that with the 1.3 release to give the order of magnitude of the performance gain.
It is possible that some of our semantics are hard to implement using the proposed DlfQuery design. The count operator that counts the unique values would be such a beast. It doesn't map well to the SQL language. (I think it is not possible to generate such results only one SQL query.)

It will be best if we start implementing that more complex operator to exercise the usefulness of the design.

Store Configuration

This is about exposing Lire::DlfStore into the user interface. This feature is about making it possible to use lr_config to configure a Lire::DlfStore. This means the following possibilities:

Configuring one or more periodical log acquisition (Lire::LogSource) (and eventually the analysers that should be run on this data).
Configuring one or more reports that should be generated periodically from the data. These XML reports should be saved in the store itself. Reports should be generated from the DLF or by merging depending on data availability.
Configuring one or more periodical "report destinations" (that is formatting one of the generated reports and sending it to a destination address). This could also mean generating HTML reports in a specific directory.

Tasks

Add support for LogSource, ReportGenerationJob and ReportOutputJob to lr_config.
Integrate these new jobs type into lr_cron
Add support for storing XML reports and report configurations in Lire::DlfStore.
Modify ReportGenerator to use an existing Lire::DlfStore instead of creating a temporary one from a DLF file.

Estimate: 10 days

Benefits:

This will make it practical to implement multi-schemas reporting since one store can contains data from more than one sources.
This will make it also practical to use the log continuation feature.
This will enable easier long-terms reporting by integrating scheduling of reports merging.

Risks:

It might be problematic to support both usage scenario (batch scheduling of lr_log2report and the store-oriented configuration) in lr_config. A solution might be to drop support for the old job-style.
Designing user-interface components is often more tedious than one initially would think. Since I don't have much experience in that area, my estimate might be optimistic.

Internationalisation

It can be argued that a reporting solution which doesn't support the user's native language isn't of much value. (All web reporting tools out there support reports other language than English). Although making Lire supports other language than English would be a big task (because of the translations involved), making it possible to support other languages than English is a lot less work.

Tasks:

Support non-ASCII encoding of our XML files.
Use Locale::gettext to internationalise Lire's messages.
Use one of the various XML i18n tool (KDE and Gnome have some) to support I18N of our various XML files.

Estimate: 8 days

Benefits:

Localisation is an area where free software works well. A lot of users who can't help work on code might help to localise Lire to their language.
Users have requested for support of other encodings than ASCII in reports. (A Russian user wanted his report specification to use KOI8-R encoding).

Risks:

The integration of the XML I18N tools of Gnome or KDE might be problematic. In that case, the L10N of our reports might be limited. In that case, the scope of the I18N effort might be reduced, but the most important part of that feature is really to support more encodings in the XML report specifications.

Multiple Schemas Reporting

With the new DlfStore in place, it would be possible to generate reports using data coming from multiple log files and thus using multiple schemas.

This feature isn't about storing the report configuration using an XML format nor about configuring report configuration file using lr_config (what Wessel started working on).

Tasks:

Modify ReportConfig to support report specifications coming from more than one schema.
Modify ReportGenerator to use a DlfStore instead of a Dlf file.
Modify ReportMerger to support multiple schemas.
Update the output formatter for multiple schemas. (They usually print "www services")

Estimate: 4 days

Benefits:

Users (especially the ones providing ISP-like services) have asked to be able to generate one report for each virtual services: email, ftp and web. Implementing that feature would make it possible to implement such a report.
It would be possible to provide generic schemas (like login, daemon or connection) that could be re-used by multiple Dlf converters.

Risks:

No major risks foreseen.

Analysers API

Analysers are an important part of Lire, but their API needs to be updated to match the functionalities of the DlfConverter. The analyser should also be run after each log importation step and not every time a report is generated.

Tasks:

Define an API for ExtendedSchema analysers (the ones that extract values for each DLF records).
Define an API for DerivedSchema analysers (the one that create M DLF new records from N DLF records).
Store only the fields of the ExtendedSchema DLF records in the SQL table. (It currently also stores the fields of the extended schema).
Modify the ReportGenerator to support ExtendedSchema which stores the DLF in a separate table.
Find a way to specify how the analysers are configured at the job-level and integrate that to DlfConverterProcess so that analysis is done right after the conversion to DLF.
Write backward compatibility wrappers.
Documentation and examples for the new API.

Estimate: 20 days

Benefits:

This is a feature mostly useful of our second class of users. Although we received many contributed DLF converters there was no contribution for analysers. But analysers are crucial because they make it easier to write DLF converters (DLF converters are simpler if they don't need complex analysis logic in them, the best example of this are the current email converters which needs to implement similar complex logic in each of them). Most of the NFB project was about writing analysers.
The only interest for our first class of users are the performance gain we might see with that. The first performance gain would come from the fact that analysers wouldn't be run on each report generation but only once. The second improvements would be that the disk space required to generate reports would be lowered. Right now each extended analysers will more than double the disk space required because it copies the main DLF fields' content.

Risks:

This is a high risk item because the scope of the feature and the implementation strategy aren't really independent of some of the other features (the SQLite backend and the Store Configuration). That estimate should be re-evaluated once the order of features is completed.

Better HTML Output

The HTML reports are really suboptimal. They look awful because we are using DocBook as an intermediary format and the table support in the DocBook stylesheets is deficient. Developing an HTML formatter which generates HTML directly from the XML report would brings many benefits to Lire.

Estimate: 4 days

Benefits:

Having nice HTML reports would make it easier to offer professional services around Lire, as this is usually the report format corporate users are interested in.
By implementing the HTML formatter using perl, we would remove dependencies on libxslt and the XSL stylesheets for an important output format.
The HTML reports could benefit from the possibility of cross-referencing information (to explanation of fields for example) or frames.
By using a separate CSS stylesheets, we could make it easier for users to style the generated reports.

Risks:

Not using the DocBook stylesheets means we are going to need to reimplement a subset of the DocBook formatting to HTML (since we are using DocBook as our markup language in <description> elements.) This risk is somewhat limited by the fact that we already implemented a DocBook text formatter.

Better PDF Output

The PDF output format could also be improved for about the same reasons than the HTML format: DocBook tables looks really ugly.

Estimate: 5 days

Benefits:

The easier professional services sell-pitch argument still holds since corporate users will also like a nice-looking PDF report.
By using a perl PDF generation library (PDF::API2) for example, we could remove the requirements on TeTeX, DocBook DSSSL style sheets, jade and jadetex for an important output format.

Risks:

It is not certain that text and tables would really be easier to implement nicely with a PDF perl library. (One important limitations of PDF libraries are usually their lack of good hyphenation). This risk could be avoided by using LaTeX as an intermediary format. LaTeX has a really nice table support. This wouldn't bring us the benefit of dropping the requirement on TeX, but we would still drop many requirements and have really nice looking reports.

Improved Charting Tool

The charts generated by Lire could be improved several ways. It should be possible to generate charts which plots more than one variable on the same chart. The charts should be configurable in other ways than by changing report specifications.

Tasks:

Adds new directives to the ReportConfig to configure charts.
Update PloticusChartWriter to generate the charts using the charts configured in the ReportConfig.
Modify the Lire Report Markup Language so that a chart can exists independently of a subreport (for the case when the variables plotted comes from different subreports).
Update HTML and PDF output formatters to support the new chart element.

Estimate: 8 days

Benefits:

Charts make it easier for users to see patterns and trends in the report's data. Having two or more variables on the same chart makes it easier to compare trends.
Better charts would also make it easier to sell Lire as viable toolkit in corporate environment.
That feature was asked by several users.

Risks:

The changes to the XML report language are risky because they involve change to several components (output formatters, parsers, Lire::Report package). To reduce risk, the scope of the feature could be reduced to only having chart's configuration in the ReportConfig and plotting two or more variables from the same subreport.

Man-power

For the rest of the summer, I'm available to work part-time on Lire at a rate of about 24 hours a week. (I have another 2 days a week contract with a University research group.) In September, my availability could be increased to 32 hours.

I have an interesting suggestion to make though to increase the work effort to develop Lire. Wolfgang Sourdeau, a friend of mine, is also available for work. Joost already knows him, but for those who don't, I'll introduce him briefly: he's a Belgian who has been living in Montreal for the past 4 or 5 years. He was one of my employee when I worked at iNsu. He's a Debian developer and is also skilled in perl and C.

Besides these qualities, there is another thing that could be interesting for the Lire project: we could code major parts of it using pair programming. Wolfgang and I started doing this a few weeks ago on some other projects. (Since he was also looking for work, I proposed him two months ago to collaborate together on future projects.)

For those who don't know what pair programming is about, it's a practice advocated by the tenants of eXtreme Programming where all production code is written by two programmers at the same desk. What could be seen first as a waste of resource gives really interesting results. The most important benefits of pair programming are:

Two persons knows the code instead of one. For Lire this would mean that at least one other person is familiar with the innards of Lire.
It makes "knowledge transfer" faster when working with somebody who has less experience on the existing code.
When you work alone, you often encounter this "empty moments" where you just glare at the code wondering what to do. When you work in pair and this happens, it's time to switch the keyboard.
All the reasons why we say that "Two heads are better than one".
Other arguments in favour of Pair Programming can be found on http://www.extremeprogramming.org/rules/pair.html

Schedule and Release Management

In summary, here are the various features along with their estimates:

1.	SQLite Based Report Generation	15 days
2.	Store Configuration	10 days
3.	Internationalisation	8 days
4.	Multiple Schemas Reporting	4 days
5.	Analysers API	20 days
6.	Better HTML Output	4 days
7.	Better PDF Output	5 days
8.	Improved Charting Tool	8 days	+
	Total:	74 days

All of these features are independent of each others (except for the Analysers API item for which the estimate needs to be re-evaluated depending whether the "SQLite Based Report Generation" and "Store Configuration" are implemented or not.

This means that we can reorganise the priority of the features and we can make a release anytime a subset of these features is considered enough for a release. (With the unit tests and functional tests, making a release is a relatively straightforward job: it took about 2 days last time.)

Apart from the feature "SQLite Based Report Generation" which I think should be our top-priority, I don't have a strong opinion as to the order of the other ones. (Although putting "Internationalisation" in early, will make it less work to I18N other features.)

This also means that the feature list could be prioritised according to input from our users. This could be made part of the user survey that was suggested at the last Board's meeting.

As far as release dates goes, I think it would be wise to let pass two weeks of pair programming to measure our velocity (that is the amount of development-days we are able to make by week). Release dates could then be picked up according to the features we want implemented before making a release (and the above estimates).

As a final note, I think we should make more release than less. Since all of these features brings important benefits to the users, I think we should push out these benefits sooner than later.

I'm eagerly waiting for your comments on this proposal. I'll gladly answers any questions regarding this proposal. Also, if the board would like to focus more on the "Log Analyser" audience, you can suggest other features and I'll integrate them to this proposition.

Best regards
Francis J. Lacoste

Proposal for Lire-2.0 related development

Audience for 2.0 development

Lire - The Log Analyser

Lire - The Log Analysis Toolkit

Complementary Areas

Features discussion

SQLite Based Report Generation

Store Configuration

Internationalisation

Multiple Schemas Reporting

Analysers API

Better HTML Output

Better PDF Output

Improved Charting Tool

Man-power

Schedule and Release Management

Project LogReport

Navigate projects

Search