Final Report about the AGFL/GNU Project

parser generator system for natural languages

The goal of the AGFL/GNU project is to make the AGFL linguistic parser generator system publicly available as a tool for the development of applications involving the linguistic processing of full-text documents.

The project, sponsored for 20 man-months by Stichting NLnet from January 2000 to September 2001, consisted of the following tasks:

to revise the AGFL formalism to bring the notation in line with other forms of Affix Grammars (EAG, CDL3)
the introduction of lexical and syntactical probabilities
improvement of the performance of AGFL parsers
production of documentation and examples
to bring the AGFL system under the GNU Public License
a Web-based application to show AGFL's usefulness.

ACHIEVEMENTS

1. The Revision

In cooperation with the maintainers of CDL3 (Paul Jones) and EAG (Marc Seutter) the syntax and semantics of CDL3 has been revised. The revised language definition forms part of the user manual. The three different Affix Grammar formalisms can now be understood as different sets of restrictions on a general formalism.
The inconveniences caused to current AGFL users by the modifications in notation are mitigated somewhat by the provision of a migration tool for converting lexica.
Kees Koster is working on a text book uniting the AG formalisms.

2. Probabilistic parsing

The revised syntax includes both syntactical and lexical probabilities. The lexical probabilities have been implemented in the lexicon generation system and the analysis phase, allowing the use of probabilistic parsing to obtain the most probable analysis of a sentence. In combination with the newly introduced segment parsing, this makes AGFL parsers useful for practical applications in Information Retrieval. Lexical probabilities can be obtained from readily available tagged corpora.
Syntactic probabilities are for the time being ignored, but they can be implemented later when the required syntactic data become available.

3. Performance improvement

The performance of the parser generator itself, and in particular the generation of lexica, has been improved so much that a complete compilation of a grammar and an associated lexicon of 300 000 entires is performed in less than a minute, making the AGFL system suitable for tight development cycles.
In optimizing the performance of the parsers generated by the system, we have been less successful. An elaborate positive and negative memoization technique has been implemented, but this development has not yet led to a significant speedup.

4. Documentation

A new user manual has been written, comprizing an informal introduction to AGFL, the language definition and a chapter about transduction and input modes. Many small examples should help linguists in understanding transduction, probabilistic parsing and robustness strategies.
For computer scientists, the main attraction will be the free English grammar and lexicon EP4IR, which allow anybody to generate parsers for inclusion in his own applications. Implementation documentation is included in the source texts, but it is sparing and cryptic in a typical academic way.

5. Bringing AGFL under GNU

The KUN and the developers of AGFL have agreed to transfer their rights on the AGFL system to the Free Software Foundation. At present the system is still under evaluation by their reviewers. In the mean time the system is freely available under GPL conditions, apart from the runtime system which is under LGPL conditions.

6. Web-based application

It was agreed with Stichting NLnet that an attractive web-relevant application of AGFL should be provided (for those who find an English parser too abstract). To this end, a system for searching, navigating and browsing a collection of documents based on a user-specified profile of phrases (rather than keywords) was implemented. Given the short time available, the system is no more than a demonstrable prototype, but (after improving the efficiency of the AGFL system) it can serve as a vehicle for research in new techniques for document disclosure and search.

The new AGFL system will be unveiled to the world in many ways, in particular by means of a workshop for linguists (January 24/25 2002) and one or more articles at linguistic conferences resulting from that workshop.
A short paper announcing its availability for applications has been submitted to the Usenix conference. In the PEKING project at KUN, AGFL will be put the test for IR applications, which should result in many more publications.

PROBLEMS

Much has been achieved, but some problems remain to be solved. The largest is the disappointing performance of parsers. This appears to be due to errors in the implementation of the memoization and branch-and-bound heuristics, resulting in erratic behaviour. Since good performance is crucial for the abovementioned PEKING project, it has been decided to set aside programming capacity from that project to solve this problem.

THINGS TO DO

The current release 2.0 of the system is workeable, and a number of linguists are using it daily. A number of enhancements are desireable:

tools for tracing and profiling. Due to the nondeterministic character of parsers (like PROLOG programs) they are hard to debug with conventional tools.
a tool for paraphrasing, i.e. the generation of random examples from the grammar and the lexicon. Experience with previous versions has shown that such a tool is very helpful in developing new grammars.
extension of the lexical phase for parsing a trellis rather than a string. Opens the door to applications like the parsing of speech and handwriting.
full implementation of probabilistic parsing.

In the coming years, these will be realised by students, and possibly by professional programmers in case new funding is found.