This thesis examines the use of machine learning techniques in various tasks of naturallanguage processing, mainly for task of information extraction from texts. The objectives are
the improvement of adaptability of information extraction systems to new thematic domains
(or even languages), and the improvement of their performance using as fewer resources
(either linguistic or human) as possible. This thesis has examined two main axes: a) the
research and assessment of existing algorithms of mechanic learning mainly in the stages of
linguistic pre-processing (such as part of speech tagging) and named-entity recognition, and
b) the creation of a new machine learning algorithm and its assessment on synthetic data, as
well as in real world data from the task of relation extraction between named entities.
This doctoral thesis researches the possibility of exploiting machine learning techniques in
the research area of natural language processing, aiming at the confrontation of the
problems of upgrade as well as adaptation of natural language processing systems in new
thematic domains or languages. The research is delimited in three important axes of
information extraction systems:
– Part of speech recognition for the Greek language.
– Named entity recognition.
– Relation extraction between recognised named entities.
This thesis examines how machine learning methods and techniques can be exploited for the
development of systems that support these tasks, which can be adapted more easily in new
thematic domains and languages in contrast to the conventional systems that are rule based,
manufactured often from experts. More specifically, this thesis research techniques of
machine learning along two main axes:
1. The application of existing techniques (both symbolic and statistical) in selected tasks of
information extraction. These techniques are evaluated comparatively to each other in both
the Greek and English languages. All existing machine learning algorithms that were
examined require a vector of constant length as input. However the transformation of natural
language in vectors of constant length is not always easy, without the use of arbitrary limits
regarding the maximum number of words. This observation constituted the motivation for
the creation of a new machine learning algorithm, which does not require vectors of constant
length as input.
2. The development of a new machine learning algorithm, without the requirement for
vectors of constant length as input. This new algorithm learns context free grammars from
positive examples, with guidance via heuristics, such as minimum description length.
Regarding the first axis, named entity recognition systems were developed and evaluated,
based on existing machine learning algorithms, such as decision trees and neural networks.
The systems that were developed concern various thematic domains (management
succession events, financial news, and juridical decisions) both in the Greek and English
languages. These systems were evaluated in Greek texts, and they led to the recognition of
the disadvantages and restrictions imposed by the examined algorithms, when applied on
natural language data. From this analysis we concluded that one of the main problems when
applying machine learning is the difficulty in managing data of variable length, as for
example the information concerning all words of a sentence. On the contrary, a syntactic
analyser can easily decide if a sentence (or part of a sentence) is described by a provided
grammar. However, the manual development of grammars suitable for a specific task is a
complex process, while the results frequently depend on the thematic domain and of course
from the language. Consequently, if such a grammar can be automatically acquired with the
use of machine learning, then the adaptation of systems that use such grammars to new thematic domains or languages can be considerably simplified.
The contribution of the developed systems is significant. The named entity recognition
systems that were developed for the Greek language are among the first systems of their
kind that have been reported in the bibliography. Simultaneously, the performance of the
developed systems is satisfactory, and directly comparable to the performance of similar
systems reported in the bibliography for the corresponding time period.
Regarding the second axis, and aiming at the confrontation of problems associated with the
application of existing techniques, a new technique of machine learning has been developed.
This new technique belongs to the category of inductive grammar learning. The main
advantages of this method with respect to other machine learning methods are the ability to
handling textual data, as well as the possibility of using learned grammars in existing
systems, replacing manually developed grammars. Main objective of this new technique is
the automation grammar creation, which can be used with the plethora of available syntactic
parsers that have been presented in the bibliography, replacing existing (and probably
manually constructed) grammars for various tasks in information extraction systems.
For applying inductive grammar learning, a new algorithm has been developed that learns
grammars from positive examples only. This new algorithm can infer context free grammars,
and it has been based on the existing algorithm GRIDS, improving both the used heuristic, as
well as the search process in the space of possible grammars, increasing simultaneously the
applicability of the new algorithm in bigger collections of data. The requirement for the
algorithm functioning only with positive examples emanates from the frequent absence of
negative examples in the area of natural language processing. It should be noted that the
presence of negative examples constitutes a necessary condition for the operation of most
existing grammatical inference algorithms. The designing of this new algorithm has been
done in such a way that it can be used in classification tasks, such as named entity
recognition. This kind of usage differs from the usual application of grammatical inference
algorithms, as the verification or the syntactic analysis of sentences according to a grammar
is not required. Instead, we are interested mainly in recognising sentence parts (phrases) and
their classification in predefined semantic categories. The evaluation of this new algorithm
has been performed on both synthetic languages, as well as on real world data for the task of
relation extraction between named entities.