Named Entity Recognition and new metrics for big economic data

Printer-friendly versionSend by email
Qualifications Required: 
a) experience with writing and consuming APIs b) good programming skills in Java c) MySQL
Qualifications Desired: 
reading/writing research articles, machine learning, natural language processing and Apache SOLR

During the last five years, there is a growing number of initiatives for publishing detailed economic data such as business information (e.g. https://opencorporates.com/), public procurement (e.g. http://platform.yourdatastories.eu/ and https://opentender.eu/).
These initiatives aim to increase the accountability in the context of “follow public money” projects and, at the same time, create opportunities for business intelligence solutions.
The project is focused on two aspects: (1) name entity recognition of the contracting parts in public procurement around the world using Natural Language Processing and Computational Linguistics, Semantic Web and Linked Open Data and Business Registries and Corporate Databases and (2) new metrics for supporting business intelligence solutions such risk factors for public procurement and business survival rate.
Particularly, in the area of NLP, research efforts are focused on addressing data heterogeneities such as misspelling errors and name or acronym mismatches, on the lexical, syntactic and semantic level. These methodologies and practices can be applied to solve general problems and usually follow a traditional approach of text normalization, lexical analysis, post-tagging word according to a grammar and semantic analysis to filter or provide some kind of service such as information extraction, reporting, sentiment analysis or opinion mining. In this context, a series of services (most in the form of an API) such as NLTK for Python, Lingpipe, OpenNLP or Gate for Java, WEKA, the Apache Lucene and Solr search engines have been created in order to serve the creation of natural-language based applications.
Entity reconciliation techniques to uniquely identify resources is also investigated in the field of Semantic Web and Linked Open Data. Specifically, an entity reconciliation process can be briefly defined as the method for looking and mapping two different concepts or entities under a certain threshold. These techniques have been applied to the field of linking entities in the LOD realm, for instance using the DBpedia.
According to the Global Open Data Index* and the Open Company Data Index** only few countries are providing their business registries as high quality open data. But even in the cases where corporate information such as name of company, address, unique identifier of the company, the owner, capital, approved and registration dates are publicly available by the official government, re-using this valuable information can be tedious due to various reasons such as different content, formats and updating process.

*http://index.okfn.org/place/
**http://registries.opencorporates.com/

Administration

© 2018 - Institute of Informatics and Telecommunications | National Centre for Scientific Research "Demokritos"