info Project Wiki
Extensive documentation about the integration of Greek language to spaCy.
NLPBuddy is a side result of the project "Addition of Greek language to spaCy" for Google Summer of Code 2018.
This page serves as a report for the Google Summer of Code 2018 Project.
The project was developed under the auspices of GFOSS - Open Technologies Alliance.
This section provides links to the source code, the documentation and the project timeline.
The main purpose of the section is to list all the places in which you can find work implemented during Google Summer of Code.
There is extensive analysis of the results of the work in the Results section and in the Deliverables section. However, if you need a direct tour to the whole project, this section is for you.
Note: There are two repositories and two wiki pages.
The first repo and the corresponding wiki page include everything that has to do with the addition of the Greek language to spaCy.
The second repo and the corresponding wiki page include everything that has to do with the implementation of a demo app on top of Spacy that demonstrates its' capabilities and supports various features such as sentiment analysis, topic classification, etc.
The project proposal was mainly about adding Greek language support to spaCy platform.
This goal is accomplished and the source code is provided in the following repository:
There is extensive documentation of every aspect of the process of adding Greek language to spaCy in the following Wiki page:
NLPBuddy is a demo produced during the Google Summer of Code. It is built on top of spaCy and it implements various interesting
tasks, all supported for Greek language too.
It makes use of the first part of the Google Summer of Code project, the addition of Greek language to spaCy, and it has some quite interesting features such as syntax analysis, emotion analysis, topic classification and a lot more.
The project repo is the following:
The corresponding Wiki page is the following:
Extensive, explanatory documentation about the implementation, the usage and the reproduction of the demo.
There is a timeline that tracks the whole Google Summer of Code work in a daily basis.
It is divided into the following sections: In progress, Done, TODO, Need test, Need improvement - Future work.
You can find the timeline here.
DISCLAIMER: Due to the huge complexity of the project, it is almost impossible to list everything that was implemented during Google Summer of Code 2018. There are over 50 Completed Tasks in the Timeline, but the list may be enriched in the near future.
We live in the era of data. Every minute, 3.8 billion internet users, produce content; more than 120 million emails, 500,000 Facebook comments, 3 million Google searches. If we want to process that amount of data efficiently, we need to
process natural language. Open source projects such as spaCy, textblob, or NLTK contribute significantly to that direction and thus they need to be reinforced.
This project is about improving the quality of Natural Language Processing of Greek Language.
The project goals can be categorized as following:
Note : All the project goals have been achieved. Added to this, there are a lot more side results that have been produced during Google Summer of Code 2018. Analysis of the achievements (with pull requests, links to production ready modules, etc) follows in the next two sections.
Greek language has been successfully added to spaCy, which was actually the most important goal of the project.
Two pull requests have been made; the first pull request is about the initial addition of the language and the second pull request contains important optimizations and additions that enrich the features Greek language class supports.
Addition of the language: You can see the first pull request here (Status: Merged)
Optimizations to the Greek language class: You can see the second pull request here (Status: Merged)
Each part of the process of integrating Greek language to spaCy is discussed in detail in the Wiki page of the project.
Two models for Greek language have been produced.
There is an ongoing process of uploading them to spaCy.
After that, you will be able to install them with the folllowing commands:
Greek language models support most of the capabilities that you will find in the deliverables section. Sentence splitting, tokenization, Part Of Speech Tagging, Syntax Analysis using DEP tags, Named Entities Recognition,
lexical attributes extraction, norm exceptions and stop-words lists, are all included the Greek language models. The big Greek model (el_core_web_lg) includes word vectors so it supports features such as similarity detection between texts.
You can find more about the models production, usage and maintenance, in the models page of the wiki.
Some visualizations from the models usage:
NLPBuddy is an open source text analysis tool that has been developed as a demonstration of the project results.
NLPBuddy leverages Spacy's capabilities to extract as much information as possible from raw text.
Briefly, in this demo you can perform the following tasks with your text in 7 languages:
The supported languages at the moment are the following: Greek, English, German, Spanish, Portuguese, French, Italian and Dutch.
Text can either be provided or imported from a URL. For the preprocess of the text imported from a URL, the following libraries are used: python readability, BeautifulSoup4.
Note: All the functionalities that demo supports (and some more) are implemented as modules so anybody can use them independently.
Those modules are extensively discussed in the deliverables section. The central idea is that this Google Summer of Code project should produce results that are going to be used later on from people all around the world. For that reason, together with my mentor, Markos Gogoulos, we have implemented an API for the Demo so anybody can access the results that it provides (see more here).
A side goal of the project is to empower spaCy itself. There is an open-dialogue with the creators of spaCy, who we would like to thank for their continuous support and enthusiasm.
A pull request for documentation improvements was successfully merged.
The pull request was about a small error found in the spaCy documentation in the pseudocode provided for overriding the spaCy tokenizer.
You can see the pull request here.
I am invited to write an article for Explosion AI Blog regarding the integration of Greek language to spaCy due to the innovative approaches followed during Google Summer of Code 2018.
There is an ongoing process of writing and evaluation of this article till its' publication which may be after the end of Google Summer of Code.
A link to the post will be published here when it's ready.
In the process of integrating Greek language to spaCy some new approaches are followed. Hopefully, these approaches will inspire other languages too.
Deliverables are independent functionality submodules or/and useful resources that were produced either during the process of integrating Greek language to spaCy or during the process of experimenting with the functionalities of spaCy and the demo implementation.
A list of the deliverables and a short description of each of them follows. You can find the functionality submodules in the res/modules folder of the project repo (here), serving as examples for usage.
Each of the deliverables is labelled with one of the following tags: greek-spacy-support , nlp-task, resource.
In computing, stop words are words which are filtered out before or after processing of natural language data. Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.
spaCy usually tries to normalise words with different spellings to a single, common spelling. This has no effect on any other token attributes, or tokenization in general, but it ensures that equivalent tokens receive similar representations. This can improve the model's predictions on words that weren't common in the training data, but are equivalent to other words – for example, "realize" and "realise", or "thx" and "thanks".
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
The greek language models support the following NER tags: ORG, PERSON, LOC, GPE, EVENT, PRODUCT. Having one of the greek models, you can use the NER tagger:
Sample Input: Η εταιρεία Google έχει τα γραφεία της στην Καλιφόρνια.
Sample Output: Entity:Google, Label:ORG, Entity:Καλιφόρνια, Label:GPE
Visualization using displaCy:
Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world's largest tech fund".
In this section, some suggestions for future work are listed. There are difficulty labels assigned to each task and some guidelines to start with. There are also labels which explain if each task refers to the improvement of Greek language support or to the addition/improvement of a general nlp task. For more info on contribution, you can always have a look at the contribute page of the project wiki.
Each language modifies the spaCy tokenization procedure by adding tokenizer exceptions. The tokenizer exceptions approach is not scalable for languages such as Greek. The reasons are pretty much the same as with the lemmatizer. A new approach, rule-based tokenization is proposed. The suggested steps are the following: