Natural language tools

The csc.nl module provides tools for working with natural language text.

To use any of these methods, you must first get an NLTools object. Here is how to get that object for English:

>>> from csc.nl import get_nl
>>> en_nl = get_nl('en')

Because Language objects store a reference to their NLTools, an alternate way to get the same object is:

>>> from csc.conceptnet4.models import Language
>>> en = Language.get('en')
>>> en_nl = en.nl
class csc.nl.NLTools

An NLTools object provides methods for dealing with natural language text in a particular language.

So far, we have three classes of languages:

  • “Lemmatized” languages are languages where we have an MBLEM lemmatizer for removing or adding inflections to words. So far, this is just English, but it could easily include Dutch or German as well.
  • “Stemmed” languages are languages where we rely on a Snowball (Porter) stemmer to remove inflections from words. As there is a Snowball stemmer for most European languages, we treat most of them as stemmed languages.
  • “Default” languages are ones where we don’t really know how to implement any NLP tools yet. All the methods perform trivial operations. Japanese, Korean, Chinese, and Arabic are currently “default” languages.

With an NLTools object, you can perform these operations:

  • Detecting stopwords, or other words that we want to handle specially
  • Tokenizing a sentence by adding spaces and manipulating punctuation
  • Stemming/lemmatizing a phrase (which replaces inflected words with a single base form)
  • Synthesizing a phrase from a lemmatized form and a set of inflections

The subclasses of NLTools define how these operations actually work.

class csc.nl.euro.EuroNL(lang, exceptions=None)

A language that generally follows our assumptions about European languages, including:

  • Words are made of uppercase and lowercase letters, which are variant forms of each other, and apostrophes, which are kind of special.
  • Words are separated by spaces or punctuation.

Only the subclasses of EuroNL – StemmedEuroNL and LemmatizedEuroNL – implement all of the NLTools operations.

is_stopword(word)

A stopword is a word that contributes little to the semantic meaning of a text and should be ignored. These tend to be short, common words such as “of”, “the”, and “you”.

Stopwords are often members of closed classes such as articles and prepositions.

Whether a word is a stopword or not is a judgement call that depends on the application. In ConceptNet, we began with the stock lists of stopwords from NLTK, but we have refined and tweaked the lists (especially in English) over the years.

Examples:

>>> en_nl.is_stopword('the')
True
>>> en_nl.is_stopword('THE')
True
>>> en_nl.is_stopword('defenestrate')
False

>>> pt_nl = get_nl('pt')      # This time, in Portuguese
>>> pt_nl.is_stopword('os')
True
>>> pt_nl.is_stopword('the')
False
is_blacklisted(text)

The blacklist is used to discover and discard particularly unhelpful phrases.

A phrase is considered “blacklisted” if every word in it appears on the blacklist. The empty string is always blacklisted.

>>> en_nl.is_blacklisted('x')
True
>>> en_nl.is_blacklisted('the')
False
>>> en_nl.is_blacklisted('a b c d')
True
>>> en_nl.is_blacklisted('a b c d puppies')
False
tokenize(text)

Tokenizing a sentence inserts spaces in such a way that it separates punctuation from words, splits up contractions, and generally does what a lot of natural language tools (especially parsers) expect their input to do.

>>> en_nl.tokenize("Time is an illusion. Lunchtime, doubly so.")
'Time is an illusion . Lunchtime , doubly so .'
>>> untok = '''
... "Very deep," said Arthur, "you should send that in to the
... Reader's Digest. They've got a page for people like you."
... '''
>>> tok = en_nl.tokenize(untok)
>>> tok
"`` Very deep , '' said Arthur , `` you should send that in to the Reader 's Digest . They 've got a page for people like you . ''"
>>> en_nl.untokenize(tok)
'"Very deep," said Arthur, "you should send that in to the Reader\'s Digest. They\'ve got a page for people like you."'
>>> en_nl.untokenize(tok) == untok.replace('\n', ' ').strip()
True
untokenize(text)

Untokenizing a text undoes the tokenizing operation, restoring punctuation and spaces to the places that people expect them to be.

Ideally, untokenize(tokenize(text)) should be identical to text, except for line breaks.

class csc.nl.euro.LemmatizedEuroNL(lang, exceptions=None)
normalize(text)

When you normalize a string (no relation to the operation of normalizing a vector), you remove its stopwords and inflections so that it becomes equivalent to similar strings.

Normalizing involves running lemma_split() and keeping only the first factor, thus discarding the information that would be used to reconstruct the full string.

>>> en_nl.normalize("This is the testiest test that ever was tested")
u'testy test ever test'
word_split(word)

Divide a single word into a string representing its lemma form (its base form without inflections), and a second string representing the inflections that were removed.

Instead of abstract symbols for the inflection, we currently represent inflections as their most common natural language string. For example, the inflection string ‘s’ represents both “plural” and “third-person singular”.

This odd representation basically makes the assumption that, when two inflections look the same, they will act the same on any word. Thus, we can avoid trying to disambiguate different inflections when they will never make a difference. (There are cases where this is not technically correct, such as “leafs/leaves” in “there were leaves on the ground” versus “he leafs through the pages”, but we don’t lose sleep over it.)

>>> en_nl.word_split(u'lemmatizing')
(u'lemmatize', u'ing')
>>> en_nl.word_split(u'cow')
(u'cow', u'')
>>> en_nl.word_split(u'went')
(u'go', u'ed')
>>> en_nl.word_split(u'people')
(u'person', u's')
lemma_split(text, keep_stopwords=False)

When you lemma split or lemma factor a string, you get two strings back:

  1. The normal form, a string containing all the lemmas of the non-stopwords in the string.
  2. The residue, a string containing all the stopwords and the inflections that were removed.

These two strings can be recombined with lemma_combine().

>>> en_nl.lemma_split("This is the testiest test that ever was tested")
(u'testy test ever test', u'this is the 1iest 2 that 3 was 4ed')
lemma_combine(lemmas, residue)

This is the inverse of lemma_factor() – it takes in a normal form and a residue, and re-assembles them into a phrase that is hopefully comprehensible.

>>> en_nl.lemma_combine(u'testy test ever test',
... u'this is the 1iest 2 that 3 was 4ed')
u'this is the testiest test that ever was tested'
>>> en_nl.lemma_combine(u'person', u'1s')
u'people'
lemmatizer
The .lemmatizer property lazily loads an MBLEM lemmatizer from the disk. The resulting object is an instance of csc.nl.mblem.trie.Trie.
unlemmatizer
The .unlemmatizer property lazily loads an MBLEM unlemmatizer from the disk. The resulting object is a dictionary of tries, one for each possible combination of part-of-speech and inflection that can be added.
class csc.nl.euro.StemmedEuroNL(lang, exceptions=None)

Previous topic

The corpus module

Next topic

The ConceptNet Web API

This Page