The csc.nl module provides tools for working with natural language text.
To use any of these methods, you must first get an NLTools object. Here is how to get that object for English:
>>> from csc.nl import get_nl
>>> en_nl = get_nl('en')
Because Language objects store a reference to their NLTools, an alternate way to get the same object is:
>>> from csc.conceptnet4.models import Language
>>> en = Language.get('en')
>>> en_nl = en.nl
An NLTools object provides methods for dealing with natural language text in a particular language.
So far, we have three classes of languages:
With an NLTools object, you can perform these operations:
The subclasses of NLTools define how these operations actually work.
A language that generally follows our assumptions about European languages, including:
Only the subclasses of EuroNL – StemmedEuroNL and LemmatizedEuroNL – implement all of the NLTools operations.
A stopword is a word that contributes little to the semantic meaning of a text and should be ignored. These tend to be short, common words such as “of”, “the”, and “you”.
Stopwords are often members of closed classes such as articles and prepositions.
Whether a word is a stopword or not is a judgement call that depends on the application. In ConceptNet, we began with the stock lists of stopwords from NLTK, but we have refined and tweaked the lists (especially in English) over the years.
Examples:
>>> en_nl.is_stopword('the')
True
>>> en_nl.is_stopword('THE')
True
>>> en_nl.is_stopword('defenestrate')
False
>>> pt_nl = get_nl('pt') # This time, in Portuguese
>>> pt_nl.is_stopword('os')
True
>>> pt_nl.is_stopword('the')
False
The blacklist is used to discover and discard particularly unhelpful phrases.
A phrase is considered “blacklisted” if every word in it appears on the blacklist. The empty string is always blacklisted.
>>> en_nl.is_blacklisted('x')
True
>>> en_nl.is_blacklisted('the')
False
>>> en_nl.is_blacklisted('a b c d')
True
>>> en_nl.is_blacklisted('a b c d puppies')
False
Tokenizing a sentence inserts spaces in such a way that it separates punctuation from words, splits up contractions, and generally does what a lot of natural language tools (especially parsers) expect their input to do.
>>> en_nl.tokenize("Time is an illusion. Lunchtime, doubly so.")
'Time is an illusion . Lunchtime , doubly so .'
>>> untok = '''
... "Very deep," said Arthur, "you should send that in to the
... Reader's Digest. They've got a page for people like you."
... '''
>>> tok = en_nl.tokenize(untok)
>>> tok
"`` Very deep , '' said Arthur , `` you should send that in to the Reader 's Digest . They 've got a page for people like you . ''"
>>> en_nl.untokenize(tok)
'"Very deep," said Arthur, "you should send that in to the Reader\'s Digest. They\'ve got a page for people like you."'
>>> en_nl.untokenize(tok) == untok.replace('\n', ' ').strip()
True
Untokenizing a text undoes the tokenizing operation, restoring punctuation and spaces to the places that people expect them to be.
Ideally, untokenize(tokenize(text)) should be identical to text, except for line breaks.
When you normalize a string (no relation to the operation of normalizing a vector), you remove its stopwords and inflections so that it becomes equivalent to similar strings.
Normalizing involves running lemma_split() and keeping only the first factor, thus discarding the information that would be used to reconstruct the full string.
>>> en_nl.normalize("This is the testiest test that ever was tested")
u'testy test ever test'
Divide a single word into a string representing its lemma form (its base form without inflections), and a second string representing the inflections that were removed.
Instead of abstract symbols for the inflection, we currently represent inflections as their most common natural language string. For example, the inflection string ‘s’ represents both “plural” and “third-person singular”.
This odd representation basically makes the assumption that, when two inflections look the same, they will act the same on any word. Thus, we can avoid trying to disambiguate different inflections when they will never make a difference. (There are cases where this is not technically correct, such as “leafs/leaves” in “there were leaves on the ground” versus “he leafs through the pages”, but we don’t lose sleep over it.)
>>> en_nl.word_split(u'lemmatizing')
(u'lemmatize', u'ing')
>>> en_nl.word_split(u'cow')
(u'cow', u'')
>>> en_nl.word_split(u'went')
(u'go', u'ed')
>>> en_nl.word_split(u'people')
(u'person', u's')
When you lemma split or lemma factor a string, you get two strings back:
These two strings can be recombined with lemma_combine().
>>> en_nl.lemma_split("This is the testiest test that ever was tested")
(u'testy test ever test', u'this is the 1iest 2 that 3 was 4ed')
This is the inverse of lemma_factor() – it takes in a normal form and a residue, and re-assembles them into a phrase that is hopefully comprehensible.
>>> en_nl.lemma_combine(u'testy test ever test',
... u'this is the 1iest 2 that 3 was 4ed')
u'this is the testiest test that ever was tested'
>>> en_nl.lemma_combine(u'person', u'1s')
u'people'