The fundamental objects represented by Divisi are tensors and views.
In Divisi, you collect and manipulate data in Tensors. A Tensor, in short, is an array of data that we can do math to.
This is the general term that we use for such things throughout Divisi. The term “tensor” may evoke images of complicated theoretical physics, but that’s not what we’re going for. If you understand vectors and matrices, you understand two examples of tensors.
The most important thing to know about a tensor is its order, which is effectively its number of “dimensions”. The order of a tensor tells you how many inputs you have to give it to get a value out.
A vector is a 1st-order tensor. To look up a value in a vector, you give it a single input (an index) and get a number as an output.
In the vector v = [3, 5, -2], the possible inputs are the indices 0, 1, and 2, which give you the outputs 3, 5, and -2 respectively.
A matrix is a 2nd-order tensor. To look up a value in a matrix, you give it two inputs: a row index and a column index.
Divisi can represent tensors of order 3 or higher, as well.
Tensors of order 2 and higher can be sparse or dense. In a dense tensor, you can look up essentially any combination of indices and find a useful value. The output of an SVD, for example, is a dense matrix. Dense matrices can be expressed by arrays that list all of their values, which is how the :class:DenseTensor class works.
In a sparse tensor, most combinations of indices bring you to values that are zero or “missing”. Only a relatively small subset of values – the useful ones – are non-zero. As an example, when ConceptNet is represented as a matrix, it is a sparse matrix. Sparse tensors can be expressed as dictionaries that enumerate only the useful values, leaving the large number of zero values implied, which is how the DictTensor class works.
Tensors support various operations through their API, described below. However, they also act like Python dictionaries wherever possible – for example, you can use values() to get a list of their values, the [] operator to access those values, and iteritems() to get an iterator over keys and values.
A View wraps around a Tensor to change the way you look at the data in it. One of the most important uses of Views, in Divisi, is to label a tensor.
In a standard mathematical vector or matrix, the indices are simply integers, like in a Python list. However, your data is presumably not just numbers indexed by other numbers.
Instead, Divisi tensors can refer to data using meaningful indices that you choose, such as strings of text, which makes them act more like Python dictionaries than lists. This functionality is provided by divisi.labeled_view.LabeledView; see The labeled_view module for more.
Different kinds of views can be layered on top of one another. To access the contents of the view v, no matter whether it wraps a tensor or another view, you can call v.unwrap() (as of version 0.6). unwrap() also works on plain Tensors, returning the array or dictionary that is being used to store the data.
Here is an example of using a LabeledView to store labeled data:
>>> from csc.divisi.labeled_view import make_sparse_labeled_tensor
>>> t = make_sparse_labeled_tensor(ndim=2)
>>> t['grass', 'green'] = 2
>>> t['grass', 'red'] = -2
>>> t['apple', 'red'] = 3
>>> t
<LabeledView of <DictTensor shape: (2, 2); 3 items>, keys like: ('grass', 'green')>
Now let’s see what we got by removing layers of abstraction:
>>> t.unwrap()
<DictTensor shape: (2, 2); 3 items>
>>> t.unwrap().unwrap()
{0: {0: 2, 1: -2}, 1: {1: 3}}
>>> t.label_lists()
[OrderedSet(['grass', 'apple']), OrderedSet(['green', 'red'])]
Note
Whether a tensor is labeled is independent of whether it is sparse or dense. All combinations of these are possible in Divisi.
The base objects of Divisi, Tensors and Views, are defined in the tensor module.
A Tensor, in short, is an n-dimensional array that we can do math to.
Often, n=2, in which case the tensor would be better known as a matrix. Or sometimes n=1, in which case it’s a vector. But Divisi can also deal with n=3 and beyond.
Divisi uses many different kinds of tensors to store data. Fundamentally, there are DenseTensors, which are built around NumPy arrays, and DictTensors, which store data sparsely in dictionaries; then, there are various kinds of views that wrap around them to let you work with your data.
Tensors are the main type of object handled by Divisi. This is the base class for all Tensors.
Tensors act like dictionaries whenever possible, so you can use dictionary methods such as keys, iteritems, and update as you would for a dictionary. They also act like Numpy arrays in some ways, but acting like a dictionary takes priority.
Many tensor operations refer to modes; these are the dimensions by which a tensor is indexed. For example, a matrix has two modes, numbered 0 and 1. Mode 0 refers to rows, and mode 1 refers to columns.
Obscure terminology: For higher-dimensional tensors, mode 2 has sometimes been called “tubes”. Modes 3 and higher don’t have names.
Layer a view onto this tensor.
Tensors can be wrapped by various kinds of View. This method adds a view in the appropriate place.
In this case, this is a plain Tensor, so it simply passes it to the view’s constructor, but other Views will override this to layer on the new View in the appropriate way.
Returns the product of this tensor and a scalar.
The * operation does this when given a scalar as its second argument.
Takes two tensors and combines them element by element using op.
For example, given input tensors a and b, result[i] = op(a[i], b[i]), for all indices i in a and b. This operation requires a and b to have the same indices; otherwise, it doesn’t make any sense.
(Note that, for the sake of efficiency, this doesn’t run op on keys that neither a nor b have.)
Return the cosine of the angle of this vector with another.
A dot B = |A| |B| cos heta => cos heta = (A dot B) / (|A| |B|)
Get the product of two tensors, using matrix multiplication.
When two tensors a and b are multiplied, the entries in the result come from the dot products of the last mode of a with the first mode of b. So the product of a k-dimensional tensor with an m-dimensional tensor will have (k + m - 2) dimensions.
The * operation on two tensors uses this method.
Extract the items at either extreme of the tensor. Returns (biggest, smallest).
In many applications, we’re not interested in all the values in a particular tensor we calculated, just the ones with the highest magnitude. extremes returns the n_biggest biggest and n_smallest smallest values in a tensor, along with their indices, in the form used by dict.items().
The results are ordered by increasing value. For example, if you ask for:
biggest, smallest = tensor.extreme_items(10, 10)
Then the smallest item will be found as smallest[0] and the biggest as biggest[-1].
If this is a vector (as it often is), note that the indices are going to be tuples with one thing in them, for consistency. If you don’t like that, pass detuple=True.
You can pass a filter(key, value) that will be called on each item; the item will be included only if the filter returns a true value.
Return a normalized version of this tensor (generally a vector), by dividing by its norm.
This is not the same as .normalized(), which normalizes all the things (generally vectors) that make up a tensor (generally a matrix), using a NormalizedView.
Multiplies this tensor in-place by a scalar.
The *= operator uses this method.
Get an iterator over the keys of a specified mode of the tensor.
Example usage:
for row in tensor.iter_dim_keys(0):
do_something_to(row)
Get a LabeledView of this tensor.
labels should be a list of OrderedSet s, one for each mode, which assign labels to its indices, or None if that mode should remain unlabeled.
Calculate the Frobenius (or Euclidean) norm of this tensor.
The Frobenius norm of a tensor is simply the square root of the sum of the squares of its elements. For a vector, this is the same as the Euclidean norm.
NOTE: This function was incorrect before 2009-06-01. Check your usage.
Calculate the Frobenius (or Euclidean) norm of this tensor.
The Frobenius norm of a tensor is simply the square root of the sum of the squares of its elements. For a vector, this is the same as the Euclidean norm.
NOTE: This function was incorrect before 2009-06-01. Check your usage.
Get a divisi.normalized_view.NormalizedView of this tensor.
Return a normalized version of this tensor (generally a vector), by dividing by its norm.
This is not the same as .normalized(), which normalizes all the things (generally vectors) that make up a tensor (generally a matrix), using a NormalizedView.
Gets a tensor of only the entries indexed by index on mode mode.
The resulting tensor will have one mode fewer than the original tensor. For example, a slice of a 2-D matrix is a 1-D vector.
Examples:
Computes the weighted sum of slices along mode. Weights are specified as (key, weight).
This is a slight generalization of the concept of an ad hoc category.
You can also pass in just a list of keys as weights and set constant_weight to a value (like 1.0) that you want to weight everything by.
Some items may be missing. Pass ignore_missing=True, and you’ll get back a tuple: (weighted sum, missing items).
A sparse tensor that stores data in nested dictionaries.
The first level of dictionaries specifies the rows, the second level specifies columns, and so on for higher modes. Therefore, slicing by rows is the easiest to do. Despite this, you can slice on any mode, possibly returning a divisi.sliced_view.SlicedView for the sake of efficiency.
DictTensors can save a lot of memory, can efficiently provide input to a Lanczos The svd module, and work well with divisi.labeled_view.LabeledView objects. However, for some operations you may need to convert the DictTensor to a DenseTensor.
Create a new, empty DictTensor with ndim dimensions.
Frequently, ndim is 2, creating a sparse matrix.
default_value is the value of all unspecified entries. An SVD will only work when *default_value*=0.0.
Returns a new DictTensor that is the transpose of this tensor.
Only works for matrices (i.e., tensor.ndim=2.)
Calculate how dense the tensor is.
Returns (num specified elements)/(num possible elements).
Note that some specified elements may be zero.
Removes any values that are specified as the default value.
Note: this method can’t remove any empty rows or columns, since that would require changing indices. You may be interested in csc.divisi.labeled_view.LabeledView.with_zeros_removed().
Returns a new DictTensor that is the transpose of this tensor.
Only works for matrices (i.e., tensor.ndim=2.)
A representation of a Tensor, based on Numpy arrays.
DenseTensors can be created from Numpy arrays and converted to Numpy arrays. This makes DenseTensors good for performing math operations, since it allows you to use Numpy’s optimized math libraries.
Create a DenseTensor from a numpy array.
Make a new DenseTensor containing all the rows in this DenseTensor, followed by all the rows in another.
The Tensors to be concatenated need to have compatible shapes.
Divide the data into clusters using k-means clustering.
Returns the pair of lists (means, clusters). For each index i, means[i] is the mean vector of a cluster, and clusters[i] is the list of items in that cluster.
A view is a wrapper around a tensor (always self.tensor) that performs some operations (usually __getitem__ and/or __setitem__) differently.
For almost all purposes, it acts like a Tensor itself. Unknown methods are passed through to the underlying Tensor.
This is just the abstract base class. Some useful ``View``s include csc.divisi.labeled_view.LabeledView and csc.divisi.labeled_view.NormalizedView.
Create a new View wrapping a Tensor.
Generate all valid indices in a tensor of the given shape.
>>> list(outer_tuple_iterator((3,)))
[(0,), (1,), (2,)]
>>> list(outer_tuple_iterator((2,2)))
[(0, 0), (0, 1), (1, 0), (1, 1)]
>>> list(outer_tuple_iterator(()))
[]