Word Co-occurrence Matrix

A typical NLP (natural language processing) data structure is an "Occurence Matrix". Each cell in the occurrence matrix represents the number of times (or frequency or probability) that a word occurs in a given set of words (usually a document or webpage).

At a fundamental level, this occurence matrix is really a graph (network of connections) where each element of the matrix at row i and column j represents the value of a connection (edge) from vertex i to node (vertex) j.

So if you start with a word in the Occurrence Graph and traverse through the it's document nodes and reach all the word nodes connected to each of those documents individually you can identify the words that "co-occur" in the same document. To form a new, smaller graph, with nodes of the same type you can delete all the document nodes and edges and replace them with the word-to-word nodes that are equivalent.

This results in a Co-Occurrence graph, which itself can be viewed as a Matrix (just like the Occurence Matrix). Each element ij represents the value of an edge from vertex i to vertex j.