Tutorial

This tutorial walks you through all the steps required to analyze and visualize your data using textnets. The tutorial first presents a self-contained example before addressing miscellaneous other issues related to using textnets.

Example

Tip

Download this example as a Jupyter notebook so you can follow along: tutorial.ipynb.

You can also make this tutorial “live” so you can adjust the example code and re-run it.

To use textnets in a project, you typically need the following imports:

from textnets import Corpus, Textnet

For the purposes of demonstration, we also import the bundled example data:

from textnets import examples

Construct the corpus from the example data:

corpus = Corpus(examples.moon_landing)

What is this moon_landing example all about?

corpus
label
The Guardian 3:56 am: Man Steps On to the Moon
New York Times Men Walk on Moon -- Astronauts Land on Plain, Collect Rocks, Plant Flag
Boston Globe Man Walks on Moon
Houston Chronicle Armstrong and Aldrich "Take One Small Step for Man" on the Moon
Washington Post The Eagle Has Landed -- Two Men Walk on the Moon
Chicago Tribune Giant Leap for Mankind -- Armstrong Takes 1st Step on Moon
Los Angeles Times Walk on Moon -- That's One Small Step for Man, One Giant Leap for Mankind
Corpus Docs: 7 Lang: en_core_web_sm
    Corpus of labeled documents.

    Parameters
    ----------
    data : Series
        Series containing the documents. The index must contain document
        labels.
    lang : str, optional
        The langugage model to use (default: ``en_core_web_sm``).
    

Note

Hat tip to Chris Bail for this example data!

Next, we create the textnet:

tn = Textnet(corpus.tokenized(), min_docs=1)

We’re using tokenized with all defaults, so textnets is removing stop words, applying stemming, and removing punctuation marks, numbers, URLs and the like. However, we’re overriding the default setting for min_docs, opting to keep even words that appear in only one document (that is, a single newspaper headline).

Let’s take a look:

tn.plot(label_term_nodes=True,
        label_doc_nodes=True,
        show_clusters=True)
_images/tutorial_5_0.svg

The show_clusters options marks the partitions found by the Leiden community detection algorithm (see here). It identifies document–term groups that appear to form part of the same theme in the texts.

You may be wondering: why is the moon drifting off by itself in the network plot? That’s because the word moon appears exactly once in each document, so its tf-idf value for each document is 0.

We can also visualize the projected networks.

First, the network of newspapers:

papers = tn.project(node_type='doc')
papers.plot(label_nodes=True)
_images/tutorial_6_0.svg

As before in the bipartite network, we can see the Houston Chronicle, Chicago Tribune and Los Angeles Times cluster more closely together.

Next, the term network:

words = tn.project(node_type='term')
words.plot(label_nodes=True,
           show_clusters=True)
_images/tutorial_7_0.svg

Aside from visualization, we can also analyze our corpus using network metrics. For instance, documents with high betweenness centrality (or “cultural betweenness”; [Bai16]) might link together themes, thereby stimulating exchange across symbolic divides.

papers.top_betweenness()
Los Angeles Times    7.0
Boston Globe         0.0
Chicago Tribune      0.0
Houston Chronicle    0.0
New York Times       0.0
The Guardian         0.0
Washington Post      0.0
dtype: float64

As we can see, the Los Angeles Times is a cultural bridge linking the headline themes of the East Coast newspapers to the others.

words.top_betweenness()
walk         72.00
man          18.00
step         16.00
small        12.75
land          6.00
giant         6.00
leap          6.00
mankind       6.00
armstrong     3.25
plain         0.00
dtype: float64

It’s because the Times uses the word “walk” in its headline, linking the “One Small Step” cluster to the “Man on Moon” cluster.

We can produce the term graph plot again, this time scaling nodes according to their betweenness centrality, and pruning edges from the graph using “backbone extraction” ([SBogunaV09]).

We can also use color_clusters (instead of show_clusters) to color nodes according to their partition.

And we can filter node labels, labeling only those nodes that have a betweenness centrality score above the median. This is particularly useful in high-order graphs where labeling every single node would cause too much visual clutter.

words.plot(label_nodes=True,
           scale_nodes_by='betweenness',
           color_clusters=True,
           alpha=0.5,
           node_label_filter=lambda n: n.betweenness() > words.betweenness.median())
_images/tutorial_10_1.svg

Wrangling Text & Mangling Data

How to go from this admittedly contrived example to working with your own data? The following snippets are meant to help you get started. The first thing is to get your data in the right shape.

A textnet is built from a collection—or corpus—of texts, so we use the Corpus class to get our data ready. Each of the following snippets assumes that you have imported Corpus and Textnet like in the preceding example.

From Pandas

You may already have your texts in a Python data structure. Corpus can read documents directly from pandas’ Series or DataFrame; mangling your data into the appropriate format should only take one or two easy steps. The important thing is to have the texts in one column, and the document labels as the index.

corpus = Corpus(series, lang='nl')
# or alternately:
corpus = Corpus.from_df(df, doc_col='tekst', lang='nl')

If you do not specify doc_col, textnets assumes that the first column containing strings is the one you meant.

You can specify which language model you would like to use using the lang argument. The default is English, but you don’t have to be monolingual to use textnets. (Languages in LANGS are fully supported since we can use spacy’s statistical language models. Other languages are only partially supported, so noun_phrases will likely not function.)

From a database or CSV file

You can also use Corpus to load your documents from a database or comma-separated value file using from_sql and from_csv respectively.

import sqlite3

with sqlite3.connect('documents.db') as conn:
    articles = Corpus.from_sql('SELECT title, text FROM articles', conn)

As before, you do can specify a doc_col to specify which column contains your texts. You can also specify a label_col containing document labels. By default, from_sql uses the first column as the label_col and the first column after that containing strings as the doc_col.

blog = Corpus.from_csv('blog-posts.csv',
                       label_col='slug',
                       doc_col='summary'
                       sep=';')

Both from_sql and from_csv accept additional keyword arguments that are passed to pandas.read_sql and pandas.read_csv respectively.

From Files

Perhaps you have each document you want to include in your textnet stored on disk in a separate text file. For such cases, Corpus comes with a utility, from_files(). You can simply pass a path to it using a globbing pattern:

corpus = Corpus.from_files('/path/to/texts/*.txt')

You can also pass it a list of paths:

corpus = Corpus.from_files(['kohl.txt', 'schroeder.txt', 'merkel.txt'],
                           doc_labels=['Kohl', 'Schröder', 'Merkel'],
                           lang='de')

You can optionally pass explicit labels for your documents using the argument doc_labels. Without this, labels are inferred from file names by stripping off the file suffix.

Break It Up

The textnet is built from chunks of texts. Corpus offers three methods for breaking your texts into chunks: tokenized, ngrams, and noun_phrases. The first breaks your texts up into individual words, the second into n-grams of desired size, while the third looks for noun phrases such as “my husband,” “our prime minister,” or “the virus.”

trigrams = corpus.ngrams(3)
np = corpus.noun_phrases(remove=['Lilongwe', 'Mzuzu', 'Blantyre'])

Warning

For large corpora, some of these operations can be computationally intense. Use your friendly neighborhood HPC cluster or be prepared for your laptop to get hot.

Calling these methods results in another data frame, which we can feed to Textnet to make our textnet.

Make Connections

A textnet is a bipartite network of terms (words or phrases) and documents (which often represent the people or groups who authored them). We create the textnet from the processed corpus using the Textnet class.

tn = Textnet(np)

Textnet takes a few optional arguments. The most important one is min_docs. It determines how many documents a term must appear in to be included in the textnet. A term that appears only in a single document creates no link, so the default value is 2. However, this can lead to a very noisy graph, and usually only terms that appear in a significant proportion of documents really indicate latent topics, so it is common to pass a higher value.

A boolean argument, sublinear, decides whether to use sublinear (logarithmic) scaling when calculating tf-idf for edge weights. The default is True because sublinear scaling is considered good practice in the information retrieval literature ([MRSchutze08]), but there may be good reason to turn it off.

doc_attrs allows setting additional attributes for documents that become node attributes in the resulting network graph. For instance, if texts represent views of members of different parties, we can set a party attribute.

tn = Textnet(corpus.tokenized(), doc_attr=df[['party']].to_dict())

Seeing Results

You are now ready to see the first results. Textnet comes with a utility method, plot, which allows you to quickly visualize the bipartite graph.

For bipartite graphs, it can be helpful to use a layout option, such as bipartite_layout, circular_layout, or sugiyama_layout, which help to spatially separate the two node types.

You may want terms that are used in more documents to appear bigger in the graph. In that case, use the scale_nodes_by argument with the value degree. Other useful options include label_term_nodes, label_doc_nodes, and label_edges. These are all boolean options, so simply pass the value True to enable them.

Finally, enabling show_clusters will draw polygons around detected groups of nodes with a community structure.

Projecting

Depending on your research question, you may be interested either in how terms or documents are connected. You can project the bipartite network into a single-mode network of either kind.

groups = tn.project(node_type='doc')
groups.summary()

The resulting network only contains nodes of the chosen type (doc or term). Edge weights are calculated, and node attributes are maintained.

Like the bipartite network, the projected textnet also has a plot method. This takes an optional argument, alpha, which can help “de-clutter” the resulting visualization by removing edges. The value for this argument is a significance value, and only edges with a significance value at or below the chosen value are kept. What remains in the pruned graph is called the “backbone” in the network science literature. Commonly chosen values for alpha are in the range between 0.2 and 0.6 (with lower values resulting in more aggressive pruning).

In visualizations of the projected network, you may want to scale nodes according to centrality. Pass the argument scale_nodes_by with a value of “betweenness,” “closeness,” “degree,” “strength,” or “eigenvector_centrality.”

Label nodes using the boolean argument label_nodes. As above, show_clusters will mark groups of nodes with a community structure.

Analysis

The tutorial above gives some examples of using centrality measures to analyze your corpus. Aside from top_betweenness, the package also provides the methods top_closeness, top_degree (for unweighted degree), top_strength (for weighted degree), and top_ev (for eigenvector centrality). By default, they each output the ten top nodes for each centrality measure.

In addition, you can use top_cluster_nodes to help interpret the community structure of your textnet. Clusters can either be interpreted as latent themes (in the word graph) or as groupings of documents using similar words or phrases (in the document graph).