Tutorial¶

This tutorial walks you through all the steps required to analyze and visualize your data using textnets. The tutorial first presents a self-contained example before addressing miscellaneous other issues related to using textnets.

Example¶

Tip

Download this example as a Jupyter notebook so you can follow along: tutorial.ipynb.

You can also make this tutorial “live” so you can adjust the example code and re-run it.

To use textnets in a project, you typically start with the following import:

import textnets as tn

You can set a fixed seed to ensure that results are reproducible across runs of your script (see Sandve et al. [2013]):

tn.params["seed"] = 42

Construct the corpus from the example data:

corpus = tn.Corpus(tn.examples.moon_landing)

What is this moon_landing example all about? (Hint: Click on the output below to see what’s in our corpus.)

corpus

Corpus

Docs: 7

Lang: en_core_web_sm

label
The Guardian	3:56 am: Man Steps On to the Moon
New York Times	Men Walk on Moon -- Astronauts Land on Plain, Collect Rocks, Plant Flag
Boston Globe	Man Walks on Moon
Houston Chronicle	Armstrong and Aldrich "Take One Small Step for Man" on the Moon
Washington Post	The Eagle Has Landed -- Two Men Walk on the Moon
Chicago Tribune	Giant Leap for Mankind -- Armstrong Takes 1st Step on Moon
Los Angeles Times	Walk on Moon -- That's One Small Step for Man, One Giant Leap for Mankind

Note

Hat tip to Chris Bail for this example data!

Next, we create the textnet:

t = tn.Textnet(corpus.tokenized(), min_docs=1)

We’re using tokenized with all defaults, so textnets is removing stop words, applying stemming, and removing punctuation marks, numbers, URLs and the like. However, we’re overriding the default setting for min_docs, opting to keep even words that appear in only one document (that is, a single newspaper headline).

When dealing with large corpora, you may also want to supply the argument remove_weak_edges=True to remove edges with a weight far below the average. This will result in a sparser graph.

Let’s take a look:

t.plot(label_nodes=True,
       show_clusters=True)

<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea7c7dea20>

The show_clusters options marks the partitions found by the Leiden community detection algorithm (see here). It identifies document–term groups that appear to form part of the same theme in the texts.

You may be wondering: why is the moon drifting off by itself in the network plot? That’s because the word moon appears exactly once in each document, so its tf-idf value for each document is 0.

Let’s visualize the same thing again, but this time scale the nodes according to their BiRank (a centrality measure for bipartite networks) and the edges according to their weights.

t.plot(label_nodes=True,
       show_clusters=True,
       scale_nodes_by="birank",
       scale_edges_by="weight")

<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea7bd0b980>

We can also visualize the projected networks.

First, the network of newspapers:

papers = t.project(node_type=tn.DOC)
papers.plot(label_nodes=True)

<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea7cfeb0b0>

As before in the bipartite network, we can see the Houston Chronicle, Chicago Tribune and Los Angeles Times cluster more closely together.

Next, the term network:

words = t.project(node_type=tn.TERM)
words.plot(label_nodes=True,
           show_clusters=True)

<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea78353890>

Aside from visualization, we can also analyze our corpus using network metrics. For instance, documents with high betweenness centrality (or “cultural betweenness”; Bail [2016]) might link together themes, thereby stimulating exchange across symbolic divides.

papers.top_betweenness()

Los Angeles Times    10.0
Chicago Tribune       0.0
Boston Globe          0.0
Houston Chronicle     0.0
New York Times        0.0
The Guardian          0.0
Washington Post       0.0
dtype: float64

As we can see, the Los Angeles Times is a cultural bridge linking the headline themes of the East Coast newspapers to the others.

words.top_betweenness()

walk         110.000000
man           26.500000
small         15.250000
men           14.000000
step          11.500000
mankind        7.166667
giant          7.166667
leap           7.166667
armstrong      3.250000
flag           0.000000
dtype: float64

It’s because the Times uses the word “walk” in its headline, linking the “One Small Step” cluster to the “Man on Moon” cluster.

We can produce the term network plot again, this time scaling nodes according to their betweenness centrality, and pruning edges from the network using “backbone extraction” [Serrano et al., 2009].

We can also use color_clusters (instead of show_clusters) to color nodes according to their partition.

And we can filter node labels, labeling only those nodes that have a betweenness centrality score above the median. This is particularly useful in high-order networks where labeling every single node would cause too much visual clutter.

words.plot(label_nodes=True,
           scale_nodes_by="betweenness",
           color_clusters=True,
           alpha=0.5,
           edge_width=[10*w for w in words.edges["weight"]],
           edge_opacity=0.4,
           node_label_filter=lambda n: n.betweenness() > words.betweenness.median())

<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea783ea990>

Another measure we can use is the textual spanning measure introduced by Stoltz and Taylor [2019], which can help identify “discursive holes” in the document-to-document network.

papers.plot(label_nodes=True,
            scale_nodes_by="spanning")

<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea783e8dd0>

Larger document nodes are similar to nodes that are dissimilar from one another, so they can be thought of as spanning a wider “distance” in the discursive field than the smaller ones.

Wrangling Text & Mangling Data¶

How to go from this admittedly contrived example to working with your own data? The following snippets are meant to help you get started. The first thing is to get your data in the right shape.

A textnet is built from a collection—or corpus—of texts, so we use the Corpus class to get our data ready. The following snippets assume that you have imported textnets as above.

From a Dictionary¶

You may already have your texts in a Python data structure, such as a dictionary mapping document labels (keys) to documents (values). In that case, you can use the from_dict method to construct your Corpus.

data = {f"Documento {label+1}": doc for label, doc in enumerate(docs)}
corpus = tn.Corpus.from_dict(data, lang="it")

You can specify which language model you would like to use using the lang argument. The default is English, but you don’t have to be monolingual to use textnets. (Languages in LANGS are fully supported since we can use spacy’s statistical language models. Other languages are only partially supported, so noun_phrases will likely not function.)

From Pandas¶

Corpus can read documents directly from pandas’ Series or DataFrame; mangling your data into the appropriate format should only take one or two steps. The important thing is to have the texts in one column, and the document labels as the index.

corpus = tn.Corpus(series, lang="nl")
# or alternately:
corpus = tn.Corpus.from_df(df, doc_col="tekst", lang="nl")

If you do not specify doc_col, textnets assumes that the first column containing strings is the one you meant.

From a database or CSV file¶

You can also use Corpus to load your documents from a database or comma-separated value file using from_sql and from_csv respectively.

import sqlite3

with sqlite3.connect("documents.db") as conn:
    articles = tn.Corpus.from_sql("SELECT title, text FROM articles", conn)

As before, you do can specify a doc_col to specify which column contains your texts. You can also specify a label_col containing document labels. By default, from_sql uses the first column as the label_col and the first column after that containing strings as the doc_col.

blog = tn.Corpus.from_csv("blog-posts.csv",
                          label_col="slug",
                          doc_col="summary"
                          sep=";")

Both from_sql and from_csv accept additional keyword arguments that are passed to pandas.read_sql and pandas.read_csv respectively.

From Files¶

Perhaps you have each document you want to include in your textnet stored on disk in a separate text file. For such cases, Corpus comes with a utility, from_files. You can pass it a path using a globbing pattern:

corpus = tn.Corpus.from_files("/path/to/texts/*.txt")

You can also pass it a list of paths:

corpus = tn.Corpus.from_files(["kohl.txt", "schroeder.txt", "merkel.txt"],
                              doc_labels=["Kohl", "Schröder", "Merkel"],
                              lang="de")

You can optionally pass explicit labels for your documents using the argument doc_labels. Without this, labels are inferred from file names by stripping off the file suffix.

Break It Up¶

The textnet is built from chunks of texts. Corpus offers three methods for breaking your texts into chunks: tokenized, ngrams, and noun_phrases. The first breaks your texts up into individual words, the second into n-grams of desired size, while the third looks for noun phrases such as “my husband,” “our prime minister,” or “the virus.”

trigrams = corpus.ngrams(3)

np = corpus.noun_phrases(remove=["Lilongwe", "Mzuzu", "Blantyre"])

Warning

For large corpora, some of these operations can be computationally intense. Use your friendly neighborhood HPC cluster or be prepared for your laptop to get hot.

An optional boolean argument, sublinear, can be passed to tokenized, ngrams, and noun_phrases. It decides whether to use sublinear (logarithmic) scaling when calculating tf-idf term weights. The default is True, because sublinear scaling is considered good practice in the information retrieval literature [Manning et al., 2008], but there may be good reason to turn it off.

Calling these methods results in another data frame, which we can feed to Textnet to make our textnet.

Make Connections¶

A textnet is a bipartite network of terms (words or phrases) and documents (which often represent the people or groups who authored them). We create the textnet from the processed corpus using the Textnet class.

t = tn.Textnet(np)

Textnet takes a few optional arguments. The most important one is min_docs. It determines how many documents a term must appear in to be included in the textnet. A term that appears only in a single document creates no link, so the default value is 2. However, this can lead to a very noisy graph, and usually only terms that appear in a significant proportion of documents really indicate latent topics, so it is common to pass a higher value.

max_docs determines the maximum number of documents a term can appear in while still being included in the textnet. By default, a term can appear in any number of documents, as a term that appears frequently is likely to play an important role in the corpus. However, some terms that appear in a large proportion of documents may be irrelevant. In such cases it may be useful to pass a lower value to exclude very common terms.

connected is a boolean argument that decides whether only the largest connected component of the resulting network should be kept. It defaults to False.

doc_attrs allows setting additional attributes for documents that become node attributes in the resulting network. For instance, if texts represent views of members of different parties, we can set a party attribute.

t = tn.Textnet(corpus.tokenized(), doc_attr=df[["party"]].to_dict())

Seeing Results¶

You are now ready to see the first results. Textnet comes with a utility method, plot, which allows you to quickly visualize the bipartite network.

For bipartite network, it can be helpful to use a layout option, such as bipartite_layout, circular_layout, or sugiyama_layout, which help to spatially separate the two node types.

You may want terms that are used in more documents to appear bigger in the plot. In that case, use the scale_nodes_by argument with the value degree. Other useful options include label_term_nodes, label_doc_nodes, and label_edges. These are all boolean options, so your can enable them by passing the value True.

Finally, enabling show_clusters will draw polygons around detected groups of nodes with a community structure.

Projecting¶

Depending on your research question, you may be interested either in how terms or documents are connected. You can project the bipartite network into a single-mode network of either kind.

groups = t.project(node_type=tn.DOC)
print(groups.summary)

The resulting network only contains nodes of the chosen type (DOC or TERM). Edge weights are calculated, and node attributes are maintained. The m property gives you access to the projected graph’s weighted adjacency matrix.

Like the bipartite network, the projected textnet also has a plot method. This takes an optional argument, alpha, which can help “de-clutter” the resulting visualization by removing edges. The value for this argument is a significance value, and only edges with a significance value at or below the chosen value are kept. What remains in the pruned network is called the “backbone” in the network science literature. Commonly chosen values for alpha are in the range between 0.2 and 0.6 (with lower values resulting in more aggressive pruning).

In visualizations of the projected network, you may want to scale nodes according to centrality. Pass the argument scale_nodes_by with a value of “betweenness,” “closeness,” “harmonic,” “degree,” “strength,” or “eigenvector_centrality.” In the document network, you can also use textual spanning, as demonstrated above.

Label nodes using the boolean argument label_nodes. As above, show_clusters will mark groups of nodes with a community structure.

Analysis¶

Use top_cluster_nodes to interpret the community structure of your textnet. Clusters in the bipartite or word network can be interpreted as discursive categories or latent themes in the corpus [Gerlach et al., 2018]. Clusters in the document network can be interpreted as groupings.

The tutorial above gives some examples of using centrality measures to analyze your corpus. Aside from top_betweenness, the package also provides the methods top_closeness (weighted closeness), top_harmonic (weighted harmonic centrality), top_degree (unweighted degree), top_strength (weighted degree), top_ev (eigenvector centrality), top_pagerank (PageRank centrality), and top_spanning (textual spanning). In the bipartite network, you can use top_birank, top_hits and top_cohits to see nodes ranked by variations of a bipartite centrality measure [He et al., 2017]. By default, they each output the ten top nodes for each centrality measure.

Saving¶

You can save both the network that underlies a textnet as well as visualizations. Assuming you want to save the projected term network, called words, that we created above, you can do so as follows:

words.save_graph("term_network.gml")

This will create a file in the current directory in Graph Modeling Language (GML) format. This can then be opened by Pajek, yEd, Gephi and other programs. Consult the docs for Textnet.save_graph for a list of supported formats.

If you want to save a plot of a network, use savefig.

words.plot(label_nodes=True, color_clusters=True)
tn.savefig("term_network.svg")

Supported file formats include SVG, PNG, TIFF, EPS and PDF.