This tutorial walks you through all the steps required to analyze and visualize your data using textnets. The tutorial first presents a self-contained example before addressing miscellaneous other issues related to using textnets.
Download this example as a Jupyter notebook so you can follow along:
You can also make this tutorial “live” so you can adjust the example code and re-run it.
To use textnets in a project, you typically need the following imports:
from textnets import Corpus, Textnet
For the purposes of demonstration, we also import the bundled example data:
from textnets import examples
Construct the corpus from the example data:
corpus = Corpus(examples.moon_landing)
What is this
moon_landing example all about?
|The Guardian||3:56 am: Man Steps On to the Moon|
|New York Times||Men Walk on Moon -- Astronauts Land on Plain, Collect Rocks, Plant Flag|
|Boston Globe||Man Walks on Moon|
|Houston Chronicle||Armstrong and Aldrich "Take One Small Step for Man" on the Moon|
|Washington Post||The Eagle Has Landed -- Two Men Walk on the Moon|
|Chicago Tribune||Giant Leap for Mankind -- Armstrong Takes 1st Step on Moon|
|Los Angeles Times||Walk on Moon -- That's One Small Step for Man, One Giant Leap for Mankind|
Corpus of labeled documents. Parameters ---------- data : Series Series containing the documents. The index must contain document labels. lang : str, optional The langugage model to use (default: ``en_core_web_sm``).
Hat tip to Chris Bail for this example data!
Next, we create the textnet:
tn = Textnet(corpus.tokenized(), min_docs=1)
tokenized with all defaults, so textnets is removing stop
words, applying stemming, and removing punctuation marks, numbers, URLs and the
like. However, we’re overriding the default setting for
min_docs, opting to
keep even words that appear in only one document (that is, a single newspaper
Let’s take a look:
tn.plot(label_term_nodes=True, label_doc_nodes=True, show_clusters=True)
show_clusters options marks the partitions found by the Leiden
community detection algorithm (see here). It identifies
document–term groups that appear to form part of the same theme in the texts.
You may be wondering: why is the moon drifting off by itself in the network plot? That’s because the word moon appears exactly once in each document, so its tf-idf value for each document is 0.
We can also visualize the projected networks.
First, the network of newspapers:
papers = tn.project(node_type='doc') papers.plot(label_nodes=True)
As before in the bipartite network, we can see the Houston Chronicle, Chicago Tribune and Los Angeles Times cluster more closely together.
Next, the term network:
words = tn.project(node_type='term') words.plot(label_nodes=True, show_clusters=True)
Aside from visualization, we can also analyze our corpus using network metrics. For instance, documents with high betweenness centrality (or “cultural betweenness”; [Bai16]) might link together themes, thereby stimulating exchange across symbolic divides.
Los Angeles Times 7.0 Boston Globe 0.0 Chicago Tribune 0.0 Houston Chronicle 0.0 New York Times 0.0 The Guardian 0.0 Washington Post 0.0 dtype: float64
As we can see, the Los Angeles Times is a cultural bridge linking the headline themes of the East Coast newspapers to the others.
walk 72.00 man 18.00 step 16.00 small 12.75 land 6.00 giant 6.00 leap 6.00 mankind 6.00 armstrong 3.25 plain 0.00 dtype: float64
It’s because the Times uses the word “walk” in its headline, linking the “One Small Step” cluster to the “Man on Moon” cluster.
We can produce the term graph plot again, this time scaling nodes according to their betweenness centrality, and pruning edges from the graph using “backbone extraction” ([SBogunaV09]).
We can also use
color_clusters (instead of
show_clusters) to color
nodes according to their partition.
And we can filter node labels, labeling only those nodes that have a betweenness centrality score above the median. This is particularly useful in high-order graphs where labeling every single node would cause too much visual clutter.
words.plot(label_nodes=True, scale_nodes_by='betweenness', color_clusters=True, alpha=0.5, node_label_filter=lambda n: n.betweenness() > words.betweenness.median())
Wrangling Text & Mangling Data¶
How to go from this admittedly contrived example to working with your own data? The following snippets are meant to help you get started. The first thing is to get your data in the right shape.
A textnet is built from a collection—or corpus—of texts, so we use the
Corpus class to get our data ready. Each of the following snippets assumes
that you have imported
Textnet like in the preceding example.
You may already have your texts in a Python data structure.
Corpus can read
documents directly from pandas’
DataFrame; mangling your data into the appropriate format should
only take one or two easy steps. The important thing is to
have the texts in one column, and the document labels as the index.
corpus = Corpus(series, lang='nl') # or alternately: corpus = Corpus.from_df(df, doc_col='tekst', lang='nl')
If you do not specify
doc_col, textnets assumes that the first column
containing strings is the one you meant.
You can specify which language model you would
like to use using the
lang argument. The default is English, but you don’t
have to be monolingual to use textnets. (Languages in
LANGS are fully
supported since we can use spacy’s statistical language models. Other languages
are only partially supported, so
noun_phrases will likely not function.)
From a database or CSV file¶
import sqlite3 with sqlite3.connect('documents.db') as conn: articles = Corpus.from_sql('SELECT title, text FROM articles', conn)
As before, you do can specify a
doc_col to specify which column contains
your texts. You can also specify a
label_col containing document labels. By
from_sql uses the first column as the
label_col and the first
column after that containing strings as the
blog = Corpus.from_csv('blog-posts.csv', label_col='slug', doc_col='summary' sep=';')
Perhaps you have each document you want to include in your textnet stored on
disk in a separate text file. For such cases,
Corpus comes with a utility,
from_files(). You can simply pass a path to it using a globbing pattern:
corpus = Corpus.from_files('/path/to/texts/*.txt')
You can also pass it a list of paths:
corpus = Corpus.from_files(['kohl.txt', 'schroeder.txt', 'merkel.txt'], doc_labels=['Kohl', 'Schröder', 'Merkel'], lang='de')
You can optionally pass explicit labels for your documents using the argument
doc_labels. Without this, labels are inferred from file names by stripping
off the file suffix.
Break It Up¶
The textnet is built from chunks of texts.
Corpus offers three methods for
breaking your texts into chunks:
first breaks your texts up into individual words, the second into n-grams of
desired size, while the third looks for noun phrases such as “my husband,” “our prime
minister,” or “the virus.”
trigrams = corpus.ngrams(3)
np = corpus.noun_phrases(remove=['Lilongwe', 'Mzuzu', 'Blantyre'])
For large corpora, some of these operations can be computationally intense. Use your friendly neighborhood HPC cluster or be prepared for your laptop to get hot.
Calling these methods results in another data frame, which we can feed to
Textnet to make our textnet.
A textnet is a bipartite network of terms (words or
phrases) and documents (which often represent the people or groups who
authored them). We create the textnet from the processed corpus using the
tn = Textnet(np)
Textnet takes a few optional arguments. The most important one is
min_docs. It determines how many documents a term must appear in to be
included in the textnet. A term that appears only in a single document creates
no link, so the default value is 2. However, this can lead to a very noisy
graph, and usually only terms that appear in a significant proportion of
documents really indicate latent topics, so it is common to pass a higher
A boolean argument,
sublinear, decides whether to use sublinear
(logarithmic) scaling when calculating tf-idf for edge weights. The default
True because sublinear scaling is considered good practice in the
information retrieval literature ([MRSchutze08]), but there may be good
reason to turn it off.
doc_attrs allows setting additional attributes for documents that become
node attributes in the resulting network graph. For instance, if texts
represent views of members of different parties, we can set a party attribute.
tn = Textnet(corpus.tokenized(), doc_attr=df[['party']].to_dict())
For bipartite graphs, it can be helpful to use a layout option, such as
sugiyama_layout, which help
to spatially separate the two node types.
You may want terms that are used in more documents to appear bigger in the
graph. In that case, use the
scale_nodes_by argument with the value
degree. Other useful options include
label_edges. These are all boolean options, so
simply pass the value
True to enable them.
show_clusters will draw polygons around detected groups
of nodes with a community structure.
Depending on your research question, you may be interested either in how terms or documents are connected. You can project the bipartite network into a single-mode network of either kind.
groups = tn.project(node_type='doc') groups.summary()
The resulting network only contains nodes of the chosen type (
term). Edge weights are calculated, and node attributes are maintained.
Like the bipartite network, the projected textnet also has a
plot method. This takes an optional argument,
which can help “de-clutter” the resulting visualization by removing edges. The
value for this argument is a significance value, and only edges with a
significance value at or below the chosen value are kept. What remains in the
pruned graph is called the “backbone” in the network science literature.
Commonly chosen values for
alpha are in the range between 0.2 and 0.6 (with
lower values resulting in more aggressive pruning).
In visualizations of the projected network, you may want to scale nodes
according to centrality. Pass the argument
scale_nodes_by with a value of
“betweenness,” “closeness,” “degree,” “strength,” or “eigenvector_centrality.”
Label nodes using the boolean argument
label_nodes. As above,
show_clusters will mark groups of nodes with a community structure.
The tutorial above gives some examples of using centrality measures to analyze
your corpus. Aside from
top_betweenness, the package also provides the
top_degree (for unweighted degree),
(for weighted degree), and
top_ev (for eigenvector centrality). By default,
they each output the ten top nodes for each centrality measure.
In addition, you can use
help interpret the community structure of your textnet. Clusters can either be
interpreted as latent themes (in the word graph) or as groupings of documents
using similar words or phrases (in the document graph).