Tutorial¶
This tutorial walks you through all the steps required to analyze and visualize your data using textnets. The tutorial first presents a self-contained example before addressing miscellaneous other issues related to using textnets.
Example¶
Tip
Download this example as a Jupyter notebook so you can follow along:
tutorial.ipynb.
You can also make this tutorial “live” so you can adjust the example code and re-run it.
To use textnets in a project, you typically start with the following import:
import textnets as tn
You can set a fixed seed to ensure that results are reproducible across runs of your script (see Sandve et al. [2013]):
tn.params["seed"] = 42
Construct the corpus from the example data:
corpus = tn.Corpus(tn.examples.moon_landing)
What is this moon_landing example all about? (Hint: Click on the output below
to see what’s in our corpus.)
corpus
Corpus
Docs: 7
Lang: en_core_web_sm
| label | |
|---|---|
| The Guardian | 3:56 am: Man Steps On to the Moon |
| New York Times | Men Walk on Moon -- Astronauts Land on Plain, Collect Rocks, Plant Flag |
| Boston Globe | Man Walks on Moon |
| Houston Chronicle | Armstrong and Aldrich "Take One Small Step for Man" on the Moon |
| Washington Post | The Eagle Has Landed -- Two Men Walk on the Moon |
| Chicago Tribune | Giant Leap for Mankind -- Armstrong Takes 1st Step on Moon |
| Los Angeles Times | Walk on Moon -- That's One Small Step for Man, One Giant Leap for Mankind |
Note
Hat tip to Chris Bail for this example data!
Next, we create the textnet:
t = tn.Textnet(corpus.tokenized(), min_docs=1)
We’re using tokenized with all defaults, so textnets is removing stop
words, applying stemming, and removing punctuation marks, numbers, URLs and the
like. However, we’re overriding the default setting for min_docs, opting to
keep even words that appear in only one document (that is, a single newspaper
headline).
When dealing with large corpora, you may also want to supply the argument
remove_weak_edges=True to remove edges with a weight far below the average.
This will result in a sparser graph.
Let’s take a look:
t.plot(label_nodes=True,
show_clusters=True)
<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea7c7dea20>
The show_clusters options marks the partitions found by the Leiden
community detection algorithm (see here). It identifies
document–term groups that appear to form part of the same theme in the texts.
You may be wondering: why is the moon drifting off by itself in the network plot? That’s because the word moon appears exactly once in each document, so its tf-idf value for each document is 0.
Let’s visualize the same thing again, but this time scale the nodes according to their BiRank (a centrality measure for bipartite networks) and the edges according to their weights.
t.plot(label_nodes=True,
show_clusters=True,
scale_nodes_by="birank",
scale_edges_by="weight")
<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea7bd0b980>
We can also visualize the projected networks.
First, the network of newspapers:
papers = t.project(node_type=tn.DOC)
papers.plot(label_nodes=True)
<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea7cfeb0b0>
As before in the bipartite network, we can see the Houston Chronicle, Chicago Tribune and Los Angeles Times cluster more closely together.
Next, the term network:
words = t.project(node_type=tn.TERM)
words.plot(label_nodes=True,
show_clusters=True)
<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea78353890>
Aside from visualization, we can also analyze our corpus using network metrics. For instance, documents with high betweenness centrality (or “cultural betweenness”; Bail [2016]) might link together themes, thereby stimulating exchange across symbolic divides.
papers.top_betweenness()
Los Angeles Times 10.0
Chicago Tribune 0.0
Boston Globe 0.0
Houston Chronicle 0.0
New York Times 0.0
The Guardian 0.0
Washington Post 0.0
dtype: float64
As we can see, the Los Angeles Times is a cultural bridge linking the headline themes of the East Coast newspapers to the others.
words.top_betweenness()
walk 110.000000
man 26.500000
small 15.250000
men 14.000000
step 11.500000
mankind 7.166667
giant 7.166667
leap 7.166667
armstrong 3.250000
flag 0.000000
dtype: float64
It’s because the Times uses the word “walk” in its headline, linking the “One Small Step” cluster to the “Man on Moon” cluster.
We can produce the term network plot again, this time scaling nodes according to their betweenness centrality, and pruning edges from the network using “backbone extraction” [Serrano et al., 2009].
We can also use color_clusters (instead of show_clusters) to color
nodes according to their partition.
And we can filter node labels, labeling only those nodes that have a betweenness centrality score above the median. This is particularly useful in high-order networks where labeling every single node would cause too much visual clutter.
words.plot(label_nodes=True,
scale_nodes_by="betweenness",
color_clusters=True,
alpha=0.5,
edge_width=[10*w for w in words.edges["weight"]],
edge_opacity=0.4,
node_label_filter=lambda n: n.betweenness() > words.betweenness.median())
<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea783ea990>
Another measure we can use is the textual spanning measure introduced by Stoltz and Taylor [2019], which can help identify “discursive holes” in the document-to-document network.
papers.plot(label_nodes=True,
scale_nodes_by="spanning")
<igraph.drawing.matplotlib.graph.GraphArtist at 0x7dea783e8dd0>
Larger document nodes are similar to nodes that are dissimilar from one another, so they can be thought of as spanning a wider “distance” in the discursive field than the smaller ones.
Wrangling Text & Mangling Data¶
How to go from this admittedly contrived example to working with your own data? The following snippets are meant to help you get started. The first thing is to get your data in the right shape.
A textnet is built from a collection—or corpus—of texts, so we use the
Corpus class to get our data ready. The following snippets assume that you
have imported textnets as above.
From a Dictionary¶
You may already have your texts in a Python data structure, such as a
dictionary mapping document labels (keys) to documents (values). In that case,
you can use the from_dict method to construct your
Corpus.
data = {f"Documento {label+1}": doc for label, doc in enumerate(docs)}
corpus = tn.Corpus.from_dict(data, lang="it")
You can specify which language model you would
like to use using the lang argument. The default is English, but you don’t
have to be monolingual to use textnets. (Languages in LANGS are fully
supported since we can use spacy’s statistical language models. Other languages
are only partially supported, so noun_phrases will likely not function.)
From Pandas¶
Corpus can read documents directly from pandas’ Series
or DataFrame; mangling your data into the appropriate
format should only take one or two steps. The important thing is to
have the texts in one column, and the document labels as the index.
corpus = tn.Corpus(series, lang="nl")
# or alternately:
corpus = tn.Corpus.from_df(df, doc_col="tekst", lang="nl")
If you do not specify doc_col, textnets assumes that the first column
containing strings is the one you meant.
From a database or CSV file¶
You can also use Corpus to load your documents from a database or
comma-separated value file using from_sql and from_csv respectively.
import sqlite3
with sqlite3.connect("documents.db") as conn:
articles = tn.Corpus.from_sql("SELECT title, text FROM articles", conn)
As before, you do can specify a doc_col to specify which column contains
your texts. You can also specify a label_col containing document labels. By
default, from_sql uses the first column as the label_col and the first
column after that containing strings as the doc_col.
blog = tn.Corpus.from_csv("blog-posts.csv",
label_col="slug",
doc_col="summary"
sep=";")
Both from_sql and from_csv accept additional keyword arguments that are
passed to pandas.read_sql and pandas.read_csv respectively.
From Files¶
Perhaps you have each document you want to include in your textnet stored on
disk in a separate text file. For such cases, Corpus comes with a utility,
from_files. You can pass it a path using a globbing pattern:
corpus = tn.Corpus.from_files("/path/to/texts/*.txt")
You can also pass it a list of paths:
corpus = tn.Corpus.from_files(["kohl.txt", "schroeder.txt", "merkel.txt"],
doc_labels=["Kohl", "Schröder", "Merkel"],
lang="de")
You can optionally pass explicit labels for your documents using the argument
doc_labels. Without this, labels are inferred from file names by stripping
off the file suffix.
Break It Up¶
The textnet is built from chunks of texts. Corpus offers three methods for
breaking your texts into chunks: tokenized, ngrams, and noun_phrases. The
first breaks your texts up into individual words, the second into n-grams of
desired size, while the third looks for noun phrases such as “my husband,” “our prime
minister,” or “the virus.”
trigrams = corpus.ngrams(3)
np = corpus.noun_phrases(remove=["Lilongwe", "Mzuzu", "Blantyre"])
Warning
For large corpora, some of these operations can be computationally intense. Use your friendly neighborhood HPC cluster or be prepared for your laptop to get hot.
An optional boolean argument, sublinear, can be passed to tokenized,
ngrams, and noun_phrases. It decides whether to use sublinear (logarithmic)
scaling when calculating tf-idf term weights. The default is True,
because sublinear scaling is considered good practice in the information
retrieval literature [Manning et al., 2008], but there may be good reason to
turn it off.
Calling these methods results in another data frame, which we can feed to
Textnet to make our textnet.
Make Connections¶
A textnet is a bipartite network of terms (words or
phrases) and documents (which often represent the people or groups who
authored them). We create the textnet from the processed corpus using the
Textnet class.
t = tn.Textnet(np)
Textnet takes a few optional arguments. The most important one is
min_docs. It determines how many documents a term must appear in to be
included in the textnet. A term that appears only in a single document creates
no link, so the default value is 2. However, this can lead to a very noisy
graph, and usually only terms that appear in a significant proportion of
documents really indicate latent topics, so it is common to pass a higher
value.
max_docs determines the maximum number of documents a term can appear in
while still being included in the textnet. By default, a term can appear in any
number of documents, as a term that appears frequently is likely to play an
important role in the corpus. However, some terms that appear in a large
proportion of documents may be irrelevant. In such cases it may be useful to
pass a lower value to exclude very common terms.
connected is a boolean argument that decides whether only the largest
connected component of the resulting network should be kept. It defaults to
False.
doc_attrs allows setting additional attributes for documents that become
node attributes in the resulting network. For instance, if texts represent
views of members of different parties, we can set a party attribute.
t = tn.Textnet(corpus.tokenized(), doc_attr=df[["party"]].to_dict())
Seeing Results¶
You are now ready to see the first results. Textnet comes with a utility
method, plot, which allows you to quickly visualize the
bipartite network.
For bipartite network, it can be helpful to use a layout option, such as
bipartite_layout, circular_layout, or sugiyama_layout, which help
to spatially separate the two node types.
You may want terms that are used in more documents to appear bigger in the
plot. In that case, use the scale_nodes_by argument with the value
degree. Other useful options include label_term_nodes,
label_doc_nodes, and label_edges. These are all boolean options, so
your can enable them by passing the value True.
Finally, enabling show_clusters will draw polygons around detected groups
of nodes with a community structure.
Projecting¶
Depending on your research question, you may be interested either in how terms or documents are connected. You can project the bipartite network into a single-mode network of either kind.
groups = t.project(node_type=tn.DOC)
print(groups.summary)
The resulting network only contains nodes of the chosen type (DOC or TERM).
Edge weights are calculated, and node attributes are maintained. The m property gives you access to the projected graph’s
weighted adjacency matrix.
Like the bipartite network, the projected textnet also has a plot method. This takes an optional argument, alpha,
which can help “de-clutter” the resulting visualization by removing edges. The
value for this argument is a significance value, and only edges with a
significance value at or below the chosen value are kept. What remains in the
pruned network is called the “backbone” in the network science literature.
Commonly chosen values for alpha are in the range between 0.2 and 0.6 (with
lower values resulting in more aggressive pruning).
In visualizations of the projected network, you may want to scale nodes
according to centrality. Pass the argument scale_nodes_by with a value of
“betweenness,” “closeness,” “harmonic,” “degree,” “strength,” or
“eigenvector_centrality.” In the document network, you can also use textual
spanning, as demonstrated above.
Label nodes using the boolean argument label_nodes. As above,
show_clusters will mark groups of nodes with a community structure.
Analysis¶
Use top_cluster_nodes to interpret the
community structure of your textnet. Clusters in the bipartite or word network
can be interpreted as discursive categories or latent themes in the corpus
[Gerlach et al., 2018]. Clusters in the document network can be interpreted as
groupings.
The tutorial above gives some examples of using centrality measures to analyze
your corpus. Aside from top_betweenness, the package also provides the
methods top_closeness (weighted closeness), top_harmonic (weighted harmonic
centrality), top_degree (unweighted degree), top_strength (weighted
degree), top_ev (eigenvector centrality), top_pagerank (PageRank
centrality), and top_spanning (textual spanning). In the bipartite network,
you can use top_birank, top_hits and top_cohits to see nodes ranked by
variations of a bipartite centrality measure [He et al., 2017]. By default, they
each output the ten top nodes for each centrality measure.
Saving¶
You can save both the network that underlies a textnet as well as
visualizations. Assuming you want to save the projected term network, called
words, that we created above, you can do so as follows:
words.save_graph("term_network.gml")
This will create a file in the current directory in Graph Modeling Language
(GML) format. This can then be opened by Pajek, yEd, Gephi and other programs.
Consult the docs for Textnet.save_graph for a list of supported formats.
If you want to save a plot of a network, use savefig.
words.plot(label_nodes=True, color_clusters=True)
tn.savefig("term_network.svg")
Supported file formats include SVG, PNG, TIFF, EPS and PDF.