Advanced Topics¶
Saving and loading your project¶
In this example, we define a project_file
to store the configuration
parameters, corpus, and textnet. If the file exists, they are loaded from file;
else they are created and saved to file.
from pathlib import Path
import textnets as tn
working_dir = Path(".")
project_file = working_dir / "my_project.db"
if project_file.exists():
tn.params.load(project_file)
corpus = tn.load_corpus(project_file)
net = tn.load_textnet(project_file)
else:
my_params = {"seed": 42, "autodownload": True}
tn.params.update(my_params)
corpus = tn.Corpus(tn.examples.digitalisierung, lang="de")
net = tn.Textnet(corpus.noun_phrases(normalize=True))
tn.params.save(project_file)
corpus.save(project_file)
net.save(project_file)
This code would only require the corpus and textnet to be created once. Subsequent runs of the script could skip ahead to visualization or analysis. This saves time, but also helps ensure the reproducibility of results.
Using alternate community detection algorithms¶
By default, textnets will use the Leiden algorithm to find communities in bipartite and projected networks. You can, however, also use other algorithms.
(These examples assume that you have already created a bipartite Textnet
called net
.)
Implemented in igraph¶
When plotting a textnet, you can supply the arguments show_clusters
or
color_clusters
. These accept a boolean value, but you can also pass a
VertexClustering
, which is the data
structure used by igraph
.
If you want to use Blondel et al.’s multilevel algorithm to color the nodes of a projected textnet, you can do so as follows:
terms = net.project(node_type="term")
# initialize the random seed before running community detection
tn.init_seed()
part = terms.graph.community_multilevel(weights="weight")
print("Modularity: ", terms.graph.modularity(part, weights="weight"))
terms.plot(label_nodes=True, color_clusters=part)
Alternately, we can also overwrite the textnet’s clusters
property:
terms.clusters = part
To return to the default (clusters detected by the Leiden algorithm), delete the clusters property:
del terms.clusters
Implemented in leidenalg¶
The leidenalg
package is installed as a dependency of textnets. It can
produce a variety of different partition types, and in some cases, you may want
to use a different one than the default. In this example, leidenalg
is
instructed to use the method of “asymptotic surprise” to determine the graph
partition.
import leidenalg as la
terms.clusters = la.find_partition(terms.graph,
partition_type=la.SurpriseVertexPartition,
weights="weight",
n_iterations=-1,
seed=tn.params["seed"])
After setting the clusters like this, you can plot the network as before. You can also output a list of nodes per partition:
terms.top_cluster_nodes()
Implemented in cdlib¶
The Community Discovery Library (cdlib)
implements a wide range of algorithms for community detection that aren’t
available in igraph
. Some of them are also able to perform community
detection on the bipartite network.
In order to run this example, you first have to install cdlib
.
from cdlib.algorithms import infomap_bipartite, paris
The first example applies the Infomap community detection algorithm to the bipartite network:
# initialize the random seed before running community detection
tn.init_seed()
bi_node_community_map = infomap_bipartite(net.graph.to_networkx()).to_node_community_map()
# overwrite clusters detected by Leiden algorithm
net.clusters = bi_node_community_map
print("Modularity: ", net.modularity)
net.plot(label_nodes=True, show_clusters=True)
This example applies the Paris hierarchical clustering algorithm to the projected network:
docs = net.project(node_type="doc")
# initialize the random seed before running community detection
tn.init_seed()
docs_node_community_map = paris(docs.graph.to_networkx()).to_node_community_map()
# overwrite clusters detected by Leiden algorithm
docs.clusters = docs_node_community_map
print("Modularity: ", docs.modularity)
docs.plot(label_nodes=True, color_clusters=True)
Implemented in karateclub¶
Karate Club is a library of
machine-learning methods to apply to networks. Among other things, it also
implements community detection algorithms. Here’s an example for using
community detection from karateclub
with textnets.
This example requires you to first have installed karateclub
.
from karateclub import SCD
cd = SCD(seed=tn.params["seed"])
cd.fit(net.graph.to_networkx())
net.clusters = list(cd.get_memberships().values())
print("Modularity: ", net.modularity)
np.plot(color_clusters=True, label_nodes=True)
Additional measures for centrality analysis¶
The Tutorial provides examples of using BiRank, betweenness, closeness
and (weighted and unweighted) degree to analyze a textnet. The NetworkX library implements a large variety of other
centrality measures that may also prove helpful that aren’t available in
igraph
, the library that textnets
builds on, including additional
centrality measures for bipartite networks.
This example requires networkx
to be installed.
import networkx as nx
bi_btwn = nx.algorithms.bipartite.betweenness_centrality(net.graph.to_networkx())
net.nodes["btwn"] = list(bi_btwn.values())
docs.plot(scale_nodes_by="btwn")
katz_centrality = nx.katz_centrality(docs.graph.to_networkx(), weight="weight")
docs.nodes["katz"] = list(katz_centrality.values())
docs.plot(scale_nodes_by="katz")
Alternative methods of term extraction and weighing¶
By default, textnets leverages spaCy language models to break up your
corpus when you call noun_phrases
, ngrams
or tokenized
, and it uses
tf-idf term weights. There are many alternative ways of extracting terms and
weighing them, and by defining a custom function, you can use them with
textnets.
This example uses YAKE!, the popular library for keyword extraction, to extract keywords from a corpus and weighs them according to their significance.
This example requires yake
to be installed.
import textnets as tn
from yake import KeywordExtractor
def yake(
corpus: tn.Corpus,
lang: str="en",
ngram_size: int=3,
top: int=50,
window: int=2
) -> tn.corpus.TidyText:
"""Use YAKE keyword extraction to break up corpus."""
kw = KeywordExtractor(
lan=lang,
n=ngram_size,
top=top,
windowsSize=window
)
tt = []
for label, doc in corpus.documents.items():
for term, sig in kw.extract_keywords(doc):
tt.append({"label": label, "term": term, "term_weight": 1-sig, "n": 1})
return tn.corpus.TidyText(tt).set_index("label")
The result of calling yake
on an instance of Corpus
can be passed to
Textnet
.