textnets.corpus.Corpus¶

class textnets.corpus.Corpus(data: Series, lang: str | None = None)[source]¶

Bases: object

Corpus of labeled documents.

Parameters:

data (Series) – Series containing the documents. The index must contain document labels.
lang (str, optional) – The langugage model to use (default set by “lang” parameter).

Raises:

ValueError – If the supplied data is empty.

documents¶

The corpus documents.

Type:: Series

lang¶

The language model used (ISO code or spaCy model name).

Type:: str

Methods

`from_csv`	Read corpus from comma-separated value file.
`from_df`	Create corpus from data frame.
`from_dict`	Create corpus from dictionary.
`from_files`	Construct corpus from files.
`from_sql`	Read corpus from SQL database.
`load`	Load a corpus from file.
`ngrams`	Return n-grams of length n from corpus in tidy format.
`noun_phrases`	Return noun phrases from corpus in tidy format.
`save`	Save a corpus to file.
`tokenized`	Return tokenized version of corpus in tidy format.

Attributes

nlp

Corpus documents with NLP applied.

classmethod from_csv(path: str, label_col: str | None = None, doc_col: str | None = None, lang: str | None = None, **kwargs) → Corpus[source]¶

Read corpus from comma-separated value file.

Parameters:

path (str) – Path to CSV file.
label_col (str, optional) – Column that contains document labels (default: None, in which case the first column is used).
doc_col (str, optional) – Column that contains document text (default: None, in which case the first text column is used).
lang (str, optional) – The langugage model to use (default set by “lang” parameter).
kwargs – Arguments to pass to pandas.read_csv.

Return type:

Corpus

classmethod from_df(data: DataFrame, doc_col: str | None = None, lang: str | None = None) → Corpus[source]¶

Create corpus from data frame.

Parameters:

data (DataFrame) – DataFrame containing documents. The index must contain document labels.
doc_col (str, optional) – Indicates which column of data contains the document texts. If none is specified, the first column with strings is used.
lang (str, optional) – The langugage model to use (default set by “lang” parameter).

Raises:

NoDocumentColumnException – If no document column can be detected.

Return type:

Corpus

classmethod from_dict(data: dict[Any, str], lang: str | None = None) → Corpus[source]¶

Create corpus from dictionary.

Parameters:

data (dict) – Dictionary containing the documents as values and document labels as keys.
lang (str, optional) – The langugage model to use (default set by “lang” parameter).

Return type:

Corpus

classmethod from_files(files: str | list[str] | list[Path], doc_labels: list[str] | None = None, lang: str | None = None) → Corpus[source]¶

Construct corpus from files.

Parameters:

files (str or list of str or list of Path) – Path to files (with globbing pattern) or list of file paths.
doc_labels (list of str, optional) – Labels for documents (default: file name without suffix).
lang (str, optional) – The langugage model to use (default set by “lang” parameter).

Raises:

IsADirectoryError – If the provided path is a directory. (Use globbing.)
FileNotFoundError – If the provided path does not exist.

Return type:

Corpus

classmethod from_sql(qry: str, conn: str | object, label_col: str | None = None, doc_col: str | None = None, lang: str | None = None, **kwargs) → Corpus[source]¶

Read corpus from SQL database.

Parameters:

qry (str) – SQL query
conn (str or object) – Database URI or connection object.
label_col (str, optional) – Column that contains document labels (default: None, in which case the first column is used).
doc_col (str, optional) – Column that contains document text (default: None, in which case the first text column is used).
lang (str, optional) – The langugage model to use (default set by “lang” parameter).
kwargs – Arguments to pass to pandas.read_sql.

Return type:

Corpus

classmethod load(source: PathLike[Any] | str) → Corpus[source]¶

Load a corpus from file.

Parameters:: source (str or path) – File to read the corpus from. This should be a file created by Corpus.save.
Raises:: FileNotFoundError – If the specified path does not exist.
Return type:: Corpus

ngrams(size: int, remove: list[str] | None = None, stem: bool = False, remove_stop_words: bool = False, remove_urls: bool = False, remove_numbers: bool = False, remove_punctuation: bool = False, lower: bool = False, sublinear: bool = True) → TidyText[source]¶

Return n-grams of length n from corpus in tidy format.

Parameters:

size (int) – Size of n-grams to return.
remove (list of str, optional) – Additional tokens to remove.
stem (bool, optional) – Return token stems (default: False).
remove_stop_words (bool, optional) – Remove stop words (default: False).
remove_urls (bool, optional) – Remove URL and email address tokens (default: False).
remove_numbers (bool, optional) – Remove number tokens (default: False).
remove_punctuation (bool, optional) – Remove punctuation marks, brackets, and quotation marks (default: False).
lower (bool, optional) – Make lower-case (default: False).
sublinear (bool, optional) – Apply sublinear scaling when calculating tf-idf term weights (default: True).

Returns:

A data frame with document labels (index), n-grams (term), and per-document counts (n).

Return type:

pandas.DataFrame

property nlp: Series¶: Corpus documents with NLP applied.

noun_phrases(normalize: bool = False, remove: list[str] | None = None, sublinear: bool = True) → TidyText[source]¶

Return noun phrases from corpus in tidy format.

Parameters:

normalize (bool, optional) – Return lemmas of noun phrases (default: False).
remove (list of str, optional) – Additional tokens to remove.
sublinear (bool, optional) – Apply sublinear scaling when calculating tf-idf term weights (default: True).

Returns:

A data frame with document labels (index), noun phrases (term), and per-document counts (n).

Return type:

pandas.DataFrame

save(target: PathLike[Any] | str) → None[source]¶

Save a corpus to file.

Parameters:: target (str or path) – File to save the corpus to. If the file exists, it will be overwritten.

tokenized(remove: list[str] | None = None, stem: bool = True, remove_stop_words: bool = True, remove_urls: bool = True, remove_numbers: bool = True, remove_punctuation: bool = True, lower: bool = True, sublinear: bool = True) → TidyText[source]¶

Return tokenized version of corpus in tidy format.

Parameters:

remove (list of str, optional) – Additional tokens to remove.
stem (bool, optional) – Return token stems (default: True).
remove_stop_words (bool, optional) – Remove stop words (default: True).
remove_urls (bool, optional) – Remove URL and email address tokens (default: True).
remove_numbers (bool, optional) – Remove number tokens (default: True).
remove_punctuation (bool, optional) – Remove punctuation marks, brackets, and quotation marks (default: True).
lower (bool, optional) – Make lower-case (default: True).
sublinear (bool, optional) – Apply sublinear scaling when calculating tf-idf term weights (default: True).

Returns:

A data frame with document labels (index), tokens (term), and per-document counts (n).

Return type:

pandas.DataFrame