textnets.corpus.Corpus

class textnets.corpus.Corpus(data: Series, lang: str | None = None)[source]

Bases: object

Corpus of labeled documents.

Parameters:
  • data (Series) – Series containing the documents. The index must contain document labels.

  • lang (str, optional) – The langugage model to use (default set by “lang” parameter).

Raises:

ValueError – If the supplied data is empty.

documents

The corpus documents.

Type:

Series

lang

The language model used (ISO code or spaCy model name).

Type:

str

Methods

from_csv

Read corpus from comma-separated value file.

from_df

Create corpus from data frame.

from_dict

Create corpus from dictionary.

from_files

Construct corpus from files.

from_sql

Read corpus from SQL database.

load

Load a corpus from file.

ngrams

Return n-grams of length n from corpus in tidy format.

noun_phrases

Return noun phrases from corpus in tidy format.

save

Save a corpus to file.

tokenized

Return tokenized version of corpus in tidy format.

Attributes

nlp

Corpus documents with NLP applied.

classmethod from_csv(path: str, label_col: str | None = None, doc_col: str | None = None, lang: str | None = None, **kwargs) Corpus[source]

Read corpus from comma-separated value file.

Parameters:
  • path (str) – Path to CSV file.

  • label_col (str, optional) – Column that contains document labels (default: None, in which case the first column is used).

  • doc_col (str, optional) – Column that contains document text (default: None, in which case the first text column is used).

  • lang (str, optional) – The langugage model to use (default set by “lang” parameter).

  • kwargs – Arguments to pass to pandas.read_csv.

Return type:

Corpus

classmethod from_df(data: DataFrame, doc_col: str | None = None, lang: str | None = None) Corpus[source]

Create corpus from data frame.

Parameters:
  • data (DataFrame) – DataFrame containing documents. The index must contain document labels.

  • doc_col (str, optional) – Indicates which column of data contains the document texts. If none is specified, the first column with strings is used.

  • lang (str, optional) – The langugage model to use (default set by “lang” parameter).

Raises:

NoDocumentColumnException – If no document column can be detected.

Return type:

Corpus

classmethod from_dict(data: dict[Any, str], lang: str | None = None) Corpus[source]

Create corpus from dictionary.

Parameters:
  • data (dict) – Dictionary containing the documents as values and document labels as keys.

  • lang (str, optional) – The langugage model to use (default set by “lang” parameter).

Return type:

Corpus

classmethod from_files(files: str | list[str] | list[Path], doc_labels: list[str] | None = None, lang: str | None = None) Corpus[source]

Construct corpus from files.

Parameters:
  • files (str or list of str or list of Path) – Path to files (with globbing pattern) or list of file paths.

  • doc_labels (list of str, optional) – Labels for documents (default: file name without suffix).

  • lang (str, optional) – The langugage model to use (default set by “lang” parameter).

Raises:
Return type:

Corpus

classmethod from_sql(qry: str, conn: str | object, label_col: str | None = None, doc_col: str | None = None, lang: str | None = None, **kwargs) Corpus[source]

Read corpus from SQL database.

Parameters:
  • qry (str) – SQL query

  • conn (str or object) – Database URI or connection object.

  • label_col (str, optional) – Column that contains document labels (default: None, in which case the first column is used).

  • doc_col (str, optional) – Column that contains document text (default: None, in which case the first text column is used).

  • lang (str, optional) – The langugage model to use (default set by “lang” parameter).

  • kwargs – Arguments to pass to pandas.read_sql.

Return type:

Corpus

classmethod load(source: PathLike[Any] | str) Corpus[source]

Load a corpus from file.

Parameters:

source (str or path) – File to read the corpus from. This should be a file created by Corpus.save.

Raises:

FileNotFoundError – If the specified path does not exist.

Return type:

Corpus

ngrams(size: int, remove: list[str] | None = None, stem: bool = False, remove_stop_words: bool = False, remove_urls: bool = False, remove_numbers: bool = False, remove_punctuation: bool = False, lower: bool = False, sublinear: bool = True) TidyText[source]

Return n-grams of length n from corpus in tidy format.

Parameters:
  • size (int) – Size of n-grams to return.

  • remove (list of str, optional) – Additional tokens to remove.

  • stem (bool, optional) – Return token stems (default: False).

  • remove_stop_words (bool, optional) – Remove stop words (default: False).

  • remove_urls (bool, optional) – Remove URL and email address tokens (default: False).

  • remove_numbers (bool, optional) – Remove number tokens (default: False).

  • remove_punctuation (bool, optional) – Remove punctuation marks, brackets, and quotation marks (default: False).

  • lower (bool, optional) – Make lower-case (default: False).

  • sublinear (bool, optional) – Apply sublinear scaling when calculating tf-idf term weights (default: True).

Returns:

A data frame with document labels (index), n-grams (term), and per-document counts (n).

Return type:

pandas.DataFrame

property nlp: Series

Corpus documents with NLP applied.

noun_phrases(normalize: bool = False, remove: list[str] | None = None, sublinear: bool = True) TidyText[source]

Return noun phrases from corpus in tidy format.

Parameters:
  • normalize (bool, optional) – Return lemmas of noun phrases (default: False).

  • remove (list of str, optional) – Additional tokens to remove.

  • sublinear (bool, optional) – Apply sublinear scaling when calculating tf-idf term weights (default: True).

Returns:

A data frame with document labels (index), noun phrases (term), and per-document counts (n).

Return type:

pandas.DataFrame

save(target: PathLike[Any] | str) None[source]

Save a corpus to file.

Parameters:

target (str or path) – File to save the corpus to. If the file exists, it will be overwritten.

tokenized(remove: list[str] | None = None, stem: bool = True, remove_stop_words: bool = True, remove_urls: bool = True, remove_numbers: bool = True, remove_punctuation: bool = True, lower: bool = True, sublinear: bool = True) TidyText[source]

Return tokenized version of corpus in tidy format.

Parameters:
  • remove (list of str, optional) – Additional tokens to remove.

  • stem (bool, optional) – Return token stems (default: True).

  • remove_stop_words (bool, optional) – Remove stop words (default: True).

  • remove_urls (bool, optional) – Remove URL and email address tokens (default: True).

  • remove_numbers (bool, optional) – Remove number tokens (default: True).

  • remove_punctuation (bool, optional) – Remove punctuation marks, brackets, and quotation marks (default: True).

  • lower (bool, optional) – Make lower-case (default: True).

  • sublinear (bool, optional) – Apply sublinear scaling when calculating tf-idf term weights (default: True).

Returns:

A data frame with document labels (index), tokens (term), and per-document counts (n).

Return type:

pandas.DataFrame