textnets.corpus.Corpus¶
- class textnets.corpus.Corpus(data: Series, lang: str | None = None)[source]¶
Bases:
object
Corpus of labeled documents.
- Parameters:
data (Series) – Series containing the documents. The index must contain document labels.
lang (str, optional) – The langugage model to use (default set by “lang” parameter).
- Raises:
ValueError – If the supplied data is empty.
- documents¶
The corpus documents.
- Type:
Series
Methods
Read corpus from comma-separated value file.
Create corpus from data frame.
Create corpus from dictionary.
Construct corpus from files.
Read corpus from SQL database.
Load a corpus from file.
Return n-grams of length n from corpus in tidy format.
Return noun phrases from corpus in tidy format.
Save a corpus to file.
Return tokenized version of corpus in tidy format.
Attributes
Corpus documents with NLP applied.
- classmethod from_csv(path: str, label_col: str | None = None, doc_col: str | None = None, lang: str | None = None, **kwargs) Corpus [source]¶
Read corpus from comma-separated value file.
- Parameters:
path (str) – Path to CSV file.
label_col (str, optional) – Column that contains document labels (default: None, in which case the first column is used).
doc_col (str, optional) – Column that contains document text (default: None, in which case the first text column is used).
lang (str, optional) – The langugage model to use (default set by “lang” parameter).
kwargs – Arguments to pass to
pandas.read_csv
.
- Return type:
- classmethod from_df(data: DataFrame, doc_col: str | None = None, lang: str | None = None) Corpus [source]¶
Create corpus from data frame.
- Parameters:
data (DataFrame) – DataFrame containing documents. The index must contain document labels.
doc_col (str, optional) – Indicates which column of
data
contains the document texts. If none is specified, the first column with strings is used.lang (str, optional) – The langugage model to use (default set by “lang” parameter).
- Raises:
NoDocumentColumnException – If no document column can be detected.
- Return type:
- classmethod from_dict(data: dict[Any, str], lang: str | None = None) Corpus [source]¶
Create corpus from dictionary.
- classmethod from_files(files: str | list[str] | list[Path], doc_labels: list[str] | None = None, lang: str | None = None) Corpus [source]¶
Construct corpus from files.
- Parameters:
- Raises:
IsADirectoryError – If the provided path is a directory. (Use globbing.)
FileNotFoundError – If the provided path does not exist.
- Return type:
- classmethod from_sql(qry: str, conn: str | object, label_col: str | None = None, doc_col: str | None = None, lang: str | None = None, **kwargs) Corpus [source]¶
Read corpus from SQL database.
- Parameters:
qry (str) – SQL query
label_col (str, optional) – Column that contains document labels (default: None, in which case the first column is used).
doc_col (str, optional) – Column that contains document text (default: None, in which case the first text column is used).
lang (str, optional) – The langugage model to use (default set by “lang” parameter).
kwargs – Arguments to pass to
pandas.read_sql
.
- Return type:
- classmethod load(source: PathLike[Any] | str) Corpus [source]¶
Load a corpus from file.
- Parameters:
source (str or path) – File to read the corpus from. This should be a file created by
Corpus.save
.- Raises:
FileNotFoundError – If the specified path does not exist.
- Return type:
- ngrams(size: int, remove: list[str] | None = None, stem: bool = False, remove_stop_words: bool = False, remove_urls: bool = False, remove_numbers: bool = False, remove_punctuation: bool = False, lower: bool = False, sublinear: bool = True) TidyText [source]¶
Return n-grams of length n from corpus in tidy format.
- Parameters:
size (int) – Size of n-grams to return.
remove (list of str, optional) – Additional tokens to remove.
stem (bool, optional) – Return token stems (default: False).
remove_stop_words (bool, optional) – Remove stop words (default: False).
remove_urls (bool, optional) – Remove URL and email address tokens (default: False).
remove_numbers (bool, optional) – Remove number tokens (default: False).
remove_punctuation (bool, optional) – Remove punctuation marks, brackets, and quotation marks (default: False).
lower (bool, optional) – Make lower-case (default: False).
sublinear (bool, optional) – Apply sublinear scaling when calculating tf-idf term weights (default: True).
- Returns:
A data frame with document labels (index), n-grams (term), and per-document counts (n).
- Return type:
- noun_phrases(normalize: bool = False, remove: list[str] | None = None, sublinear: bool = True) TidyText [source]¶
Return noun phrases from corpus in tidy format.
- Parameters:
- Returns:
A data frame with document labels (index), noun phrases (term), and per-document counts (n).
- Return type:
- save(target: PathLike[Any] | str) None [source]¶
Save a corpus to file.
- Parameters:
target (str or path) – File to save the corpus to. If the file exists, it will be overwritten.
- tokenized(remove: list[str] | None = None, stem: bool = True, remove_stop_words: bool = True, remove_urls: bool = True, remove_numbers: bool = True, remove_punctuation: bool = True, lower: bool = True, sublinear: bool = True) TidyText [source]¶
Return tokenized version of corpus in tidy format.
- Parameters:
remove (list of str, optional) – Additional tokens to remove.
stem (bool, optional) – Return token stems (default: True).
remove_stop_words (bool, optional) – Remove stop words (default: True).
remove_urls (bool, optional) – Remove URL and email address tokens (default: True).
remove_numbers (bool, optional) – Remove number tokens (default: True).
remove_punctuation (bool, optional) – Remove punctuation marks, brackets, and quotation marks (default: True).
lower (bool, optional) – Make lower-case (default: True).
sublinear (bool, optional) – Apply sublinear scaling when calculating tf-idf term weights (default: True).
- Returns:
A data frame with document labels (index), tokens (term), and per-document counts (n).
- Return type: