Skip to contents

input pure text and get a tibble/data frame of word co-occurrence.

Usage

cooccur_words(
  text,
  sw = "",
  token_by = "sentence",
  lower = TRUE,
  loop = FALSE,
  output = "df",
  count = TRUE
)

Arguments

text

The inputed text

sw

A vector of stopwords to be removed

token_by

Tokenize by sentence or paragraph

lower

Convert words to lowercase. If the text is passed in all lowercase, it can return false sentence and paragraph tokenization. It is advised to use lowercase.

loop

if FALSE, self referential nodes (e.g. n1=x and also n2=x) will be excluded. Default FALSE.

output

as 1) a single tibble/dataframe ("tlb", "df", "tibble", "datafame"); 2) as a list of dataframes with cooccurrence per vector element ("lst" or "list"); or 3) as raw list. This format is the most raw output of this function; 4) "df2", tibble/dataframe with the doc numbers.

count

Return count of words (default TRUE)

Examples

txt <- "Lorem Ipsum. The Ipsum John. Dolor est. Lorem Ipsum dolor."
txt |> cooccur_words()
#> tokenizing sentences...
#> tokenizing words...
#> # A tibble: 7 × 3
#>   n1    n2        n
#>   <chr> <chr> <int>
#> 1 ipsum lorem     2
#> 2 dolor est       1
#> 3 dolor ipsum     1
#> 4 dolor lorem     1
#> 5 ipsum john      1
#> 6 ipsum the       1
#> 7 john  the       1