input pure text and get a tibble/data frame of word co-occurrence.
Usage
cooccur_words(
text,
sw = "",
token_by = "sentence",
lower = TRUE,
loop = FALSE,
output = "df",
count = TRUE
)Arguments
- text
The inputed text
- sw
A vector of stopwords to be removed
- token_by
Tokenize by sentence or paragraph
- lower
Convert words to lowercase. If the text is passed in all lowercase, it can return false sentence and paragraph tokenization. It is advised to use lowercase.
- loop
if FALSE, self referential nodes (e.g. n1=x and also n2=x) will be excluded. Default FALSE.
- output
as 1) a single tibble/dataframe ("tlb", "df", "tibble", "datafame"); 2) as a list of dataframes with cooccurrence per vector element ("lst" or "list"); or 3) as raw list. This format is the most raw output of this function; 4) "df2", tibble/dataframe with the doc numbers.
- count
Return count of words (default TRUE)
Examples
txt <- "Lorem Ipsum. The Ipsum John. Dolor est. Lorem Ipsum dolor."
txt |> cooccur_words()
#> tokenizing sentences...
#> tokenizing words...
#> # A tibble: 7 × 3
#> n1 n2 n
#> <chr> <chr> <int>
#> 1 ipsum lorem 2
#> 2 dolor est 1
#> 3 dolor ipsum 1
#> 4 dolor lorem 1
#> 5 ipsum john 1
#> 6 ipsum the 1
#> 7 john the 1