A rule based entity extractor extracts the entity from a text using regex. This regex captures all uppercase words, words that begin with upper case. If there is sequence of this patterns together, this function also captures. In the case of proper names with common lower case connectors like "Wwwww of Wwwww" this function also captures the connector and the subsequent uppercase words.

extract_entity(text, connect = connectors("misc"), sw = "the")

Arguments

text

an input text

connect

a vector of lowercase connectors. Use use your own, or use the function "connector" to obtain some patterns.

sw

a vector of stopwords

Examples

"John Does lives in New York in United States of America." |> extract_entity()
#> [1] "John Does"                "New York"                
#> [3] "United States of America"
"João Ninguém mora em São José do Rio Preto. Ele esteve antes em Sergipe" |> extract_entity(connect = connectors("pt"))
#> [1] "João Ninguém"          "São José do Rio Preto" "Ele"                  
#> [4] "Sergipe"