vignettes/entities_and_relation_extraction.Rmd
entities_and_relation_extraction.Rmd
The package {sto} also provides a set of tools to extract entities and relations from text. It does so by one of the simples form: by using the rule based approach. It contains a rule based extraction algorithm and a rule based relation extraction algorithm. For more advanced NER, there is other packages such as {crfsuite}, {UDpipe} and {spacyR}. This functions do not work so well like the NER of the mentioned packages, but we think that, in some situations, they can be better job than joining unigram, bigrams, trigrams and so on. Because it is a rule based approach, it is very simple to use and also fast. It will also requires a lot of cleaning, but you have the absolute control over which words are extracted and what words are rejected.
In Natural Language Processing, to find proper names or terms that
frequently appears together is called “collocation”, e.g., to find
“United Kingdom”. You can learn more what is collocation and its
statistical details and r function in this article,
and it is also possible to functions like quanteda.textstats::textstat_collocations()
,
TextForecast::get_collocations()
to identify them, but it also will require a lot of data cleaning,
specially if you want is proper names.
How it works?
The function captures all words that:
- begins with uppercase,
- followed by other uppercases, lowercase or numbers, without white
space
- it can contain symbols like _
, -
or
.
.
In this way, words like “Covid-19” are also captured In languages such as English and Portuguese, it extracts proper names. In German, it also extracts nouns.
There is some trade-off, of course. It will capture a lot of
undesired words and will demand posterior cleaning, like:
- It does not contain any sort of built-in classification.
- “Obama Chief of Staff Rahm Emanuel” will be captured as one entity,
what is not wrong et al, but maybe not what was expect.
The downsides are
It will not capture: - entities that begin with lowercase
In my experience, this approach works better to certain types of text than others. Books, formal articles can be a good option that works well with this function. Text from social media, because it lacks formalities of the language and have a lot of types, it will not work so well.
So, let’s extract some proper names from a simple text:
"John Does lives in New York in United States of America." |> extract_entity()
#> [1] "John Does" "New York"
#> [3] "United States of America"
Or it is possible to use other languages, specifying the parameter
connectors
using the function
connectors(lang)
. Checking the connectors:
connectors("eng")
#> [1] "of" "of the"
connectors("pt")
#> [1] "da" "das" "de" "do" "dos"
connectors("port")
#> [1] "da" "das" "de" "do" "dos"
# by default, the functions uses the parameter "misc". meaning "miscellaneous".
connectors("misc")
#> [1] "of" "the" "of the" "von" "van" "del"
Using with other languages:
"João Ninguém mora em São José do Rio Preto. Ele esteve antes em Sergipe" |>
extract_entity(connect = connectors("pt"))
#> [1] "João Ninguém" "São José do Rio Preto" "Ele"
#> [4] "Sergipe"
vonNeumann_txt <- "John von Neumann (/vɒn ˈnɔɪmən/ von NOY-mən; Hungarian: Neumann János Lajos [ˈnɒjmɒn ˈjaːnoʃ ˈlɒjoʃ]; December 28, 1903 – February 8, 1957) was a Hungarian and American mathematician, physicist, computer scientist and engineer"
vonNeumann_txt |> extract_entity()
#> [1] "John von Neumann" "NOY-" "Hungarian"
#> [4] "Neumann János Lajos" "December" "February"
#> [7] "Hungarian" "American"
It is possible to extract a graph from the extracted entities. First,
happens the tokenization by sentence or paragraph. Than, the entities
are extracted using extract_entity()
. Than a data frame
with the co-occurrence of words in sentences or paragraph is build.
vonNeumann_txt |> extract_graph()
#> Tokenizing by sentences
#> # A tibble: 27 × 2
#> n1 n2
#> <chr> <chr>
#> 1 John von Neumann NOY-
#> 2 John von Neumann Hungarian
#> 3 John von Neumann Neumann János Lajos
#> 4 John von Neumann December
#> 5 John von Neumann February
#> 6 John von Neumann Hungarian
#> 7 John von Neumann American
#> 8 NOY- Hungarian
#> 9 NOY- Neumann János Lajos
#> 10 NOY- December
#> # ℹ 17 more rows
This process can take a while to run if the text/corpus is big. So, if you are interested only in some words, so first of all, filter the sentences/paragraphs with the desired words, and after that, extract the graph. Seeing another example, extracting from a wikipedia article:
page <- "https://en.wikipedia.org/wiki/GNU_General_Public_License" |> rvest::read_html()
text <- page |>
rvest::html_nodes("p") |>
rvest::html_text()
# looking at the scraped text:
text[1:2] # seeing the head of the text
#> [1] "\n\n"
#> [2] "The GNU General Public Licenses (GNU GPL or simply GPL) are a series of widely used free software licenses, or copyleft licenses, that guarantee end users the freedoms to run, study, share, and modify the software.[7] The license was the first copyleft for general use and was originally written by Richard Stallman, the founder of the Free Software Foundation (FSF), for the GNU Project. The license grants the recipients of a computer program the rights of the Free Software Definition.[8] The licenses in the GPL series are all copyleft licenses, which means that any derivative work must be distributed under the same or equivalent license terms. It is more restrictive than the Lesser General Public License and even further distinct from the more widely-used permissive software licenses such as BSD, MIT, and Apache.\n"
And now extracting the graphs:
g <- text |> extract_graph(sw = gen_stopwords("en", include = "The This It"))
#> Tokenizing by sentences
g
#> # A tibble: 3,669 × 2
#> n1 n2
#> <chr> <chr>
#> 1 The GNU General Public Licenses GNU GPL
#> 2 The GNU General Public Licenses GPL
#> 3 The GNU General Public Licenses Richard Stallman
#> 4 The GNU General Public Licenses Free Software Foundation
#> 5 The GNU General Public Licenses FSF
#> 6 The GNU General Public Licenses GNU Project
#> 7 The GNU General Public Licenses Free Software Definition
#> 8 The GNU General Public Licenses GPL
#> 9 The GNU General Public Licenses Lesser General Public License
#> 10 The GNU General Public Licenses BSD
#> # ℹ 3,659 more rows
g_N <- g |> dplyr::count(n1, n2, sort = T)
g_N
#> # A tibble: 2,412 × 3
#> n1 n2 n
#> <chr> <chr> <int>
#> 1 GPLv3 GPLv2 24
#> 2 Artifex Hancom 19
#> 3 Ghostscript Hancom 19
#> 4 Linux GPL 16
#> 5 GPL FSF 14
#> 6 GPL-licensed GPL 14
#> 7 GPL GPLv3 13
#> 8 Artifex Ghostscript 12
#> 9 However GPL 12
#> 10 FSF GPL 11
#> # ℹ 2,402 more rows
plot_graph(text, g_N)
To plot an interactive graph, it is possible to use {networkD3}:
g_N |>
head(100) |> # to reduce the amount of nodes and edges in the graph
networkD3::simpleNetwork(
height = "10px", width = "30px",
linkDistance = 50,
fontSize = 16
)
Another text example.
page <- "https://en.wikipedia.org/wiki/Hurricane_Milton" |> rvest::read_html()
text <- page |>
rvest::html_nodes("p") |>
rvest::html_text()
text[1:2] # seeing the head of the tex
#> [1] "\n" "\n\n"
g <- text |> extract_graph(sw = gen_stopwords("en", include = "The This It"))
#> Tokenizing by sentences
g_N <- g |> dplyr::count(n1, n2, sort = T)
plot_graph(text, g_N, head_n = 50)
To plot an interactive graph, it is possible to use {networkD3}:
g_N |>
head(100) |> # to reduce the amount of nodes and edges in the graph
networkD3::simpleNetwork(
height = "10px", width = "30px",
linkDistance = 50,
fontSize = 16
)