02 Rule based entity extraction with regular expression • Networds

Entity extraction and its relations

The package {networds} provides a set of tools to extract entities and relations from text. The networds tool described in this article, uses one of the simples form: the rule based approach. It contains a rule based extraction algorithm and a rule based relation extraction algorithm. Another similar package is textnets, from Chris Bail, that captures proper names using UDPipe, plot word networks and calculates the centrality measures of words in the network. To extract entities based on Part of Speech Tagging, see other functions of this package,

The functions of {networds} based on word and pattern of words co-occurrence do not work so well like the NER, but we think that, in some situations, they can be better job than traditional joining unigram, bigrams, trigrams and so on. Because it is a rule based approach, it is very simple to use, need less dependencies and runs also fast (or maybe less slower). It will also requires a lot of post-cleaning, but you have the absolute control over which words are extracted and what words are rejected.

In Natural Language Processing, to find proper names or terms that frequently appears together is called “collocation”, e.g., to find “United Kingdom”. You can learn more what is collocation and its statistical details and r function in this article, and it is also possible to functions like quanteda.textstats::textstat_collocations(), TextForecast::get_collocations() to identify them, but it also will require a lot of data cleaning, specially if you want is proper names.

How it works?

The function captures all words that:

begins with uppercase,
followed by other uppercases, lowercase or numbers, without white space
it can contain symbols like _, - or .. In this way, words like “Covid-19” are captured.
the user can specify a connector, like “of”or “of the” so, words like “United States of America” are also captured as one word.

In languages such as English and Portuguese, it extracts proper names. In German, it also extracts nouns. There is some trade-off, of course. It will capture a lot of undesired words and will demand posterior cleaning, like:
- It does not contain any sort of built-in classification.
- “Obama Chief of Staff Rahm Emanuel” will be captured as one entity, what is not wrong et al, but maybe not what was expected.

The downsides

It will not capture:

entities that begin with lowercase.
it does not delves with words ambiguity. For example, is “WHO” referring to question or World Health Organization? Washington is a person or a place?

To overthrow this problemas, networds has a set of other functions that works with Part of Speech tagging and Named Entity Recognition. If the problems enumerated are a big problem to your case, take a look at the next session “Extract entity co-ocurrences with POS”.

Conclusion

In my experience, the approach of this section using regex works better to certain types of text than others. Very well formatted text, like Books, formal articles can be a good option that works well with this function. Text from social media, because it lacks formalities of the language and have a lot of typos, lacking uppercase sometimes, or written all in uppercase, this approach will not work so well.

Using networds: extracting entities with regex

After installing the package (see Readme), load it:

library(networds)

So, let’s extract some proper names from a simple text:

"John Does lives in New York." |> extract_entity_rb()
#> [1] "John_Does" "New_York"

Or it is possible to use other languages, specifying the parameter connectors using the function connectors(lang). Checking the connectors:

connectors("eng")
#> $en
#> [1] "of"     "of the"
# or you can also use for english, to get the same result:
connectors("en")
#> $en
#> [1] "of"     "of the"
# For portuguese
connectors("pt")
#> $pt
#> [1] "da"  "das" "de"  "do"  "dos"
# to get the same result:
connectors("port")
#> $pt
#> [1] "da"  "das" "de"  "do"  "dos"

# by default, the functions uses the parameter "misc". meaning "miscellaneous".
connectors("misc")
#> $misc
#> [1] "of"     "the"    "of the" "von"    "van"    "del"    "de"
connectors("all")
#>  [1] "da"     "das"    "de"     "del"    "di"     "do"     "dos"    "of"    
#>  [9] "of the" "the"    "van"    "von"

Using with other languages:

"João Ninguém mora em São José do Rio Preto. Ele esteve antes em Sergipe" |>
  extract_entity_rb(connect = connectors("pt"))
#> [1] "João_Ninguém" "São_José"     "Rio_Preto"    "Ele"          "Sergipe"

vonNeumann_txt <- "John von Neumann (/vɒn ˈnɔɪmən/ von NOY-mən; Hungarian: Neumann János Lajos [ˈnɒjmɒn ˈjaːnoʃ ˈlɒjoʃ]; December 28, 1903 – February 8, 1957) was a Hungarian and American mathematician, physicist, computer scientist and engineer"
vonNeumann_txt |> extract_entity_rb()
#> [1] "John"                "Neumann"             "NOY-"               
#> [4] "Hungarian"           "Neumann_János_Lajos" "December"           
#> [7] "February"            "Hungarian"           "American"

Extracting a graph

It is possible to extract a graph from the extracted entities. First, happens the tokenization by sentence or paragraph. Than, the entities are extracted using extract_entity_rb(). Than a data frame with the co-occurrence of words in sentences or paragraph is build.

vonNeumann_txt |> extract_graph_rb()
#> Tokenizing by sentences
#> # A tibble: 31 × 3
#>    n1        n2            n
#>    <chr>     <chr>     <int>
#>  1 Hungarian American      2
#>  2 John      Hungarian     2
#>  3 NOY-      Hungarian     2
#>  4 Neumann   Hungarian     2
#>  5 December  American      1
#>  6 December  February      1
#>  7 December  Hungarian     1
#>  8 February  American      1
#>  9 February  Hungarian     1
#> 10 Hungarian December      1
#> # ℹ 21 more rows

One of the parameters is sw that means “stopwords”. It is possible to add a vector stopwords.

my_sw <- c(stopwords::stopwords(
  language = "en",
  source = "snowball", simplify = TRUE
), "lol")

vonNeumann_txt |> extract_graph_rb(sw = my_sw)
#> Tokenizing by sentences
#> # A tibble: 31 × 3
#>    n1        n2            n
#>    <chr>     <chr>     <int>
#>  1 Hungarian American      2
#>  2 John      Hungarian     2
#>  3 NOY-      Hungarian     2
#>  4 Neumann   Hungarian     2
#>  5 December  American      1
#>  6 December  February      1
#>  7 December  Hungarian     1
#>  8 February  American      1
#>  9 February  Hungarian     1
#> 10 Hungarian December      1
#> # ℹ 21 more rows

This process can take a while to run if the text/corpus is big. So, if you are interested only in some words, so first of all, filter the sentences/paragraphs with the desired words, and after that, extract the graph. Seeing another example, extracting from a Wikipedia article:

page <- "https://en.wikipedia.org/wiki/GNU_General_Public_License" |> rvest::read_html()

text <- page |>
  rvest::html_nodes("p") |>
  rvest::html_text()

# looking at the scraped text:
text[2] # seeing the head of the text
#> [1] "The GNU General Public Licenses (GNU GPL or simply GPL) are a series of widely used free software licenses, or copyleft licenses, that guarantee end users the freedoms to run, study, share, and modify the software.[7] The GPL was the first copyleft license for general use. It was originally written by Richard Stallman, the founder of the Free Software Foundation (FSF), for the GNU Project. The license grants the recipients of a computer program the rights of the Free Software Definition.[8] The licenses in the GPL series are all copyleft licenses, which means that any derivative work must be distributed under the same or equivalent license terms. It is more restrictive than the Lesser General Public License and even further distinct from the more widely-used permissive software licenses such as BSD, MIT, and Apache.\n"

And now extracting the graphs:

g <- text |> extract_graph_rb(sw = my_sw)
#> Tokenizing by sentences
g
#> # A tibble: 2,136 × 3
#>    n1           n2              n
#>    <chr>        <chr>       <int>
#>  1 GPLv3        GPLv2          26
#>  2 Artifex      Hancom         19
#>  3 Ghostscript  Hancom         19
#>  4 Linux        GPL            16
#>  5 GPL          FSF            14
#>  6 GPL-licensed GPL            14
#>  7 GPL          GPLv3          13
#>  8 Artifex      Ghostscript    12
#>  9 GPL          GPLv2          12
#> 10 However      GPL            12
#> # ℹ 2,126 more rows

To make a quick plot

plot_graph(g)
#> Only one value passed to edge_width. All edges will have the same width.

Is possible to use the frequency of word pairs co-occurrence as weight in the net plot, as the line thickness between the nodes. We need at least two arguments: 1) the dataframe with the frequency. In this case, as the words are separated with “_“, we must first replace it with white spaces. 2) original the text, to count the frequency of the words.

g |>
  dplyr::mutate(
    n1 = gsub(x = n1, "_", " "),
    n2 = gsub(x = n2, "_", " ")
  ) |>
  plot_graph2(text)
#> You provided a vector of 125 elements instead of one. No problem, but these will be collapsed into a single element, with a final punctuation mark added to each.
#>   |                                                                              |                                                                      |   0%  |                                                                              |====                                                                  |   5%  |                                                                              |=======                                                               |  11%  |                                                                              |===========                                                           |  16%  |                                                                              |===============                                                       |  21%  |                                                                              |==================                                                    |  26%  |                                                                              |======================                                                |  32%  |                                                                              |==========================                                            |  37%  |                                                                              |=============================                                         |  42%  |                                                                              |=================================                                     |  47%  |                                                                              |=====================================                                 |  53%  |                                                                              |=========================================                             |  58%  |                                                                              |============================================                          |  63%  |                                                                              |================================================                      |  68%  |                                                                              |====================================================                  |  74%  |                                                                              |=======================================================               |  79%  |                                                                              |===========================================================           |  84%  |                                                                              |===============================================================       |  89%  |                                                                              |==================================================================    |  95%  |                                                                              |======================================================================| 100%
#> Using node_size proportional to word frequency as no node_size was provided in parameters

To plot the wordcloud network it is necessary two parameters: the original text and the dataframe/tibble as returned by dplyr::count(), with three columns: node 1, node 2 and the weight/frequency.

g |>
  dplyr::mutate(
    n1 = gsub(x = n1, "_", " "),
    n2 = gsub(x = n2, "_", " ")
  ) |>
  net_wordcloud(text)
#> You provided a vector of 125 elements instead of one. These will be collapsed into  a single element, with a final punctuation mark added to each.
#>   |                                                                              |                                                                      |   0%  |                                                                              |====                                                                  |   5%  |                                                                              |=======                                                               |  11%  |                                                                              |===========                                                           |  16%  |                                                                              |===============                                                       |  21%  |                                                                              |==================                                                    |  26%  |                                                                              |======================                                                |  32%  |                                                                              |==========================                                            |  37%  |                                                                              |=============================                                         |  42%  |                                                                              |=================================                                     |  47%  |                                                                              |=====================================                                 |  53%  |                                                                              |=========================================                             |  58%  |                                                                              |============================================                          |  63%  |                                                                              |================================================                      |  68%  |                                                                              |====================================================                  |  74%  |                                                                              |=======================================================               |  79%  |                                                                              |===========================================================           |  84%  |                                                                              |===============================================================       |  89%  |                                                                              |==================================================================    |  95%  |                                                                              |======================================================================| 100%

There are different information in the graph:

The size of words and compound words means the individual frequency of each word/compound word
The thickness of the links indicates how often the pair occur together.

We opted to use this approach of the two parameters of data frame with weights, as well as with the original text because plotting such networks as a matter of readability, often requires to select only the most frequent. A word individual frequency is not necessarily correlated with it’s frequency in graphs, so the function calculates the individual frequency of the word. Looking at the frequency dataframe, the user may want to strip some graphs, and then plot it.

This function uses {ggraph} and ggplot. So, you can change some ggplot or add another ones a posteriori.

To plot an interactive graph, use plot_graph_i

g |> plot_graph_i()

It is possible to use {networkD3} package:

g |>
  head(100) |> # to reduce the amount of nodes and edges in the graph
  networkD3::simpleNetwork(
    height = "10px", width = "30px",
    linkDistance = 50,
    fontSize = 16
  )

Another text example.

page <- "https://en.wikipedia.org/wiki/Hurricane_Milton" |> rvest::read_html()

#> Error in path_to_connection(x) : 
#>   inst/wiki_Hurricane_Milton.html does not exist in current working
#> directory
#> (/home/alisson/Documentos/Programação/R/meus_pacotes/networds/vignettes).

text <- page |>
  rvest::html_nodes("p") |>
  rvest::html_text()

text[2] # seeing the head of the text
#> [1] "Hurricane Milton was an extremely powerful and destructive tropical cyclone which became the second-most intense Atlantic hurricane ever recorded over the Gulf of Mexico, behind only Hurricane Rita in 2005. Milton made landfall on the west coast of the U.S. state of Florida, less than two weeks after Hurricane Helene devastated the state's Big Bend region.[2] The thirteenth named storm, ninth hurricane, fourth major hurricane, and second Category 5 hurricane of the 2024 Atlantic hurricane season, Milton is the strongest tropical cyclone to occur worldwide in 2024 thus far.[3]"

g <- text |> extract_graph_rb(sw = my_sw)
#> Tokenizing by sentences

g |>
  dplyr::mutate(
    n1 = gsub(x = n1, "_", " "),
    n2 = gsub(x = n2, "_", " ")
  ) |>
  net_wordcloud(text, head_n = 50)
#> You provided a vector of 48 elements instead of one. These will be collapsed into  a single element, with a final punctuation mark added to each.
#>   |                                                                              |                                                                      |   0%  |                                                                              |===                                                                   |   4%  |                                                                              |======                                                                |   8%  |                                                                              |========                                                              |  12%  |                                                                              |===========                                                           |  16%  |                                                                              |==============                                                        |  20%  |                                                                              |=================                                                     |  24%  |                                                                              |====================                                                  |  28%  |                                                                              |======================                                                |  32%  |                                                                              |=========================                                             |  36%  |                                                                              |============================                                          |  40%  |                                                                              |===============================                                       |  44%  |                                                                              |==================================                                    |  48%  |                                                                              |====================================                                  |  52%  |                                                                              |=======================================                               |  56%  |                                                                              |==========================================                            |  60%  |                                                                              |=============================================                         |  64%  |                                                                              |================================================                      |  68%  |                                                                              |==================================================                    |  72%  |                                                                              |=====================================================                 |  76%  |                                                                              |========================================================              |  80%  |                                                                              |===========================================================           |  84%  |                                                                              |==============================================================        |  88%  |                                                                              |================================================================      |  92%  |                                                                              |===================================================================   |  96%  |                                                                              |======================================================================| 100%

To plot an interactive graph, it is possible to use {networkD3}:

g |>
  head(100) |> # to reduce the amount of nodes and edges in the graph
  networkD3::simpleNetwork(
    height = "10px", width = "30px",
    linkDistance = 50,
    fontSize = 16
  )

Substitutions: replacing node text

When extracting graphs from unstructured text, some synonyms will appear. To replace the text of nodes in dataframe, use the function graph_subs:

# a test tibble
test_graph <- tibble::tibble(
  n1 = c("A", "B", "A", "C", "B", "Ab", "A", "D"),
  n2 = c("B", "Ab", "B", "D", "C", "A", "C", "D") # Includes a loop (D-D)
)

# dataframe with substitutions
DF_substitution <- tibble::tribble(
  ~col1, ~col2,
  "B", "blah",
  "C", "Capybara"
)

# Doing the substitutions
test_graph |>
  graph_subs(DF_substitution)
#> # A tibble: 8 × 2
#>   n1       n2      
#>   <chr>    <chr>   
#> 1 A        blah    
#> 2 blah     Ab      
#> 3 A        blah    
#> 4 Capybara D       
#> 5 blah     Capybara
#> 6 Ab       A       
#> 7 A        Capybara
#> 8 D        D

Example: using State of the Union data

Using the package SOTU, that contains the State of the Union Addresses. It :

"is an annual message delivered by the president of the United States to a joint session of the United States Congress near the beginning of most calendar years on the current condition of the nation. The speech generally includes reports on the nation's budget, economy, news, agenda, progress, achievements and the president's priorities and legislative proposals."

library(sotu) #  text examples of US presidents speeches

# checking the DF with the speeches
tibble::as_tibble(sotu_meta)
#> # A tibble: 240 × 6
#>        X president          year years_active party       sotu_type
#>    <int> <chr>             <int> <chr>        <chr>       <chr>    
#>  1     1 George Washington  1790 1789-1793    Nonpartisan speech   
#>  2     2 George Washington  1790 1789-1793    Nonpartisan speech   
#>  3     3 George Washington  1791 1789-1793    Nonpartisan speech   
#>  4     4 George Washington  1792 1789-1793    Nonpartisan speech   
#>  5     5 George Washington  1793 1793-1797    Nonpartisan speech   
#>  6     6 George Washington  1794 1793-1797    Nonpartisan speech   
#>  7     7 George Washington  1795 1793-1797    Nonpartisan speech   
#>  8     8 George Washington  1796 1793-1797    Nonpartisan speech   
#>  9     9 John Adams         1797 1797-1801    Federalist  speech   
#> 10    10 John Adams         1798 1797-1801    Federalist  speech   
#> # ℹ 230 more rows

Checking Obama speech of the first year of his first mandate

# checking what are the speeches of Obama
sotu_meta |>
  dplyr::filter(
    grepl("Obama", president, ignore.case = T),
    grepl("2009", years_active)
  )
#>     X    president year years_active      party sotu_type
#> 1 229 Barack Obama 2009    2009-2013 Democratic    speech
#> 2 230 Barack Obama 2010    2009-2013 Democratic    speech
#> 3 231 Barack Obama 2011    2009-2013 Democratic    speech
#> 4 232 Barack Obama 2012    2009-2013 Democratic    speech

# Picking this speech of his first year
text_sotu <- sotu_text[229] #|>
# paste(collapse = " ") # turning the vector into a single element

str(text_sotu) # first lines of the text
#>  chr "Madam Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States--she's around here "| __truncated__

# Just as a matter of curiosity, checking the most frequent entities
text_sotu |>
  extract_entity_rb(sw = my_sw) |>
  plyr::count() |>
  dplyr::arrange(-freq) |>
  head(25)
#>                  x freq
#> 1         American   25
#> 2          America   19
#> 3        Americans   14
#> 4              Now   13
#> 5         Congress   10
#> 6           Nation    6
#> 7          Chamber    5
#> 8       Government    4
#> 9             Iraq    4
#> 10       Democrats    3
#> 11        Medicare    3
#> 12       President    3
#> 13     Republicans    3
#> 14   United_States    3
#> 15     Afghanistan    2
#> 16         Already    2
#> 17           Given    2
#> 18             God    2
#> 19        Laughter    2
#> 20 Social_Security    2
#> 21  South_Carolina    2
#> 22           Thank    2
#> 23     Wall_Street    2
#> 24            Well    2
#> 25        Al_Qaida    1

sotu_g_Ob <- extract_graph_rb(text_sotu, sw = my_sw)
#> Tokenizing by sentences

sotu_g_Ob
#> # A tibble: 5,412 × 3
#>    n1        n2            n
#>    <chr>     <chr>     <int>
#>  1 American  America     267
#>  2 America   American    208
#>  3 Now       American    193
#>  4 American  Americans   176
#>  5 Americans American    174
#>  6 Congress  American    171
#>  7 Now       America     157
#>  8 Congress  America     144
#>  9 Americans America     143
#> 10 American  Now         132
#> # ℹ 5,402 more rows

To plot we’ll use another function, plot_graph2. It works differently than net_wordcloud. Because word frequencies can vary significantly, differences in text size can be substantial. Therefore, instead of adjusting text size, we vary the dot/node size, ensuring the text remains consistently sized and maintains readability.

plot_graph2(
  sotu_g_Ob,
  text_sotu,
  head_n = 70,
  text_size = 1.5, text_contour_color = "white",
  edge_color = "blue", edge_alpha = 0.1,
  # scale_graph = "log2"
) +
  ggplot2::labs(
    title = "Obama SOTU - First Year",
    caption = "The more dense the link between the nodes, more frequent is the occurence of both words together. The node sizes shows how frequent is the word alone in the text"
  )
#>   |                                                                              |                                                                      |   0%  |                                                                              |===                                                                   |   5%  |                                                                              |======                                                                |   9%  |                                                                              |==========                                                            |  14%  |                                                                              |=============                                                         |  18%  |                                                                              |================                                                      |  23%  |                                                                              |===================                                                   |  27%  |                                                                              |======================                                                |  32%  |                                                                              |=========================                                             |  36%  |                                                                              |=============================                                         |  41%  |                                                                              |================================                                      |  45%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================                                |  55%  |                                                                              |=========================================                             |  59%  |                                                                              |=============================================                         |  64%  |                                                                              |================================================                      |  68%  |                                                                              |===================================================                   |  73%  |                                                                              |======================================================                |  77%  |                                                                              |=========================================================             |  82%  |                                                                              |============================================================          |  86%  |                                                                              |================================================================      |  91%  |                                                                              |===================================================================   |  95%  |                                                                              |======================================================================| 100%
#> Using node_size proportional to word frequency as no node_size was provided in parameters

Checking Trump speech of the first year of his first mandate

# Trump, first Mandate
sotu_meta |>
  dplyr::filter(grepl("Trump", president, ignore.case = T))
#>     X    president year years_active      party sotu_type
#> 1 237 Donald Trump 2017    2016-2020 Republican    speech
#> 2 238 Donald Trump 2018    2016-2020 Republican    speech
#> 3 239 Donald Trump 2019    2016-2020 Republican    speech
#> 4 240 Donald Trump 2020    2016-2020 Republican    speech

sotu_g_Tr <- sotu_text[237] |>
  paste(collapse = " ") |>
  extract_graph_rb(sw = my_sw)
#> Tokenizing by sentences

#  the most frequent entities
sotu_g_Tr |>
  extract_entity_rb(sw = my_sw) |>
  plyr::count() |>
  dplyr::arrange(-freq) |>
  head(30)
#> Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
#> argument is not an atomic vector; coercing
#>                      x freq
#> 1                Thank  378
#> 2              America  374
#> 3             Congress  374
#> 4              Members  374
#> 5             American  368
#> 6               Nation  363
#> 7             Applause  346
#> 8              Finally  342
#> 9            Americans  338
#> 10          Government  312
#> 11          Department  304
#> 12             Tonight  295
#> 13            Laughter  288
#> 14               State  278
#> 15              Canada  254
#> 16           According  249
#> 17           Democrats  228
#> 18         Republicans  228
#> 19             Justice  227
#> 20               Today  219
#> 21           Obamacare  215
#> 22             Joining  214
#> 23               Megan  211
#> 24              Action  208
#> 25             Arizona  208
#> 26 Government-approved  208
#> 27            Kentucky  208
#> 28           Mandating  208
#> 29           One-third  208
#> 30            Remember  208

plot_graph2(
  dplyr::count(sotu_g_Tr, n1, n2, sort = T),
  sotu_text[237],
  head_n = 50,
  edge_color = "red",
  edge_alpha = 0.3,
  # scale_graph = "log2",
  text_size = 2,
) +
  ggplot2::labs(
    title = "Trump SOTU - First Year",
    caption = "The more dense the link between the nodes, more frequent is the occurence of both words together. The node sizes shows how frequent is the word alone in the text"
  )
#> Warning: There was 1 warning in `dplyr::mutate()`.
#> ℹ In argument: `n = scale_to_range(n, 0.2, 9)`.
#> Caused by warning in `scale_to_range()`:
#> ! All values are identical. Returning all values as new_min
#>   |                                                                              |                                                                      |   0%  |                                                                              |=                                                                     |   2%  |                                                                              |===                                                                   |   4%  |                                                                              |====                                                                  |   6%  |                                                                              |=====                                                                 |   8%  |                                                                              |=======                                                               |  10%  |                                                                              |========                                                              |  12%  |                                                                              |==========                                                            |  14%  |                                                                              |===========                                                           |  16%  |                                                                              |============                                                          |  18%  |                                                                              |==============                                                        |  20%  |                                                                              |===============                                                       |  22%  |                                                                              |================                                                      |  24%  |                                                                              |==================                                                    |  25%  |                                                                              |===================                                                   |  27%  |                                                                              |=====================                                                 |  29%  |                                                                              |======================                                                |  31%  |                                                                              |=======================                                               |  33%  |                                                                              |=========================                                             |  35%  |                                                                              |==========================                                            |  37%  |                                                                              |===========================                                           |  39%  |                                                                              |=============================                                         |  41%  |                                                                              |==============================                                        |  43%  |                                                                              |================================                                      |  45%  |                                                                              |=================================                                     |  47%  |                                                                              |==================================                                    |  49%  |                                                                              |====================================                                  |  51%  |                                                                              |=====================================                                 |  53%  |                                                                              |======================================                                |  55%  |                                                                              |========================================                              |  57%  |                                                                              |=========================================                             |  59%  |                                                                              |===========================================                           |  61%  |                                                                              |============================================                          |  63%  |                                                                              |=============================================                         |  65%  |                                                                              |===============================================                       |  67%  |                                                                              |================================================                      |  69%  |                                                                              |=================================================                     |  71%  |                                                                              |===================================================                   |  73%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================                |  76%  |                                                                              |=======================================================               |  78%  |                                                                              |========================================================              |  80%  |                                                                              |==========================================================            |  82%  |                                                                              |===========================================================           |  84%  |                                                                              |============================================================          |  86%  |                                                                              |==============================================================        |  88%  |                                                                              |===============================================================       |  90%  |                                                                              |=================================================================     |  92%  |                                                                              |==================================================================    |  94%  |                                                                              |===================================================================   |  96%  |                                                                              |===================================================================== |  98%  |                                                                              |======================================================================| 100%
#> Using node_size proportional to word frequency as no node_size was provided in parameters

Now, comparing speeches on a certain topic

# a regex to capture some words/patterns
term <- "\\bChin|Beijing|Shanghai|\\bXi\\b|Jinping"
term_ <- "China"

# Get all Obama speeches of his first mandate
text_sotu_Ob <- sotu_text[229:234] |>
  filter_by_query(term)

sotu_g_Ob <- text_sotu_Ob |>
  paste(collapse = " ") |>
  extract_graph_rb(sw = my_sw) |>
  filter_graph("China", invert = T) # to avoid a star shape graph
#> Tokenizing by sentences

g_Ob <- plot_graph2(
  dplyr::count(sotu_g_Ob, n1, n2, sort = T),
  filter_by_query(sotu_text[229:234], term, unlist = T),
  head_n = 50,
  edge_color = "blue",
  edge_alpha = 0.1,
  text_size = 1.5,
  # scale_graph = "log2"
  layout = "nicely",
) +
  # ggplot2::labs(title= paste("Obama about", term))
  ggplot2::labs(title = "Obama")
#> You provided a vector of 15 elements instead of one. No problem, but these will be collapsed into a single element, with a final punctuation mark added to each.
#> Warning: There was 1 warning in `dplyr::mutate()`.
#> ℹ In argument: `n = scale_to_range(n, 0.2, 9)`.
#> Caused by warning in `scale_to_range()`:
#> ! All values are identical. Returning all values as new_min
#>   |                                                                              |                                                                      |   0%  |                                                                              |======                                                                |   8%  |                                                                              |============                                                          |  17%  |                                                                              |==================                                                    |  25%  |                                                                              |=======================                                               |  33%  |                                                                              |=============================                                         |  42%  |                                                                              |===================================                                   |  50%  |                                                                              |=========================================                             |  58%  |                                                                              |===============================================                       |  67%  |                                                                              |====================================================                  |  75%  |                                                                              |==========================================================            |  83%  |                                                                              |================================================================      |  92%  |                                                                              |======================================================================| 100%
#> Using node_size proportional to word frequency as no node_size was provided in parameters


# Trump
text_sotu_Tr <- sotu_text[237:240] |>
  filter_by_query(term, unlist = T) |>
  paste(collapse = " ") |>
  stringr::str_replace_all("(?i)u\\.s\\.a", "USA")

df_count <- text_sotu_Tr |>
  extract_graph_rb(sw = my_sw) |>
  filter_graph("China", invert = T) # to avoid a star shape graph
#> Tokenizing by sentences

g_Tr <- plot_graph2(
  df_count,
  text_sotu_Tr,
  head_n = 50,
  edge_color = "red",
  edge_alpha = 0.2,
  text_size = 1.5,
  layout = "nicely",
  scale_graph = "log2"
) +
  # ggplot2::labs(title= paste("Trump about", term))
  ggplot2::labs(title = "Trump")
#>   |                                                                              |                                                                      |   0%  |                                                                              |====                                                                  |   5%  |                                                                              |=======                                                               |  10%  |                                                                              |==========                                                            |  15%  |                                                                              |==============                                                        |  20%  |                                                                              |==================                                                    |  25%  |                                                                              |=====================                                                 |  30%  |                                                                              |========================                                              |  35%  |                                                                              |============================                                          |  40%  |                                                                              |================================                                      |  45%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================                                |  55%  |                                                                              |==========================================                            |  60%  |                                                                              |==============================================                        |  65%  |                                                                              |=================================================                     |  70%  |                                                                              |====================================================                  |  75%  |                                                                              |========================================================              |  80%  |                                                                              |============================================================          |  85%  |                                                                              |===============================================================       |  90%  |                                                                              |==================================================================    |  95%  |                                                                              |======================================================================| 100%
#> Using node_size proportional to word frequency as no node_size was provided in parameters

# Joining the graphs
library(patchwork)
(g_Ob + g_Tr) +
  plot_annotation(
    title = "Coocurrence of terms related to China",
    caption = "The more dense the link between the nodes, more frequent is the occurence of both words together. The node sizes shows how frequent is the word alone in the text"
  )

Final remarks

As mentioned, this approach based on text patterns (regex) has its limitations. We advise to test different head_n, to increase the number of nodes, and if it is too polluted, decrease it. It can helps to see what to look at the text to infer meaning. Is possible to use grammar classification to get more precise results. See next session, “Extract entity co occurrences with POS”.