1 História da Análise Textual

“May we hope that when things come to such a crisis, human labor of the literary sort may be in part superseded by machinery? Machinery has done wonders, and when we think of what literature is becoming, it is certainly to be wished that we could read it by machinery, and by machinery digest it” (Andrew Stauffer In London’s Daily News. 15 de Setembro de 1869 apud Catherine DeRose.)

A análise textual abarca campos do conhecimento bem variados, como psicologia, ciências da computação, ciências da informação, linguística, ciência política, sociologia, etc. Apresentamos aqui uma linha do tempo com alguns dos principais eventos relacionados à análise textual, bem como ao seu uso com computadores. De modo resumido, podemos pensar nos primeiros desenvolvimentos ao final do século XIX, a introdução do computador e mais recentemente, a introdução da inteligência artificial como pontos marcantes nesta cronologia. Fizemos aqui uma breve cronologia, que não pretende ser extensiva, com todos eventos importantes, mas apenas demarcar alguns pontos interessantes, para ajudar a dar alguma ideia àqueles que iniciam no campo das humanidades digitais e na sociologia digital.

1.1 Linha do tempo da história da Análise Textual

Séc. XVII - Igreja Católica analisa a proporção de textos impressos de conteúdo não religioso
1887 Medenhall. Analisa o comprimento de palavras:
- MENDENHALL, T. C. . The characteristic curves of composition. Science.Vol ns-9, Issue 214S. 11 March 1887. baixar pdf
1888 Benjamin Bourdon (1860-1943, psicólogo e professor da Université de Rennes): Ao pesquisar sobre a expressão de emoções através de palavras, analisou o livro “Exodus” da Bíblia e calculou frequências, classificou e eliminou as stopwords.
- “In 1888, in a research on the expression of emotions through words, Benjamin Bourdon analysed the Exodus of the Bible and calculated the frequencies by rearranging and classifying them, eliminating the stop words” fonte, link para o livro online.
1888 Friedrich Kaeding (1855 - 1929), cria índices de frequência para estruturação de sistemas estenográficos (sistema de escrita por abreviações para que a escrita seja tão rápida como a fala).
1. Wincenty Lutoslawski, criador do termo “estilometria”, lança sua análise de palavras raras na obra de Platão.
- LUTOSLAWSKI, W. Origin and Growth of Plato’s Logic. with an account of Plato’s style and of the chronology of his writings. New York: Longsmans, Green. 1897. p. 613. link.
1934 Harold Laswell (1902-1978, cientista político) produz a primeira contagem de palavras chave.
1. O linguista norteamericano George Kingsley Zipf (1902–1950) publica “The psycho-biology of language” onde sugere o “principle of relative frequency”, que ficou conhecido posteriormente como “Zipf’s Law”.
1934 Vygostky produz a primeira análise quantitativa de narrativa
1949 Robert Busa (padre jesuíta) junto à IBM com o projeto Index Thomisticus, que levou 34 anos, envolveu cerca de 70 pessoas, sobre as obras de São Tomás de Aquino, que envolveu indexação e lematização de palavras e frequência de termos. Técnicas ali desenvolvidas foram usadas posteriormente nos manuscritos do Mar Morto, para tentar preencher partes faltantes do texto. “The IBM… considered this first enterprise of using a computer for linguistic and lexicographic goals as a pilot-project” fonte, indexando mais de 10 milhões de palavras.
- Busa, R. (1980). “The Annals of Humanities Computing: The Index Thomisticus”. Computers and the Humanities. 14 (2): 83–90. doi:10.1007/BF02403798. ISSN 0010-4817.
- ROCKWELL, Geoffrey; PASSAROTTI, Marco (2019-05-27). “The Index Thomisticus as a Big Data Project”. Umanistica Digitale (5). doi:10.6092/issn.2532-8816/8575. “The Index Thomisticus, itself, is divided into two parts - the indexes and the concordances. The index alphabetically notes each word along with a reference to its distribution and frequency. Besides a general index for the entire study, there is also one for each work. The concordances, on the other hand , list alphabetically all the words and cite every passage in which a word app ears.”. “Jesuit Father Uses Computer to Analyze Works of St.Thomas Aquinas”. Modern Data, 1973. pp.41-2
1950 Gottschalk usa Content Analysis para rastrear temas freudianos.
- GOTTSCHALK, Louis A. The Measurement of Psychological States Through the Content Analysis of Verbal Behavior. University of California Press. 1969. 317p.
- Gottschalk-Gleser Content Analysis Method of Measuring the Magnitude of Psychological Dimensions
1950 Alan Turing aplica Inteligência Artificial a textos.
1952 Bereleson publica o primeiro manual de análise de conteúdo.
- BERELSON, B. (1952). Content analysis in communication research. New York: Hafner.
1954 Primeira tradução automática de texto (Georgetown–IBM experiment) do russo para o inglês.
- HUTCHINS, John. The first public demonstration of machine translation: the Georgetown-IBM system, 7th January 1954. 2006.
- Press Release da IBM
1963 Mosteller e Wallace analisam a autoria dos Federalist Papers.
- MOSTELLER, F., WALLACE, D. L.. 1963. Inference in an authorship problem. Journal of the American Statistical Association 58:275–309.
1965 Tomashevsky formaliza análise quantitativa de narrativa.
- TOMASHEVSKY, B. (1965). Thematics. In L T. Lemon & M. I. Reis (Eds. & Trans.), Russian formalist criticism: Four essays (pp. 61-95). Lincoln: University of Nebraska Press. (Original de 1925)
1966 Stones e Bales usam computador para medir propriedades psicométricas de textos na RAND.
1980 Declínio do formalismo chomskyano; nascimento do Processamento de Linguagem Natural (PLN).
1980 Aplicação de Aprendizado de máquinas (Machine Learning) ao Processamento de Linguagem Natural
1981 Walter Weintraub e a contagem de parts-of-speech.
- WEINTRAUB, Walter. Verbal Behavior: adaptation and psychopathology. Springer:NY. 1981.
SOBEL, Dava. Language patterns reveal problems in personality. NYT. Oct. 27,1981.
1985 Schrodt introduz codificação automática de eventos (Automated Event Coding).
- SCHRODT, Philip A. Automated Coding Of International Event Data Using Sparse Parsing Techniques. 2000.
1986 James W. Pennebaker desenvolve LIWC (Linguistic Inquiry and Word Count).
1989 Roberto Franzosi (perfil no Research Gate) (sociólogo) traz a análise quantitativa de narrativa (quantitative Narrative Analysis) para as Ciências Sociais.
1998 Primeiro desenvolvimento de Topic Models.
1998 John W Mohr conduz a primeira análise quantitativa de visões de mundo.
1999 Peter Bearman (sociólogo) et al. aplicam métodos de rede a narrativas “Narrative network”.
2001 David M. Blei et al desenvolvem a LDA (Latent Dirichlet Allocation).
- David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003) 993-1022.
- Github da Blei Lab.
2003 MALLET (MAchine Learning for LanguagE Toolkit), um dos primeiros sistemas de topic models, é criado.
- “MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. MALLET includes sophisticated tools for document classification: efficient routines for converting text to”features”, a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.”
2005 Quin et al analisam discursos políticos usando topic models.
- Kevin M Quinn, Burt L Monroe, Michael Colaresi,Michael H Crespin, and Dragomir R Radev. 2010. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science54(1):209–228
2010 Gary King e Daniel Hopkins trazem Topic Models ao mainstream.
- Hopkins, Daniel, and Gary King. 2010. “A Method of Auto-mated Nonparametric Content Analysis for Social Science.”American Journal of Political Science54(1): 229–47.
2014 Margaret Roberts, et al. desenvolvem “Structural Topic Models”.
2014 - Primeiro Workshop sobre Argument mining ou “argumentation mining”
- Proceedings das edições anteriores aqui e aqui (na Sessão “Past Workshops” é possível acessar papers de edições anteriores - desde 2014 - do evento).

Fonte: Versão ampliada, baseado parcialmente em: “SICSS 2018 - History of Quantitative Text Analysis” slides, video e BROWN, Taylor W. Workshop on automated text analysis no Summer Institute in Computational Social Science na Universidade de Oxford em 2019. Em inglês, sem legendas. Parte 1.

Pretende-se posteriormente expandir esta seção, explicando em mais detalhes alguns dos exemplos acima