Cool Analytics: Using Word Analysis to Forecast World Events
Posted: February 26, 2012 Filed under: Measurement and Analytics | Tags: Black Belt, Culturomics, Egypt uprising, geographic landmark, geographic reference, global news tone, Kalev Leetaru, Lean Six Sigma, Regressive Imagery Dictionary, sentiment analysis, statistical thinking, Summary of World Broadcasts, WordNet, world broadcasts, Z score Leave a comment
Global geocoded tone of all Summary of World Broadcasts content 1979-2010. Click on image to GIF animation.
Above is an animated gif. If you click on the image, it runs an animation depicting a spatial analysis of key words in the Summary of World Broadcasts between 1979 and 2010.
In a previous post (Sentiment Analysis) Kalev Leetaru was one of the people mentioned as a leader in sentiment analysis. Here he is showing how analysts look at what is said around the world and what places are most often mentioned.
According to Leetaru “location is a critical component of the news, with a typical news article averaging one location mention every 200–300 words. The Summary of World Broadcasts, with 1.2 billion words from 1979–2010, has 201,000 unique locations mentioned roughly 5.81 million times, or around one geographic reference every 215 words.
Each city or other geographic landmark (such as islands, oceans, mountains, rivers, etc) is color–coded on a 400–point scale from bright green (high positivity) to bright red (high negativity), based on the average tone of all articles mentioning that city. Each article mentioning two or more cities together results in a link being drawn between those cities, and the average tone of all articles mentioning both cities is used to color–code that link on the same color scale as the cities.”
In an article published on the site First Monday, Leetaru describes what he terms Culturomics, an interesting blend of history, big data analysis, and societal observation:
The emerging field of “Culturomics” seeks to explore broad cultural trends through the computerized analysis of vast digital book archives, offering novel insights into the functioning of human society. Yet, books represent the “digested history” of humanity, written with the benefit of hindsight. People take action based on the imperfect information available to them at the time, and the news media captures a snapshot of the real–time public information environment.
News contains far more than just factual details: an array of cultural and contextual influences strongly impact how events are framed for an outlet’s audience, offering a window into national consciousness. A growing body of work has shown that measuring the “tone” of this real–time consciousness can accurately forecast many broad social behaviors, ranging from box office sales to the stock market itself.
Can the public tone of global news data forecast even broader behaviors, such as the stability of nations, the location of terrorist leaders, or even offer new insight on conflict and cooperation among countries, as accurately as it predicts movie sales or stock movements? This study makes use of a 30–year translated archive of news reports from nearly every country of the world, applying a range of computational content analysis approaches including tone mining, geocoding, and network analysis, to present “Culturomics 2.0.”
The traditional Culturomics approach treats every word or phrase as a generic object with no associated meaning and measures only the change in the frequency of its usage over time. The Culturomics 2.0 approach introduced in this paper focuses on extending this model by imbuing the system with higher–level knowledge about each word, specifically focusing on “news tone” and geographic location, given their importance to the understanding of news coverage. Translating textual geographic references into mappable coordinates and quantifying the latent “tone” of news into computable numeric data permits an entirely new class of research questions to be explored via the news media not possible through the traditional frequency count approach.
This study will explore how the latent tone of a large digital news archive can be visualized to understand macro–level changes in global society in both time and space. Measuring the tone of news coverage about a single geography over time, a fundamentally new approach to conflict early warning is developed that “passively crowdsources” the global mood about each country in the world. This is found to offer highly accurate short–term forecasts of national stability.
One of the key data sources for this analysis is a news gathering service formed in the days leading up to World War II called the Summary of World Broadcasts (SWB):
Recognizing the need for on–the–ground insights into the reaction of local media around the world in the lead-up to World War II, the U.S. and British intelligence communities formed the Foreign Broadcast Information Service (FBIS — now the Open Source Center) and Summary of World Broadcasts (SWB) global news monitoring services, respectively. Tasked with monitoring how media coverage “varied between countries, as well as from one show to another within the same country … the way in which specific incidents were reported … [and] attitudes toward various countries,” the services transcribe and translate a sample of all news globally each day.
The services work together to capture the “full text and summaries of newspaper articles, conference proceedings, television and radio broadcasts, periodicals, and non–classified technical reports” in their native languages in over 130 countries…In fact, news monitoring, or “open source intelligence,” now forms such a critical component of the intelligence apparatus that a 2001 Washington Post article noted “so much of what the CIA learns is collected from newspaper clippings that the director of the agency ought to be called the Pastemaster General.”
Leetaru makes an important distinction in his paper between straight word count and what he terms tone:
Sentiment mining counts up the number of words in a document that appear in precompiled dictionaries of “positive” and “negative” words to determine the density of emotional language and its overall “tone.” A document with many words like “terrible”, “awful”, and “horrific”, and few words like “good” or “nice” would be given a highly negative score by the algorithm, while one with more positive language would be given a more positive score. While not as accurate as humans, automated sentiment mining systems are robust enough that they are now used by most large companies to monitor the online discourse about their products and learn which areas consumers like and dislike.
It is important to note that computer–based tone scores capture only the overall language used in a news article, which is a combination of both factual events and their framing by the reporter. A classic example of this is a college football game: the hometown papers of both teams will report the same facts about the game, but the winning team’s paper will likely cast the game as a positive outcome, while the losing team’s paper will have a more negative take on the game, yielding insight into their respective views towards it. Capturing the global reaction to a political event requires precisely this type of composite tonal measure that emphasizes the editorial framing of the event‘s significance and tone.
One of the fascinating aspects of this area of work is the methods used to determine “tone.” Leetaru indicates that several instruments were evaluated including:
Regressive Imagery Dictionary: The Regressive Imagery Dictionary is a content analysis coding scheme designed to measure primordial vs. conceptual thinking. Conceptual thought is abstract, logical, reality oriented, and aimed at problem solving. Primordial thought is associative, concrete, and takes little account of reality. It is the type of thinking found in fantasy, reverie, and dreams. A running tally of category occurrences is kept, and final output consists of the categories and the percentage of words in a document that were assigned to each category. This output can then be subjected to statistical analyses.
WordNet: WordNet is a lexical database of English nouns, verbs, adjectives and adverbs grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing. WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words.
One of the things that caught my eye, especially since recently I was coaching some Black Belts on some basic statistical concepts including the Z-score, was how Leetaru converted his analysis into deviations on a Z-score run chart. Below is his analysis of Egypt.
On 25 January 2011, popular dissent with the Egyptian state culminated in mass protests that continued through President Mubarak’s resignation on 11 February. [The figure below] shows the average tone by month from January 1979 to March 2011 of all 52,438 articles captured by SWB mentioning an Egyptian city anywhere in the article. Only articles explicitly mentioning an Egyptian city were included to filter out casual references to Egypt to return only articles reporting on the country in more detail.
To normalize the data, the Y axis reports the number of standard deviations from the mean, with higher numbers indicating greater positivity and lower numbers indicating greater negativity. January 2011 reports only the tone for 1 January through 24 January, capturing the period immediately preceding the protests. Only twice in the last 30 years has the global tone about Egypt dropped more than three standard deviations below average: January 1991 (the U.S. aerial bombardment of Iraqi troops in Kuwait) and 1–24 January 2011, ahead of the mass uprising. The only other period of sharp negative moment was March 2003, the launch of the U.S. invasion of neighboring Iraq.

Tone of coverage mentioning Egypt, Summary of World Broadcasts January 1979–March 2011 (January 2011 is 1–24 January). Y axis is Z–scores (standard deviations from mean).