Culturomics: What We Can Learn from 5 Million Books


How to put your “beft” foot forward, or what the algorithm of censorship has to do with 1950.

We’ve already established that we could learn a remarkable amount about language from these 5 essential books, but imagine what we could learn from 5 million books. In this excellent talk from TEDxBoston, Harvard scientists Jean-Baptiste Michel and Erez Lieberman Aiden reveal fascinating insights from their computational tool that inspired Google Labs’ addictive NGram Viewer, which pulls from a database of 500 billion words and ideas culled from 5 million books across many centuries, 12% of the books that have ever been published.

They call their approach Culturomics — “the application of massive scale data collection and analysis to the study of human culture.” From advising you on the best career choices for early success to figuring out when an artist is being censored to proving that we’re forgetting the past exponentially more quickly than ever before, the data speaks volumes when queried with intelligence and curiosity.

[The database pulls from] a collection of 5 million books. 500 billion words. A string of characters a thousand times longer than the human genome. A text which, when written out, would stretch from here to the moon and back ten times over. A veritable shard of our cultural genome.”

