2011.03.01. 1:47

From the greatest database ever created we can learn at least 200 years back in time about whom and when books were written. The researchable "cultural genome" of mankind was introduced on Thursday by which we among others looked for the traces of Puskas, Neumann, the Rubik-cube and the treaty of Trianon.

The Harvard University, Google and Encyclopaedia Britannica have launched an unprecedented scientific project. The usual data mining and data analysis methodology of the natural sciences was extended for an area which provides us with information about human culture and languages.

Searching in more than 5 million books

The text-contents available on the Internet are easily searchable for words, names, idioms and newly also for more serious contexts and trends. However most of the contents generated by mankind are not digitalized so far: this means the books, the Gutenberg Galaxy. It is needless to describe why it would be a fantastic possibility to easily search in this database too.

The researchers of Harvard University - together with experts from Google and Encyclopaedia Britannica - took the first step on this course. More than 5 million books, published since the beginning of the 1800's (this means approx 500 billion words) have been made searchable. This quantity means about 4% of the books ever published and according to the researchers it is a database large enough to analyze it with methods like it is done with the human genome. Therefore the expressions "cultural genome" and "culturomics" are used, based on the example "genomics".

In this huge "cultural footprint" of mankind it can be easily traced when and with which intensity names and phrases occur, e.g. about whom (or what) was written on the book-pages. "The importance of this program is given by the fact that lots of questions which are significant for mankind can be searched with a quantity-approach" - told [origo] Erez Lieberman-Aiden, mathematician and biological engineer of Harvard University. The expert, with Hungarian origin by one of his grandparents, together with Jean Baptiste Michel is leading the group which developed from the initial idea a huge database and a sophisticated method of analysis. By doing this they had to solve the problem that the time-scaled database of 500 billion words do not hurt copyrights and they had also to filter the books with faulty metadata (e.g. year of publishing) for instance in Google Books.

We searched some Hungarian names

The work has begun in 2008 and the developers introduce the program together with the first analyses in the latest issue of Science. Simultaneously with the publication of the article the webpage www.culturomics.org became public on which everybody can freely start researching. With preliminary data received from Harvard we already checked some names and idioms which are of interest for us Hungarians also, from Ferenc Puskas to the forced treaty of Trianon. Let's see first Ferenc Puskas:

The frequency of occurrence of the name "Ferenc Puskas" in the past decades. The first two spikes on the chart can be the effect of the Hungarian national soccer team called "Golden Eleven" and Real Madrid, after it follow the effects of his carrier as a trainer.

How can the fourth peak be explained? About 72% of the books on which the database is founded are written in English (the rest is German, French, Spanish, Russian and Hebrew). The charts received from Harvard base on searches of English-written books. It is obvious that the books published earlier in the USA (where European football started to be popular only from the beginning of the 80'-s) "watered down" the frequency of the name Puskas.

It can be explained with the American influence as well that the name of another worldwide known Hungarian, Janos (John von) Neumann occurs with a magnitude two times higher and more continuously.

About the interpretation of the chart: the "X" axis marks the time, on the "Y" axis we can see the frequency of the name's occurrence. In the case of Neumann the value on the top of the "Y" axis is 1e-7, this means 10-7 or one ten-millionth. If the chart of Neumann reaches the value 1 on the "Y" axis it means that in that year every ten millionth word was "Neumann" between all words. Rolling upwards the page it is visible that in the case of Puskas the value 1 means only one billionth word although the two men were known at almost the same level.

Another Hungarian example, the forced treaty of Trianon analyzed by the editors of "Mult-kor"

The trauma of Trianon has settled in common language since the 20's and it remained there until 1945 when it was erased from the "official language". The end of World War 2 is a caesura in the hit-list of the word Trianon: from publications it was forced back into private conversations. However it is surprising and thought-provoking that the changes of the 90's did not bring a significant rising or change in the occurrence of this word which became a notion. Upon the survey an international attitude is visible that the international public is disinterested in Trianon which means that it is not a historic milestone for them.

More Hungarian examples can be seen on page 2 of this article.

Celebrities in past and present

The dominance of the English language can also be explained by the fact that the first aim of the program was the following of the changes of English language in the past 200 years (e.g. it turned out that the language increases yearly by approx. 8500 words) but its creators would like to continuously enlarge the database and the research-possibilities. "We have to enlarge the database with more books and by adding more languages" - told Erez Lieberman-Aiden. "We have to work out similar methods for other sources, e.g. newspapers and periodicals also."

But even from the data available now interesting social and cultural connections and changes come to light. According to the data mankind is forgetting its past increasingly fast: for example the number of references to the 1880's decreased by 50% in 32 years (by 1912) while the references to the year 1973 decreased by 50% in a decade, by 1983. The development of technical civilization can also be followed well: the technological innovations have spread in the 20th century two times faster than in the century before.

The celebrities are younger and more well-known than their predecessors in the 19th Century but their radiance is shorter lived (analyzed without media because don't forget that that the database contains only books now) The celebrities who came from people born in the 1800's had an average age of 43 while those born in the 50's only 29.

The www.culturomics.org can be a useful tool to follow the effects of propaganda and censorship. It is worth to study how often the name of Marc Chagall is mentioned in German and English books in the period between 1936 and 1944. Likewise in a period the name of Trotski was erased from the Russian texts or the name of Tienanmen-square from the Chinese writings.

The books build an essential part of our "cultural genome" and keep information which is passed from one generation to the next like our genes. Culturomics mean new in-depth study of the development of human culture where books are the "fossils" of previous generations. According to its creators future researches can reveal hidden connections of diseases, wars, science and religions.

129 million books published, 12 million books scanned, 5 million books analyzed. The frequency of the word "apple" (Science/AAAS)

