A nice long post from ReadWriteWeb about how the National Library of Finland has created some microtasking games to help get the OCR errors in their massive digitzation project fixed up. Related, a panel at NYPL from a few days ago sounds interesting.
Tag: ocr
Google Books ngrams – on Hegel and Hitler and OCR
So hey this is interesting. I’ve skipped a lot of the Google Books ebookstore stuff lately because I’m honestly not sure what to make of it. And I don’t buy books anyhow. But a friend mentioned this Google Labs Ngram viewer, a fun tool that lets you search the full corpus of the Google Books databases. Here’s a New York Times article about it and data geeks should read the article Quantitative Analysis of Culture Using Millions of Digitized Books (free reg. required – click for PDF ILL) or nose around in the datasets. I did my own dopey search pictures above – Hegel vs. Hitler. And here’s what’s interesting. The big jump in the late 1940’s is fairly predictable, but who was talking about Hitler in 1620?
I clicked through and poked around some and here’s what I found. No one was talking about Hitler. OCR is, as you know, imperfect. So the words that Google Books’ optical character recognition thought of as “Hitler” were actually words like “Ruler” and “bitter” and “herbe.” How about that?