ocr – librarian.net

So hey this is interesting. I’ve skipped a lot of the Google Books ebookstore stuff lately because I’m honestly not sure what to make of it. And I don’t buy books anyhow. But a friend mentioned this Google Labs Ngram viewer, a fun tool that lets you search the full corpus of the Google Books databases. Here’s a New York Times article about it and data geeks should read the article Quantitative Analysis of Culture Using Millions of Digitized Books (free reg. required – click for PDF ILL) or nose around in the datasets. I did my own dopey search pictures above – Hegel vs. Hitler. And here’s what’s interesting. The big jump in the late 1940’s is fairly predictable, but who was talking about Hitler in 1620?

I clicked through and poked around some and here’s what I found. No one was talking about Hitler. OCR is, as you know, imperfect. So the words that Google Books’ optical character recognition thought of as “Hitler” were actually words like “Ruler” and “bitter” and “herbe.” How about that?

Tag: ocr

Angry Birds for the Thinking Person – National Library of Finland crowdsources fixing OCR errors

Google Books ngrams – on Hegel and Hitler and OCR