Trees? Meet forest

4 October 2007

Beech-maple forest

In Google Books: Is it Good for History? in the latest Perspectives–the journal of the American Historical Association (AHA)–Robert Townsend reprises his April AHA blog post with

The Google Books project promises to open up a vast amount of older literature, but a closer look at the material on the site raises real worries about how well it can fulfill that promise and the potential costs to history scholarship and teaching.

I think he misses the point; and yes, the Google Books project is good for History …

Mr Townsend’s complaints are in three areas.

Scan quality: In his noodling around in Google Books he’s found that pages are missing, flipped, chopped, blurred. He’s right, of course, there are problems. I’ve spent way to much time in those books, and I can testify. Perhaps he could do a more scientific study, though, and give us an error rate. If it’s less than, say, 10% (and I’ll bet it is), it’s time to give this a rest.

There’s also grumbling about the quality of the OCR’d text behind the page images. Again, if you need better than 90% accuracy, download and correct the text yourself or scan your own copy and see if you can improve on Google’s.

Metadata reliability: It’s noted that cataloging and other ‘information about the information’ for some books is inaccurate, incomplete or missing altogether. Right again. But … metadata (like content descriptors, tags, or subject categories) is not the best way to find specific information outside of a tightly controlled data environment, anyway. Don’t believe me? Ask a librarian. Or a database administrator.

It’s the data itself. Most of us find online information these days with text searches. Indeed the best text mining and analysis tools work with character and word patterns and data objects directly. Simple searches–not subject browsing or tag linking–are tremendously effective at returning appropriate chunks from Google Books. Searches across the whole web by author or title frequently return Google Books entries in the top 10 results.

Blocked public domain material. I’m unhappy with the way Google does this, too. I think I understand it, though. Keeping in mind the number of volumes digitized daily, there’s no time for human exception processing to make these determinations case-by-case. I accept this, grudgingly, as a cost of doing business.

Perhaps some clever IP guru could identify a rule-based model for Google to improve accuracy here. I expect, in the meantime, they’ll play it conservatively: blocking access when in doubt.

If this really is a show stopper, how about someone talks Google into a Mechanical Turk or multi-hit reporting system like craigslist uses to review copyright issues.

Finally, says Mr Townsend, “these problems will just be compounded over time”.

I agree that all these things would be nice to fix, and higher quality ought be everyone’s goal (what’s worth doing, do well). But a 10% error rate is a 10% error rate, no matter how many volumes are scanned. I don’t see a snowball effect here.

____________________

I don’t believe that any of these are really significant problems, though, given the following points:

The big one? Google Books do open up a vast (freaking huge, OMG, can’t-see-the-end vast) amount of older literature. Done and done. You want 100% accuracy? Go see NASA. The rest of us are ecstatic with 90% (or whatever the number is).

After this, all is quibbling.

Google’s digitized books are not archival. Don’t expect authoritative versions of historical documents. Use what you can of the digital versions online. Find a better copy in a library if Google’s isn’t good enough. Any trained historian ought well know how to do that.

Also, if you must have a persistent and reliable hyperlink, you’ll have to host the document yourself and pay for perpetual care. Otherwise, get used to the idea that links age and die.

The Google Books project is not scholarly. It is commercial. It is neither driven nor funded by academic historians. If academic needs or standards are important enough, academic institutions will pay for and implement them. Until then? Sit down.

It’s unfair to complain about Google Books not meeting archival preservation standards, academic quality requirements, or metadata conventions. It is only fair to ask: does Google Books meet the business need Google intended for it? I don’t know for sure what that is, but if it has to do with market share, I’ll bet the answer is yes. The beauty is, we data consumers are along for the ride.

Lastly, what are the “potential costs to history scholarship and teaching” Mr Townsend mentions? I don’t know. Is there a risk that Google Books will become the de facto research standard at the expense of real books? The end of libraries as we know them? The corruption of our youth?

Which begs another question: what is Google’s responsibility to history scholarship and teaching, exactly? As opposed to that of scholars and teachers? I hope that answer is obvious.

Consider, in contrast, the benefits of having Google Books vs. doing without–by not waiting for perfect.*

So here’s my point: while these complaints about Google Books have theoretical interest, and it’s fun to jaw about “what-if” or “wish I had”, Google is out there getting the job done well enough. Providing a huge, useful, ubiquitous information resource from otherwise invisible or extinct books–for the price of looking at advertising.**

I’ll take that.

___________________
Notes

Thanks to Dan Cohen at GMU for the pointer to the Perspectives piece in his blog.

The greenery above is a beech-maple forest at Sleeping Bear Dunes National Lakeshore.

* A librarian at the University of Michigan, a partner in the Google project, notes:

Although we have engaged in large-scale, preservation-based conversion of materials in the Library’s collection for several years, and have been a leader in digital preservation efforts among research libraries, we know that only through partnerships of this sort can conversion of this scale be achieved. Our program is strong, and we have been able to digitize approximately 5,000 volumes/year; nevertheless, at this rate, it would take us more than a thousand years to digitize our entire collection.

** And maybe our very souls to the devil, but that’s not been clearly established yet.

2 Responses to “Trees? Meet forest”

  1. Mark Stoneman says:

    Thank you for drawing my attention to this post.

    Regarding the permanent URL issue, scholars can use WebCite.

  2. Brian says:

    Thank you for another chance to flog the post! Good luck with your collection of Google commentary. WebCite looks very cool – I’m going to play with that, myself.

Please Leave a Reply