Google Knols

Google has announced a new initiative that threatens to seriously disturb the precarious knowledge ecosystem of the web: Google Knols. The project is still shrouded in mystery with only one screen shot so far released and only this Google blog entry, by VP Engineering Udi Manber, to work with. Predictably the blogosphere and tech commentary press have gone in to overdrive, but this time with good reason. The core of the Knol project can be described as the first realistic challenge to Wikipedia as THE knowledge portal on the web. Here is a brief summary of the project- it describes how Google will allow anybody to contribute articles to a public database on any topic. In contrast to Wikipedia articles (Knols) will be single authored with popular articles rising to the top of the rankings in a given topic. Knols will also feature advertising like Google Search, with ad revenue being shared between Google and the author of the Knol supporting the ads, the idea being that useful or particularly good Knols will generate revenue and hence will be rewarded.

This has people understandably worried for a number of reasons. Techcrunch questioned whether this might be a step too far for Google, as it moves into the space previously occupied by websites like Wikipedia, Squidoo, About and of course traditional publishers. Ben Vershbow on if:book offers a thorough and meaty dissection of the concept, accusing Google of a myopic, insular and self serving attitude that threatens to turn the web into little more than a directory of Google products.

Potentially Knols could be the first step on a journey whereby all content is provided free on Google or perhaps some unforeseen site, a prospect that has and should see publishers concerned. If people can read a book quality entry on Gordon of Khartoum they are less likely to buy the book.

Part of the fear surrounding Knols keys into the recent rejigging of the Google algorithm that saw many prominent websites slip down the crucial PageRank system. Google argues that the system is designed to counter some of the effects of black hat SEO but many think that Knols could be given a search advantage over, say, Wikipedia in the long term through such PageRank reevaluations. Given that Wikipedia's traffic is driven from their uniformly high search position such a move could seriously damage the Wiki Foundation flagship. This conflict of interest between content hosting and content indexing is an old chestnut for those familiar with Google Book Search.

Unlike many commentators I actually think the Knols really could work: if the information is useful then people really will go there. Everyone has a lot of affection for Wikipedia's model, even if it is not 100% reliable; but only hardcore web activists will refuse to use Knols should the information to be found there prove more useful. For most people getting the best results quickly and simply is the main priority and in this area Google has a track record second to none. Plus, as Open Access points out, Knols look like they come with a CC licence meaning it will be easy to distribute and re-use the articles.

Yet there is a lingering sense that Google are moving into, and trying to dominate, too many areas of the web, so, as if:book describes it, they become "the alpha and omega" of the internet. Recently they have launched the open application platform OpenSocial and the mobile platform Android, which while being laudable for their commitment to open development show a willingness on Google's part to occupy and own the central ground of virtually everything.

There are also some question marks about the proximity of nominally objective knowledge with the quick buck mechanism of click through ads- might not the temptation to maximise ad revenue prove too much for some authors, and without the auto-corrective ability of a wiki, might this not then remain?

Whatever happens it can be confidently stated that a new front has been opened in the dissemination of knowledge and entertainment.

‘Digitizing the British Library’

The February 2008 issue of PC Pro reports on the British Library’s plan to digitize 100,000 books published in the nineteenth century - 25,000,000 pages. The digitizing partner chosen is Microsoft, with the actual work being done by a German firm, Content Conversion Specialists; the library ‘retains the rights to all the data being collected’ but Microsoft has the right to host the collection on its Live Search Books site, for a duration not revealed by the library. The team of five people scans 50,000 pages a day to complete the project in two years. Books smaller than 28 x 35.5 cm can be automatically scanned, and so 20-30% must be scanned manually. All books are visually checked for loose or torn pages, then placed under a lectern with two Canon 16.6 megapixel lenses; the operator turns the first few pages then the machine uses suction to turn the remainder, at one page every two or three seconds. The operator at the station sees all the pages as thumbnails on a PC, to fix errors. Fold-outs that can’t be scanned by the machine are around 1% of the total, and they’re scanned separately and integrated later by software. The project has a 12 CPU blade server with 40TB of storage.

Resolution is 300dpi for both text and images, which the library says is ideal for reading online but also suitable for print on demand if required in the future. Output formats are JPEG 2000, PDF and plain text; OCR is used to capture plain text which is ‘specially processed’ to deal with antique orthography and typography. A secondary check takes place in Romania, and the library batch-samples files delivered by CSS to ISO 2859-1.

Scanning takes place underground with no natural daylight, to ensure colour consistency, and the scanning room is air-conditioned: 'Just one degree in temperature changes the light tuning and requires colour adjustments.'

To deal with copyright issues the library is using ‘a database of authors’; those in copyright (less than 1%) won’t be digitized, and orphan works (about 40%) will be but with a ‘notice and takedown’ procedure on the website.

Note: the article uses ‘scan’ throughout but it’s clear from the diagram that a static photograph of each page is used.