Posted by Michael Bhaskar
on 19 December, 2007
Google has announced a new initiative that threatens to seriously disturb the precarious knowledge ecosystem of the web: Google Knols. The project is still shrouded in mystery with only one screen shot so far released and only this Google blog entry, by VP Engineering Udi Manber, to work with. Predictably the blogosphere and tech commentary press have gone in to overdrive, but this time with good reason.
The core of the Knol project can be described as the first realistic challenge to Wikipedia as THE knowledge portal on the web. Here is a brief summary of the project- it describes how Google will allow anybody to contribute articles to a public database on any topic. In contrast to Wikipedia articles (Knols) will be single authored with popular articles rising to the top of the rankings in a given topic. Knols will also feature advertising like Google Search, with ad revenue being shared between Google and the author of the Knol supporting the ads, the idea being that useful or particularly good Knols will generate revenue and hence will be rewarded.
Continue reading "Google Knols" »
Posted by Nicholas Blake
on 18 December, 2007
The February 2008 issue of PC Pro reports on the British Library’s plan to digitize 100,000 books published in the nineteenth century – 25,000,000 pages.
The digitizing partner chosen is Microsoft, with the actual work being done by a German firm, Content Conversion Specialists; the library ‘retains the rights to all the data being collected’ but Microsoft has the right to host the collection on its Live Search Books site, for a duration not revealed by the library. The team of five people scans 50,000 pages a day to complete the project in two years. Books smaller than 28 x 35.5 cm can be automatically scanned, and so 20-30% must be scanned manually. All books are visually checked for loose or torn pages, then placed under a lectern with two Canon 16.6 megapixel lenses; the operator turns the first few pages then the machine uses suction to turn the remainder, at one page every two or three seconds. The operator at the station sees all the pages as thumbnails on a PC, to fix errors. Fold-outs that can’t be scanned by the machine are around 1% of the total, and they’re scanned separately and integrated later by software. The project has a 12 CPU blade server with 40TB of storage.
Resolution is 300dpi for both text and images, which the library says is ideal for reading online but also suitable for print on demand if required in the future. Output formats are JPEG 2000, PDF and plain text; OCR is used to capture plain text which is ‘specially processed’ to deal with antique orthography and typography. A secondary check takes place in Romania, and the library batch-samples files delivered by CSS to ISO 2859-1.
Scanning takes place underground with no natural daylight, to ensure colour consistency, and the scanning room is air-conditioned: ‘Just one degree in temperature changes the light tuning and requires colour adjustments.’
To deal with copyright issues the library is using ‘a database of authors’; those in copyright (less than 1%) won’t be digitized, and orphan works (about 40%) will be but with a ‘notice and takedown’ procedure on the website.
Note: the article uses ‘scan’ throughout but it’s clear from the diagram that a static photograph of each page is used.