Following on from James' post about fan fiction, it seems that some of the issues are not just applicable to content as such, but also the wider concept of data. The idea behind scraping is simple: a program takes information from a web page and translates it to another webpage. It means that websites can, in theory, take data and then use it new ways. Popular scraping services like Dapper make the process easy and efficient, while a whole sub-industry as built up around translating information from one site to another, with tools such as this Ruby on Rails kit being widely available.

However as this feature points out the whole concept is increasingly problematic. Scraping essentially relies on the co-operation of the sites being scraped, and those tend to be the most popular: Google, eBay, Amazon etc. Most of the time sites are happy being scraped as it increases the profile of the site and the data they are displaying. Plus it can be difficult to stop.

As the article makes clear though, there are an increasing number of sites that are not willing to let their data be scraped. For example the listing site Craigslist has cracked down on sites that scraped their listings and repackaged them, as did Alexa, the Amazon owned web information service, which clamped down on sites using Alexa data.

There is an obvious parallel between fan fiction and scraping sites. Both are engaged with taking a proprietary piece of information and then converting this for a new and altered consumption that in some way augments or transforms the original information. There is, in both, a delicate balance between the risks of revenue and reputation damage (e.g. Warner Bros argument in the protracted Harry Potter Wars)/copyright theft and the massive benefits of visibility for the original owner or producer of the data. This is the critical faultline of the web. What is better: control or visibility?

In a perceptive post Tim O'Reilly draws an analogy between banks trading for their own accounts and websites that formerly directed traffic away from their sites to within their sites, trading for their own screen views rather than those of others. Here is a shift from, say, a visibility aid to producer of proprietary concept. It suggests a trend towards control even as DRM and fan fic lawsuits begin to look more and more anachronistic.

Perhaps the best way of balancing these competing demands is through Creative Commons licences: this ensures the integrity of the original work and revenue stream whilst also increasing the much needed visibility of the product or data. In the case of Craigslist for instance one of their major problems was with Google ads that were being displayed next to the scraped listings. If there were not ads on the original, they argued, why should someone profit from the data they had aggregated?

It makes sense though for Craigslist entries to be scraped as anything that increases views of those listings is by definition improving the listings; a CC licence obviates the issue of who is monetizing the content. Likewise the ugly spectacle of media companies hounding their fans could be eased in a similar way. Seeing as a data content distinction is fairly meaningless on the web some kind of creative commons could be built in to future works of HP magnitude, which would allow people to build on them assuming they acknowledged the source. Some kind of royalty arrangement could even be built in (if a fan fic became profitable or if a company was willing to take the risk).

Creative Commons come with the potential flexibility to allow alteration, a key sticking point in the debate over fan fiction.

A perfect illustration of this movement that skirts the transformative, free culture of the web, a need for visibility and the demands for greater proprietary content is the Google Knols project, at least as far as it is currently possible to tell. In the screen shot the Knol is released under a Creative Commons 3.0 licence. The content is thus hosted and displayed by Google, but is available for use elsewhere.

Scraping and fan fiction are raising new questions for viewer hungry websites and media producers, questions that require a new approach to concepts of ownership and data usage.