Use of Online Data in the Big Data Era: Legal Issues Raised by the Use of Web Crawling and Scraping Tools For Analytics Purposes
By Jim Snell and Derek Care, Bingham McCutchen LLP
Automated Web Content Gatherers
With the advent of Big Data – the increasingly widespread practice of using advanced data analytics to identify trends and patterns in extremely large datasets collected from a variety of sources – the potential applications for scraped data, and the benefits associated with analysis of that data, have increased exponentially. Whereas past cases involving unauthorized web crawling and scraping often involved simple copying and republication of website content in direct competition with the scraped website, the growing use of advanced data analytics is giving rise to instances where the connection between the data analytics service and the scraped website is attenuated and not directly competitive. Nevertheless, the online content of websites that may be scraped is among such businesses most valuable data, and great lengths are understandably taken to protect such content.
Given both the tremendous value and Big Data-driven demand for Internet-based information, and the relative ease by which such information can be compiled using automated data collection tools such as that deployed by Mr. Warden, it is likely that future cases relating to web crawling and scraping will focus on the legal issues raised by automated data gathering for analytics purposes – and what theories a website owner may exercise to protect any factual data so collected and what theories a data collector may use to justify such collection. Few courts, however, have directly addressed the legal issues raised by Big Data or the collection of data for related purposes, leaving uncertain the legal environment faced by website owners wishing to protect the data on their websites, and those who would gather such data for analytics purposes. Without taking sides – and while recognizing that the legal landscape relating to the Internet is constantly evolving, with previously challenged technologies such as search engines now recognized as nearly per se legitimate while others such as peer-to-peer networks have continually been subject to scrutiny – this article seeks to outline the legal issues such parties may face. In doing so, this article will consider the legal theories that have been applied in prior cases relating to the use of web crawling and scraping tools in other contexts, and will identify issues relating to whether claims under these theories are likely to succeed in connection with disputes relating to automated data collection for Big Data and analytics purposes.
Legal Theories Related to Automated Online Data Collection
A. Copyright Infringement.
The Copyright Act protects original expressions that are fixed in a tangible medium, including mediums such as computer memory or a web server.3 These protections extend not only to original expressions such as images contained on a website, but also the underlying code that enables the display of any content on the website – including facts displayed on a website that are not otherwise entitled to copyright protection. Accordingly, because web crawling and scraping tools generally index information on a targeted webpage regardless of whether the tool seeks to obtain copyrighted content or unprotected facts4, courts have recognized claims for copyright infringement in connection with the use of web crawling and scraping tools.5
Because some courts have recognized that such activities may infringe a website owner’s copyrights, the focus in such cases is generally on whether the web crawling or scraping at issue is a fair use of the copyrighted content. For example, in Kelly v. Arriba Soft Corp., the defendant search engine conceded that its display of low-resolution “thumbnail” copies of high-resolution photographs constituted reproduction of those photographs, but argued that such display was a transformative, fair use of the copied photographs. The Ninth Circuit agreed – holding that the search engine’s display of low-resolution photographs to facilitate the general public’s access to information on the Internet was highly transformative of, and did not provide a substitute for, the plaintiff’s high-resolution photographs whose purpose was primarily artistic.6 Notably, the fact that such use was for a commercial purpose did not bar the court’s finding that the search engine made a fair use of plaintiff’s copyrighted photograph.7 In contrast, in Associated Press v. Meltwater Holdings U.S., Inc., the court found that an online news aggregator that provided its subscribers with nearly 500-character excerpts of copyrighted articles scraped from the website’s of the Associate Press’s licensees did not engage in a fair use of those articles. The court distinguished the news aggregator’s services from those at issue in Kelly on the grounds that the news aggregator did not facilitate the general public’s access to information on the Internet, but instead only provided word-for-word excerpts of the copied articles to the aggregator’s paying customers without transforming that content in any way.8 The court further held that the aggregator’s use of that content to generate analytics relating to the online news sources it covered, while potentially transformative in and of itself, did not render the aggregator’s excerpting transformative insofar as the analytics and excerpting were separate and distinct services.9
While even incidental reproduction of copyrighted webpage material may give rise to copyright liability, courts have also recognized that such reproduction may constitute a fair use of the protected content. For example, inTicketmaster Corp. v. Tickets.com, Inc., the defendant argued that the momentary copying of Ticketmaster’s webpages by its spiders for the purpose of extracting factual information concerning concert times, ticket prices, and venues that defendant then posted to its website constituted a fair use. The court agreed. In so finding, the court emphasized that the copying was momentary, the effect on the market value of the copyrighted material was “nil”, and that the “amount and substantiality” of the material used was negligible insofar as defendant did not reproduce the copyrighted material on its webpage. Further, the court observed that the central purpose of the Copyright Act – i.e., “to secure a fair return for an author’s creative labor and to stimulate artistic creativity for the general good” – would not be served by restricting defendant from momentarily copying Ticketmaster’s webpages for the purpose of obtaining non-protected, factual information.10
In addition to the fair use defense, courts have also considered whether a plaintiff’s copyright claims are subject to implied license or estoppel defenses based on its failure to deploy the “robots.txt” protocol to deter unwanted web crawling or scraping. The robots.txt protocol is industry-standard programming language that a website may deploy to instruct cooperating web crawlers generally, or certain web crawlers specifically, to voluntarily refrain from accessing all or part of the website.11 In Parker v. Yahoo, Inc., the court held that the plaintiff’s failure to deploy the protocol granted Yahoo an implied license to create cache copies of his website where plaintiff was aware that Yahoo – which has a policy of not creating cache copies of websites that deploy the protocol – would do so in the absence of the protocol.12 Conversely, in Meltwater, the court rejected the defendants’ implied license and estoppel defenses based on the Associated Press’s purported failure to require its licensees to deploy the protocol. The court distinguished Parker on several grounds, including that the defendants reserved the right to ignore the protocol if deployed. The court further emphasized that the defendants’ arguments, if accepted, would shift the burden of preventing infringement to the copyright owner, and threatened the “openness of the Internet” by forcing copyright owners to choose between deploying the protocol and deterring all web crawlers (including search engines which may help users locate the website), and refraining from doing so and losing the right to prevent unauthorized use of its protected content.13
With respect to future cases involving use of scraped content for analytics purposes, courts are likely to follow a similar analysis driven by the facts of the specific case. Issues regarding whether the copying is momentary, whether the information extracted is factual, the effect on the market value of the copyrighted material, and the amount and substantiality of the material used are likely to be key issues in these cases. Courts are further likely to focus on whether the object of the Copyright Act – “to secure a fair return for an author’s creative labor and to stimulate artistic creativity for the general good” – would be served by prohibiting the challenged conduct. Courts are also likely to consider, in the context of defenses to copyright claims, the specific circumstances relating to a website’s deployment of the robots.txt protocol, including whether the defendant has a practice or policy of complying with the protocol if deployed.
B. Breach of Contract.
Issues regarding enforceability of contract are likely to continue to be an issue addressed by courts in this area, with content providers citing clickwrap agreements and actual knowledge of terms, and those using crawling and scraping tools arguing a lack of mutual assent to such terms.
3. Damages Relating to Unauthorized Data Collection.
The issue of damages is, of course, an intensely factual determination, but it should be noted that this issue is likely to play a key role in these cases in the future – with content owners trying to either quantify actual damages or establish the applicability of liquidated damages provisions, and those who use crawling and scraping tools arguing the impossibility of establishing such amounts. Based on the difficulty in establishing damages, content owners may also seek injunctive relief in such cases.
C. Computer Fraud and Abuse Act.
Courts have also considered whether web crawling or scraping in breach of a website’s terms of service constitutes a violation of the Computer Fraud and Abuse Act (“CFAA”), which prohibits access to a computer, website, server or database either “without authorization” or in way that “exceeds authorized access” of the computer.33 While these terms have been variously defined, in essence, a person who accesses a computer “without authorization” does so without any permission at all, while a person “exceeds authorized access” where she “has permission to access the computer, but accesses information on the computer that the person is not entitled to access.”34 So long as a computer is publicly accessible, and not protected by password or other security measures, courts have declined to find any access of the website to be “without authorization.”35Conversely, a CFAA claim may lie where a computer or website is protected from unauthorized access, either by technical measures or even explicit warnings in a cease-and-desist letter.36
D. Hot News Misappropriation.
In addition to asserting copyright claims based on incidental reproduction of copyrighted webpage material, numerous plaintiffs have asserted claims for hot news misappropriation relating to scraping of purely factual information. “Hot news” misappropriation – once a claim that existed under the federal common law, but which now exists only under the laws of five states43 – provides a cause of action where a party reproduces factual, time-sensitive information that was gathered at the effort and expense of another party, and thereby deprives the gathering party of the commercial value of that information. Thus, for example, in Int’l News Serv. v. Associated Press, the Supreme Court in 1918 recognized a claim under federal common law for hot news misappropriation in connection with a wire service’s re-publication of breaking news gathered by the Associated Press, which thereby deprived the Associated Press of the news value of its reporting.44 The court justified its decision as protecting the “quasi-property” rights of profit seeking entrepreneurs who gathered time-sensitive information from those who would free-ride on the efforts of those entrepreneurs.45
Since hot news misappropriation generally concerns factual information rather than content that is subject to copyright protection, it is generally found not to be preempted by the Copyright Act.46 However, courts have recognized hot news misappropriation as an extremely narrow claim that survives preemption only in very narrow circumstances that mirror the circumstances in Int’l News Serv. For example, in Barclays Capital Inc. v. Theflyonthewall.com, Inc., financial services firms alleged claims for copyright infringement and hot news misappropriation against a news aggregation website that reported on investment recommendations issued by the firms to their clients who paid to receive those recommendations before they became generally known to the investment community. On appeal from a denial of the defendant’s motion to dismiss the hot news claim, the court found that plaintiff’s claim was preempted by the Copyright Act. In so finding, the court emphasized that the plaintiffs’ claim lacked an “indispensable element of an INS ‘hot news’ claim,” i.e., “free-riding by a defendant on a plaintiff’s product, enabling the defendant to produce a directly competitive product for less money because it has lower costs.”47 Rather, though the defendant’s conduct potentially threatened plaintiffs’ businesses, the defendant was actually breaking news generated by the plaintiffs’ recommendations (and attributing the recommendations to plaintiffs), rather than merely repackaging news that had been reported by plaintiffs.48
The Barclays case suggests the difficulty of stating a valid hot news misappropriation claim against a party engaged in automated data collection for purposes of data analytics. In many factual scenarios, scraping of information would not appear to qualify as “free-riding” within the meaning of INS so long as the scraper did not attempt to pass the information off as his own without attribution to the content provider. Indeed, many factual circumstances would appear similar to the recommendations at issue in Barclays, where the information is only valuable because it was attributed to the source. The fact that data analytics often involves the use of information to create entirely new insights (including in combination with information from other sources) suggests further difficulties in establishing the requisite “free-riding,” which under Barclays involves demonstrating that the underlying information was used to produce a directly competitive product.
E. Trespass to Chattels.
Courts have also recognized, in certain narrow circumstances, that unauthorized use of web crawling or scraping tools can give rise to a trespass to chattels claim, which “lies where an intentional interference with the possession of personal property has proximately cause injury.”49 For example, in eBay, Inc. v. Bidder’s Edge, Inc., eBay brought a trespass to chattels claim against the defendant, an online auction aggregation service that scraped auction information from eBay’s website using spiders that accessed the website approximately 100,000 times per day in violation of eBay’s terms of service and in defiance of cease-and-desist demands from eBay. eBay also moved to preliminary enjoin the defendant from accessing its website. In granting that motion, and finding that eBay was likely to prevail on its trespass to chattels claim, the court relied on the fact that defendant’s spiders consumed a portion – albeit very small – of eBay’s server and server capacity, and thereby “deprived eBay of the ability to use that portion of its personal property for its own purposes.”50
In contrast, where tangible interference is absent, or is no more than theoretical or de minimus, courts have declined to recognize claims for trespass to chattel relating to the use of web crawling or scraping tools. For example, in Tickets.com, the court granted summary judgment dismissing Ticketmaster’s trespass to chattel because Ticketmaster failed to present any evidence that its competitor’s scraping of its website either caused physical harm to Ticketmaster’s servers or otherwise impeded Ticketmaster’s use or utility of its servers. In so holding, the court criticized the decision of the eBay court, and required a showing of “some tangible interference with the use or operation of the computer being invaded by the spider.” 51 Later courts have generally agreed with the holding in Tickets.com.52
To the extent that Tickets.com presents the prevailing statement of law, and evidence of a tangible interference with a computer or server is necessary to state a claim for trespass to chattels based on unauthorized web crawling or scraping, courts are likely in the future to focus on evidence of tangible interference with systems.53
As indicated above, the legal landscape relating to web crawling and scraping is still taking shape—particularly insofar as few courts have considered claims based on crawling or scraping for analytics purposes. Further, because most cases involving the use of web crawling and scraping tools in other contexts have been highly fact specific, it is difficult to identify bright line rules for determining when use of such tools for analytics purposes is likely to give rise to liability. Nonetheless, the cases discussed above suggest a number of issues that should be considered both by website owners and by those who seek to perform analytics using data gathered from web-based sources.
Ultimately, while the claims and theories that may be advanced in connection with the use of web crawling and scraping tools for analytics purposes have yet to be deeply explored by courts, this is likely a temporary state of affairs. Rather, given the increasing number and availability of tools for aggregation and analysis of content in the Big Data era, courts will ultimately be required to address these complicated issues.
Jim Snell is co-chair of Bingham‘s Intellectual Property Group and co-chair of the firm’s Privacy and Security Group. He represents clients in complex commercial matters, including patent litigation, Internet and privacy issues, trade secret matters and matters involving unfair competition claims under California Business and Professions Code section 17200.
Derek Care is counsel in Bingham’s New York office, where he is a member of the firm’s Intellectual Property, Privacy and Security, and Entertainment, Media and Communications groups. He represents clients in a wide range of complex commercial matters, including trademark, copyright, Internet and privacy-related matters.
©2013 Bloomberg Finance L.P. All rights reserved. Bloomberg Law Reports ® is a registered trademark and service mark of Bloomberg Finance L.P.
This document and any discussions set forth herein are for informational purposes only, and should not be construed as legal advice, which has to be addressed to particular facts and circumstances involved in any given situation. Review or use of the document and any discussions does not create an attorney-client relationship with the author or publisher. To the extent that this document may contain suggested provisions, they will require modification to suit a particular transaction, jurisdiction or situation. Please consult with an attorney with the appropriate level of experience if you have any questions. Any tax information contained in the document or discussions is not intended to be used, and cannot be used, for purposes of avoiding penalties imposed under the United States Internal Revenue Code. Any opinions expressed are those of the author. Bloomberg Finance L.P. and its affiliated entities do not take responsibility for the content in this document or discussions and do not make any representation or warranty as to their completeness or accuracy.