Boston University School of Law
"Web scraping" is a ubiquitous technique for extracting data from the World Wide Web, done through a computer script that will send tailored queries to websites to retrieve specific pieces of content. The technique has proliferated under the ever-expanding shadow of the Computer Fraud and Abuse Act (CFAA), which, among other things, prohibits obtaining information from a computer by accessing the computer without authorization or exceeding one's authorized access.
Unsurprisingly, many litigants have now turned to the CFAA in attempt to police against unwanted web scraping. Yet despite the rise in both web scraping and lawsuits about web scraping, practical advice about the legality of web scraping is hard to come by, and rarely extends beyond a rough combination of "try not to get caught" and "talk to a lawyer." Most often the legal status of scraping is characterized as something just shy of unknowable, or a matter entirely left to the whims of courts, plaintiffs, or prosecutors.
Uncertainty does indeed exist in the caselaw, and may stem in part from how courts approach the act of web scraping on a technical level. In the way that courts describe the act of web scraping, they misstate some of the qualities of scraping to suggest that the technique is inherently more invasive or burdensome. The first goal of this piece is to clarify how web scrapers operate, and explain why one should not think of web scraping as being inherently more burdensome or invasive than humans browsing the web.
The second goal of this piece is to more fully articulate how courts approach the all-important question of whether a web scraper accesses a website without authorization under the CFAA. I aim to suggest here that there is a fair amount of madness in the caselaw, but not without some method. Specifically, this piece breaks down the twenty years of web scraping litigation (and the sixty-one opinions that this litigation has generated) into four rough phases of thinking around the critical access question.
The first runs through the first decade of scraping litigation, and is marked with cases that adopt an expansive interpretation of the CFAA, with the potential to extend to all scrapers so long as a website can point to some mechanism that signaled access was unauthorized. The second, starting in the late 2000s, was marked by a narrowing of the CFAA and a focus more on the code-based controls of scraping, a move that tended to benefit scrapers. In the third phase courts have receded back to a broad view of the CFAA, brought about by the development of a "revocation" theory of unauthorized access. And most recently, spurred in part by the same policy concerns that led courts to initially constrain the CFAA in the first place, courts have begun to rethink this result.
The conclusion of this piece identifies the broader questions about the CFAA and web scraping that courts must contend with in order to bring more harmony and comprehension to this area of law. They include how to deal with conflicting instructions on authorization coming different channels on the same website, how the analysis should interact with existing technical protocols that regulate web scraping, including the Robots Exclusion Standard, and what other factors beyond the wishes of the website host should govern application of the CFAA to unwanted web scraping.
Twenty Years of Web Scraping and the Computer Fraud and Abuse Act,
Boston University Journal of Science & Technology Law
Available at: https://scholarship.law.bu.edu/faculty_scholarship/465