Legal Issues Involved in Attack and Defense of Web Crawlers

Lin Hua,[Patent]

As crawlers will exist side by side with the Internet in the foreseeable future, any connected website will unavoidably encounter them. To properly address our co-existence with crawlers, effective technical measures and appropriate legal weapons should be put in place to prevent malicious crawlers comprehensively.
Supreme Court's website compromised by crawlers
Recently it has been reported that myriad judgment documents published on China Judgments Online by the Supreme People's Court of the People's Republic of China were offered for sale by third parties. According to the data of the website, over 70 million judgment documents are published on the website, and that a number of sellers claim that they can provide more than 60 million entries. If the numbers were truthful, it would mean that an overwhelming majority of the judgments published online free of charge are being sold publicly.
According to the sellers' descriptions, their goods were obtained through web crawlers. According to the definition given by, a "web crawler" means any program or script that automatically captures information from the World Wide Web on the basis of certain rules. It consists of feature codes that are set to automatically search and capture website information.
In addition to the public offering for sale of judgment documents, the media's attention has been caught by the latency and crash of China Judgments Online due to the excessive workloads from numerous crawlers. In its reply regarding the slow access and frequent breakdowns of the website, the Supreme People's Court stated that "... A large number of technical companies uses web crawlers to initiate unlimited concurrent accesses and obtains judgment document data of the website illegally. They create huge workloads and congest the traffics of many other normal users, so that the speed of access become very slow and some of the web pages cannot be displayed, among others."
Does it constitute an infringement?
Some law experts believe that it constitutes infringement to use web crawlers to capture judgment-related data. Zhang Xinnian, a lawyer and deputy chairman of Beijing Social Organization & Law Mediation Center, pointed out that "these judgment documents are published online for the purpose of judicial transparency. They are free public resources. Anybody's selling them for a profit without the authorization of the Supreme People's Court constitutes an infringement."1
Do web crawlers constitute infringement for sure? In terms of the crawler technology and the copyright-related attribute of judgment documents, it is highly debatable whether the technology is infringing in and by itself. First of all, crawlers have been one of the most widely adopted technologies in the Internet sector, which could not have grown as it has up to date without crawlers. All search engines must rely on crawlers, as they have no other means to effectively search the entire Internet. Fundamentally, crawlers form the basis for information circulation on the Internet, and play a huge role in effective information collection, integration and propagation. The technology neutrality principle can be applied to crawlers.
To prevent crawlers from crossing the line and obtaining nonpublic information from its server, the crawled website may configure the recognized robot protocol of the Internet to set the command to reject crawlers in robots.txt.
Second, from the perspective of copyright, crawlers do not always infringe the websites they crawl. There have been cases where actors use crawlers to grab copyrighted information and spread it illegally, and some of them are serious enough to meet the criteria for criminal penalties.2 However, in view of the nature of the contents on China Judgments Online , what crawlers have captured are not infringement stipulated by the Copyright Law.
Article 5 of the Copyright Law of the People's Republic of China provides that "This law shall not be applicable to:(1) laws; regulations; resolutions, decisions and orders of state organs; other documents of legislative, administrative and judicial nature; and their official translations." Judgment documents, which are issued by the judicial organ and of judicial nature, are not protected under the Copyright Law. It cannot be deduced in law that any court's authorization would be needed if anyone wants to use its judgment documents. Assume that the way used to obtain information is lawful in and by itself, and that there is no legal restriction to protect information in the public domain from being used to gain profits, then the act would be lawful, as, for example, it is lawful for anyone to publish and sell any ancient books whose copyright protection period is expired.
Collaborative governance against crawlers
Well-developed crawler applications have impacted on the Internet in two aspects. If properly and reasonably employed, they are helpful with effective collection and propagation of information on the Internet. Once misused, they can create heavy workloads to the crawled websites, adversely affecting their operation and normal access. As an extremely common application on the Internet, they have been identified in the technical circle as creating a leading traffic through the Internet. Though technology neutrality principle can be applied, to prevent the misuse of crawlers still requires collaborative governance in law and technology, among others.
1.      Legal countermeasures against crawler misuse
Article 48(6) of the Copyright Law provides, as one of the infringing acts, that "without the permission from the copyright owner or obligee related to the copyright, intentionally avoiding or destroying the technical measures taken by the obligee on his works, sound recordings or video recordings, etc. to protect the copyright or the rights related to the copyright..." When the obligees of websites use the "Disallow" command to restrict crawlers in robots.txt deployed on the servers, they essentially rely on the technical standard to delineate a forbidden zone for external crawlers. It is a technical measure adopted to protect contents. As long as no anti-trust law or copyright misuse is involved, crawlers that disobey the robots.txt and grab copyrighted contents have avoided or destroyed the technical measures taken by the obligee of the website pursuant to the Copyright Law.
In addition, the Anti-Unfair Competition Law provides another legal basis to prevent infringing crawlers. Article 12 of the Anti-Unfair Competition Law of the People's Republic of China provides that "No business may, by technical means to affect users' options, among others, commit the following acts of interfering with or sabotaging the normal operation of online products or services legally provided by another business: ...(4) Other acts of interfering with or sabotaging the normal operation of online products or services legally provided by another business." Moreover, Article 24 of the same law provides for legal consequences for violations specified under Article 12. The misuse of crawlers costs little, but leads to huge workloads to the crawled websites. Their unrestricted and relentless grabbing of contents often results in website crashes and breakdowns. In that case, Article 12 of the Anti-Unfair Competition Law can be cited reasonably and justifiably.
2.      Technical countermeasures against crawlers
As crawlers is no more than a technology, to use one technology against the other comes out to be the first choice of website obligees for self-defense. The Supreme People's Court said that it had adopted the anti-crawling function of limiting the number of page turns on list pages and adopting CAPTCHA, which enabled CAPTCHA verification after reaching a specific number of views in a certain period of time. It also stated that it will update its anti-crawler technology from time to time, to strengthen its website and increase its efficiency and robustness.
In addition, new defensive techniques will continue to emerge in the attack of and defense against misused crawlers. For example, to protect data from being analyzed and grabbed by crawlers, essential information can be rendered using a picture format on the foreground, so that it will still be readable to the naked eye while not be read by machines.
The wide adoption of the crawler technology is a necessity for the existence and growth of the Internet. According to the technology neutrality principle, it is not advisable or feasible to restrict crawlers once and for all, which would lend a deadly blow to the normal propagation and collection of information on the Internet. However, the misuse of crawlers to forcibly grab data has long been a public hazard in the Internet industry, all Internet platforms have their dedicated anti-crawler teams and regularly update their anti-crawler measures.
As crawlers will exist side by side with the Internet in the foreseeable future, any connected website will unavoidably encounter them. To properly address our co-existence with crawlers, effective technical measures and appropriate legal weapons should be put in place to prevent malicious crawlers comprehensively.
(Translated by Ren Qingtao)

Member Message

  • Only our members can leave a message,so please register or login.

International IP Firms
Inquiry and Assessment

Latest comments

Article Search


People watch

Online Survey

In your opinion, which is the most important factor that influences IP pledge loan evaluation?

Control over several core technologies for one product by different right owners
Stability of ownership of the pledge
Ownership and effectiveness of the pledge