The company needed to complete a complex migration on a tight deadline to avoid millions of dollars in post-contract fess and fines.

Optimizing E-commerce efficiency through bot mitigation

The client, a global leader in the e-commerce industry, operates across more than 20 countries and processes an
immense volume of over 1.000.000 orders per second.

As a result of their high-profile presence, the platform attracted numerous web crawlers, leading to significant
challenges in maintaining platform efficiency and delivering seamless customer experiences.

Client Needs

The client faced an urgent need to mitigate the impact of robotic crawlers on their platform, which were significantly increasing server loads and scaling costs. These activities led to added latency for genuine users, compromising customer experience.

The key requirements included:

Identifying and blocking harmful bots while ensuring no false positives.
Avoiding additional latency on page loads to preserve performance standards.

Challenges

The project began with two key constraints: ensuring no false positives by prioritizing letting through uncertain requests over blocking potential legitimate customers, and avoiding any additional latency on page loads to maintain platform performance.

During the implementation phase, the team encountered several technical challenges. The main challenge was finding a data structure that could quickly answer the question, “Does this IP belong to a robot?” with a latency of less than 10 milliseconds, measured on the server side. Additionally, they needed to establish seamless communication protocols between the Java and Python systems that were exchanging gigabytes of data, and address a business requirement to whitelist specific bots for platform access.

Solutions Provided

A team of six experts, divided into engineering and machine learning (ML) divisions, collaborated to develop an efficient, scalable, and high-performance solution. The machine learning division was tasked with a critical objective: to generate and upload a comprehensive file to the cloud every hour, containing IP addresses definitively identified as belonging to automated bots that were actively engaged in crawling the e-commerce platform. This file was essential for ongoing analysis and proactive measures against bot-driven activities. The goal of the engineering division was to provide a list of APIs that could classify an IP in under 10 milliseconds whether it was a robot or not and that would accept data for the continuous retraining of the classification models by the machine learning division.

Technology Stack

In terms of technology stack, the machine learning part were using mainly Python and Pandas, whereas the engineering part were using Java with Spring Boot. The applications were running in the AWS cloud and the S3 was used to store files. The engineering part found a very efficient data structure to store the IPs range: a sorted tree data structure, where every octet from the IP address is a new descendant and in a quick traversal an IP could be found (meaning is robot) or not (meaning it belongs to a “good” visitor).

Results Achieved

Bot mitigation: Successfully blocked 78% of crawling robots, exceeding the initial goal of 60%.
Performance: Achieved server-side latency of 6ms, well below the 10ms target.
Scalability: The solution handled a peak load of 780.000 requests per second without compromising performance.