The company needed to complete a complex migration on a tight deadline to avoid millions of dollars in post-contract fess and fines.
Automatic Text Data Gathering and Processing using AI
In 2021, approximately 80 zettabytes of data were generated worldwide, and this is projected to grow to over 150 zettabytes by 2025. With such exponential growth, the challenge lies in efficiently searching and finding relevant information. To address this, we proposed a technical solution utilizing automatic web searching and scraping, as well as AI models and NLP for data processing.
Our focus was on news articles related to specific topics or entities, but the methods can be applied to various other use cases such as market research, sentiment analysis, business intelligence, and machine learning dataset production.
The Challenges
Some of the challenges we faced during the project included addressing web server restrictions to avoid IP bans due to high request volume, ensuring the accuracy of data extraction from diverse HTML documents, devising effective methods to handle unstructured data from news articles and blogs, scaling the automatic processes for web searching and scraping to handle the increasing data volume, and ensuring compliance with web policies and regulations to avoid potential legal and ethical issues.
Solutions Provided
The automatic processing of news articles involves stages like searching, scraping, content extraction, and processing.
1. Web Searching
For the automatic web searching we’ve identified the following options:
We could either develop an in-house solution using Python, Scrapy, and Selenium, or integrate an external API like SerpWow to quickly retrieve results from search engine results pages in JSON format.
2. Web Scraping
3. Content extraction
Solution architecture
We have used an architecture based on microservices and message queues.
The data pipeline contains multiple stages, as presented in the earlier chapters, for web searching, web scraping, content extraction and processing (e.g. clustering, summarization). We’ve used multiple message queues to facilitate the communication between the pipeline stages and modules. Each stage was sustained by multiple workers that processed the requests and data in parallel.