The company needed to complete a complex migration on a tight deadline to avoid millions of dollars in post-contract fess and fines.

Automatic Text Data Gathering and Processing using AI

In 2021, approximately 80 zettabytes of data were generated worldwide, and this is projected to grow to over 150 zettabytes by 2025. With such exponential growth, the challenge lies in efficiently searching and finding relevant information. To address this, we proposed a technical solution utilizing automatic web searching and scraping, as well as AI models and NLP for data processing.

Our focus was on news articles related to specific topics or entities, but the methods can be applied to various other use cases such as market research, sentiment analysis, business intelligence, and machine learning dataset production.

The Challenges

Some of the challenges we faced during the project included addressing web server restrictions to avoid IP bans due to high request volume, ensuring the accuracy of data extraction from diverse HTML documents, devising effective methods to handle unstructured data from news articles and blogs, scaling the automatic processes for web searching and scraping to handle the increasing data volume, and ensuring compliance with web policies and regulations to avoid potential legal and ethical issues.

Solutions Provided

The automatic processing of news articles involves stages like searching, scraping, content extraction, and processing.

1. Web Searching

For the automatic web searching we’ve identified the following options:

We could either develop an in-house solution using Python, Scrapy, and Selenium, or integrate an external API like SerpWow to quickly retrieve results from search engine results pages in JSON format.

2. Web Scraping

When we were scraping web pages, it was important to handle issues related to time gaps between requests and web server restrictions. We used Python with Scrapy and a proxy service to ensure efficient scraping. Our tests showed that for thousands of URLs, the process took only a couple of minutes.

3. Content extraction

After scraping HTML documents, the main content for news or blog articles needed to be extracted along with important metadata such as article title, author, source, and date. We utilized dragnet, an AI/ML solution, and integrated Python libraries like newspaper, goose3, and others to extract content and metadata. The extraction of article dates posed challenges due to varied formats. By combining methods, libraries, custom logic, and algorithms, we achieved over 93% accuracy in extracting content and metadata from the majority of the tested scraped pages.

Solution architecture

We have used an architecture based on microservices and message queues.

The data pipeline contains multiple stages, as presented in the earlier chapters, for web searching, web scraping, content extraction and processing (e.g. clustering, summarization). We’ve used multiple message queues to facilitate the communication between the pipeline stages and modules. Each stage was sustained by multiple workers that processed the requests and data in parallel.

Conclusions

The exponential growth of data makes manual processing unfeasible across various business sectors. Text data holds significant importance for businesses, governments, and others, and NLP solutions, combined with other AI algorithms, are essential for managing the vast volume of information.

The proposed solution for automatic data gathering and processing is applicable to almost any business domain, such as financial and stock market analysis, business intelligence, and market research. We have developed a use case for searching, fetching, and processing news articles. This solution can also be used for brand or company research, as well as comprehensive online information searching, classification, and processing.