HomeDigiteum
Works
Custom document management system

System to Extract Data from Multiformat Documents

One of the biggest problems in document management is the diversity of paperwork formats, structures and origins. As a rule, dealing with the data stored in non-standardized PDFs and on paper requires time and effort.  

There are tools that help optimize document workflow. For example, the systems based on optical character recognition (OCR) technology are able to extract data from different digitized documents. However, there’s no such one-size-fits-all tool that would be able to process any document in any format. Often, it requires costly manual verification to guarantee the accuracy of data extraction.

Agriculture analytics and research provider in the UK was facing the exact problem. The company asked Digiteum to build a custom system to automate the processing of PDF invoices of different structures and extract meaningful information from these documents such as invoice number, date, company’s name, the total amount, etc.

document management system
Efficient document management is essential in any industry
PDF text recognition
Variety of formats, structures and origins often makes it hard to find a one-size-fits-all solution to capture, read and understand documents

Tesseract OCR for Smart PDF Text Recognition

Digiteum team analyzed the company’s business process and suggested building an automated PDF processing platform based on two major modules: a data extractor and an OCR service.  

Initially, the data extractor performs the preliminary analysis of a PDF invoice to figure out the basic characteristics of the document - if it’s a scanned image, original PDF, plain text. If the system recognizes images, it engages the OCR service for PDF text recognition and extraction.

In order to choose cost-efficient and reliable OCR service for this purpose, Digiteum team has tested major cloud-based OCR services - ABBYY, Google, Azure, OCR Space and open-source offline service Tesseract. After the analysis, the team has selected Version 4 of Tesseract as the most advanced OCR which showed the highest precision in PDF text recognition.

Apart from its originally strong computer vision algorithms, library and configuration capabilities, the latest version of Tesseract offers Deep Learning methods for image understanding. The advanced methods allow to experiment and train neural networks, improve symbol recognition, enhance accuracy and teach the system to understand handwriting, for example. These benefits allowed Tesseract OCR service to meet the objectives of the project and perform PDF text recognition of multiformat invoices with a high level of accuracy.

PDF invoices recognition
Automated recognition and processing of the data stored in multiformat PDF invoices
extract text from pdf
Custom web system to automate PDF invoice workflow and reduce time, cost and effort spent on paperwork

Data Extraction Algorithms Provide up to 80% Accuracy

The other part of the system - the data extractor engine - uses custom algorithms to perform the detailed analysis and extract information from the readable PDFs and the documents prepared by the OCR service. Digiteum team has tested 15 algorithms. They found the algorithms that reach up to 80% of PDF text recognition accuracy along with the ones that provide 60% quality of recognition and can be improved in the future. These algorithms allow the system to identify font style, geometry, recognize tables and their structures, etc. and parse the given data against a number of validation rules to find certain classes of data such as account number, product name, contact information, etc.

Finally, the information extracted by the system is sent to AWS cloud service which enables the security and reliability of data storage.

In the future, this project will grow into a full-on document management system. By training the OCR service using Deep Learning technology, the team can teach the system to better recognize the text in scanned or other image PDF invoices, improve data validation rules, introduce new pattern-based algorithms and, as a result, reduce the number of errors.

PDF data recognition algorithm
Custom algorithms to perform the detailed analysis and extract information from PDFs
document management system
Broad space for the development and scalability of the system

Highlights

  • Custom web system for automated text recognition in multiformat PDF invoices of various quality.
  • Deployment on cloud-based AWS for scalability and strong data storage security.
  • Advanced OCR service for text recognition based on proven computer vision technology and smart Deep Learning methods.
  • Custom data extraction algorithms that enable up to 80% accuracy.
  • Data validation rules that allow to classify data against certain requirements.
  • Broad space for system development and scalability - improving algorithms, enhancing text recognition precision, introducing new data validation rules, training OCR service.

PROJECT DETAILS

RESULT: Complete MVP with up to 80% data extraction success
CLIENT: Agriculture analytics and research provider (UK)

Let's talk about your project!

VIEW PROFILE
Dictionaries Conversion Platform
Lexical conversion tool for Oxford Dictionaries
VIEW PROFILE
Applixure Telemetry Solution
IT infrastructure monitoring daemon
VIEW PROFILE
Research Track
Extract value from field-specific data
VIEW PROFILE
AdoramaPix
New online services double customer revenue
0
0
image
https://digiteum.com/wp-content/themes/blake/
https://digiteum.com//
#dd170f
style1
default
Loading posts...
/opt/bitnami/apps/wordpress/htdocs/
#
on
none
loading
#
Sort Gallery
https://digiteum.com/wp-content/themes/blake
off
yes
yes
off
Enter your business email here
on
off