Say goodbye to manual table extraction!
DOC2 guarantees accurate, faster and
less expensive invoice operations through Artificial Intelligence.
Check out our demo version by uploading an invoice!
Introduction to Table Extraction
The amount of data being collected is drastically increasing day-by-day with growing numbers of applications, software, and online platforms.
To handle/access this humongous data productively, it’s necessary to develop valuable information extraction tools.
One of the sub-areas that’s demanding attention in the Information Extraction field is the extraction of tables from images or the detection of tabular data from forms, PDFs & documents.
Table Extraction is the task of detecting and decomposing table information in a document.
Imagine you have lots of documents with tabular data that you need to extract for further processing. Conventionally, you can copy them manually (onto a paper) or load them into excel sheets.
However, with table OCR software, you can automatically detect tables & extract all tabular data from documents in one go. This saves a lot of time and rework.
In this article, we’ll first look at how POLYDOCS can automatically extract tables from images or documents. We’ll then cover some popular DL techniques to detect and extract tables in documents.
Want to extract tabular data from invoices, receipts or any other type of document? Check out POLYDOCS’ DOC2 table extractor to extract tabular data. Schedule a demo to learn more about POLYDOCS’ table extraction feature.
Table Extraction can be so simple
With DOC² it is possible to extract tables from PDF-Files. That will be done via the “Line Items” functionality. It is used for extracting the tables from all types of documents (Invoices, Contracts, Forms, Medical Prescriptions etc.).
You will end up in the table extraction view.
If the document contains very simple tables it will detect and extract them automatically:
In practice, tables on documents are often much more complex and have a wide variety of formatting and arrangements. For example, text may extend across several columns or there may be several lines of text in one position line. For example, in the case of long item descriptions or similar:
And this is where the advantage of DOC² and its table extraction functionality comes into play. There are several ways to train the table extraction functionality and to achieve the best possible result, even with demanding tables.
Define tables and columns
To define tables and columns on a document import a document, open it and go to the table extraction view like already known (via “Line Items”).
You will end up in following screen where you can activate the Training Mode:
Via the “Edit” button table selection mode will be activated and you will be able to edit the document shown on the left side:
In general your are now able to use the autodetect tables functionality and the system will automatically define the tables on the document:
Once the tables are defined you can manually define the columns via the following button:
If in a table a column goes over several rows, you can group them by clicking on the column on the right in the extraction and selecting group