Oops! Something went wrong while submitting the form.
Document classification starts with identifying the text in a document, tagging it, and categorizing the document based on the insights derived from text classification. Automated Document Classification is possible using algorithms that work with NLP & AutoML, and work based on a Neural Network (Deep Learning), Naive Bayes classifiers, or a very simple Logistic Regression algorithm if the data is not too large (not an exhaustive list).
In intelligent document processing workflow, supervised and unsupervised - both kinds of ML techniques are used to classify documents automatically. Supervised model works on a trained data set and it is a widely used technique because of the accuracy it is able to produce. Based on the algorithm used, the model may provide the user a confidence score and other related metrics to convey how confident the model is in terms of the accuracy for document classification.
So, what is document classification? Who may find it useful? What are different techniques to perform document classification ? What are the limitations and benefits of different deep learning algorithms and machine learning models used to automate document classification? - All questions answered in this article.
In this article, we’ll go over what Document Classification means, and discuss different different aspects of it including:-
A) Who is document classification for
B) What are the different ways to classify documents
C) Approach taken to classify documents
D) How to hard code the automation of document classifiers with NLP (Natural Language Processing) in Python with an example
By the end of the article, you will have a thorough understanding of Automation in Document Classification. For the scope of this article, we won’t be discussing the manual way of classifying documents.
So, let’s jump right into it:-
What is document classification?
Document Classification or Document Categorization is a process to assign different classes or categories to documents as required, eventually helping with storage, management, and analysis of the documents. It has become an important part of the computer sciences and the daily functioning of many companies today.
Document classification has been a long-due development in the world of automation and data, with documents of every kind (structured and unstructured) being developed throughout all industries. Every document shares hands with multiple entities and teams before going for analysis and manually classifying these documents to go into the right stream of analysis is a task indeed.
Think of 10,000 large documents that need you to classify. It's next to impossible to be able to grow rapidly while being slowed down by such a repetitive task.
Types of document classification methods
There are essentially two approaches that to classify and categorize documents: -
Most companies employ the manual classification approach in their workflow. Smaller organizations with a limited number of documents in their processing queue may manage it in-house, whereas organizations with large numbers of documents may outsource it. Despite taking a great deal of time, manual classification is error-prone, costly, and inefficient.
Manual documents classification suffers from two fatal constraints -
Excessive time consumption - The time required to classify and process a massive heap of documents can be substantial.
Subjectiveness - Humans hold biases and different approaches to reality which can cloud their judgment when classifying documents, leading to subjective and erroneous classification.
It takes about 20-40% of an employee's time to locate a document manually, and another 50% to search for information.
However, using a document processing technology, you can swap out the manual classification process, data capture, and document routing with automation, alleviating the total expenses involved in a traditional document processing workflow.
Auto-classification of documents
The solution to the manual classification is the automatic document classification which is much faster and more accurate. As documents are ingested in an IDP system, they are identified, classified, sorted, split, assembled, and processed as per their document type, which enables you to:-
Scan documents without pre-sorting or inserting separator pages.
Automatically route documents to the appropriate department as per their content.
Auto-categorize single-page and multi-page documents.
Mark any documents with erroneous or missing pages.
Automate verification of relevant batch documents scanning.
Assign classified documents to respective team members.
How does automated document classification work?
In an IDP workflow, irrespective of supervised or unsupervised learning technique adopted, document classification works on 3 levels:-
Level 1 - Identifying the file format
Since IDP solutions deal with multiple document formats, the first step is to determine whether the file is a jpeg/png/pdf/tiff or any other format. Whether the file is scanned or non-scanned pdf is determined at this level.
Level 2 - Identifying the document structure
Based on the structure, documents come in 3 categories:-
These documents have fixed templates, layouts, key-value pairs, and tables. Tax return forms and mortgage applications are the best examples of structured documents.
Semi structured documents
These documents may have a fixed set of key-value pairs and tables but they vary in terms of layouts and templates. They may often have similar information at different places in the different documents. Invoices are the best example of semi structured documents.
These documents have no structure at all. There are no key-value pairs, formatting, or tables. Documents are textual in nature and carry information embedded in paragraphs. Contracts are best examples of unstructured documents.
Level 3 - Identifying the document type
Documents are classified into respective categories at this level. This process has certain steps:-
In some IDP workflows, this step comes before identifying the document structure. The aim of this step is to identify/distinguish the text from background. Certain techniques such as binarization, deskewing, and noise reduction are used to improve the quality of the document to be processed.
Tagged data set
The quality of the tagged dataset is the most important component of a statistical Natural Language Processing (NLP) classifier. The dataset needs to be large enough and must be of a high-quality so that the model has sufficient information of clear delineation for a document type from others.
Classification methods are of two types:-
i) Visual Approach
In this approach, computer vision analyzes the visual structure of the document without reading its text. This approach works well for structured documents, and in some cases for semi-structured documents as well. It works on the idea that different document types have information laid out in a document at definite places and patterns. If the model is able to identify those patterns and distinguish them from the patterns on other document types , it classifies the document accordingly. The advantage of this approach is that it happens during the scanning phase thus saves a lot of time.
ii) Text classification approach
In this approach, OCR reads the text from the documents, classifies the text, and moves on to classifying the document based on the information derived. With text classification, text can be analyzed at different levels:-
1. Document level - All the text in a document is read.
2. Paragraph level - Text in a particular paragraph is read.
3. Sentence level - Reads text from a particular sentence.
4. Sub-sentence level - Specific phrases are read.
Let’s discuss both the visual and textual recognition techniques in detail.
Automated document classification techniques
Document classification algorithms function based on different recognition methods. The recognition techniques work based on text classification or visual classification. The types of recognitions involved in classifying documents in the aforementioned types of learnings are given as under:-
1. Computer Vision features recognition
At times, documents in question are so different from each other that there is no need to read their text to classify them - they can be classified by just looking at their structure and style. For example - an invoice and a tax form are so different from each other that you don’t have to read and analyze their entire text to classify them. They can be classified solely based on their structure.
With the capabilities of Computer vision, a document is broken down into pixels to learn about its structure, style, and layout. The pixels are analyzed to make up an image and are then identified as objects when together, and subsequently classified.
Computer Vision has grown out as a branch of computer science today where computers are being taught to make sense of an image. From Self Driving Cars to AI recognition in your smartphone, it all involves computer vision. The possibilities with computer vision are only growing with the years and it has a wide range of applications such as facial recognition, character recognition, pattern recognition, etc.
The backend CV algorithm is complex and depends on the use case to work precisely. It requires a lot of data and sometimes can be trained to even recognize hand gestures etc. in the simplest CV algorithms. The more modern approaches of self-driving cars etc. use Deep Learning models that involve CNNs, LSTM, Transformers, etc
In computer vision and image processing, a feature is simply a piece of information about the image being processed. This information is used to classify different building blocks in documents. Based on the format of a particular document type, different blocks of information are recognized by the CV algorithm eventually using this information to classify documents.
2. Textual Recognition
Textual recognition works on the idea to recognize text with a definitive context associated with it. This is then used with lexical processing to understand the underlying genre, theme, and emotion of the sentence to lead the organization to pick off the class that the document might belong to (to a certain level of accuracy).
There are 3 ways in which textual recognition works:-
Optical Character Recognition
In a simple OCR scanner that is added hardware to a system, light and dark areas are identified. The dark areas are then processed to be classified as alphabetical characters or numbers, and then it takes one character or word at a time and is recognized. Taking this to the next level using computer vision in an algorithm with a back-end language, pattern recognition is used to feed examples of text in different forms for a system to recognize out of a scanned document or an image.
Feature detection then applies the rules of OCR for identifying features of a document. The features can be added to identify the number of lines, curves, crosses, etc. in a particular character, and then the character is identified and stored as the ASCII code within the system to handle any further manipulations to the elements. The OCR program can do this on a series of blocks, texts, tables, images, formats, pages, etc. to break down the documents to their character levels and then make sense of them together to create a program that will present you with the recognized text and its classification, as required.
Essentially, optical character recognition (OCR) technology makes data entry effortless and classification simpler by creating effortless text classifications that would otherwise take hours to do manually.
One step beyond the OCR is to perform Document Classification with an NLP algorithm using a programming language like Python.
Rule-based text recognition
Rule-based text recognition recognizes words in a document in different ways such as isolation, explicit word segmentation, simultaneous recognition, etc. It can also be based on searching certain terms in a document to understand where they might belong.
‘Rules’ in a rule-based text recognition system guide the system to identify semantically relevant elements of a text to classify it into relevant categories based on its content. Each rule consists of an antecedent i.e. a pattern with a category or classification.
For example, if you want to classify topics into two groups: Food and Careers. First, define words that fall into each category (for example - Dark chocolate, Lettuce, Fries, etc. fall into Food, and Engineers, doctors, accountants, etc. fall into careers)
Counting the instances of these words in an incoming text based on the trained algorithm will simply see which type of words occur more than another and then classify the text accordingly.
For example - a sentence that says, “Careers in the industry of engineers and doctors are seeing a massive trend of eating more dark chocolate.” - the classifier will classify this document with the text as one that falls into the ‘careers’ category.
Rule-based systems are not black box algorithms and can be developed easily. These algorithms have certain disadvantages as they require domain knowledge and are time-consuming.
Since generating rules for a complex system can be quite challenging, it needs a lot of data too. Rule-based systems are also difficult because, in their upkeep, they require a lot of new rules which don’t scale well with existing rules at times.
Document classification with NLP
Natural Language Processing algorithms differentiate between documents by using different lexical and semantic processes that can be combined with techniques like a bag of words, tokenization, and word stemming and using stopword removal processes to arrive at an algorithm that can differentiate between different classes of documents based on the words in the document.
It is easy to find a platform to conduct document auto-classification to skip the entire hassle of having to code the feature recognition engine or a textual classifier with NLP using a coding language like Python, both of which are given in the next sections of this article.
Different methods may read text on different levels based on the training model adopted. Based on the information retrieved, there are 2 classification models that data scientists use for document classification:-
In this learning method, the user needs to define a set of tags for different documents. For example, in a document, if ‘invoice number’, ‘vendor’s name’, ‘invoice owner’s name’, ‘purchase order number’ and other related fields are tagged and identified, the document can be classified as invoice. The accuracy of this model depends on the text fields classified and tagged.
In this learning method, a set of words/sentences/phrases are grouped together without any prior training. These grouped sets are then used to classify the document type.
Document auto-classification using Python
In this section, we’ll go over some code and break it down to understand how you can make your auto-classification algorithm with Python.
First, we start with importing the following libraries:
import pandas as pd import numpy as np from sklearn.preprocessing import MinMaxScaler, Normalizer from nltk.text import wordnet from nltk import SnowballStemmer
import spacy from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split, RandomizedSearchCV from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, precision_recall_curve from wordcloud import WordCloud
import matplotlib.pyplot as plt import warnings warnings.filterwarnings('ignore') NUM_OF_THREADS = 10
Our data, let’s say, consists of comments from different users that we are going to try to label as toxic or not toxic. For this particular example, we’ll work on the data provided in Kaggle at the Toxic Comment Classification challenge given as under:
We have some utility, models, visualizations, evaluations, and other NLP libraries that have been imported.
We’ll now import our data into the notebook: (this format helps to import data into a Kaggle Notebook)
data = pd.read_csv("../input/traindata/train.csv") data.dropna(inplace=True) data.reset_index(inplace=True, drop=True) Text = "comment_text"
On stack overflow, one can easily find the commands to remove certain regex pattern matching.
def regex(text): text = text.apply(lambda x: re.sub("\S*\d\S*"," ", x)) text = text.apply(lambda x: re.sub("\S*@\S*\s?"," ", x)) text = text.apply(lambda x: re.sub("\S*#\S*\s?"," ", x)) text = text.apply(lambda x: re.sub(r'http\S+', ' ', x)) text = text.apply(lambda x: re.sub(r'[^a-zA-Z0-9 ]', ' ',x)) text = text.apply(lambda x: x.replace(u'\ufffd', '8')) text = text.apply(lambda x: re.sub(' +', ' ', x)) return text
Data[Text] = apply_regex(data[Text])
Essentially this now removes text such as numbers and words that are concatenated with numbers, emails, hashtags, URLs, multiple spaces, etc. All of these are not necessary for our model to verify if a comment is toxic or not.
Preprocess the data using an NLP library - SpaCy to remove the stop words that are predefined in the library:
Some visualizations that you can make to check the non-toxic words are given under (Note: the visuals for toxic text have a lot of unprofessional words so it's best if you don’t execute it, this code is here only for demonstration and is picked off from one of the other competition participants at Kaggle. Best to stick to the non-toxic word cloud, thanks!)
Now converting the data into the TF-IDF embeddings is a basic frequency-based approach.
Term Frequency–Inverse Document Frequency (TF-IDF) is a very common method that is used to compute word importance across documents in data. The assumption is that the more times a word appears in a document, the more important that word is for that document compared to the rest.
The TF-IDF assigns, accordingly, a weight to each word based on its order of occurrence frequency. In the end, the words assigned a lower weight are words that occur in all documents in general. A bag of word representation is used to show an array containing the scores of each word and the word order is lost and context is not considered this way to improve the computing speed on a local machine.
Use the following code to fit and transform the data into an array as required.
Starting with cleaning the data and performing a common NLP pipeline, the embedding methods are used to form a frequency-based basic TF-IDF approach. A different baseline model would give a different outcome and that might also change with other hyperparameters that work with the best performing model. The end is to just depict how accurately you can take a document classifier or a comment classifier in this case and make a working model.
Note: This is only an example demonstration and of course, to put the classifier to production use cases, a much more advanced algorithm will be required which might not be possible for everyone to develop. This is where alternative ways of using a document classifier come into the picture which leads to optimization of the process as a whole.
Document auto-classification - benefits and perks
Document classification transcends beyond algorithmically classifying documents with advanced ML and renders the following perks -
1. Adaptability to highly variable content
With advanced Machine Learning technology and AI augmentation, document categorization automatically categorizes scanned and digital documents as per their content, even when the content is variable.
2. Employee time savings
Automating document classification eliminates the requirement for human intervention and manual classification of documents, which is time-consuming and potentially repetitive.
Implementing auto-classification saves employee time, improves job satisfaction, and alleviates staff turnover rate.
3. Prevent data breaches
Automated document classification helps enterprises efficiently gather and centralize data. This gesture helps identify PII (Personally Identifiable Information), reducing the risk of a data breach.
The classification of sensitive data improves organizations’ ability to evaluate and address sources of PII, delete redundant documents that contain sensitive information, and retain critical PII.
Document classification with Docsumo
Coming to document auto-classification, here is how you can classify different document types in Docsumo:-
Step 1: Open 'API and Services
Visit ‘API and Services’ on Docsumo's interface
Step 2: Enable document types
Under 'Actions' enable the document types you wish to categorize. After enabling the required document types, their status type will change from ‘disabled’ to ‘enabled’ for that specific document type.
Step 3 - Enable ‘Auto-classification’
To enable the ‘auto-classification’ feature, make sure that each document type that you’ve selected in the step-2 has been trained against at least 20 documents.
Step 4: Upload your documents
Go back to the ‘Document Types’ and upload the documents collectively in the auto-classification section.
Step 5: Receive classified document types
Get intelligently classified outputs according to their respective document types, which are visible under ‘Types’.
If you wish to have different document types evaluated by different team members, you can select the ‘Auto-Assign’ option by following these steps:-
A. Visit 'Document Types'
Navigate to the ‘Document Types’ option.
B. Open Settings
Select the Setting icon for a particular document type.
C. Choose a member
Pick a suitable member from your team from the 'General Settings' option.
After following the above three steps, you can auto-classify different document types and delegate them to individual team members and obtain validation and approval.
What are the differences between hard coding an algorithm and using a service like Docsumo?
1. Hardcoding an algorithm can cost your organization a huge sum to set up a server, get the developers, work on preparing the data for the algorithm to work on, and can be time taking while none of these costs are incurred when using a service like Docsumo instead.
2. It is almost impossible to consider manually entering millions of rows of data from millions of documents and hard coding an algorithm to do that is expected to end up in some errors. With Docusmo you can always go back and review the outcome that the algorithm delivers to ensure that accuracy is not compromised.
3. You can not define a predefined function to directly verify things like the difference between Gross Income and tax to be the Net income. You can identify them in the document to be at a specific location but not double-check the values using custom settings because it requires very complicated semantic analyses on the back-end and you can add multiple such checks in Docsumo.
Data protection and integration with Docsumo
At Docsumo, we take data protection and security very seriously. Docsumo is a GDPR compliant and SOC-2 certified company. All requests get transferred over HTTPS only, and data transfer gets encrypted with AES 256. All the stored data on S3 & Mongo dB also gets encrypted.
You remain in power by choosing to delete the data from our servers promptly or periodically after you have completed document processing. You can monitor individuals with access to different data types in your organization via advanced user management.
We realize that no platform exists in a vacuum, which is why we have built our solutions to integrate with other software and solutions. By employing plug-in APIs and out-of-the-box input and output connectors, our platform can conveniently get integrated into any workflow.
If you’re curious about how Docsumo operates and simplifies document processing for different industries, accurately extracts data, and safely stores & organizes it - all that in real time, signup for a 14-day free trial. We’d love to hear from you about your business use-case and figure out how we can help!
Oops! Something went wrong while submitting the form.