Data Extraction

Everything you need to know about Document Classification [Complete Guide]

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Everything you need to know about Document Classification [Complete Guide]

Document classification starts with identifying the text in a document, tagging it, and categorizing the document based on the insights derived from text classification. Automated Document Classification is possible using algorithms that work with NLP & AutoML, and work based on a Neural Network (Deep Learning), Naive Bayes classifiers, or a very simple Logistic Regression algorithm if the data is not too large (not an exhaustive list).

In intelligent document processing workflow, supervised and unsupervised - both kinds of ML techniques are used to classify documents automatically. Supervised model works on a trained data set and it is a widely used technique because of the accuracy it is able to produce. Based on the algorithm used, the model may provide the user a confidence score and other related metrics to convey how confident the model is in terms of the accuracy for document classification.

So, what is document classification? Who may find it useful? What are different techniques to perform document classification ? What are the limitations and benefits of different deep learning algorithms and machine learning models used to automate document classification? - All questions answered in this article.

In this article, we’ll go over what Document Classification means, and discuss different different aspects of it including:-

A) Who is document classification for

B) What are the different ways to classify documents

C) Approach taken to classify documents

D) How to hard code the automation of document classifiers with NLP (Natural Language Processing) in Python with an example

By the end of the article, you will have a thorough understanding of Automation in Document Classification. For the scope of this article, we won’t be discussing the manual way of classifying documents.

So, let’s jump right into it:-

What is document classification?

Document Classification or Document Categorization is a process to assign different classes or categories to documents as required, eventually helping with storage, management,  and analysis of the documents. It has become an important part of the computer sciences and the daily functioning of many companies today. 

Document classification has been a long-due development in the world of automation and data, with documents of every kind (structured and unstructured) being developed throughout all industries. Every document shares hands with multiple entities and teams before going for analysis and manually classifying these documents to go into the right stream of analysis is a task indeed. 

Think of 10,000 large documents that need you to classify. It's next to impossible to be able to grow rapidly while being slowed down by such a repetitive task. 

Types of document classification methods

There are essentially two approaches that to classify and categorize documents: -

  • Manual Classification
  • Automated Classification

Most companies employ the manual classification approach in their workflow. Smaller organizations with a limited number of documents in their processing queue may manage it in-house, whereas organizations with large numbers of documents may outsource it. Despite taking a great deal of time, manual classification is error-prone, costly, and inefficient.

Manual vs automated document classification

Manual documents classification suffers from two fatal constraints -

  1. Excessive time consumption - The time required to classify and process a massive heap of documents can be substantial.
  2. Subjectiveness - Humans hold biases and different approaches to reality which can cloud their judgment when classifying documents, leading to subjective and erroneous classification.

It takes about 20-40% of an employee's time to locate a document manually, and another 50% to search for information.

However, using a document processing technology, you can swap out the manual classification process, data capture, and document routing with automation, alleviating the total expenses involved in a traditional document processing workflow.

Auto-classification of documents

The solution to the manual classification is the automatic document classification which is much faster and more accurate. As documents are ingested in an IDP system, they are identified, classified, sorted, split, assembled, and processed as per their document type, which enables you to:-

Auto classification of documents
  • Scan documents without pre-sorting or inserting separator pages.
  • Automatically route documents to the appropriate department as per their content.
  • Auto-categorize single-page and multi-page documents.
  • Mark any documents with erroneous or missing pages.
  • Automate verification of relevant batch documents scanning.
  • Assign classified documents to respective team members.

How does automated document classification work?

In an IDP workflow, irrespective of supervised or unsupervised learning technique adopted, document classification works on 3 levels:-

document classification workflow

Level 1 - Identifying the file format

Since IDP solutions deal with multiple document formats, the first step is to determine whether the file is a jpeg/png/pdf/tiff or any other format. Whether the file is scanned or non-scanned pdf is determined at this level.

Level 2 - Identifying the document structure

Based on the structure, documents come in 3 categories:-

Structured documents

These documents have fixed templates, layouts, key-value pairs, and tables. Tax return forms and mortgage applications are the best examples of structured documents.

Semi structured documents

These documents may have a fixed set of key-value pairs and tables but they vary in terms of layouts and templates. They may often have similar information at different places in the different documents. Invoices are the best example of semi structured documents.

Unstructured documents

These documents have no structure at all. There are no key-value pairs, formatting, or tables. Documents are textual in nature and carry information embedded in paragraphs. Contracts are best examples of unstructured documents.

Level 3 - Identifying the document type

Documents are classified into respective categories at this level. This process has certain steps:-


In some IDP workflows, this step comes before identifying the document structure. The aim of this step is to identify/distinguish the text from background. Certain techniques such as binarization, deskewing, and noise reduction are used to improve the quality of the document to be processed.

Tagged data set

The quality of the tagged dataset is the most important component of a statistical Natural Language Processing (NLP) classifier. The dataset needs to be large enough and must be of a high-quality so that the model has sufficient information of clear delineation for a document type from others.

Classification methods

Classification methods are of two types:-

i) Visual Approach

In this approach, computer vision analyzes the visual structure of the document without reading its text. This approach works well for structured documents, and in some cases for semi-structured documents as well. It works on the idea that different document types have information laid out in a document at definite places and patterns. If the model is able to identify those patterns and distinguish them from the patterns on other document types , it classifies the document accordingly. The advantage of this approach is that it happens during the scanning phase thus saves a lot of time.

ii) Text classification approach

In this approach, OCR reads the text from the documents, classifies the text, and moves on to classifying the document based on the information derived. With text classification, text can be analyzed at different levels:-

1. Document level - All the text in a document is read.

2. Paragraph level - Text in a particular paragraph is read.

3. Sentence level - Reads text from a particular sentence.

4. Sub-sentence level - Specific phrases are read.

Let’s discuss both the visual and textual recognition techniques in detail.

Automated document classification techniques

Document classification algorithms function based on different recognition methods. The recognition techniques work based on text classification or visual classification. The types of recognitions involved in classifying documents in the aforementioned types of learnings are given as under:-

1. Computer Vision features recognition

At times, documents in question are so different from each other that there is no need to read their text to classify them - they can be classified by just looking at their structure and style. For example - an invoice and a tax form are so different from each other that you don’t have to read and analyze their entire text to classify them. They can be classified solely based on their structure.

With the capabilities of Computer vision, a document is broken down into pixels to learn about its structure, style, and layout. The pixels are analyzed to make up an image and are then identified as objects when together, and subsequently classified. 

Computer Vision has grown out as a branch of computer science today where computers are being taught to make sense of an image. From Self Driving Cars to AI recognition in your smartphone, it all involves computer vision. The possibilities with computer vision are only growing with the years and it has a wide range of applications such as facial recognition, character recognition, pattern recognition, etc.

The backend CV algorithm is complex and depends on the use case to work precisely. It requires a lot of data and sometimes can be trained to even recognize hand gestures etc. in the simplest CV algorithms. The more modern approaches of self-driving cars etc. use Deep Learning models that involve CNNs, LSTM, Transformers, etc

In computer vision and image processing, a feature is simply a piece of information about the image being processed. This information is used to classify different building blocks in documents. Based on the format of a particular document type, different blocks of information are recognized by the CV algorithm eventually using this information to classify documents.

2. Textual Recognition

Textual recognition works on the idea to recognize text with a definitive context associated with it. This is then used with lexical processing to understand the underlying genre, theme, and emotion of the sentence to lead the organization to pick off the class that the document might belong to (to a certain level of accuracy).

There are 3 ways in which textual recognition works:-

Optical Character Recognition

In a simple OCR scanner that is added hardware to a system, light and dark areas are identified. The dark areas are then processed to be classified as alphabetical characters or numbers, and then it takes one character or word at a time and is recognized. Taking this to the next level using computer vision in an algorithm with a back-end language, pattern recognition is used to feed examples of text in different forms for a system to recognize out of a scanned document or an image. 

Feature detection then applies the rules of OCR for identifying features of a document. The features can be added to identify the number of lines, curves, crosses, etc. in a particular character, and then the character is identified and stored as the ASCII code within the system to handle any further manipulations to the elements. The OCR program can do this on a series of blocks, texts, tables, images, formats, pages, etc. to break down the documents to their character levels and then make sense of them together to create a program that will present you with the recognized text and its classification, as required.

Essentially, optical character recognition (OCR) technology makes data entry effortless and classification simpler by creating effortless text classifications that would otherwise take hours to do manually. 

One step beyond the OCR is to perform Document Classification with an NLP algorithm using a programming language like Python.

Rule-based text recognition

Rule-based text recognition recognizes words in a document in different ways such as isolation, explicit word segmentation, simultaneous recognition, etc. It can also be based on searching certain terms in a document to understand where they might belong. 

‘Rules’ in a rule-based text recognition system guide the system to identify semantically relevant elements of a text to classify it into relevant categories based on its content. Each rule consists of an antecedent i.e. a pattern with a category or classification.

For example, if you want to classify topics into two groups: Food and Careers. First, define words that fall into each category (for example - Dark chocolate, Lettuce, Fries, etc. fall into Food, and Engineers, doctors, accountants, etc. fall into careers)

Counting the instances of these words in an incoming text based on the trained algorithm will simply see which type of words occur more than another and then classify the text accordingly. 

For example -  a sentence that says, “Careers in the industry of engineers and doctors are seeing a massive trend of eating more dark chocolate.” - the classifier will classify this document with the text as one that falls into the ‘careers’ category.

Rule-based systems are not black box algorithms and can be developed easily. These algorithms have certain disadvantages as they require domain knowledge and are time-consuming. 

Since generating rules for a complex system can be quite challenging, it needs a lot of data too.  Rule-based systems are also difficult because, in their upkeep, they require a lot of new rules which don’t scale well with existing rules at times. 

Document classification with NLP

Natural Language Processing algorithms differentiate between documents by using different lexical and semantic processes that can be combined with techniques like a bag of words, tokenization, and word stemming and using stopword removal processes to arrive at an algorithm that can differentiate between different classes of documents based on the words in the document.

It is easy to find a platform to conduct document auto-classification to skip the entire hassle of having to code the feature recognition engine or a textual classifier with NLP using a coding language like Python, both of which are given in the next sections of this article.

Different methods may read text on different levels based on the training model adopted. Based on the information retrieved, there are 2 classification models that data scientists use for document classification:-

i) Supervised

In this learning method, the user needs to define a set of tags for different documents. For example, in a document, if ‘invoice number’, ‘vendor’s name’, ‘invoice owner’s name’, ‘purchase order number’ and other related fields are tagged and identified, the document can be classified as invoice.  The accuracy of this model depends on the text fields classified and tagged.

ii) Unsupervised 

In this learning method, a set of words/sentences/phrases are grouped together without any prior training. These grouped sets are then used to classify the document type.

Document auto-classification using Python

In this section, we’ll go over some code and break it down to understand how you can make your auto-classification algorithm with Python.

First, we start with importing the following libraries:

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, Normalizer 
from nltk.text import wordnet
from nltk import SnowballStemmer
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, precision_recall_curve
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import warnings

Our data, let’s say, consists of comments from different users that we are going to try to label as toxic or not toxic. For this particular example, we’ll work on the data provided in Kaggle at the Toxic Comment Classification challenge given as under:

We have some utility, models, visualizations, evaluations, and other NLP libraries that have been imported.

We’ll now import our data into the notebook:  (this format helps to import data into a Kaggle Notebook)

data = pd.read_csv("../input/traindata/train.csv")
data.reset_index(inplace=True, drop=True)
Text = "comment_text"

On stack overflow, one can easily find the commands to remove certain regex pattern matching.

def regex(text):
  text = text.apply(lambda x: re.sub("\S*\d\S*"," ", x))
  text = text.apply(lambda x: re.sub("\S*@\S*\s?"," ", x))        
  text = text.apply(lambda x: re.sub("\S*#\S*\s?"," ", x))         
  text = text.apply(lambda x: re.sub(r'http\S+', ' ', x))          
  text = text.apply(lambda x: re.sub(r'[^a-zA-Z0-9 ]', ' ',x))       
  text = text.apply(lambda x: x.replace(u'\ufffd', '8'))            
  text = text.apply(lambda x: re.sub(' +', ' ', x))                  
  return text
Data[Text] = apply_regex(data[Text])

Essentially this now removes text such as numbers and words that are concatenated with numbers, emails, hashtags, URLs, multiple spaces, etc. All of these are not necessary for our model to verify if a comment is toxic or not. 

Preprocess the data using an NLP library - SpaCy to remove the stop words that are predefined in the library:

preprocess = spacy.load("en_core_web_sm")
stop_words = preprocess.Defaults.stop_words

Apply stemming using the following code:

stemmer = SnowballStemmer(language="english")
def applyStemming(listOfTokens):
    return [stemmer.stem(token) for token in listOfTokens]
data['stemmed'] = data['tokenized'].apply(applyStemming)

Check the sample of the data using the code here:


Some visualizations that you can make to check the non-toxic words are given under (Note: the visuals for toxic text have a lot of unprofessional words so it's best if you don’t execute it, this code is here only for demonstration and is picked off from one of the other competition participants at Kaggle. Best to stick to the non-toxic word cloud, thanks!)

wordcloud_pos = WordCloud(collocations=False, 
plt.figure(figsize=(15, 10))
plt.imshow(interpolation = “bilinear”)
plt.title(f"Most common words associated with non-toxic comment", size=20)

Splitting the train test data: 

X_train, X_test, y_train, y_test = train_test_split(data[“stemmed"], data["label"])

Now converting the data into the TF-IDF embeddings is a basic frequency-based approach. 

Term Frequency–Inverse Document Frequency (TF-IDF) is a very common method that is used to compute word importance across documents in data. The assumption is that the more times a word appears in a document, the more important that word is for that document compared to the rest. 

The TF-IDF assigns, accordingly, a weight to each word based on its order of occurrence frequency. In the end, the words assigned a lower weight are words that occur in all documents in general. A bag of word representation is used to show an array containing the scores of each word and the word order is lost and context is not considered this way to improve the computing speed on a local machine. 

Use the following code to fit and transform the data into an array as required.

tfid = TfidfVectorizer(lowercase=False, max_features=500)
train_vectors_tfidf = tfid.fit_transform(X_train).toarray()
test_vectors_tfidf = tfid.transform(X_test).toarray()

Use the following code to normalize the TF-IDF vectors:

norm_TFIDF = Normalizer(copy=False)
norm_train_tfidf = norm_TFIDF.fit_transform(train_vectors_tfidf)
norm_test_tfidf = norm_TFIDF.transform(test_vectors_tfidf)

In terms of the algorithm being used, we’ll use a Naive Bayes classifier as the algorithm on our model.

model = MultinomialNB()

A custom function to receive back a dataframe with all our evaluation metrics - 

def classifier(y_test, predictions, modelName):
  tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
  prec=precision_score(y_test, predictions)
  rec=recall_score(y_test, predictions)
  f1=f1_score(y_test, predictions)
  acc=accuracy_score(y_test, predictions)
   # specificity 
score = {'Model': [model], ‘acc’ : [acc], 'f1': [f1], ‘rec’: [rec], 'Prec': [prec],'Specificity': [spec], 'TP': [tp], 'TN': [tn], 'FP': [fp], 'FN': [fn], 'y_test size': [len(y_test)]}
    df_score = pd.DataFrame(data=score)
 return df_score

To train the data and test it

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test), y_train)
preds = model.predict(X_test)
scores = classifier(y_test, preds, name)

To check the model scores:


Starting with cleaning the data and performing a common NLP pipeline, the embedding methods are used to form a frequency-based basic TF-IDF approach. A different baseline model would give a different outcome and that might also change with other hyperparameters that work with the best performing model. The end is to just depict how accurately you can take a document classifier or a comment classifier in this case and make a working model. 

Note: This is only an example demonstration and of course, to put the classifier to production use cases, a much more advanced algorithm will be required which might not be possible for everyone to develop. This is where alternative ways of using a document classifier come into the picture which leads to optimization of the process as a whole. 

Document auto-classification - benefits and perks

Document classification transcends beyond algorithmically classifying documents with advanced ML and renders the following perks -

1. Adaptability to highly variable content

With advanced Machine Learning technology and AI augmentation, document categorization automatically categorizes scanned and digital documents as per their content, even when the content is variable.

2. Employee time savings

Automating document classification eliminates the requirement for human intervention and manual classification of documents, which is time-consuming and potentially repetitive.

Implementing auto-classification saves employee time, improves job satisfaction, and alleviates staff turnover rate.

3. Prevent data breaches

Automated document classification helps enterprises efficiently gather and centralize data. This gesture helps identify PII (Personally Identifiable Information), reducing the risk of a data breach.

The classification of sensitive data improves organizations’ ability to evaluate and address sources of PII, delete redundant documents that contain sensitive information, and retain critical PII.

Document classification with Docsumo

Coming to document auto-classification, here is how you can classify different document types in Docsumo:-

Step 1: Open 'API and Services

Visit ‘API and Services’ on Docsumo's interface

API and Services

Step 2: Enable document types

Under 'Actions' enable the document types you wish to categorize. After enabling the required document types, their status type will change from ‘disabled’ to ‘enabled’ for that specific document type.

Enable Document Types

Step 3 - Enable ‘Auto-classification’

To enable the ‘auto-classification’ feature, make sure that each document type that you’ve selected in the step-2 has been trained against at least 20 documents.

Enable auto-classification

Step 4: Upload your documents

Go back to the ‘Document Types’ and upload the documents collectively in the auto-classification section.

Step 5: Receive classified document types

Get intelligently classified outputs according to their respective document types, which are visible under ‘Types’.

If you wish to have different document types evaluated by different team members, you can select the ‘Auto-Assign’ option by following these steps:-

A. Visit 'Document Types'

Navigate to the ‘Document Types’ option.

B. Open Settings

Select the Setting icon for a particular document type.

Setting icon

C. Choose a member

Pick a suitable member from your team from the 'General Settings' option.

Auto assign

After following the above three steps, you can auto-classify different document types and delegate them to individual team members and obtain validation and approval.

‍What are the differences between hard coding an algorithm and using a service like Docsumo?‍

1. Hardcoding an algorithm can cost your organization a huge sum to set up a server, get the developers, work on preparing the data for the algorithm to work on, and can be time taking while none of these costs are incurred when using a service like Docsumo instead. 

2. It is almost impossible to consider manually entering millions of rows of data from millions of documents and hard coding an algorithm to do that is expected to end up in some errors. With Docusmo you can always go back and review the outcome that the algorithm delivers to ensure that accuracy is not compromised. 

3. You can not define a predefined function to directly verify things like the difference between Gross Income and tax to be the Net income. You can identify them in the document to be at a specific location but not double-check the values using custom settings because it requires very complicated semantic analyses on the back-end and you can add multiple such checks in Docsumo. 

Data protection and integration with Docsumo

At Docsumo, we take data protection and security very seriously.  Docsumo is a GDPR compliant and SOC-2 certified company. All requests get transferred over HTTPS only, and data transfer gets encrypted with AES 256. All the stored data on S3 & Mongo dB also gets encrypted.

Data protection & integration

You remain in power by choosing to delete the data from our servers promptly or periodically after you have completed document processing. You can monitor individuals with access to different data types in your organization via advanced user management.

We realize that no platform exists in a vacuum, which is why we have built our solutions to integrate with other software and solutions. By employing plug-in APIs and out-of-the-box input and output connectors, our platform can conveniently get integrated into any workflow.

If you’re curious about how Docsumo operates and simplifies document processing for different industries, accurately extracts data, and safely stores & organizes it - all that in real time, signup for a 14-day free trial. We’d love to hear from you about your business use-case and figure out how we can help!

Suggested Case Study
Automating Portfolio Management for Westland Real Estate Group
The portfolio includes 14,000 units across all divisions across Los Angeles County, Orange County, and Inland Empire.
Thank you! You will shortly receive an email
Oops! Something went wrong while submitting the form.
Pankaj Tripathi
Written by
Pankaj Tripathi

Helping enterprises capture data for analytics and decisioning

Is document processing becoming a hindrance to your business growth?
Join Docsumo for recent Doc AI trends and automation tips. Docsumo is the Document AI partner to the leading lenders and insurers in the US.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.