00:00
Raghav Sethi, 2427030127

Optical Handwritten Digit Recognition Using Deep Learning Models

Modern deep learning architectures such as VGG16, ResNet18, and Vision Transformers can effectively classify handwritten digits from MNIST. Residual connections and attention mechanisms are hypothesized to outperform traditional CNN pipelines.

Abstract

This project evaluates VGG16, ResNet18, and Vision Transformer models on the MNIST handwritten digit dataset using identical preprocessing and training pipelines. Results demonstrate superior performance from Vision Transformers, highlighting the benefit of global attention in visual classification tasks.

Problem Statement

Handwritten digit recognition suffers from variability in writing styles, stroke thickness, and digit morphology. Traditional methods fail to generalize well. This project benchmarks three pretrained architectures to determine stability and accuracy under controlled conditions.

Research Gap

Direct comparison of VGG16, ResNet18, and Vision Transformers under identical transfer-learning pipelines is limited. This work closes that gap.

MNIST Dataset

The Modified National Institute of Standards and Technology (MNIST) dataset is a widely used benchmark in computer vision and machine learning, particularly for evaluating handwritten digit recognition systems. It consists of 70,000 grayscale images of handwritten digits ranging from 0 to 9, collected from various writers to ensure diversity in writing styles, stroke thickness, curvature, and digit structure.

The dataset is divided into 60,000 training samples and 10,000 testing samples. Each original image has a resolution of 28×28 pixels and contains a single digit centered within the frame. Despite its apparent simplicity, MNIST presents several challenges such as overlapping digit shapes (for example, 4 vs 9 or 5 vs 6), inconsistent handwriting patterns, and variations in scale and alignment.

For compatibility with pretrained deep learning architectures originally designed for ImageNet, all MNIST images were resized to 224×224 pixels and converted from single-channel grayscale to three-channel grayscale. Pixel values were normalized to the range [-1, 1] to match ImageNet preprocessing standards. This transformation enables effective transfer learning by allowing pretrained convolutional and transformer-based models to extract meaningful visual features from digit images.

Each image represents one of ten digit classes (0–9), providing a balanced multiclass classification problem. MNIST serves as an ideal experimental platform due to its standardized splits, extensive prior research, and reproducibility across studies. In this project, the dataset is used to comparatively evaluate traditional convolutional networks, residual architectures, and transformer-based vision models under identical training conditions.

Dataset Summary: 70,000 images • 10 classes • 28×28 grayscale • resized to 224×224 • RGB conversion • normalized input

Sample Digits from MNIST

The MNIST dataset contains significant variation in handwriting styles. Below are representative samples illustrating intra-class diversity in stroke thickness, curvature, and digit structure.

Digit: 7

Digit: 4

Additional digit samples may be added to further illustrate intra-class diversity.

Model Architecture Comparison

This project evaluates three fundamentally different deep learning paradigms: a traditional convolutional network (VGG16), a residual convolutional network (ResNet18), and an attention-based Vision Transformer. The table below summarizes their architectural characteristics and practical trade-offs.

Aspect VGG16 ResNet18 Vision Transformer
Feature learning Local spatial features Local + residual features Global context features
Typical parameters ~138M ~11M ~86M (ViT-Base)
Training stability Moderate High Moderate on small datasets
Training time High Medium High
Compute cost High Medium High
Data efficiency Strong on small datasets Strong on small datasets Weak without large pretraining
Primary limitation Gradient degradation Limited global context Heavy data & compute demand

ResNet18 provides the best stability–efficiency tradeoff, while Vision Transformers benefit from global attention at the cost of higher computational requirements.

Methodology

Dataset

MNIST (70,000 images). Resized to 224×224, converted to RGB, normalized to [-1,1].

Models

VGG16, ResNet18, Vision Transformer with replaced 10-class heads.

System Architecture

Training Analysis & Experimental Evidence

This section presents training loss curves, test accuracy plots, confusion matrices, and implementation snapshots for VGG16, ResNet18, and Vision Transformer models trained on MNIST.

ResNet18

VGG16

Vision Transformer

Experimental Environment

Model training, evaluation, and visualization conducted using PyTorch in Jupyter/VS Code environment.

Results

Final Accuracies: VGG16 – 98.53%, ResNet18 – 99.22%, Vision Transformer – 99.33%.

Comparison with Existing Methods

Table II: Comparison of MNIST classification accuracy with existing methods and results obtained in this study.

Research Paper & Publication

Based on the experimental findings and comparative analysis presented in this project, a formal research paper has been authored in collaboration with Dr. Tarun Jain. The paper documents the methodology, architectural comparisons, training behavior, and performance evaluation of VGG16, ResNet18, and Vision Transformer models on the MNIST dataset.

The manuscript is currently under the publication process, with Dr. Tarun Jain overseeing submission and review. This work aims to contribute empirical insights into transfer learning and attention-based vision models for handwritten digit recognition.

This project therefore extends beyond academic coursework and represents an ongoing research effort toward peer-reviewed publication.

Conclusion & Future Work

Vision Transformers achieved the highest accuracy, demonstrating the value of attention-based architectures. Future work includes larger datasets, hyperparameter optimization, and deployment on edge devices.

References

  1. V. Agrawal, J. Jagtap, and M. P. Kantipudi, “Exploration of advancements in handwritten document recognition techniques,” Intelligent Systems with Applications, vol. 22, p. 200358, 2024.
  2. L. Deng, “The MNIST database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
  3. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 2002.
  4. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of CVPR, pp. 770–778, 2016.
  5. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
  6. A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv:2010.11929, 2020.
  7. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  8. S. S. Kadam, A. C. Adamuthe, and A. B. Patil, “CNN model for image classification on MNIST and Fashion-MNIST dataset,” Journal of Scientific Research, vol. 64, no. 2, pp. 374–384, 2020.
  9. S. Tabik et al., “A snapshot of image preprocessing for CNNs: Case study of MNIST,” International Journal of Computational Intelligence Systems, vol. 10, no. 1, pp. 555–568, 2017.
  10. L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
  11. K. Taghandiki, “Handwritten digit recognition on MNIST using transfer learning with VGG16,” Journal of Skill Sciences and Creativity, vol. 1, no. 2, pp. 45–68, 2024.
  12. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, 2nd ed., Springer, 2009.
  13. J. Yosinski et al., “How transferable are features in deep neural networks?” NeurIPS, pp. 3320–3328, 2014.

Project Guide

Dr. Tarun Jain

Team Member

Raghav Sethi

2427030127