A simple Python script to parse SIR (Special Intensive Revision) electoral roll PDFs published by the Election Commission of India, extract voter details from scanned Malayalam documents, and convert them into a clean CSV with Malayalam to English translation.
Special Intensive Revision (SIR) is a comprehensive verification process of electoral rolls conducted by the Election Commission of India to update, correct, add, or remove voter entries.
- Malayalam OCR using Tesseract
- Parses scanned Election Commission SIR PDFs
- Extracts voter details into structured CSV
- Malayalam → English translation
- UTF-8 encoded output
- Python 3.9+
- Tesseract OCR with Malayalam language data
- Poppler
brew install tesseract tesseract-lang poppler
pip install pytesseract pdf2image pandas pillow opencv-python tqdm googletrans==4.0.0-rc1
python sir_parser.py
The script is written to allow easy customization for other Indian languages.
To adapt it:
- Replace Malayalam field keywords (e.g. name, age, gender) with equivalents from the target language
- Change the OCR language code in Tesseract
- Update the translation source language code if required
A CSV file containing extracted voter information in both Malayalam and English.
This is a simple educational script provided as-is for learning and experimentation purposes only.
It is not intended for official, legal, political, or administrative use, and should not be relied upon for decisions related to electoral processes or voter data accuracy.
There is always room for improvement. If you find any issues, edge cases, or have ideas to improve accuracy, performance, or language support, please feel free to contribute or open an issue. Community feedback and improvements are welcome.
Open for educational and personal use.