Skip to content

Instantly share code, notes, and snippets.

@nuhmanpk
Created December 25, 2025 11:52
Show Gist options
  • Select an option

  • Save nuhmanpk/666e0619e231e69f04ffe471f2769a9f to your computer and use it in GitHub Desktop.

Select an option

Save nuhmanpk/666e0619e231e69f04ffe471f2769a9f to your computer and use it in GitHub Desktop.

SIR Electoral Roll PDF Parser (Malayalam → CSV)

A simple Python script to parse SIR (Special Intensive Revision) electoral roll PDFs published by the Election Commission of India, extract voter details from scanned Malayalam documents, and convert them into a clean CSV with Malayalam to English translation.

What is SIR?

Special Intensive Revision (SIR) is a comprehensive verification process of electoral rolls conducted by the Election Commission of India to update, correct, add, or remove voter entries.

Features

  • Malayalam OCR using Tesseract
  • Parses scanned Election Commission SIR PDFs
  • Extracts voter details into structured CSV
  • Malayalam → English translation
  • UTF-8 encoded output

Requirements

  • Python 3.9+
  • Tesseract OCR with Malayalam language data
  • Poppler

Installation (macOS)

brew install tesseract tesseract-lang poppler  
pip install pytesseract pdf2image pandas pillow opencv-python tqdm googletrans==4.0.0-rc1

Usage

python sir_parser.py

Language Customization

The script is written to allow easy customization for other Indian languages.
To adapt it:

  • Replace Malayalam field keywords (e.g. name, age, gender) with equivalents from the target language
  • Change the OCR language code in Tesseract
  • Update the translation source language code if required

Output

A CSV file containing extracted voter information in both Malayalam and English.

Disclaimer

This is a simple educational script provided as-is for learning and experimentation purposes only.
It is not intended for official, legal, political, or administrative use, and should not be relied upon for decisions related to electoral processes or voter data accuracy.

Contribution

There is always room for improvement. If you find any issues, edge cases, or have ideas to improve accuracy, performance, or language support, please feel free to contribute or open an issue. Community feedback and improvements are welcome.

License

Open for educational and personal use.

from pdf2image import convert_from_path
import pytesseract
import pandas as pd
import cv2
import numpy as np
import logging
from tqdm import tqdm
import re
from googletrans import Translator
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s | %(levelname)s | %(message)s"
)
pdf_path = "../path_to_sir.pdf"
csv_path = "output_clean_translated.csv"
translator = Translator()
logging.info("Starting PDF to image conversion")
pages = convert_from_path(pdf_path, dpi=300)
logging.info(f"Total pages detected: {len(pages)}")
records = []
current = {}
def translate(text):
try:
return translator.translate(text, src="ml", dest="en").text
except:
return ""
def flush():
global current
if current:
records.append(current)
current = {}
for page in tqdm(pages, desc="Processing pages"):
img = np.array(page)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
text = pytesseract.image_to_string(
gray,
lang="mal",
config="--oem 3 --psm 6"
)
lines = [l.strip() for l in text.split("\n") if l.strip()]
for line in lines:
if line.startswith("പേര്"):
flush()
value = re.sub(r".*?:", "", line).strip()
current["name_ml"] = value
current["name_en"] = translate(value)
elif "അച്ഛന്റെ പേര്" in line or "ഭര്‍ത്താവിന്റെ പേര്" in line:
value = re.sub(r".*?:", "", line).strip()
current["relative_name_ml"] = value
current["relative_name_en"] = translate(value)
elif "വീട്ടു നമ്പര്‍" in line:
value = re.sub(r".*?:", "", line).strip()
current["house_no"] = value
elif "പ്രായം" in line:
age = re.search(r"പ്രായം\s*:\s*(\d+)", line)
gender = re.search(r"ലിംഗം\s*:\s*(\S+)", line)
if age:
current["age"] = age.group(1)
if gender:
current["gender_ml"] = gender.group(1)
current["gender_en"] = translate(gender.group(1))
flush()
logging.info(f"Total voters extracted: {len(records)}")
df = pd.DataFrame(records)
df.to_csv(csv_path, index=False, encoding="utf-8-sig")
logging.info("Structured + translated CSV saved successfully")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment