IIT Kharagpur / reinforcement learning / reasoning systems

I study how agents turn experience into reusable internal structure.

I am a final-year student at IIT Kharagpur. My work spans reinforcement learning for reasoning, robust memory and state compression for long-horizon agents, mechanistic interpretability, and geometry-aware adaptation.

The question I keep coming back to is simple: when an agent succeeds, did it preserve and use the right internal state, or did a shortcut happen to work? I like problems where behavior, representation, and causal interventions all have to tell the same story.

CV Email IIT KGP mail LinkedIn GitHub X Scholar Hugging Face

+4.70ppLaViDA nearest-expert gain over GRPO on MATH-500 n=8, seed 0

R² ≈ 0.99input and transducer beliefs decoded from 4-layer causal transformers

262Klearned memory tokens scaled with Mixture of Chapters routing

25–80%trainable-parameter reduction targeted by GRIT-style PEFT

Research

Current threads

My research asks how useful internal structure forms and survives change: hidden states as Bayesian belief states under coarse-graining, RL as structured credit assignment, LoRA updates as geometry, and agent memory that remains useful as policies evolve.

BTP / Complex Networks Research Lab, IIT Kharagpur / Prof. Pawan Goyal

LaViDA: Latent Visitation Distribution Alignment for Mathematical Reasoning

LaViDA asks whether outcome-only GRPO leaves useful reasoning structure on the table. Correct rollouts are projected through a frozen latent encoder, then aligned toward verified expert traces. The current result is deliberately honest: simple nearest-expert alignment is promising under sampling, while the learned chi-square critic is not yet reward-aligned.

Built Qwen2.5-Math-7B GRPO training with LoRA-r64, vLLM, FlashAttention, and single-H100 rollout scaling.
Constructed self and Oracle-augmented expert pools: 8,963 self traces plus 3,354 filtered 72B traces embedded on the 7B manifold.
D_OracleAug ties GRPO on greedy MATH-500 but improves n=8 mean correctness by +4.70pp, p=0.0069.
On the harder MATH-500 L4-5 subset, the same branch improves n=8 mean correctness by +5.77pp, p=0.0429.
Chi-square density-ratio matching was null, sharpening my interest in representations that remain useful as models and policies change.

BTP slides

MATH-500 n=8 mean correctness

A_2000 GRPO

74.88%

D_OracleAug

79.57%

SFT_Oracle

79.45%

Seed-0 read: SFT wins greedy; D_OracleAug and SFT are statistically indistinguishable under n=8 sampling.

MARS 4.0 / Cambridge AI Safety Hub / Prof. Fernando Rosas

Belief-state geometry in transformer ε-transducers

Research Fellow · Dec 2025–present · hybrid

I study how a causal transformer represents the Bayesian state of an input process driving an ε-transducer, and what changes when the intermediate input is hidden. Joint and output-only observation make coarse-graining a controlled intervention on which latent state remains recoverable.

Devised hierarchical-HMM and ε-transducer pipelines with exact joint and coarse-grained Bayesian belief operators.
Probed 4-layer causal transformers: input and transducer beliefs decode linearly at R² ≈ 0.99, while next-token loss reaches the computed entropy-rate floor.
Measured information erasure under coarse-graining: about 93% of the fully observable input belief disappears in cross-fit, while most transducer belief remains recoverable.
Used shuffle, untrained, sequence-level cross-validation, and temporal controls to separate learned geometry from probe leakage.

Analytical belief manifolds under fully observed, hidden-intermediate, and output-coarse-grained regimes — Observation changes the analytical belief geometry: the hidden-intermediate regime produces an entangled cascade.

Transducer belief probe errors for trained, untrained, sequence cross-validation, temporal, and shuffled controls — The trained readout survives sequence-level cross-validation; untrained, temporal, and shuffled controls break the signal.

Project link

RAAPID INC / Prof. Amitava Das / first-author paper

GRIT: geometry-aware PEFT

GRIT treats adapter updates as a geometric object rather than just a bag of low-rank parameters. It uses rank-space K-FAC, Fisher-guided reprojection, dynamic rank adaptation, and high-rank-to-low-rank compression to make LoRA-style updates more sample-efficient and less drift-prone.

GRIT pipeline from LoRA update through K-FAC natural gradient and neural reprojection — GRIT pipeline: estimate rank-space curvature, precondition the LoRA update, then reproject into the top Fisher eigenspace.

Uses a train-high, compress-low setup: train at r=64, then ship a Fisher-compressed adapter at roughly r=12-20.
Mid-training compaction is basis change, not freezing: it preserves B@A up to a reconstruction check and only fires under conservative guards.
Authored fused Triton kernels for covariance fusion, GPU-side Cholesky inversion, and batched preconditioning.
Built a dual-stream CUDA pipeline that overlaps stale-by-one K-FAC inversion/eigendecomposition with the next training step.
Targets competitive generative/NLU performance while reducing trainable parameters by 25-80%.

GRIT parameter update geometry compared with LoRA — GRIT concentrates LoRA updates into a tighter curvature-aligned subspace.

GRIT layerwise parameter update ablation — Layerwise update footprint comparison: LoRA, LoRA plus K-FAC, and GRIT.

Paper

ICLR 2026 NFAM Workshop

Mixture of Chapters: learned memory inside transformers

MoC adds a learned internal memory bank that transformer layers query through cross-attention. To scale beyond dense memory access, the bank is partitioned into chapters and a router selects a sparse subset per input sequence. The point is to give models an explicit, trainable memory substrate rather than only implicit parametric storage or external text retrieval.

Scales to 262,208 learned memory tokens with 4,097 chapters and sparse top-k routing.
Outperforms iso-FLOP vanilla transformer baselines during pretraining.
Shows stronger retention under heavy instruction fine-tuning, suggesting memory can reduce interference across phases.

Paper OpenReview Code

Mixture of Chapters architecture diagram — Chapter-routed memory cross-attention.

Mixture of Chapters pretraining loss curve — Pretraining validation loss under iso-FLOP comparison.

Instruction fine-tuning validation loss comparison — Instruction fine-tuning retention comparison.

Publications

Papers

Mixture of Chapters: Scaling Learnt Memory in Transformers

ICLR 2026 NFAM Workshop. Co-author.

Sparse learned memory banks, chapter routing, 262K latent memory tokens, and improved retention under instruction fine-tuning.

OpenReview Code

GRIT: Geometry-Aware PEFT with K-FAC Preconditioning, Fisher-Guided Reprojection, and Dynamic Rank Adaptation

arXiv preprint. First author.

Rank-space natural-gradient proxy for LoRA, Fisher-spectrum rank allocation, train-high/compress-low adapters, guarded compaction, and Triton/CUDA acceleration.

Paper

Industry systems

Graph analytics for onboarding-fraud detection

A production-minded, CPU-only graph pipeline built inside Axis Bank's Business Intelligence Unit: from raw savings-account transfers to leak-free features, ranked review queues, and analyst-readable evidence.

Axis Bank / Business Intelligence Unit / Mumbai, on-site

Data Science Intern

May – Jul 2026

I asked whether pre-disbursal money flow could expose coordinated loan fraud that an application-only scorecard cannot see. The result was an explainable precision overlay for fraud review, built across 224.8 million accounts and 1.31 billion transfer edges.

Cast fraud proximity as bidirectional, three-hop, time-respecting BFS and pruned the working graph roughly 20× while retaining about 80% applicant coverage.
Implemented shared-mule and cash-out signals, weighted PageRank, label propagation, three-cycle motifs, and camouflage-resistant Fraudar blocks in PySpark.
Hand-rolled Bahmani (2 + 2ε) greedy peeling with 1 / log₂(fan-in + 5) collector weights; reliable HDFS checkpoints truncated iterative RDD lineage.
Delivered a ranked investigation queue with plain-English reasons and ring diagrams for each high-risk case.

Diagram showing a new loan applicant and a known fraud paying the same rare collector account — A collector-ring signal: the applicant and a confirmed fraud pay the same rare account before the loan outcome is known.

01 Build the monthly money-flow graph 02 Anchor every feature before disbursal 03 Traverse rare, time-valid paths 04 Score proximity, entities, and motifs 05 Rank and explain the review queue

Scale & systems 224.8M nodes · 1.31B transfers

Ran on Cloudera Spark 3.3 / Hadoop with HDFS checkpointing, hub controls, degree gates, and end-to-end QC.

Signal quality 12.1% vs 0.155% fraud rate

The closest one-hop band reached roughly 81× the book rate; the shipped queue's top 1% delivered 5.49× mean lift.

Validation Feature ceiling, not model ceiling

A separate WOE–IRLS scorecard over a larger feature set found no reliable LOAO gain over the transparent blend.

The important negative result: about 83% of confirmed frauds had no seed-anchored money-flow signal. The graph is therefore a high-precision overlay, not a replacement for the existing behavioural model; the next gain must come from new identity, device, or sourcing edges rather than a more complex ranker.

Final presentation

Selected builds and competitions

Applied projects

I keep these on the page because they reflect how I work: build the pipeline, make the evaluation honest, then optimize the bottleneck until the system actually runs.

GenAI analytics dashboard

Runner-up, General Championship Data Analytics, IIT Kharagpur

Captained a full-stack NLQ analytics dashboard for Frammer AI with LangGraph, self-healing SQL, KPI labs, and Gaussian-anchored synthetic star-schema evaluation.

Presentation

Amazon ML Challenge 2025

40.8 SMAPE

Stacked Qwen2.5-VL-3B SFT with LightGBM over CLIP/text features; used offline tensorization, WebDataset, 4-bit QLoRA, Pseudo-Huber loss, and monotonic constraints.

Code

American Express Campus Challenge

National Finalist, Decision Science Track

Built a 3-stage GBDT-Transformer ranking ensemble with 3k+ leakage-free temporal features and a listwise Transformer trained on GBDT residuals; final MAP 0.59.

Code

Background

Education

Indian Institute of Technology, Kharagpur
B.Tech. (Hons.) in Manufacturing Science and Engineering and M.Tech. in Industrial Engineering and Management, 2022-2027.

Coursework and self-study include Safety Fundamentals of Generative AI, Operations Research, Probability and Statistics, Linear Algebra, Stanford CS229, Stanford CS230, LLM Agents MOOC, Algozenith, and Summer Analytics. Selected Safe Gen-AI assignments are public.

Research taste. I am drawn to interactive agent learning, RL for reasoning, state abstractions, and memory systems that preserve decision-relevant information as models and tasks evolve.

Stack

Tools I use

I am open to research collaborations around RL for reasoning, robust memory for long-horizon agents, learned state representations, mechanistic supervision, efficient adaptation, and agents that preserve and reuse useful structure as policies evolve.

References and letters from Prof. Pawan Goyal and Prof. Amitava Das are available privately on request. Outside research: Codeforces Pupil, interhall football and water polo, karate black belt, and NSS volunteering.