Experience
4~6 years in ML/NLP, preferably in document-heavy domains (finance, legal, policy)
Key Responsibilities
- Data Ingestion and Preprocessing: Ability to build and maintain data pipelines to ingest unstructured data from PDFs, gazettes, HTML circulars etc. and process data extraction, parsing, and normalization
- NLP & LLM Modeling: Ability to fine-tune or prompt-tune LLMs for summarization, classification, and change detection in regulations. Ability to develop embeddings for semantic similarity.
- Knowledge Graph Engineering: Ability to design entity relationships (regulation, control, policy) and implement retrieval over Neo4j or similar graph DBs.
- Information Retrieval (RAG): Ability to build RAG pipelines for natural language querying of regulations.
- Annotation and Validation: Ability to annotate training data by collaborating with SMEs and validate model outputs
- MLOps: Ability to build CI/CD for model retraining, versioning, and evaluation (precision, recall, BLEU, etc.)
- API and Integration: Ability to expose ML models as REST APIs (FastAPI) for integration with product frontend.
Skills
Languages: Python, SQL
AI/ML/NLP: Hugging face transformers, OpenAI API, Spacy, Scikit-Learn, LangChain, RAG, LLM prompt-tuning, LLM fine-tuning
Vector Search: Pinecone, Weaviate, FAISS
Data Engineering: Airflow, Kafka, OCR (Tesseract, pdfminer)
MLOps: MLflow, Docker