Role Overview
You will architect, implement, and iterate on reliable, fast, and cost‑aware RAG pipelines and agentic workflows across text and vision. You’ll own the full stack: document/image processing and embeddings, retrieval and re‑ranking, function/tool calling and planning loops, safety/guardrails, and online/offline evaluation.
What You’ll Do
- Design, implement, and productionize multimodal RAG systems (text + vision) that ground LLM/VLM outputs in enterprise knowledge.
- Build agentic workflows that can perceive (images, PDFs, screenshots), reason, and act: function/tool calling, JSON‑schema actions, planning/reflect loops, and short/long‑term memory.
- Stand up and operate vector/search infrastructure for text and image embeddings (e.g., Postgres/pgvector, Elasticsearch/OpenSearch, Pinecone/Qdrant/FAISS) with hybrid retrieval and re‑rankers.
- Create robust ingestion for large and messy corpora: PDFs (scanned & born‑digital), images, HTML, logs, tables; apply OCR, layout analysis, chunking, and metadata enrichment.
- Implement online/offline evals for multimodal tasks: groundedness/faithfulness, retrieval precision/recall, VQA accuracy, OCR WER, table/diagram extraction quality, latency/cost—and wire them into CI.
- Add guardrails and safety filters: policy checks, prompt hardening, schema/output validation, image moderation, PII redaction, and defense against prompt injection/data exfiltration.
- Optimize throughput and reliability: batching, caching (request/result/embedding), retries/timeouts/fallbacks, concurrency control, and GPU utilization.
- Run rapid experiments (A/B and canaries) to iterate on retrieval, prompts, re‑rankers, tools, routing, and multimodal prompting strategies.
- Instrument systems for observability (telemetry, tracing, cost/latency dashboards) and maintain SLOs.
- Collaborate closely with product/design/engineering to ship user‑visible impact.
Minimum Qualifications
- 2+ years building applied ML/LLM systems (production). Strong Python; TypeScript familiarity is a plus.
- Shipped features using embeddings, retrieval, and re‑ranking; comfortable with structured outputs and function/tool calling.
- Hands‑on experience with vision‑language models (VLMs) and vision pipelines (e.g., OpenAI Vision/GPT‑4o‑class, Gemini‑Vision‑class, Claude‑with‑vision, LLaVA/BLIP/CLIP, LayoutLMv3/Donut).
- Practical understanding of multimodal RAG design trade‑offs (latency, recall, cost, context limits) and evaluation beyond "vibes".
- Proficiency with one or more vector/search stacks: Postgres/pgvector, Elasticsearch/OpenSearch, Pinecone, Qdrant, FAISS.
- Familiarity with orchestration/tooling (LangChain, LangGraph, LlamaIndex) and building chains/agents for text + vision.
- Experience with OCR & document understanding (e.g., Tesseract/PaddleOCR, layout‑parser/DocAI‑style tooling) and PDF/image preprocessing.
- Solid engineering practices: Git/CI, testing, code review, observability, secure deployment (Docker/Kubernetes familiarity helpful).
- Clear, pragmatic communication in English; ability to work with non‑ML stakeholders.
Nice to Have
- Experience with graph‑based agent platforms, tool discovery, and multi‑step planning for multimodal tasks.
- Knowledge of re‑rankers (e.g., E5/ColBERT), hybrid search, and large‑scale search systems.
- Practical GPU inference for VLMs (quantization, LoRA/adapter fine‑tuning, distillation) and performance tuning.
- Exposure to LLM/VLM security reviews (prompt‑injection defenses, data‑exfiltration prevention, watermark/safety checks).
- Background in data quality pipelines, synthetic data for vision tasks, and feedback/label loops.
- Experience with video understanding or streaming‑frame pipelines (bonus).
Core Tech Stack (indicative)
- Languages: Python
- LLM & VLM: OpenAI‑class Vision models, Gemini‑class multimodal, Claude‑with‑vision; Hugging Face Transformers; vLLM (or similar)
- Orchestration: LangChain, LangGraph, LlamaIndex
- Vector & Search: Postgres/pgvector, Elasticsearch/OpenSearch, Pinecone, Qdrant, FAISS
- Ingestion & Vision: Unstructured/PyMuPDF, OCR (Tesseract/PaddleOCR), layout‑parser/Detectron2/SAM (as applicable)
- Evals & Observability: custom eval harnesses (text + vision), A/B testing, tracing/telemetry dashboards
- MLOps: Git/CI, Docker/K8s (or equivalent), experiment tracking tools