This post presents a pipeline to fine-tune a domain-specific embedding model in under a day on a single GPU without manual labeling.
- •An LLM generates synthetic QA pairs from domain documents at 1-3 hop complexity, filtered by quality scoring
- •Hard negative mining finds near-miss passages using a 95% margin filter to strengthen contrastive training
- •Multi-hop questions spanning 2-3 documents teach the model complex cross-document retrieval
- •Results: 10%+ gain in Recall@10 and NDCG@10; Atlassian achieved 26% Recall@60 improvement on JIRA data
- •Pipeline integrates NeMo Data Designer, NeMo Automodel, BEIR evaluation, and NVIDIA NIM
This summary was automatically generated by AI based on the original article and may not be fully accurate.