A multimodal question-answering system built in collaboration with Bosch Research, focused on technical documents containing graphs, charts, and tables.
I fine-tuned vision-language models — Phi-3-vision, Idefics2, and LLaVA-NeXT — using LoRA for parameter-efficient training. The resulting system improved multi-hop QA performance on technical graphs and tables by 12.4%.