My doctoral research explores the critical question of training data influence in large-scale generative models. Rather than viewing these systems as transparent architectures, we treat them as black boxes, focusing on data-centric and information retrieval (IR) techniques to estimate how specific training samples contribute to generated outputs. By moving away from reliance on opaque internal parameters or dense embeddings, our work focuses on identifying and developing structured representations that enable a more transparent and interpretable comparison between training data and generated content. This approach allows us to trace influence at a conceptual level, identifying shared objects, motifs, and semantic relationships, to provide human-understandable justifications for efficient data attribution.