We’re ramping up to pilot a few AI-based sourcing and risk scenarios in procurement, but our supplier master data is all over the place. We have duplicates across three ERP instances, inconsistent naming conventions, missing contact info, and incomplete categorization. When we ran a quick proof-of-concept with an external vendor tool for supplier risk modeling, the results were pretty much garbage because the underlying data was so messy.
Our CPO is pushing hard to get something live in the next six months, but I’m worried we’re setting ourselves up for failure if we don’t clean up the data first. We’ve started some manual deduplication work in one region, but it’s slow and not scalable. A few people have mentioned using AI itself to help with data cleansing and standardization, which sounds promising but also feels like a catch-22 if our data isn’t good enough to train those models.
Has anyone tackled supplier master data quality as a prerequisite for procurement AI? What approach did you take—manual cleanup first, or did you use AI-powered data management tools to accelerate the process? And how did you convince leadership to invest in data quality before the flashy AI pilot?
I’m curious how you’re handling the catch-22 of needing clean data to train AI models that clean data. Did you bootstrap with any pre-trained models or external datasets, or did you have enough semi-clean data to get started? We’re looking at a similar project and trying to figure out if we need to do a first manual pass or if we can jump straight to AI-assisted cleansing.
Good question. The vendor tool we used came with pre-trained models for supplier name matching and address standardization, so we didn’t have to train from scratch. We fed it our messy data and it gave us match confidence scores. Anything above 90% confidence we auto-accepted, anything below 70% went to manual review, and the middle zone was semi-automated with suggested matches. Over time the model learned from our steward decisions and got better. So you don’t need perfect data to start, but you do need some human-in-the-loop validation to keep the quality high.
We went through this exact situation two years ago. Manual cleanup is a dead end if you have any scale. What worked for us was standing up a lightweight master data hub with embedded machine learning for deduplication and standardization. We started with one high-value domain—supplier records tied to our top 200 vendors by spend—and got a clean golden record for that subset in about eight weeks. The ML matching was something like 85% accurate out of the box, and we had data stewards review the edge cases. That gave us enough clean data to run a meaningful pilot for supplier risk assessment. Once leadership saw the pilot work, they funded the broader data quality effort. So my advice: don’t try to boil the ocean. Pick a narrow, high-impact slice, use AI-powered tooling to accelerate it, and prove the value before scaling.
We’re in the same boat. Our finance team has been complaining about duplicate vendor records for years, but it was never a priority until procurement started talking about AI. We hired a third-party data services firm to do an initial audit and cleansing pass, and they’re using a mix of automated matching and manual review. It’s not cheap, but it’s faster than trying to do it all in-house. The key thing we learned is that you need clear data governance policies up front—who owns the supplier record, what the golden source is, and how you handle exceptions. Otherwise you clean it up once and it gets messy again in six months.