Challenge
The client’s team maintained constant contact with manufacturers and vendors, passing parts lists back and forth. They needed to match these up against internal records to make sure everything was accounted for.
Frustratingly, the Manufacturer and Model would often come in slightly different.
For example, these two should be the same:
OLYMPUS AMERICA INC
OLYMPUS
And these models are the same:
CF-10 COMPUTER
CF-10
Another example:
DC 1200
DC-1200
So if you see the part DC 1200
, which isn’t an exact match from your master part list, you’d have to do a lot of manual eye-checking or subjective Ctrl + F searches. Now imagine doing this hundreds of times every day.
Solution
In data cleaning work, we usually perform multiple passes.
The easiest first pass is to remove unexpected whitespaces, and remove common suffixes (such as INC
, LLC
, etc. in manufacturer names).
From there, because a given part is always tied to a manufacturer, the next step was to concatenate the manufacturer with the model to create a unique ID/key for each entry.
Then, I created several LAMBDA functions that calculated “closeness” scores based on dynamic parameters, such as the source string length and how strict the matching should be.
Finally, a comparison sheet would output an aggregated score (based on combining various matching techniques and parameters) with a list of the system’s best guesses.
Outcome
The client now has a simple Google Sheets template where they can paste incoming data, and see the fuzzy matches highlighted within a matter of seconds.
No Google Apps Script or fancy Python scripting required!
Everything is good so far and I don’t have any changes to make on the file.