There is a commonly held belief regarding address standardization…. and it goes like this… “If I run data through address standardization software such as USPS CASS Certified Address Correction, then the addresses are verified, standardized and correct. - AND by doing so, the data can absolutely be relied on for data matching.” This is WRONG! This is probably the most pervasive and damaging beliefs in data quality and data matching!
The challenge, nearly every textbook, tech article, Youtube video and conventional data quality vendor will tell you the same thing; To create a properly formed and consistent matchkey for the address, you must run your address through address correction software to correct, append, transform and standardize your data before creating a conventional matchkey.
Standardize [stan-der-dahyz]
- Verb: the process of making things of the same, to make standard or uniform; to cause to be without variations or irregularities. standardized; standardizing, standardization
This notion of address standardization in 'conventional' matchkey creation is so prevalent that nearly every data quality, data integration and analytics software vendor s advise that matchkeys should ‘only’ be built from a USPS standardized/corrected addresses - fully parsed down to the individual components (e.g. Street Number, Street Pre-directional, Street Name, Street Suffix, Street Post-Directional, PO Box, Street Secondary, City, State, Zip).
Take, for example, the following address; 3500 N CAPITAL OF TEXAS HWY, APT 121, AUSTIN TX 78746-3378 for 'conventional matching' processes to work the address must be standardized into:
Contrary to the term “address standardization” - there is no such thing as achieving standardization with address data. This is a bold statement, and it will probably leave even the most entrenched data quality practitioners scratching their head - or even disagreeing.
But if you doubt it, consider this... According to the State of Texas - my wife and I live in the same house - but we live in different cities.
Yes - our driver licenses say we live in a different city. That’s because the documents that I brought to attain my driver license (electric bill) to prove residency listed my Address as Austin, and her document (water bill) listed Lakeway.
Think about that - two municipal utility companies providing services to the same home differ as to what city the home is in. Now the State of Texas - from this point forward, every document that references our driver license thereon; whether that be voter registration, passports, financial/lending, etc. consider my wife and me to be in two different cities.
Example:
Why is this belief damaging to data matching? Because as stated earlier, 'conventional' matchkeys are intolerant to any deviation from one input to another. Unfortunately developers, analysts and DBA’s build data quality routines ‘expecting’ address correction to fully correct and align their data for matching.
The USPS primary concern is ‘not’ standardization! They care about one thing, and that one this is 'deliverability'.
What you have to realize is that address accuracy is so important to the United States Postal Service, that USPS developed a method to evaluate the accuracy of commercially available address correction software called the Coding Accuracy Support System (CASS). Software that meets the USPS standard is deemed CASS Certified and can validate addresses down to the delivery point and verify that an address is deliverable.
Pay attention to that statement - CASS certified software can “validate addresses to the delivery point and verify that an address is deliverable.” It said nothing about being standardized or making certain it is the same every time.
200 Park Ave New York NY 10166 in the postal database returns a staggering 99 possible address matches. Each of these 99 addresses has different ZIP+4 and mailing industry codes.
These issues are not unique or mere one-off issues. These are just top-level issues, and we’ve not even touched on zip+4 code accuracy, mail stops, suite numbers, commercial mail receiving agencies, hyphenated addresses, geocoding, or college or even military APOs, FPOs, government or foreign and diplomatic addresses.
These issues are prevalent in every database and are what contribute to the complexity of Name Matching in the 21st century.
FACTS
The USPS designed ZIP Codes to increase mail delivery efficiency, not data quality. Here are few tips.
- With USPS CASS address validation, the city name is whatever USPS says it is, even if that city name isn't the city in which your property is actually located.
- 9-digit Zip+4 codes do not uniquely identify an address. the Zip+4 represents a postal delivery area. A Zip+4 can be as small as a room, a floor, 5-10 houses or even a building, a company, a military base, military unit or command, and sometimes even a ship.
- USPS CASS address standardization will only validate address that receives mail. If the postal service doesn't service an area directly, it won’t be in the database. (common for people in rural areas).
- If a physical address does not receive mail it won't be registered in the USPS database e.g. college dorms. PO Boxes - people get their mail there - but they don’t live in them. Cemeteries do not.
- CASS address standardization (DPV) won’t add a street number, but it will tell you if it can deliver mail to it
- It won’t add a Suite number or even tell you if it’s correct, (but will verify it exists).
- A ZIP Code isn’t a ‘boundary’ but rather a collection of lines that represent delivery routes that define where the delivery trucks go.
- And, if you’re wondering... Yes - ZIP Codes can and do cross, cities, towns, county boundaries, and even state lines.
To be clear - address validation is a necessary component of data quality - but you have to understand it’s limitations. Especially as it relates to data matching.
How does 360Science solve the inherent problems in address matching?
The 360 Matching Engine treats the address as an object - and doesn’t require addresses to be validated, standardized or corrected prior to matching. Additionally, because the Engine has scoring engine separate from the matching engine, it is tolerant to significant variations in the data and can score the granular differences in address inputs.
This makes 360Science Matching Engine ideal for applications involving fraud detection or international data where address data is expected to be poorly formatted and uncorrectable.
An Important Warning about FRAUD
People who are set out to defraud corporations, financial institutions and government agencies understand these nuances of address data - and they exploit fraudulent claims and payments. You CANNOT rely on “address standardization” as a form of data matching.