According to the National Institute of Standards and Technology (NIST), approximate matching is a technology that can be used in a variety of settings, including digital forensics, security monitoring and data filtering. It involves locating similarities among pieces of digital data to match objects that are alike or to find objects that contain other objects. Such technology will likely become invaluable in upcoming years as the amount of information collected and used becomes even more overwhelming.
NIST has created new documentation on approximate matching technology, which explains its use, provides use cases, and provides testing scenarios for using it. The paper titled “Approximate Matching: Definition and Terminology,” can be found in our IT Downloads area.
According to the document’s table of contents, sections explained include:
- Purpose and scope
- Essential requirements
- Reliability of results
- Motivation
- Test data
- Reference test methodology for bytewise approximate matching
The definition of approximate matching as given by NIST is as follows:
Approximate matching is a generic term describing any technique designed to identify similarities between two digital artifacts. In this context, an artifact (or an object) is defined as an arbitrary byte sequence, such as a file, which has some meaningful interpretation.
Different approximate matching methods may operate at different levels of abstraction. At the lowest level, generic techniques may detect the presence of common byte sequences (substrings) without any attempt to interpret the artifacts. At higher levels, approximate matching can incorporate more abstract analysis. In general lower level methods are expected to be faster and more generic in their applicability, whereas higher level methods are typically more targeted and require more processing.
The paper explains how two queries concerning similarities are “resemblance and containment.” Resemblance involves taking at least two objects of similar size and comparing them for common pieces. However, if two pieces of data have similarities, but are of very different sizes, this refers to containment, which goes further to see whether the smaller object resides within the larger one.
The NIST documentation includes a section on terminology with definitions to help understand characteristics that are compared among data and the types of methods used to do such comparisons. These terms are used throughout the paper in explaining the technology and its uses.
Any IT staff dedicated to digital forensics, data scientists, programmers or anyone charged with building or using technology to find similarities among data would benefit from reading this paper. It provides a detailed view of approximate matching and can help others use the technology to search through and filter like data within their organization.