Data Mining for Fraud (Part 2)

Number of Transactions

There is a direct correlation between the degree of sophistication of the concealment strategy and the number of transactions meeting the data profile requirements. Highly sophisticated concealment strategies tend to have a larger number of transactions required, whereas data mining routines that search for direct matches use a smaller number of transactions. The characteristics of the relationship between the number of fraudulent transactions and the sophistication level of the concealment strategy include the following.

Low sophistication:

  • Direct matches of the fraudulent entity structure to another entity structure.
  • Entity identifying information links to the perpetrator’s known identifying information, for example, a specific street address.
  • Fraudulent activity is linked to one or a few entity structures.
  • Data mining routine searches key on data matching.
  • Overall sample size is determined by the number of transactions that match the data profile. The sample size can range from zero to a large number.

Medium sophistication:

  • Direct matching routines are less effective. Data interpretation skills are more crucial.
  • Filtering techniques like drill-down analysis are effective in reducing the number of transactions fitting the data profile, thus allowing data interpretation to be more effective.
  • Entity-identifying information relates to some aspect of the perpetrator’s known identifying information, for example, a zip code location versus a physical street address.
  • Sample selection relies on data interpretation skills and scenario-specific data mining routines.
  • Sample size tends to be judgmentally determined versus the use of all transactions meeting the matching criteria.

High sophistication:

  • Direct matches seldom occur.
  • Fraudulent activity may be linked to multiple entities or smaller dollar transactions.
  • Entity-identifying information has no relationship with the perpetrator’s known identifying information.
  • Entity-identifying information may relate to a mailbox service or an out-of area address that has a mail-back feature.
  • Sample selection relies on data interpretation skills.
  • The population for deriving a sample is larger because the selection criterion identifies all transactions in a group versus a specific transaction.
  • Sample size tends to be judgmentally determined versus the use of all transactions meeting the matching criteria.

The Fundamentals

There are certain fundamentals or ground rules to follow in using data mining to locate fraud. They may be considered as the eight steps needed for effective data mining, although following them in a sequence is not necessary:

  1. Understanding the ‘‘what,’’ ‘‘where,’’ and ‘‘how much’’ of data.
  2. Mapping the data fields to the fraud scenario.
  3. Understanding the integrity of the data.
  4. Applying inclusion/exclusion theory.
  5. Understanding false positives.
  6. Understanding the ‘‘norm’’ of the data.
  7. Data correlations.
  8. Entity structures and search routines.

Understanding the ‘‘What,’’ ‘‘Where,’’ and ‘‘How Much’’ of Data

The auditor needs to know ‘‘what’’ data fields exist in the database, then determine if a field is to be used, which may be accomplished by building a listing of the data required and working with the data administrator to identify the name of the data element in the table. Start with a generic listing of the type of data that will be necessary for the audit objectives. For example, in a payroll audit, consider name, employee number, gross compensation, and so on. Once the auditor has determined the type of data to be searched, the generic list should be referenced to the data names in the tables. Reviewing the tables identifies additional data that will assist in the audit. Such a review can also cause confusion as to the existence of data or what data exists in a particular heading. A good example is the general ledger code. The expenditure table may list the accounts payable or inventory code, whereas the purchase order file may indicate the expenditure or budget code. Most likely both tables will use a table heading of general ledger code.

The auditor needs to know ‘‘where’’ the tables with the needed data field are located and how the primary and foreign keys link the tables. This determination requires locating the data element in the applicable table. Care must be taken in ensuring that the extracted data is in fact the data that relates to the fraud scenario. Data tables capture information for a variety of purposes—in some cases, calculations, and in other cases, reporting. The field may be an accumulation field, total overtime for the year, or a rate field. Once the data is extracted, select a few transactions and verify that the extracted data is the actual data needed for interpretation. The auditor needs to know ‘‘how much’’ data is being worked with through counting, by data element, the number of records that are actually populated. This is an important procedure because it will determine whether the selected fields are populated or, in essence, the availability of the data. It is common for columns not to be used or to be sparingly used.

Mapping the Data Fields to the Fraud Scenario

Data mapping is the process of connecting the data to the fraud scenario. The process starts with the fraud scenario, then each data element is linked to the fraud scenario. The mapping process includes the identified patterns that would exist in each data element and what type of data exists in each field.

Understanding the Integrity of the Data

In the real world, data in a database contains errors, blanks, and inconsistency caused by data input or system changes, and data, although accurate, may cause a red flag. So, part of the prep for a data mining routine is cleaning up or scrubbing the data. The auditor needs to identify the data inconsistencies, then develop strategies to fix, group, isolate, or eliminate them. If the problems with the data cannot be fixed, then a decision should be made on how to resolve them within the context of the search routine. For example, a problem occurs when a transaction is entered incorrectly, then it is reversed, and later entered correctly. From a frequency perspective, the one event will have a count of three transactions. Since the number of transactions is inflated, the average will be distorted. To clean this mess, a direct match on the reversals needs to be performed for the purpose of excluding the reversal or place all the reversals into one grouping for purposes of pattern, frequency, and analysis.

Frequently, errors occur due to a data input error. Informational fields, such as order number or date are prone to these types of errors. For date errors, the search routine would count the number of records that contain a date outside of the expected date range. If the date field is critical to the search routine, then transactions should be grouped into two groups, one group having dates within accepted range and the other group having dates out of expected range. If the number of transactions with the error is low, the date could be corrected in the database. Most likely, the number of errors will be too large to fix. Therefore, analyzing the date error transactions in a separate group will not distort the primary purpose of the search routine.

Applying Inclusion/Exclusion Theory

The inclusion portion of the theory starts with a database of transactions where the data is categorized into like groups. The purpose of doing this categorization is twofold. For one, examining a smaller database is easier, and for another, an anomaly is easier to spot when all the transactions are in common. The grouping of data is dependent on the fraud scenario. Some logical groupings are the following:

  • Geographical business divisions or territories.
  • Dollar value of the account or transaction.
  • Transactional codes.
  • Major category of revenue or expenditure.
  • Those with or without control documents.
  • Company anomalies, house accounts, overrides, or one account number used by multientities are all often temporary vendors or one-time vendors.
  • Active entity versus inactive entity.
  • False entity or real entity.
  • Class of transactions.
  • Specific to a person, entity, or account.


  • 2021. Google Image – Data Mining
  • Vona, Leonard. 2011. The Fraud Audit: Responding to the Risk of Fraud in Core Business Systems. John Wiley & Sons, Inc.
Ignatius Edward Riantono, S.E., M.Ak., CCFA, CertDA., CHCM., CPHCM., CHCBP.