Data Mining for Fraud (Part 1)

Data Mining for Fraud

Fraud auditing is about locating and recognizing fraud, and data mining is the tool used in locating fraud. Data mining for fraud auditing purposes can be thought of as both a science and an art—a science because there is a discovery and exploration aspect to it. The art of it is derived through the auditor’s ability to analyze the data from many perspectives to arrive at a summary of targeted information. Think of a painter who uses his tools, a brush and paint, to make numerous strokes into a recognizable pattern. If that analogy is a bit of a stretch for you, think of it as the auditor’s ability to interpret the data to find the proverbial needle in the haystack. From a Las Vegas view, then, the odds are against the auditor in detecting a fraudulent transaction with visual judgmental or random selection. Obviously, databases are very large; that’s why they are electronically stored. Business systems process millions of transactions on a daily basis with the dollar value of the transactions in the billions of dollars. What can an auditor do? As with any tool, the auditor needs to develop the skills to use the tool of data mining effectively.

Simply, auditors need the ability to group data into homogeneous groups in order for any anomalies to become apparent. Granted, the world’s best audit program will not detect fraud if the sample, no matter how well organized for analysis, does not include a fraudulent transaction. Obviously, no method of searching for fraudulent transactions is useful if none exist. Then again, do auditors, as professionals, want to ignore the possibility of fraud when we know it to exist, and in many varieties of schemes? So, we have data mining, which, when used to its fullest, thereby becomes the heart and soul of a good fraud audit program.


As with any science, there is a nomenclature, sorry, terminology that is used specifically in discussing an item of science. There are also theories, definitions, assumptions, fundamentals, steps, and so on. So it is with our discussion of the science of data mining. The rest is art.

Data Mining Defined

What is data mining? Generally speaking, data mining is the process of analyzing selected data by finding patterns or anomalies in patterns, then organizing those resulting patterns or anomalies for interpretation. Going back to painting, for just a second, where does your eye focus when looking at a painting that is all in shades of gray except for one red stroke? That stroke can be thought of as our anomaly. Data mining, although first used by researchers, has become extremely useful in the area of marketing. With the ever-evolving capacity to centralize data in data warehouses, companies were given the ability to use large amounts of data to analyze an almost endless supply of things like customer demographics, pricing, product sales, web site browsing, and many other parameters.

The end result is that development of customer profiles has become a commonplace occurrence today. In fraud auditing there is profile development as well, making the use of data mining a natural fit. The fraud auditing definition of data mining can be thought of as the process of organizing and analyzing transactional data and descriptive data that are consistent with a fraud scenario. The actual organizing and analyzing can be performed by visually reviewing a journal for transactions that appear suspicious or by using audit software to scrutinize the entire data file. In either case, the auditor has to define attributes for the data that match the fraud scenario.

The result of developing these attributes is the fraud data profile. Going back to the easel, you can think of it as the drawing of a picture of a fraud scenario using data rather than paint. The clarity of the picture depends on the availability and integrity of the information in the database. The fraud data profile will focus on the data in the master file description, the transaction description, or both. Given the size of databases, it helps to key on data that will identify fraud, much like the marketing director who will key on data that shows customer spending habits. So, in building the profile, it is useful to key in on the following data types in detecting fraud:

  • Data that tends to conceal identity, for example, common names or lack of physical address.
  • Data that controls access to the information, for example, no listed telephone number or contact name.
  • Data designed to limit visibility (transparency) of a transaction, for example, structuring transactions below a control threshold or processing updates at off-time periods.


To understand the process of data mining, an understanding of the terminology used in the process is helpful. The following is a list of often-used terms.

  • Master file data: All data associated with the entity structure, for example, name, address, customer type, etc.
  • Transactional data: All data associated with an event or action, for example, vendor invoice number, invoice date, invoice amount, etc.
  • Pattern of data: A combination of qualities, acts, and tendencies forming a consistent or characteristic arrangement.
  • Frequency of data: The number of occurrences of an event.
  • Drill-down analysis: Drilling down is a concept analogous to data mining in that to work with large databases, they have to be broken down, not only into manageable parts, but perhaps down to the lowest level, that of the raw data, to get useful information.
  • Transactional anomaly: A transaction containing the red flags of a fraud scenario.
  • Entities: Can be vendors, customers, employees, or inanimate objects like inventory numbers.


Data mining is one of the key elements within the fraud auditing process. However, data mining also has logical limitations that need to be considered in the fraud audit plan. The data integrity will impact data mining effectiveness. The sophistication of the concealment strategy will determine the appropriate search routine and the resulting interpretation of the results. The discussions within this chapter are scenario-based versus simple data anomalies that exist any data file.

The Certainty Principle

We all know that nothing is certain in this world. So, true to this adage, nothing is certain in the results produced by data mining. In this case, to be certain of the results indicates that the data mining was predictive. While we want to scrub the data for problems and reduce the data to key in on fraud, we do not want to be predictive, thereby rendering useless results. Data mining allows the production of many informational reports, so to concentrate on obtaining one perfect report that will be guaranteed to find fraud is foolish. Remember, producing many reports allows a breadth of analysis whereby what may be missed in one report may be identified in another.

Data Mining Routines

Data mining routines must be derived from the fraud scenario identified in the fraud risk assessment. A fraud risk assessment addresses the likelihood of fraud from occurring, meaning that the design of internal control is sufficient to mitigate the fraud risk from occurring. However, within the fraud audit, data mining routines are developed to measure the likelihood of whether fraud is occurring. The difference between the likelihood of fraud from occurring and the likelihood that it is occurring isn’t just semantics. It is important to note that saying the likelihood fraud is occurring is dependent upon the identification of events that are consistent with a fraud scenario and, therefore, a fraud data profile.

Data Mining Effectiveness

For data mining to be effective, audit software search features need to be adapted to coincide with the fraud scenario. Therefore, the starting point is the identification of a fraud scenario followed by the building of a fraud data profile. The data mining routine is then constructed by scrubbing the data of problems and reducing it into a homogeneous population of transactions. A data mining search routine is then performed on this selected data to uncover any red flags associated with the fraud scenario. Data mining effectiveness is directly correlated to the integrity or availability of the data residing in the database, which makes the preparation of the data an imperative step. The details of this preparation are interspersed throughout the sections that follow.


  • 2021. Google Image – Data Mining
  • Vona, Leonard. 2011. The Fraud Audit: Responding to the Risk of Fraud in Core Business Systems. John Wiley & Sons, Inc.
Ignatius Edward Riantono, S.E., M.Ak., CCFA, CertDA., CHCM., CPHCM., CHCBP.