Controlling big data quality: automated quantitative approach

The issuance of BCBS 239 “Principles for effective risk data aggregation and risk reporting” has marked the growing attention of data management processes in the banking industry. At the same time the importance of big data management rises, the question of quality control also arises. Around 72% of data miners experience an increase in data amounts, according to Rexer Analytics Survey (2013). Most respondents mention time and effort required as a leading challenge brought up by increasing size of datasets. Furthermore, the most popular way to handle such issues according to survey results are better software, algorithms or code. Since a range of businesses and scientific areas in the banking and financial industry are working with very large data sets, they require elaborating a specific approach to handle quality due to inability to check and test big data using traditional tools. In this article we describe a concept of a component-based system aiming to check for existing errors, detect unknown issues and react to those issues to minimise risks when using big scale financial data.

We define the following key tasks related to big scale financial data:

  • controlling and correcting known errors and issues;
  • detecting unknown issues or signaling about potential arising deterioration in data quality;
  • reducing time interval between detected error and response.

An example of a related business case from the industry is provided below:

An institution is a hedge fund, whose primary business is related to automatic trading on financial markets with price feeds collected and stored to be further used by algorithms. Even minor gaps or errors in data flow may have unpredictable effect escalated through trading algorithms and resulting outcome on volumes traded. Analysts also should use only reliable data, which contains no systematic errors, to develop algorithms and procedures. With access to multiple exchanges offering thousands of feeds and several updates per second, the institution faces the need for an automatic controlling mechanism that would reduce possible losses stemming from data errors.

  1. Therefore, three main components of the big data management quality control process can be defined. Each of those components can function independently; however, the only combination of the above that can ensure an effective control environment is numbered below:
    Reports provide snapshots of data quality and controls for known issues.
  2. Statistical process control can signal about potentially unknown issues.
  3. Early warning system reacts to signals from the components mentioned above to automate decision making in case data quality is not deemed appropriate.

Diagnostics reports

These reports contain a list of formal rules and tests to be executed to check data for accuracy and completeness. The rules should cover all critical aspects of data in the light of strategic business objectives of the institution. The key feature is the ease of interpretation and straightforward interpretation of each test. Depending on the nature of the business, the diagnostics can be called in an active (triggered by an event like new transactions or actions) or in a passive way (schedule or ad hoc manual request).

The results of diagnostics reports should be carefully reviewed and analysed for violations. If any of the tests breaches the pre-specified threshold, an in-depth investigation should be made. Regular diagnostics is a necessary condition for effective risk management.

For example, the credit risk department may employ a set of 50-100 simple tests to check credit portfolio data for errors on daily basis to ensure that all the analyses produced by the analytical department and decisions made by management rely on correct data. For instance, a list of payment status tests may contain very simple but critically important logical checks (number of loans with negative or zero balance, current number of clients in database compared to previous closing etc.) and complex tests, which can potentially discover violations in source systems (total amount of all payments received from loans on previous day cannot be less than total reduction in portfolio volume). Other sets of rules may test application, behavioural or operational data. As can be seen from those examples, violation of any test can be easily interpreted and provides a straightforward signal about data quality (such as missing data about loan agreements, missing records about clients in database, missing or incorrect transactions in the cases described above).

Different tests can be combined into scores to provide a single scalable measure of data quality. The scores can further allow formal comparison of data sources, as well as checking the influence of aggregated quality on operational or financial results.

Statistical process control

One of disadvantages of diagnostic reports as a single-standing tool is that they usually contain tests, which only cover a narrow area or task at any single moment of time. As such, those tests are not able to distinguish stochastic deviations in data quality arising from systematic changes on early stages and the test set would require an extension for known issues. Statistical Process Control (SPC) is another useful tool, which can be applied to analyse data quality within an ongoing process, helping to detect and signal about unknown issues. In general, the purpose of such a process is quantifying the accuracy, precision, and stability of a measurement system. However, it can be also applied to monitor data quality and provide statistical signals when extraordinary events or variations in data are detected.

Having mentioned that SPC can help to detect new issues in data quality using existing data checks, which actually do not cover unknown problems, further clarification is provided below to explain this concept.

Obviously, there are two alternatives to check data quality for cases when an institution onboard a new data format or data source. The costly and relatively time-consuming way would involve allocating resources to investigate cases manually (using Excel or SQL, for example), or creating new list of formal checks. Alternatively, to save costs and effort, existing checks could be used with no modifications. Neither of those ways could guarantee that new issues will be detected in timely manner.

Although existing tests may not directly explain the nature and the source of the problem, they can signal about a potential problem. The idea of data quality SPC is that any violation in data quality may be directly reflected in simplified tests over time. To ensure that a set of tests will provide this signal, it needs to cover all observable aspects of data quality. We provide a simple example of a related business case below.

The institution is a debt collection company, whose business is relying on data for retail credit portfolios. The company is rapidly growing and purchases new portfolios from other institutions, which typically have unique data model, quality, and formats. A single portfolio may reach up to several millions of records with more than 100 data fields. The company is considering a purchase of a new portfolio, obtains a sample of data and executes simple tests on debtor’s living address field. The results immediately show the following:

  1. Although there are no records with an empty address, the length of the field is on average significantly shorter than for all previous portfolios.
  2. On average, the share of digits in the address string is lower than for all previous portfolios.

The company then screens examples of records with extreme values of characteristics mentioned above and detects that some records are missing a 6-digit post index in the address field. Note that those two characteristics will be non-informative unless being analysed cross-sectionally or in dynamics.

This example clearly shows how running two simple tests can help to detect a data quality violation which is unobservable from standard checks. An SPC related to the described case can be easily constructed based on regular diagnostics reports. If single tests contained in a diagnostics report do not satisfy the predefined criteria (although keeping within thresholds), this may signal systematic changes in data and require potential reconsideration of applied practices.

Early warning system

If data quality does not satisfy the requirements of the institution’s operational risk framework, proactive measures should be applied. The concept of an Early Warning System (EWS) can be best explained in the context of automated financial systems. Reducing the time gap between data error detection and reaction may be critical for businesses having high frequency of transactions. Note that this logic can be applied to both external and internally generated data.

As a best practice, every automated system shall have “natural” self-checks to ensure that the algorithms applied work properly. For example, for a trader, the sum of number of positions opened and rejected is equal to number of orders sent. However, it is crucial to handle a problem when the system unexpectedly gives incorrect decisions (illustrated by an example of over $400 mln loss by Knight Capital in 2012). Therefore, the additional data management component (EWS) is required to address such cases.

An Early Warning System typically relies on the results of statistical tests produced by SPC systems. Using the example above, the system could signal about unusual trading results compared to historical data and induce corresponding actions immediately, thus significantly reducing losses.

We have provided a concept of component-based system for big scale data quality management based on diagnostic reporting, Statistical Process Control and Early Warning System.


Alexey Malashonok is risk manager and quantitative analyst with over 7 years of background in financial and banking industries. His practice includes working for large international banks, algorithmic trading company fund, consulting and software development company, debt collection agency. His primary areas of expertise include credit and market risks, valuation and forecasting models, Basel and IFRS requirements and enterprise-wide risk management solutions. He is certified Financial Risk Manager and Level II Candidate in the CFA Program.


Anna Bielenka has been working in finance and banking for over 7 years and holds a PhD degree in Finance. Having started her career in an ALM and liquidity risk management function with one of major CEE banks, she worked on the topics such as stress testing, liquidity ratio management, and management reporting. Further, she has managed a project on software development in Middle East, covering counterparty credit risk, market risk, and Basel III capital charge calculation for OTC derivatives. Currently Bielenka is responsible for the functional support of pre sales process in EMEA at SunGard.