Medical research is one area where, with little equivocation, it can be said that reams of data are collected. To put a sense of scale on the amounts of data being collected, an average clinical research study managed by the Applied Health Research Center (AHRC), which collects data on a subset of patients that match protocol inclusion and exclusion criteria in a specific therapeutic area, will enroll about 650 patients, and collect about 1,000 pieces of data on each patient –about 650,000 pieces of data in total. And that’s only one study! As you collect data from more studies, the numbers scale logarithmically. You may ask, why bother collecting volumes of data?

The simple answer – the larger and richer a data source, the more population-level inferences can be drawn from that data.

Complex analyses such as identifying genetic markers of disease can only be performed on analyses of hundreds of thousands or more patients’ data. In even more complex therapeutic areas such as oncology or neurology, data sources scale into the millions of patients, for clinically meaningful answers to be generated. The challenge with clinical research is not limited just to the sheer scale or volume of data, but to the inherent heterogeneity of the clinical research studies being performed, and the aggregation, centralization, management, and analysis of that data itself. The purpose of this article is to identify some of the common challenges that projects with grand visions for data collection can face. Let’s take a few moments to consider some of these challenges.


Data aggregation involves gathering data, from potentially hundreds or even thousands of different sources, and summarizing and presenting the data in a format that can be analyzed effectively. How do you begin combining that data?

Let’s look at an illustrative example, involving data from three different studies where ethnicity is being collected:

  • Study #1 – collects important data such as ethnicity. This study defines ethnicity using the US National Institutes of Health (NIH) classification system to classify and determine ethnicity (e.g. ethnicity = Asian)
  • Study #2 – collects the same data on ethnicity, but instead uses the US Food & Drug Administration (US FDA) classification system (e.g. ethnicity = Pacific Islander)
  • Study #3 – also collects data on ethnicity, but uses one example of Statistics Canada’s classification system (e.g. ethnicity = East Asian)

So, how does a group thinking about data warehouses combine and aggregate the above data? Practically speaking, this aggregation process requires the use of common definitions for each variable, which in this case is “ethnicity = Asian”. In addition to common data structure definitions, a comprehensive and rigorous process needs to be in place to map the clinical research data into the “common” definitions. The warehouse will need to create a common definition of “Asian” that can consistently, regardless of the heterogeneity of the source data, allow for that response to be properly and accurately stored. Many companies, to make this function seamless, tend to sync their customer data to online tools like Marketo or similar others (companies like Grouparoo can help with that) that can let them aggregate big data efficiently. Doing so can especially help companies when they are dealing with data that needs to be divided into different fields. The common definition is absolutely vital to ensure that aggregate data can be easily analyzed to answer large, population level questions.


Some research questions can only be answered effectively by using data from large repositories, and a data warehouse of this size needs to be centralized. This implies a series of requirements including space, power, back-up power, computer hardware, database software, back-ups, anti-viruses, network connections, technical support staff, software/programming specialists, database specialists, and more.

Some key technical challenges to take into account include:

  • availability -how much down-time are you willing to tolerate, and what level of access/connectivity do users require
  • scalability – as your data sources and volume of data grow, can your infrastructure handle it
  • integration – can data be easily retrieved and combined from your potentially proprietary data capture systems
  • redundancy -are there backups, failovers, and other safety-nets in place
  • resiliency -what happens if something catastrophic happens, or if the power goes out, or you have a database crash
  • infrastructure -can the infrastructure needs of the data warehouse be easily met


Data is never going to be 100% perfect, reliable, or accurate. If this is the case, how does a data warehouse confirm and manage the accuracy and reliability of the data collected? How does a data warehouse ensure that, as new clinical research study data is added to the database, it complies with the common data structures that have been put in place so that data can be used and analyzed? These are non-trivial challenges, and require significant investments in human capital. Large health oriented data warehouses, such as the STRIDE project, have archivists on staff, where their primary responsibility is to ensure data is managed and curated responsibly.


The purpose of a data warehouse or a data analytics platform is to have sufficient data to ask and answer the tough questions that change the healthcare and health outcomes of populations. Examples of these types of questions include determining drug-drug interactions in large populations, understanding the links between genetic variations and disease, the cost-effectiveness of treatments, and more. Perhaps surprisingly then, the greatest challenges with data warehouses arise not from the operational challenges that have been outlined above, but the ability to actually perform the methodologically, statistically, and scientifically valid analyses that will help to answer complex questions in a reproducible way. Population-level analyses require expertise in complex statistical methodologies, such as cluster analysis, genetic algorithms, predictive modeling, pattern recognition, time-series analyses, and more. Dedicated, trained, and extremely capable statisticians that understand not only how to generate results but how to prepare data prior to analysis, and perform and conduct relevant steps to ensure data accuracy, reliability, and validity prior to the actual analyses, are essential.

While this article may be a useful overview of challenges to consider when working with big data, it is recommend that interested readers review the successes, failures, and obstacles that the following large, and successful, projects have experienced, to gain a better and deeper understanding of big data, and data warehouses within clinical research.

  1. STRIDE (Stanford Translational Research Integrated Database Environment):
  1. University of California at San Diego – Standards in the Use of Collaborative or Distributed Data Networks In Patient Centered Outcomes Research:

To find out more about the data management services available at the AHRC, contact Khalid Sabihuddin at or 416-864-6060 ext. 3925.


KHALID SABIHUDDIN is the Manager of Business Operations

NATASCHA KOZLOWSKI is the Manager of Research Quality & Process

CHRISTOPHER DUCHARME is the Manager of Research Informatics

SARAH GRANT ALVARADO is a Research Coordinator III and responsible for communications and marketing at the AHRC