The Importance of Data Standards to Biospecimen Collection
Martin Ferguson, Ph.D.
Pharmaceutical and Life Sciences Consultant
Research tissue banks, quality controlled and with enough samples to have statistical power, are becoming an invaluable resource in this age of high-throughput genomic technologies. A well designed biorepository can support significant contributions to biomarker discovery and validation, target identification, and other critical points along the pathway to molecular medicine. However, many hurdles to establishing high quality biospecimen collections remain, including a significant burden created by broad lack of data standards. Any biospecimen collection is only as good as the clinical data annotating those biospecimens. Unfortunately, the continuing diversity of ethical and legal strictures for clinical data access, use, and distribution, coupled with a diversity of technical formats for data definitions, continue to impede the development of good biobanks.
The continuing diversity of ethical and legal strictures for clinical data access, use, and distribution, coupled with a diversity of technical formats for data definitions, continue to impede the development of good biobanks.
This essay will appear to focus on the technical issues, describing some of the hurdles and giving examples of projects currently clearing those hurdles. Indeed the literature and trade press frequently describe informatics research successes and new products aimed at solving biobank data standardization problems. However, the key goal of this piece is to bring the reader to an understanding that the “policy” ambiguity is a key factor impacting resolution of many gaps in technical data standards that are ultimately required to support a well designed biorepository. In other words, to use a phrase common among software development managers, many key obstacles impacting adoption of data standards in biobanking are “human problems masquerading as technical ones.”
Many key obstacles impacting adoption of data standards in biobanking are “human problems masquerading as technical ones.”
As biorepositories become more important, larger, and organized as parts of networks, it becomes even more critical that the data annotating the inventory are “machine readable” – a term used to mean that the data can be operated on by computer software. Biobank data must support everything from the logistics of sample inventory management (e.g., freezer locations), to patient confidentiality protection (e.g., controlled data access based on a user’s level of authority), to the ultimate scientific analysis (e.g., correlation between clinical data and molecular profiles) the biobank was developed to support. The following list summarizes the main domains of data which must be maintained by a high quality biorepository:
- Protocol information that covers the ethical and legal characteristics of the collection, such as permissible use of sample and data granted during an informed consent
- Donor annotation that describes the historical, current, and on-going clinical status of the patient, such as diagnosis or treatment and response
- Biospecimen annotation that describes the sample, such as histopathology, collection protocol, and physical formats (e.g., embedded in paraffin)
- Biospecimen tracking/logistics/material management data that describe physical location, tracking, shipping/receiving, and control of access
- Quality Control data collected during various processes, including both the physical sample, clinical data validation, and the quality of molecular analytes extracted from tissues
As biorepositories become more important, larger, and organized as parts of networks, it becomes even more critical that the data annotating the inventory are “machine readable”.
To underscore the importance of having such data understandable by computer software, consider a relatively typical and simple query that a researcher might ask of an inventory: “identify all frozen samples of untreated lung adenocarcinoma from still living, 40-50-year old donors for which the patient has consented to being re-contacted or to having their samples and data used in a genetic profiling study different from the original collection protocol (i.e., the patient granted non-specific consent).” An individual researcher working with his own small biobank could manually review medical charts, informed consents, and inventory maintained in a spreadsheet to answer that question. Conversely, asking that question of a large collection generated by a multi-site clinical trial absolutely requires that the key concepts (underlined above) be computer readable.
To continue this analysis in the context of the researcher’s question, consider two biospecimen characterization terms that would be categorized under the first bullet above. The terms “de-identified” and “anonymized” – various human subject protocols, federal regulations, policy documents of many academic medical centers, and treatises by bioethicists routinely interchange these terms to variously mean clinical data that are:
- Stripped of specific identifiers such as name, social security number, or medical record number.
- Stripped of specific identifiers (as above) and stripped of non-specific identifiers, such as date and time of procedure, geography, age, or name of care providers, which can be used to “forensically” infer the identity of the donor.
- Linked or not linked by a secret code, regardless of the level of identifier stripping, back to the patient’s specific name or other ID.
- Stripped of all information and permanently unlinked from the donor.
Clearly, aspects of these definitions are very different, even mutually exclusive. Nevertheless, for software to answer the researcher’s question, the query algorithm needs access to demographic information and to establish that the donor’s sample is part of a collection for which ongoing data collection is permitted – i.e., the sample is “linked” back to the patient to enable re-contacting the patient and collection of vital status data. To achieve this computer readability, the data describing the above characteristics of specimens in inventory cannot be in the form of loose prose, nor can it be defined in a multitude of incomparable computer readable standards. From a computer’s perspective, the terms “anonymized” and “de-identified” must be semantically unambiguous and then specifically codified for use in software.
Projects exist to create such meta-databases of terminologies that are of interest to the biomedical informatics community; in fact, there are specific projects to support the computing and data standards infrastructure required by the biobanking community. An excellent example can be found in the US National Cancer Institute’s (NCI) cancer Biomedical Informatics Grid (caBIG™) project. Two collaborating activities within the caBIG project, one on tissue banks and another on common data elements, are closely collaborating to develop biospecimen management software applications that characterize their inventory in a common language. The NCI Center for Bioinformatics supports centralized metadata repositories called the cancer Data Standards Registry (caDSR) and an Enterprise Vocabulary Service (EVS) to which software on the internet can refer for standardized definitions of concepts such as the term “anonymized.” The computational underpinnings of these tools are technologically designed to support central metadata services for all of the domains described above. For the NCI, the goal of such work is to ensure that the entire federally funded (and hopefully, eventually, beyond) cancer biomedical research effort is capable of unambiguously sharing data across institutions and projects.
Nevertheless, despite these technology-rich efforts, the community (writ large) must come together to disambiguate meanings before software can hope to truly support large, high quality, nationwide biobanking efforts. In the central metadata repositories, the terms must be unique and unambiguous before software applications around the internet can meaningfully point to them as a central authority on definition. Frankly, this coming together of people’s mindsets, across many different entities, is the hard part in our de-centralized system.
The community (writ large) must come together to disambiguate meanings before software can hope to truly support large, high quality, nationwide biobanking efforts.
The example of “anonymized” vs. “de-identified” described above is repeated hundreds of times across all the domains relevant to biorepositories. The ethical and legal examples, alone, that impact the development of human biobanks are extremely complex and warrant a significant treatise in their own right, and are encountered in almost every biospecimen-based research effort. To researchers collecting biospecimens, there are duplicative processes and mutually exclusive definitions of many terms due to, for example, the independently legislated regulations resulting in HIPAA and “The Common Rule” (the set of regulations governing human subjects research). There also remains ambiguity about access and use of clinical data from an ethical perspective, in large part because ethical best practices are considering issues far ahead of where the regulations currently apply. For example, are genetic profiles incorporated into the clinical record identifying data? – they are clearly unique. But that data element is not one defined by HIPAA to be identifying.
The HIPAA “privacy rule,” while perhaps being painfully laborious to implement, has significantly clarified the requirements for working with clinical data.
Some progress has been made and sometimes from surprising directions, such as stemming from legislation. For example, the HIPAA “privacy rule,” while perhaps being painfully laborious to implement, has significantly clarified the requirements for working with clinical data. The “privacy rule” specifies a list of data elements that must be removed from clinical information to categorize it as “de-identified” (for HIPAA purposes) thus permitting that data to be used freely in research. The list is specified with a degree of precision that it can readily be incorporated into software systems protecting access to clinical data based on different tiers of authorization. Such important details were added to the “privacy rule” after significant input by the research community during the comment period.
While the technology may be expensive and time-consuming, it is not intractable...
In conclusion, the “take-home” point is that the technical hurdles regarding software and data standards for annotating high quality human biobanks are not “rocket science.” While the technology may be expensive and time-consuming, it is not intractable and many of the software building blocks have been developed over the last 10 years as part of the first dot.com boom and the current “Web 2.0.” Technology hurdles for data standards in biospecimen banking are soluble, with dedicated application of time, people, and money. However, these solutions are dependent on the broader community, though the judicious application of regulation, best practices, voluntary standards, public/private partnerships, and other mechanisms of cohesion, coming together on policy.


