Biomedical Data Management Systems _{/ Workshop at VLDB 2026}

Community of biomedical informatics and data management researchers and practitioners who engage in collaborative efforts to identify emerging problem areas, develop novel solutions and help accelerate the pace of innovation in healthcare.

Program

Where: VLDB 2026 in Boston, USA.
When: September 4th (afternoon).

The workshop program will feature a selection of invited and proposed talks, and networking opportunities. Details TBD.

Contributing

As an inaugural workshop, our goal is to hear as many voices as possible, providing the attendees with a thorough overview of the current state of the field. To complement our selection of esteemed speakers, we encourage the community to submit proposals for talks that will (1) highlight key problems and/or potential solutions, and, optionally, (2) outline visions of collaborative projects. Hence, we will accept two types of submissions:

Insight Talk Proposal (1-2 pages, ~5-10 minutes): This will serve as an initiation talk for community members, allowing them to share their expertise and spark ideas for future projects. This talk should ideally focus either on presenting problems (nails) or solutions (hammers) related to biomedical data management. From biomedical researchers, we expect talks about nails, i.e., problem statements that describe a current data management or processing workflow, and an overview of some painful challenges associated with that workflow. From data management researchers, we expect hammers, i.e., expositions of a data management system or methodologies that they have expertise in and visions (not necessarily realized) of how they can be applied to specific problems in biomedical data management.
Project Talk Proposal (up to 4 pages, ~15-20 minutes): This talk should either outline a work-in-progress project or present a viable project vision. An ideal project should (a) involve folks from both communities, (b) identify an important nail matched with the appropriate hammer, and (c) meaningfully advance both biomedical research and data management research. As such, upon completion, a successfully executed project should ideally be able to result in publications at both a top-tier data management conference (e.g., VLDB, SIGMOD) and a top-tier biomedical journal (e.g., Nature, Cell).

Important dates

May 15th, 2026 (11:59 PM AoE): Submission deadline for all talk proposals. Acceptance notifications will arrive by June 15th, 2026.
July 1st, 2026 (11:59 PM AoE): Submission deadline for camera-ready papers and associated blog posts.
August 15th, 2026 (11:59 PM AoE): Deadline for talk merger requests.
August 31st or September 4th, 2026 (TBD): Day of the workshop.

Review process

All submissions will receive single-blind peer reviews from a program committee consisting of both biomedical researchers and data management researchers. Each submission will be assigned to at least two reviewers with relevant expertise. The review criteria will focus on the clarity of the problem statement, the relevance to biomedical data management, the potential for cross-disciplinary collaboration, and the feasibility of the proposed project (for Project Talk Proposals). We will also consider the novelty and potential impact of the proposed ideas. The final meta-review and acceptance decision will be made by the workshop chairs.

Accepted submissions will be invited to submit a camera-ready version as well as an associated blog. The camera-ready version will be included in the workshop proceedings and hosted on OpenReview. The blog post will be hosted on this community website and will serve as an easy-to-digest artifact.

Submission guidelines

Talk proposals should be submitted to OpenReview as a PDF formatted according to the official PVLDB volume 19 formatting guidelines. Submissions that breach the formatting guidelines will not be considered. Below, we provide a few general guidelines, followed by some guidelines specific to different submission types:

Author introductions: To properly highlight all community members, submissions should contain a final “Authors” section with a list of authors, their affiliation, a brief bio (2-3 sentences), and whether they are coming from the biomedical or the data management community.
Mind the (language) gap: Communicating across different scientific domains is no easy task. In order to maximize understanding between communities, we encourage authors to minimize the usage of domain-specific jargon in all submitted materials, or to provide gentle and clear introductions before usage.

Insight Talk Proposal

Length: Up to 2 pages (excluding references).
Novelty: This type of submission does not have to represent novel work, but it has to draw a clear connection to the topics of interest and be aligned with the overall workshop vision.

Project Talk Proposal

Length: Up to 4 pages (excluding references).
Novelty: This submission should present either novel work-in-progress projects or visions of novel future projects. Existing projects can be presented as insight talks.

Camera Ready

After a submission is accepted, the authors will be given a chance to submit a camera-ready version that integrates reviewer feedback, includes any final touches, and any supplementary details that the authors want to include (up to 2 additional pages).

Blog Post

Invited speakers and authors of accepted submissions are invited to contribute a blog post to the workshop website highlighting key points of their talk. This will allow workshop attendees and other interested parties to easily get acquainted with the author’s work, even if they miss their talk. Some specific guidelines:

Length: Flexible, but suggested length is 1000-2000 words.
Novelty: The text is free to be copied from the original submission or any other content that the authors produced in the past. That said, a light reflow is always beneficial to make the post accessible to a casual reader.
Formatting: The blog post should be submitted as a single .md text file written using Markdown syntax. Additional JPEG, PNG, or GIF images can be submitted alongside, as well as author info. Blog posts should be submitted as a pull request to the workshop website.

Collaborating

One of our key goals is to organize a friendly scientific forum that encourages cross-disciplinary collaborations. Apart from the in-person interactions that will take place during the workshop day, we are planning several online initiatives aimed at sparking productive joint efforts.

Discord Server

We started a Discord server that will serve as the main forum for online interactions. We encourage interested participants to join the server and:

Go to the #introductions channel and write a few sentences about their background and their motivation to join the community. It is helpful to include links to online profiles (e.g., personal homepage, Google Scholar) both in the introduction and on your profile “About me”.
Ask a question to the chairs or offer some feedback on the #questions-and-feedback.
Pitch their talk ideas on the #talk-ideas channel and ask for feedback.
Propose a collaborative project on the #project-ideas channel and invite other members to join.

Talk Mergers

Community members who have had their insight talk proposals accepted will be given a chance to join forces with authors of other accepted insight talks and request a merger into a single project talk. After the camera-ready deadline, we will release all submitted proposals, associated blog posts, and author information on the workshop website. The authors will be able to review each other’s submissions and identify complementary groups with whom they can get in touch and come up with an idea for a joint project.

Submitting a merger request: The two author groups who have agreed to merge their talks should send an email to the workshop chairs with a formal request, a statement of motivation, a title and abstract of the new talk, and a brief outline of the structure of the new talk. The proposed structure can still be based on the individual insight talks, but should connect them into a cohesive synergistic project vision. The chairs will review the request and respond with their final decision that will be based on the main criteria for project talk proposals (i.e., relevance, collaborative potential, and feasibility).

Accepted merger requests: Authors of accepted merger requests will be given a regular project talk slot. They will also be invited to submit a blog post about their proposed project vision, which will be published on the workshop website.

Vision Paper

Assuming the workshop proves successful at achieving its main goals, the co-chairs will start a working group for producing a vision paper to be submitted to PVLDB that will outline the conclusions of the workshop and lay out a vision for a future comprehensive biomedical data management system. All speakers and authors of accepted submissions will be invited to join this working group and contribute to this paper as co-authors, leaving a lasting artifact that will hopefully inspire more future work in this space.

About

Vision

This interdisciplinary workshop will focus on data management techniques, tools, and systems with direct applications to the unique challenges in biomedical research and healthcare. Our goal is to build a lasting community centered around these topics and spark fruitful collaborations.

Biomedical research and healthcare are increasingly data-driven, yet practitioners regularly struggle with data management challenges.

We are witnessing a proliferation of data collection technologies (e.g., electronic health records, high-throughput sequencing, medical imaging), the growing adoption of computational methods for data analysis, and the recognition that data is crucial for unlocking new scientific insights and improving patient care. Furthermore, the scale of data that is being collected is growing exponentially, driven by the decreasing costs of data acquisition technologies and the increasing digitization of healthcare systems. However, the people who produce, analyze, and interpret biomedical data (e.g., clinicians, digital health specialists, bioinformaticians, etc.) often lack the expertise and tools to effectively manage and analyze these multi-modal datasets at scale. This means that scientific progress is often limited by overwhelming data management challenges.

Data management researchers are well-equipped to tackle many of these challenges, and are actively seeking new research directions.

Data management researchers have spent decades developing techniques, tools, and systems for managing large-scale data. They possess valuable expertise in areas such as data integration, data quality, data governance, and scalable analytics. However, as evidenced by some points raised at panels at SIGMOD and VLDB 2025, there is a growing push for data management researchers to actively pursue new, high-impact application domains whose requirements can inspire fundamentally new system designs, benchmarks, and end-to-end deployments. Biomedical data management is one such domain, presenting unique research opportunities that can directly impact and improve human lives.

Bringing these two communities together can unlock the potential for novel solutions and rapidly accelerate scientific progress.

However, this is by no means a trivial task. One noteworthy challenge is that data management researchers often lack exposure to real-world biomedical datasets (due to privacy restrictions), as well as the domain knowledge necessary to understand the specific challenges and requirements of biomedical data management. Conversely, biomedical researchers often lack awareness of the latest advances in data management research and how these techniques can be applied to their specific problems. As a result, there is a gap between the capabilities of existing data management systems and the needs of biomedical researchers and clinicians. Our goal is to bridge this gap by bringing together experts from both communities to identify pressing biomedical data management challenges and explore opportunities for collaboration that can lead to the development of novel data management solutions tailored to the biomedical domain.

Target audience

This workshop brings together two key groups, each one bringing a deep understanding of their own domain of expertise and an interest in collaborating with the other side on projects that have the potential to advance both fields:

Bioinformatics and Medical Informatics Experts: Researchers and industry practitioners who understand specific biomedical data domains (e.g., genomic data), the relevant questions that need to be asked, and the computational steps that need to take place to answer those questions.
Data Management and Data Analysis Experts: People with expertise in relevant subdomains of data management (e.g., scalable analytics, data integration, data quality, streaming data management, etc.) and large-scale data analysis, who have a desire or even a proven track record for making an impact on the biomedical field.

Motivating scenario

To provide an illustrative example of the kinds of challenges that this workshop aims to address, consider the scenario of a molecular tumor board (MTB). It is a multidisciplinary meeting in which experts from multiple disciplines (e.g., oncologists, molecular biologists, pathologists, surgeons, genetic counselors) discuss a complex patient case and converge on a treatment plan. Decisions must be made quickly, transparently, and with a clear trail of supporting evidence. Such evidence is usually found by integrating highly multimodal evidence: clinical data (diagnoses, medications, adverse events, patient history, and family), high-throughput omics data (genome, transcriptome, proteome, genetic variations), histopathological lab results, and tissue imaging, all of which are often originating from different institutions.

Given the number of patients, their uniqueness, and the time constraints of all the practitioners involved, the board operates under tight time constraints and high stakes. Time spent discussing a typical patient’s case is measured in minutes, while the preparation phase can take several hours of manual data preparation and analysis.

In practice, tumor board preparation often turns into an ad-hoc integration exercise across siloed systems and heterogeneous formats. This setting exposes a set of recurring data management challenges: data assembly across scales, statistical data quality assessments, integration across multiple external data sources, intuitive data access, and scalability. A well-designed biomedical data management system could substantially reduce these frictions, enabling a secure, patient-centric, multimodal view that is assembled reliably and quickly, transforming tumor board preparation from a tedious, error-prone integration task into a reproducible workflow.

Topics of interest

This workshop will feature presentations and discussions related to the following topics:

Domain-specific Data Models, and Interoperability
Challenge: Biomedical data spans many modalities with some existing standards, but interoperability remains bespoke and fragile.
Related DM topics: data modeling, data integration, schema evolution
Details: Interoperability between different standards across semantic and temporal scales, e.g., from ICU surveillance to patient histories and from single molecules to public health. This encompasses highly heterogeneous data types, including raw genomic sequences, time series, omics measurements, and high-content imaging, clinical records, biomedical knowledge graphs, and scientific publications.
Reproducible Data Preparation and Quality Control
Challenge: Biomedical analyses depend on complex, iterative data preparation and quality control steps that are rarely reproducible, auditable, or reusable.
Related DM topics: data lineage, data cleaning, uncertain data models
Details: Methods to ensure reproducibility and quality control of biomedical data processing and analysis despite regional, temporal, and population-based differences.
Biomedical Analytics Pipelines and Platforms
Challenge: End-to-end biomedical analytics pipelines are difficult to scale, evolve, and reproduce using existing ad hoc workflow and data platform solutions.
Related DM topics: multi-modal data analytics, materialized views, data indexing
Details: Optimization, scalability, and reusability of end-to-end analysis pipelines over large, heterogeneous, streamed, and distributed biomedical data sets, possibly facing a variety of privacy regulations. Knowledge graph construction and management for biomedical applications, including drug discovery and repurposing, omics data analysis, and clinical decision-making.
Privacy, Governance, and Compliance
Challenge: Biomedical data must be analyzed under stringent privacy, governance, and regulatory constraints that complicate access and processing.
Related DM topics: access control, policy-aware query execution, federated workload management
Details: Anonymization algorithms and procedures to ensure governance and regulation compliance for patient-related data. Systems for federated learning and privacy-preserving machine learning.
Human-centric and Organizational Challenges
Challenge: Deployments often fail because of mismatches between system design, user workflows, institutional incentives, and long-term maintenance realities.
Related DM topics: human-in-the-loop data management, incentive-aware system design, data visualization
Details: Intuitive and powerful user interfaces and human-in-the-loop systems for biomedical practitioners.

Specific goals

Introduction: Exposure of data management experts to biomedical research questions, datasets, analytics pipelines, and domain-specific challenges. Conversely, exposure of bioinformatics experts to recent and advanced data management, data analysis, and data engineering techniques that can help overcome their challenges.
Scoping: Identification of the most pressing challenges related to biomedical data management and analysis, having the biggest potential for fruitful synergistic efforts.
Collaboration: Establishing lasting connections, determining modalities of collaboration, funding strategies, and publication venues.
Contribution: Release of datasets, benchmarks, research project proposals, and a vision paper laying a foundation for future work in this space.

Organizers

Bojan Karlaš
Postdoc / Harvard University (affiliated with HMS, MGB, DFCI, and the Broad)
Works on developing interpretable deep learning pipelines for extracting clinically meaningful insights from pathology images. He obtained his PhD at ETH Zurich, working on data management systems for ML with a particular focus on data debugging.

Gerardo Vitagliano
Postdoc / Data Systems Group / MIT CSAIL
Builds interactive and user-friendly data systems, allowing domain experts to analyze large-scale multimodal datasets. His research involves active collaborations with clinicians and biomedical researchers to impact real-world healthcare.

Benjamin M. Gyori
Associate professor / Northeastern University
Works on large-scale data integration and knowledge assembly in biomedicine. His research combines computational systems modeling, ML, NLP, and human–machine interaction to improve our understanding of complex human biology.

Ulf Leser
Full professor / Humboldt-Universität zu Berlin
Developed new tools for management, integration, and analysis of biomedical data. Interested in biomedical data management, text mining, infrastructures for large-scale scientific data analysis, and statistical bioinformatics, with a focus on cancer research.