DKM department – IRISA – Rennes – Data and Knowledge management. IRISA.

Overview

DKM is a department of the IRISA laboratory, hosting researchers from Univ Rennes, Inria, CNRS, INSA, located in Rennes and Lannion. To foster interactions with laboratories in Rennes, the department also hosts, as associated members, researchers and engineers from INRAE, INSERM, Institut Agro Rennes Angers, and ENSAI.

Scientific focus: Enrich data management methods with knowledge-based and reasoning-based approaches.

Originality: rely on a priori knowledge of the expert/user domain
- Structure, homogenize, and then facilitate the data exploration,
- Semantic web technologies and dedicated data structures.
Goal: provide reliability indicators to the users
- Elucidate how and why data have been analyzed,
- Explainability (automated reasoning, formal concept analysis, trust functions, pattern mining, etc.).
Four levels of data-management techniques
- Structuring: Structuring and analyzing data according to domain knowledge.
- Managing: Enhancing the performance and expressiveness of data management and representation frameworks.
- Exploring: Interactively guiding the data exploration process.
- Analyzing: Extracting complex features and knowledge from generic data.

Applications
- Bioinformatics
- Environment
- Health

Research Teams

The DKM department includes the following teams.

DRUID: Declarative & Reliable management of Uncertain, user-generated Interlinked Data

Team leader: Zoltan MIKLOS (Associate Professor, Univ Rennes)

Website URL: https://www-druid.irisa.fr

Description :

Recently, there has been a growing interest in data management methods. Statistical machine learning techniques, fueled by the availability of pay-as-you-go distributed computing power, can now extract useful information from certain data. Across industries and scientific fields, organizations strive to build massive datasets to capitalize on the benefits of the AI revolution.

The DRUID team focuses on addressing questions surrounding data management techniques, specifically to store, query, and organize data efficiently. Additionally, the team investigates artificial intelligence techniques aimed at extracting knowledge and generating insights from data, particularly in situations involving uncertainty. The ultimate goal is to ensure that this knowledge is actionable, enabling the provision of services such as recommendations.

Unfortunately, data management and machine learning are often viewed as separate tasks. Current data management systems lack elegant support for machine learning primitives. For instance, machine learning operators are still treated as external procedures outside the query language, and query optimizers scarcely consider them. Furthermore, knowledge extraction tasks are difficult to design without comprehensively understanding the available data. This suggests that knowledge extraction should be an interactive process where users actively guide and influence the analysis.

With these challenges in mind, the DRUID team has defined the following objectives:

Propose new query mechanisms, particularly for network-oriented data, and to better integrate Machine Learning methods with the database logic and engines.
Propose interactive, human-in-the-loop data analysis and knowledge extraction methods even with uncertain data.

Last activity report: 2023 – PDF

DyLiSS: Dynamics, Logics and Inference for biological Systems and Sequences

Team leader: Olivier DAMERON (Professor, Univ Rennes)

Website URL: http://www.irisa.fr/dyliss

Description:

Bioinformatics context: from life data science to functional information about biological systems and unconventional species.
Sequence analysis and systems biology both focus on interpreting biological information at the molecular level, primarily concerning intracellular compounds. The analysis of genome-level information is a major issue in sequence analysis. The ultimate goal here is to create a comprehensive catalog of bio-products and their functions while also providing efficient methods to characterize such bio-products in genomic sequences. In parallel, systems biology seeks to analyze contextual physiological information, which includes all observable cell events when a perturbation affects a living system. This contextual physiological information is also a critical component of systems biology.

Historically, computational methods developed for sequence analysis and dynamical modeling exhibited limited interaction. However, with the rise and democratization of new sequencing technologies (NGS, metagenomics), there is now greater capacity to link systems and genomic sequences. In this area, the DyLISS team focuses on connecting genomic sequence analysis with systems biology. The main biological goal is to characterize groups of genetic actors that control phenotypic responses of species under environmental stress. Computational goals focus on developing methods to analyze the dynamic responses of biological systems, model and classify families of gene products using sensitive and expressive languages, and identify the main actors of biological systems within static interaction maps.

This research formalizes and integrates a set of logical or grammatical constraints, including both generic knowledge (literature-based regulatory pathways, molecular function diversity, DNA patterns associated with molecular mechanisms) and species-specific information (physiological response to perturbations, sequencing, etc.). It relies on symbolic methods (e.g., Semantic Web technologies for data integration and querying, reasoning with bio-ontologies, solving combinatorial optimization problems, and formal classification) to compute the main features of the space of admissible models.
Computational challenges.
The primary computational challenges stem from data incompleteness and heterogeneity, leading to non-identifiability. It has been observed that biological systems, as currently understood, cannot be uniquely identified. While “omics” technologies have dramatically increased the number of measurable compounds in systems, the theoretical number of different measurements required to integrate these compounds into a single discriminative model has exponentially increased. Therefore, with the current state of knowledge, it is not possible to explain the data with a single model. The rationale here is that biological systems will remain non-identifiable for a long time.

To address this, the research team advocates for the construction and study of a space of feasible models or hypotheses. This space integrates known constraints and facts related to a living system rather than focusing on a single discriminative optimized model. The development of methods enabling precise and exhaustive investigation of this hypothesis space is critical. This approach positions the team to develop experimental strategies that can progressively shrink the hypothesis space and increase the overall understanding of the system.
Bioinformatics challenges.
The research objectives in computer science align with three main bioinformatics challenges:
1. Data science and knowledge science for life sciences.
2. Understanding metabolism.
3. Characterizing regulatory and signaling phenotypes.

Last activity report: 2023 – PDF

GenScale: Scalable, Optimized and Parallel Algorithms for Genomics

Team leader: Pierre PETERLONGO (Research Director, Inria)

Website URL: https://team.inria.fr/genscale

Description:

The primary goal of the GenScale project is to develop scalable methods and software programs for processing genomic data. The research is motivated by the rapid advancement of sequencing technologies, particularly next-generation sequencing (NGS) and third-generation sequencing (TGS). NGS provides up to billions of very short (few hundred base pairs, bps) DNA fragments of high quality, called short reads, while TGS offers millions of long (thousands to millions of bps) DNA fragments of lower quality, referred to as long reads. Synthetic long reads or linked-reads technology combine high-quality and low-cost short-reads sequencing with long-range information by adding barcodes that trace reads originating from the same long DNA fragment. These sequencing data present significant challenges for bioinformatics and computer science, as recent sequencing machines generate terabytes of DNA sequences that require time-consuming processes to extract useful and relevant information.

A wide range of biological questions can be investigated using genomic data. A complete project may include DNA extraction from one or more organisms, sequencing using high-throughput machines, and designing methods and bioinformatics pipelines to address the initial question. Such pipelines involve preprocessing steps (quality control and data cleaning) and core functions transforming data into genomic objects where GenScale’s expertise lies (e.g., genome assembly, variant discovery – SNP, structural variations, sequence annotation, sequence comparison). Further integration steps are often required to interpret and gain insights from data by incorporating other semantic information sources.

The challenge for GenScale is to develop scaling algorithms capable of managing the daily volume of sequenced DNA data, which tends to overwhelm bioinformatics computing centers. To meet this challenge, the strategy focuses on both space and time scalability. Space scalability relates to designing optimized, low-memory data structures that can capture essential information from sequencing datasets, compressing terabytes of raw data into a form that can fit within computer memory. Time scalability ensures that algorithms operate efficiently, with their execution time being proportional to the size of the problem or, at the very least, within a reasonable duration. In this regard, parallelism is a complementary technique to enhance scalability.

Another key objective of GenScale is fostering and maintaining partnerships with life science research groups. Collaborations with genomics research teams are crucial for validating tools and staying updated in this fast-evolving field. The approach is to actively contribute to solving biological problems with partners and engage in a few select, challenging genomic projects.

GenScale’s research is structured into four main axes:

Axis 1: Data structures & indexing algorithms
Axis 2: Sequence analysis algorithms
Axis 3: Parallelism
Axis 4: Applications

Last activity report: 2023 – PDF

LACODAM: Large scale Collaborative Data Mining

Team leader: Alexandre TERMIER (Professor, Univ Rennes)

Website URL: https://team.inria.fr/lacodam

Description:

Data collection is ubiquitous today, providing society with tremendous volumes of knowledge about human, environmental, and industrial activities. This ever-growing stream of data holds the key to new discoveries in both industrial and scientific fields. However, these keys are only accessible to those who can make sense of the data, which is a challenging task. It requires a solid understanding of the data at hand, proficiency with available analysis tools and methods, and strong deductive skills. These abilities have been grouped under the broad term “Data Science,” and universities have invested considerable effort in training professionals in this domain. The role of “Data Scientist” has become highly sought after, with demand far outstripping the supply of qualified professionals.

Despite the data boom, data science remains largely a “manual” process, with current analysis tools requiring significant human effort and expertise. This makes data analysis a time-consuming and error-prone process, even for experienced data scientists. Additionally, most current approaches are inaccessible to non-specialists.

The objective of the LACODAM team is to facilitate the process of making sense of large amounts of data. This enables the derivation of knowledge and insights to support better decision-making. The team’s approaches focus on providing innovative tools for data scientists that either tackle tasks not addressed by other tools or improve performance in existing tasks, such as reducing execution time, improving accuracy, or better handling imbalanced data.

Last activity report: 2023 – PDF

MALT: Machine Learning with Temporal Contraints

Team leader: Elisa FROMONT (Professor, Univ Rennes)

Website URL: https://team.inria.fr/malt/

Description:

MALT is a joint project-team between IRISA (via Université de Rennes and Université Rennes 2) and Inria (Rennes). Its work spans three key areas: i) Machine learning models on temporal data (sequences and time series) with questions around embeddings, generation of such data and early predictions, ii) Adapting machine learning models to evolving data and class distributions over time with questions around domain adaptation, continual learning and sequential learning, iii) Trustworthiness of temporal models with questions around explanations, privacy, fairness, robustness and efficiency of such models.

SHAMAN: A Symbolic and Human-centric view of dAta MANagement

Team leader: François GOASDOUÉ (Professor, Univ Rennes)

Website URL: https://www-shaman.irisa.fr

Description:

The overall goal of Shaman is to improve data management methods currently used in commercial systems, which suffer from a significant lack of flexibility. Specifically, with the current techniques available, it is difficult for users to: i) understand the data they have access to, and ii) specify their information needs in an intuitive yet sufficiently expressive manner. Additionally, these systems have limited capabilities when handling imperfect data, particularly when data comes from multiple sources. Shaman seeks to address these shortcomings by developing new tools designed to assist end users and database designers in the following ways:

Model and integrate the data—potentially heterogeneous and/or imperfect—that are relevant in a given application context.
Understand the data (structure and semantics) in a way that is accessible to them.
Query and analyze the data, considering user preferences, using as cooperative a mechanism as possible.

Shaman favors symbolic approaches for better intelligibility and ease of use, aiming to define human-centric data management methods. Fuzzy set theory (and possibility theory) offers a natural and intuitive symbolic/numerical interface, bridging the symbolic aspects of linguistic variables and the numerical nature of corresponding functions valued in the unit interval. This theory can be applied to model preference queries, data summaries, and cooperative answering strategies while also defining new data models based on clusters instead of tables. Furthermore, possibility theory supports the modeling of uncertain databases where uncertainty is treated as qualitative and nonfrequent.

Another key focus of Shaman is ontology-based data management. Ontologies: i) help make data more intelligible to users and mediate between sources with different schemas; ii) enhance data management systems with reasoning capabilities, allowing for more “intelligent” data handling.

A strength of Shaman is its positioning at the intersection of the databases and artificial intelligence domains. Historically, these two research communities have remained separate. However, Shaman believes data management will benefit from cross-fertilization between database technologies and AI approaches. The team has long focused on this challenge, leveraging theoretical tools such as fuzzy logic to make database querying more flexible. This approach aligns with changes in the data management landscape, particularly with the rise of the internet and the need to manage open and linked data using methods that involve reasoning capabilities, such as those of the Semantic Web.

Last activity report: 2023 – PDF

Platform

GenOuest: https://www.genouest.org

Organization Team

Romain Tavenard (Head)
Marie Le Roïc
Johan Le Boursicaud
Dorra Sassi
Adel Aly
Nicolas Maurice
Gwladys Kelodjou
Cécile Beust

Contact

Email: dept-dkm@irisa.fr

Presentation

Overview

Research Teams

DRUID: Declarative & Reliable management of Uncertain, user-generated Interlinked Data

DyLiSS: Dynamics, Logics and Inference for biological Systems and Sequences

GenScale: Scalable, Optimized and Parallel Algorithms for Genomics

LACODAM: Large scale Collaborative Data Mining

MALT: Machine Learning with Temporal Contraints

SHAMAN: A Symbolic and Human-centric view of dAta MANagement

Platform

Organization Team

Contact