Seminar – DKM department – IRISA

Current Organizers

Romain Tavenard (Head)
Marie Le Roïc (AER)
Johan Le Boursicaud (Druid)
Dorra Sassi (Shaman)
Adel Aly (Shaman)
Nicolas Maurice (Genscale)
Gwladys Kelodjou (Lacodam)
Cécile Beust (Dyliss)

April, 2025

Speaker: Anne-Cécile Orgerie (CNRS Research Scientist)
Title: Energy consumption and environmental impacts of distributed systems
Abstract: Distributed systems, such as Clouds or data centers, are increasingly spanning worldwide, with digital services hosted all around the globe and often belonging to complex systems, utilizing many other services and hardware resources themselves. Along with this increase comes an alarming growth of Cloud devices and their related energy consumption. Despite the distributed systems’ complexity, understanding how they consume energy is important in order to hunt wasted Joules and reduce their environmental impact. This talk will deal with measuring the energy consumption of distributed systems and deriving models from these measurements to evaluate their energy consumption and their environmental impact.

February 6, 2025

Speaker: Pavel Vesely
Title: Theory meets practice at the median: a highly accurate quantile summary with error guarantees
Abstract: Data summaries provide a scalable way to extract key features from large datasets, enabling real-time analysis at streaming speed or massively parallel processing of distributed datasets, in applications such as network monitoring, machine learning, privacy, or bioinformatics. In this talk, we focus on summaries for characterizing the data distribution by approximating ranks and quantiles. We highlight the theoretical advancements that have led to optimal or near-optimal algorithms, such as the KLL sketch. However, practitioners often use other summaries that perform significantly better in practice despite lacking error guarantees. Most notably, t-digest has been widely adopted by major tech companies and open source projects, although its accuracy can degrade arbitrarily on adversarial data. Drawing inspiration from both theoretical and practical approaches, we present a new quantile summary, called SplineSketch, that achieves the best of both worlds: It provides near-optimal error guarantees in both streaming and distributed settings, while significantly outperforming all other quantile sketches on synthetic and real-world datasets.

November 28, 2024

Speaker: Laurent Miclet (Professor Emeritus University of Rennes)
Title: Calcul en très grande dimension
Abstract: see the presentation details here

September 26, 2024

Speaker: Angela Bonifati (Professor of Computer Science at Lyon 1 University)
Title: Constraints and Transformations for Property Graphs
Abstract: Property graphs are widely known as expressive data models for representing interconnected multi-labeled data enhanced with properties as key/value pairs. They are used in a wide range of domains, such as social and transportation networks, biological networks, finance, cybersecurity, logistics and planning, to name a few. Property graphs are the building blocks of future graph ecosystems, in which OLTP and OLAP processes are intertwined with complex advanced processes, such as learning, scientific computing and business intelligence. While property graphs are currently used in a variety of graph databases, a rather fragmented landscape emerges in terms of the supported query and schema languages. In particular, the coverage of schema and constraints is limited if not completely lacking in these systems. In this talk, I will present recent advances in terms of schemas and constraints for property graphs, as part of our work within the LDBC community. I will also focus on property graph transformations leveraging graph pattern calculus abstractions and leading to a declarative framework having practical utility and efficiency. Finally, I will pinpoint future directions of research in this new exciting area of data management.

July 11, 2024

Speaker: Nathan Fradet (postdoctoral researcher at the Metacreation Lab)
Title: miditok: A Python package for MIDI file tokenization
Abstract: Recent progress in natural language processing has been adapted to the symbolic music modality. Language models, such as Transformers, have been used with symbolic music for a variety of tasks among which music generation, modeling or transcription, with state-of-the-art performances. These models are beginning to be used in production products. To encode and decode music for the backbone model, they need to rely on tokenizers, whose role is to serialize music into sequences of distinct elements called tokens. MidiTok is an open-source library allowing to tokenize symbolic music with great flexibility and extended features. It features the most popular music tokenizations, under a unified API. It is made to be easily used and extensible for everyone.
Google Scholar, linkedlin, https://www.lip6.fr/actualite/personnes-fiche.php?ident=D2488

April 9, 2024

Speaker: Nicolas Courty (Professor, Université Bretagne Sud)
Title: Tutorial on Optimal Transport (Part 2 & Part 1)
Abstract: Optimal Transport theory has been successfully applied to a variety of tasks where comparing or matching data distributions is of interest. Successful applications encompass e.g. domain adaptation, learning overs graphs or generative modeling. In this talk I will give a gentle introduction to the underlying concepts and computational strategies, which I will illustrate with related applications in machine learning problems.

March 21, 2024

Speaker: Eric Tannier (Research Director, Inria Lyon)
Title: Rebound effects in the environmental footprint of Bioinformatics.

January 25, 2024

Speaker: Lê Nguyên Hoang (founder of the Youtube channel Science4All)
Title: The Science of AI Governance
Abstract: The research, development, deployment, and regulation of AI have become critical societal challenges, especially in the era of ChatGPT and TikTok, and within a context of global geopolitical tensions. In this presentation, I will discuss the field of critical digital sciences, which aims to equip democracies with the necessary tools to govern AI effectively. Specifically, I will focus on two key aspects: first, the implications of these sciences for ensuring that products like ChatGPT comply with existing laws; second, I will introduce Tournesol, a collaborative and secure platform for governing recommendation AIs, which integrates cutting-edge research in the field. I will conclude by highlighting several fascinating new research directions prompted by Tournesol, such as the integration of secure liquid democracy. Tournesol is accessible at https://tournesol.app.

April 6, 2023

Speaker: Colin de la Higuera (Univ Nantes, LS2N)

February 2, 2023

Speaker: Meghyn Bienvenu (CNRS, Labri)

December 1st, 2022

Speaker: Damien Eveillard (Univ Nantes)

November 17, 2022

Speaker: Sihem Amer-Yahia (CNRS, Univ. Grenoble Alpes)
Title: Fairness on Online Labor Markets
Abstract: Online labor markets are increasingly becoming a destination for work. These marketplaces include freelancing platforms such as Qapa and MisterTemp’ in France, and TaskRabbit and Fiverr in the USA. On those platforms, workers can find temporary jobs in the physical world such as moving furniture, or in the form of virtual micro-gigs such as helping with designing a website. I will present the results of a study of fairness on those platforms, and discuss the design of a model to enforce fairness in the Future of Work.
Bio: Sihem Amer-Yahia is a Silver Medal CNRS Research Director and Deputy Director of the Lab of Informatics of Grenoble. She works on exploratory data analysis and fairness in job marketplaces. Before joining CNRS, she was Principal Scientist at QCRI, Senior Scientist at Yahoo! Research and Member of Technical Staff at at&t Labs. Sihem is PC chair for SIGMOD 2023 and vice president of the VLDB Endowment. She currently leads the Diversity & Inclusion initiative for the database community.

September 08, 2022

Speaker: Rémy Eyraud (LabHC CNRS, University Jean Monnet, Saint-Etienne)
Title: Hankel Matrix & Weighted Automata: From spectral Learning to Spectral Distillation
Abstract: This talk focuses on the Hankel matrix and Weighted Automata. After introducing these two notions and the main algorithm that links them, I will detail the spectral learning algorithm, a machine learning approach that comes with theoretical guarantees and interesting practical results. Extensions will be discussed with a particular insight on the extraction of weighted automata from any black box trained on (categorial) sequential data. If time allows it, a use case of an easy to use toolbox will be shown.

June 09, 2022

Speaker: Marie-Christine Rousset (SLIDE, LIG, Grenoble)
Title: Some ongoing work on building interpretable explanations for AI algorithms on tabular or graph data
Abstract: AI systems use sophisticated algorithms that apply to personal data for developing more and more decision making applications that directly impact humans. Both for social acceptability and for ethical purposes, it is of utmost importance to make the decisions of AI systems interpretable by humans and also to provide guarantees of privacy protection. In this talk, I will present some ongoing work that we are conducting in the Grenoble MIAI chaire “Explainable and Responsible AI” for building interpretable explanations of AI algorithms. In a first part, I will summarize experimental results that we have obtained on building local and global explanations for predictions of microcredit default learned by black-box models from a tabular dataset. In the second part, I will present our ongoing work on explaining privacy risks detected by a graph-based reasoning algorithm used to check incompatibility between privacy and utility policies expressed as queries. in this setting, queries are interpreted as logical formulas over a common schema and the explanation is based on the construction of a small synthetic graph data illustrating a possible entailment between graph patterns.

April 20 – 22, 2022: IDA

Speakers
- Dominique Lavenier (CNRS, Fr) data storage on DNA
- Cynthia Liem (TU Delft, N) validation and validity in data processing pipelines
- Michèle Sebag (CRNS, Fr) causal explanations
- Julia Stoyanovich (NYU Center for Data Science, USA) building data equity systems

March 24, 2022

Speaker: Charlotte Truchet (TASC, University of Nantes)
Title: Randomization of solutions in constraint programming
Abstract: Constraint programming provides generic techniques to efficiently solve combinatorial problems. In this talk, I will present an ongoing work on constraint sampling. The question is: is it possible to sample combinatorial problems in a generic way, using a constraint solver, and without making the computation time explode? I will present an algorithm, inspired from Meel’s method on SAT, to add randomly chosen hashing constraints to the problem, in order to split the search space into small cells of solutions. By sampling the solutions in a cell, it randomly generates solutions without revamping the model of the problem. We ensure the randomness by introducing a new family of hashing constraints: randomly generated tables. We implemented this solving method using the constraint solver Choco-solver. The quality of the randomness and the running time of our approach are experimentally compared to a random branching strategy. We show that our approach improves the randomness while being in the same order of magnitude in terms of running time.

January 29, 2022

Speaker: Farouk Toumani (LIMOS, Clermont Ferrand)
Title: Reasoning in description logics with variables: beyond matching and unification
Abstract: Description logics (DLs) are a family of logic-based knowledge representation languages used to represent knowledge of an application domain in a structured and well-understood way while providing inference procedures for reasoning about the represented knowledge. DLs form a very active research area, spanning the last 40 years, and which led to a rich variety of description formalisms, with differing expressive power, employed in various application domains (e.g., information systems and databases, natural language processing, semantic web, etc). Standard inference problems (e.g., subsumption, satisfiability and instance problem) as well as their computational properties (decidability, connection between complexity and expressivity) are now well understood. After a brief introduction to description logics nonstandard inferences, this talk will be devoted to presenting description logics augmented with (concept or role) variables. Concepts with variables (also called patterns) have been introduced in description logics since the midnineties and led to a highly interesting research stream on the so-called nonstandard reasoning, specifically, matching and unification. The talk will discuss new semantics for variables and highlights new inference problems which can be viewed as generalization of matching and unification.

December 16, 2021

Speaker: Franck Varennes (ERIAC, Rouen)
Title: Apprentissage machine et explicabilité : un regard issu de l’épistémologie comparée des modèles. https://perso.limos.fr/~fatouman/
Abstract: Certaines techniques dites d’apprentissage machine (machine learning) sont de plus en plus employées pour concevoir des modèles mathématiques et informatiques à fonction soit de prédiction soit de décision. Dans des contextes de diagnostic médical ou de décision juridique, en particulier, il est demandé que ces modèles – ou bien les algorithmes associés ou encore leurs résultats – soient, dit-on, explicables. Parallèlement, dans ce sous-domaine de la philosophie des sciences nommé « épistémologie des modèles », il est devenu d’usage de distinguer les modèles utilisés dans les sciences au regard de leurs différentes fonctions de connaissance, selon qu’ils décrivent, prédisent, servent à comprendre, expliquent, etc. Dans cette conférence – qui présentera une partie des travaux que j’ai commencé à mener en collaboration avec Christophe Denis (LIP6, équipe ACASA) – je montrerai qu’il est possible et utile d’appliquer certaines analyses de cette épistémologie classificatrice des modèles au cas particulier des modèles à apprentissage machine. En revenant sur ce qu’il faut entendre derrière cette demande d’explication, ainsi que sur la différence entre expliquer le modèle et expliquer par le modèle, comme enfin sur le rapport entre explication et causalité, j’essaierai d’apporter quelques éclairages nouveaux sur un sujet controversé. Ce sujet est controversé car il mêle la pensée mathématique et informatique non seulement à l’IA, mais aussi à l’épistémologie des modèles comme à l’épistémologie des mathématiques appliquées.

November 18, 2021

Speaker: William Ritchie (IGH, Montpellier)
Title: Sequencing in Health, a decade of disappointment. Possible AI solutions. https://www.igh.cnrs.fr/en/research/departments/genome-dynamics/intelligence-artificielle
Abstract: The recent increase in the amount of sequencing data has produced a remarkably low number of clinically actionable discoveries. The main issues in translating these data into outcomes come from the fact that many diseases are multifactorial requiring complex models and large sample sizes to understand them. However, sequencing data are large, sensitive and heterogeneous. In this talk, I will propose some solutions to address multiple issues pertaining to the use and interpretation of health-related sequencing data. Specifically, how can we explore sequencing without a reference genome? How can we realistically preserve patient identity? My talk will also introduce some basic deep learning architectures for an audience without a background in machine learning.

November 12, 2021

Speaker: Thomas Schiex (INRAE, Toulouse)
Title: AI for Computational Protein Design: bridging symbolic and numerical IA https://www.igh.cnrs.fr/en/research/departments/genome-dynamics/intelligence-artificielle
Abstract: As a person working on automated reasoning, logic and constraints in Artificial Intelligence but also as a computational biologist at INRAE, I quickly realized that logic alone was not ideal to model and solve the computational problems I faced as an INRAE researcher. For this reason, our team quickly worked on the AI side to extend the “propositional automated reasoning” techniques that reason on absolutely certain statements, to more flexible formulations using “weighted” information. This lead to the birth of several dedicated fundamental algorithms and their implementation in our dedicated solver “toulbar2”. Toulbar2 is now one of the most efficient solver in his area, and is capable of mixing logical and numerically weighted information rigorously, solving complex puzzles that combine pure logical knowledge with more vague information, including probabilistic statements.I will show how, in the last 8 years, we have put toulbar2 to the task of designing new proteins targeting a predefined biochemical/biological function, or more precisely a predefined 3-dimensional structure, standing of the shoulders of giants that designed biophysical force-fields and rotamer libraries, capturing slowly acquired and permanently improved bio-physical and statistical knowledge on proteins. This protein design problem, using a rigid backbone target, a catalog of discrete conformations for amino acid side-chains and a pairwise decomposable force-field such as AMBER, CHARMM or Rosetta score functions, is known to be NP-hard.For a reason we do not really understand, toulbar2 shines on these problems. Contrarily to the usual Monte Carlo based methods, it is able to find and prove that the solution it has found is optimal for the force-field used on problems of non trivial sizes. It also allows to rigorously satisfy non-trivial designer constraints. It outperforms other guaranteed optimization tools we have tried and was even able to show the limits of an optimized Monte-Carlo method in Rosetta (IPD, Univ. Washington). Recent comparisons of toulbar2 with D-Wave quantum annealing hardware (by Rosetta team members) also show its good relative performances. Thanks to this, we have shown it is also capable of dealing with backbone flexibility, at least when the aim is to design a protein sequence that should fit several “states”. We have put all these techniques to work in practice with structural biologist colleagues, designing new self-assembling proteins, antibody or enzymes. As I will show, for a few years now, we have upgraded this pure force-field and design target constraints based approach with Machine Learned information extracted from multiple sequence alignments, allowing to refine the force-field for a given suitable design structure, bridging the gap between data, machine learning information, thermodynamic and logic information, which perfectly fits the usual “designer” situation. Going back to AI, we have showed that this approach is also able to learn how to play the Sudoku, without knowing the rules, just from images of solved grids, better than “Deep Neural Net”-friendly appraoches, while providing understandable and customizable learned rules.
Slides: view slides

October 22, 2021

Speaker: Francois Charoy (LORIA, Nancy)
Title: Confiance et partage d’information à grande échelle.
Abstract:La coopération entre organisations et personnes à grande échelle est aujourd’hui facilitée par des outils et des plateformes qui permettent le partage et la production d’information. Depuis les premiers travaux de Ellis dans les années 80 différentes techniques ont été développées pour permettre ce partage à l’échelle du Web. Cependant l’existence d’une telle possibilité n’est pas suffisante pour en assurer un usage cohérent avec les contraintes des domaines concernés. A travers une présentation de différents travaux menés dans l’équipe Coast du LORIA, nous montrerons les facteurs qui peuvent impacter la confiance entre partenaires lorsqu’il s’agit de partager des informations. Le premier facteur est le facteur humain que nous introduirons à travers une étude menée dans le domaine de la sécurité civile. Le second facteur concerne la sureté du partage et en particulier les plateformes et les algorithmes permettant d’assurer un partage sure et efficace de données à grande échelle en pair à pair. Le dernier facteur concerne la sécurité du partage et nous expliquerons quels sont les risques d’attaque contre une plateforme de partage pair à pair et comment on peut s’en protéger.

October 14, 2021

Speaker: Blaise Hanczar (IBISC, Université Paris-Saclay)
Title: Apprentissage profond pour la prédiction de phénotypes à partir de données transcriptomiques.
Abstract:Les données transcriptomiques représentent une estimation de la quantité d’ARN « produite » par chaque gène, elles peuvent être vues comme l’activité des gènes dans un échantillon. L’analyse de ces données joue un rôle majeur dans la compréhension de la biologique moléculaire et le développement de la médecine personnalisée. En effet, il est théoriquement possible de prédire de nombreux phénotypes à partir du profil d’expression d’un patient. Si de nombreuses méthodes d’apprentissage automatique classiques ont déjà été appliquées sur ces données, il y a encore peu de travaux sur l’utilisation de l’apprentissage profond.Nous nous intéressons à l’apprentissage profond pour la prédiction de phénotype à partir de données transcriptomiques et en particulier a deux problèmes. Le premier est la petite taille de jeux de données transcriptomiques alors que l’apprentissage des modèles profonds demande beaucoup de données. Pour pallier cela, nous utilisons les approches d’apprentissage par transfert. Le deuxième problème vient du manque de transparence des réseaux de neurones. Il est très difficile de comprendre et d’interpréter les prédictions d’un modèle profond ce qui est pourtant indispensable pour des applications médicales. Nous proposons deux approches pour interpréter ces modèles. La première est une méthode qui identifie les parties importantes d’un modèle déjà appris, la seconde propose une nouvelle architecture interprétable par construction. Dans les deux cas l’interprétation se fait en identifiant les fonctions biologiques les plus mobilisées par le modèle.
Slides: view slides

September 23, 2021

Speaker: Olivier Collin (IRISA, Rennes)
Title: GenOuest Platform Overview
Abstract: Opérationnelle depuis 2001, la plate-forme GenOuest offre un environnement complet initialement spécialisé pour la bioinformatique.
Elle offre ses services à une large communauté (plus de 850 comptes ouverts). Grâce à une stratégie d’intégration successive dans divers réseaux, régionaux, nationaux, elle fait désormais partie de l’Institut Français de Bioinformatique (IFB) ce qui lui permet de s’insérer dans l’infrastructure Européenne Elixir.
Au fil des ans, l’offre de service s’est étoffée pour proposer un éventail d’environnements adaptés aux divers besoins des utilisateur·trice·s avec, en plus d’un cluster de calcul, des environnements virtualisés et un cloud. Depuis quelques années, GenOuest s’est également focalisée sur la gestion de la donnée scientifique au travers de divers projets.
On présentera durant cet exposé la plate-forme GenOuest avec une vue générale des ressources et services offerts.

May 20, 2021

Speaker: Tias Guns (Vrije Universiteit Brussel, Belgium)
Title: Learning from user and environment in combinatorial optimisation
Abstract: Industry and society are increasingly automating processes, which requires solving constrained optimisation problems. This includes vehicle routing problems in transportation, scheduling of tasks in electricity demand-response programs, rostering of personnel and more. However, the solutions found by state-of-the-art constraint programming solvers are often found to not match the expectations of domain experts, which reduces the potential of the technology. One key direction to overcome this, is to automatically learn from the user and environment over time. This includes learning preferences, implicit constraints and impacts of external factors. It requires combining machine learning techniques with constraint solving techniques, and hence hybrid learning and reasoning systems. In this talk I will provide an overview of three different ways in which part of the problem specification can be learned from data. This includes preference learning of objective functions, perception-based reasoning and end-to-end decision focussed learning, where I will highlight recent evolutions and advances we have been making in our lab. I will end by sharing how I used early results in these directions as a motivation for an ERC grant, due to start in 2021; including the research challenges identified and how they range from feasible to potentially high impact.
Slides: View slides

April 29, 2021

Speaker: Bruno Crémilleux (Greyc, University of Caen)
Title: Pattern mining : d’une recherche exhaustive à une découverte de motifs centrée sur l’utilisateur.
Abstract: La fouille de données orientée motifs – ou pattern mining – est maintenant un domaine bien établi de la fouille de données. Les méthodes ont longtemps privilégié les approches complètes en effectuant d’importants efforts sur les aspects algorithmiques. Dans cet exposé, nous montrons les succès et les limites pratiques de ces approches qui ont motivé l’évolution du domaine vers l’utilisation de préférences et des méthodes centrées sur l’utilisateur·ice. Nous illustrons cette évolution en montrant comment intégrer des préférences explicites en pattern mining. L’utilisateur·ice ayant souvent une idée vague de l’information recherchée, nous introduisons la fouille interactive et ses défis, la fouille interactive ayant l’avantage d’expliciter les préférences utilisateur·ice afin de fournir des motifs plus utiles.

March 16, 2021

Speaker: Ioanna Manolescu ( Inria, Saclay)
Title: What do the Sources Say? Exploring Heterogeneous Journalistic Data As a Graph
Abstract:Professional journalism is of utmost importance nowadays. It is a main feature distinguishing dictatorships from democracies, and a mirror sorely needed by society to look upon itself and understand its functioning. In turn, understanding is necessary for making informed decisions, such as political choices.With the world turning increasingly digital, journalists need to analyze very large amounts of data, while having no control over the structure, organization, and format of the data. Since 2013, my team has been working to understand data journalism and computational fact-checking use cases, to identify and develop tools adapted for this challenging setting. I will describe our SourcesSay project (2020-2024), in which extremely heterogeneous data sources are integrated as graphs, on top of which journalistic applications can be supported through flexible graph queries. I will explain the data source integration module, the role played by Information Extraction and Entity Disambiguation, as well as novel techniques to explore and simplify these graphs. This work is joint with Angelos Anadiotis, Oana Balalau, Helena Galhardas, Tayeb Merabti, Emmanuel Pietriga, and many other colleagues. Project Web site: https://sourcessay.inria.fr
Slides: View slides

January 21, 2021

Speaker: Michel Dumontier (Maastricht University)
Title: Accelerating biomedical discovery science with an Internet of FAIR data and services
Abstract: Biomedicine has always been a fertile and challenging domain for computational discovery science. Indeed, the existence of millions of scientific articles, thousands of databases, and hundreds of ontologies, offer exciting opportunities to reuse our collective knowledge, were we not stymied by incompatible formats, overlapping and incomplete vocabularies, unclear licensing, and heterogeneous access points. In this talk, I will discuss our work to create computational standards, platforms, and methods to wrangle knowledge into simple, but effective representations based on semantic web technologies that are maximally FAIR – Findable, Accessible, Interoperable, and Reusable – and to further use these for biomedical knowledge discovery. But only with additional crucial developments will this emerging Internet of FAIR data and services enable automated scientific discovery on a global scale.
Bio: Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research focuses on the development of computational methods for scalable and responsible discovery science. Dr. Dumontier obtained his BSc (Biochemistry) in 1998 from the University of Manitoba, and his PhD (Bioinformatics) in 2005 from the University of Toronto. Previously a faculty member at Carleton University in Ottawa and Stanford University in Palo Alto, Dr. Dumontier founded and directs the interfaculty Institute of Data Science at Maastricht University to develop sociotechnological systems for responsible data science by design. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon 2020, the European Open Science Cloud, the US National Institutes of Health and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
Slides: View slides

November 28, 2020

Speaker: Laurent Simon (LABRI, Bordeaux)
Title: SAT : résoudre un problème difficile pour les résoudre tous
Abstract: Les progrès autour de la résolution pratique du problème SAT, le problème NP-Complet canonique, ont été spectaculaires dans certains domaines applicatifs. Même si des limites fortes existent toujours sur quelques problèmes fortement combinatoires, nous présenterons, dans cet exposé, quelques applications clés qui ont bénéficié de ces progrès. Nous présenterons également comment la logique propositionnelle, au coeur de SAT, permet de modéliser et de résoudre des problèmes de raisonnement bien au delà de ce formalisme initial. Ainsi, l’exposé se conclura par la présentation des progrès récents en compilation de connaissance, formalisme puissant, général et élégant pour le raisonnement.
Slides: view slides

November 12, 2020

Author: Thomas Schiex (Inrae, Toulouse)
Title: AI for Computational Protein Design: bridging symbolic and numerical IA
Abstract: As a person working on automated reasoning, logic and constraints in Artificial Intelligence but also as a computational biologist at INRAE, I quickly realized that logic alone was not ideal to model and solve the computational problems I faced as an INRAE researcher. For this reason, our team quickly worked on the AI side to extend the “propositional automated reasoning” techniques that reason on absolutely certain statements, to more flexible formulations using “weighted” information. This lead to the birth of several dedicated fundamental algorithms and their implementation in our dedicated solver “toulbar2”. Toulbar2 is now one of the most efficient solver in his area, and is capable of mixing logical and numerically weighted information rigorously, solving complex puzzles that combine pure logical knowledge with more vague information, including probabilistic statements. I will show how, in the last 8 years, we have put toulbar2 to the task of designing new proteins targeting a predefined biochemical/biological function, or more precisely a predefined 3-dimensional structure, standing of the shoulders of giants that designed biophysical force-fields and rotamer libraries, capturing slowly acquired and permanently improved bio-physical and statistical knowledge on proteins. This protein design problem, using a rigid backbone target, a catalog of discrete conformations for amino acid side-chains and a pairwise decomposable force-field such as AMBER, CHARMM or Rosetta score functions, is known to be NP-hard. For a reason we do not really understand, toulbar2 shines on these problems. Contrarily to the usual Monte Carlo based methods, it is able to find and prove that the solution it has found is optimal for the force-field used on problems of non trivial sizes. It also allows to rigorously satisfy non-trivial designer constraints. It outperforms other guaranteed optimization tools we have tried and was even able to show the limits of an optimized Monte-Carlo method in Rosetta (IPD, Univ. Washington). Recent comparisons of toulbar2 with D-Wave quantum annealing hardware (by Rosetta team members ) also show its good relative performances. Thanks to this, we have shown it is also capable of dealing with backbone flexibility, at least when the aim is to design a protein sequence that should fit several “states”. We have put all these techniques to work in practice with structural biologist colleagues, designing new self-assembling proteins, antibody or enzymes. As I will show, for a few years now, we have upgraded this pure force-field and design target constraints based approach with Machine Learned information extracted from multiple sequence alignments, allowing to refine the force-field for a given suitable design structure, bridging the gap between data, machine learning information, thermodynamic and logic information, which perfectly fits the usual “designer” situation. Going back to AI, we have showed that this approach is also able to learn how to play the Sudoku, without knowing the rules, just from images of solved grids, better than “Deep Neural Net”-friendly appraoches, while providing understandable and customizable learned rules.

October 24, 2020

Speaker: Antoine Amarilli (Telecom Paris)
Title: Enumerating pattern matches in texts and trees
Abstract: We study the data management task of extracting structured information from unstructured documents, e.g., raw text or HTML pages. We use the framework of document spanners, where the pattern to extract is specified declaratively by the user (as a regular expression with capture variables) and is translated to an automaton that then is evaluated on the document to compute the pattern matches. We focus on the last step of this pipeline: our goal is to efficiently find the matches of the automaton on the document, even when there can be many of them. We do this by proposing an enumeration algorithm, which first preprocesses the automaton and document, and then enumerates all matches with a small delay between consecutive matches. Unlike previous work, our algorithm achieves the best possible bounds in the input document (namely, linear preprocessing and constant delay), while remaining tractable in the automaton. The guiding principle of the algorithm is to compute a factorized representation of all matches as a product of the automaton and document, and design efficient indexes based on the structure of this representation. We also present our ongoing follow-up work, e.g., how to extend our algorithm to the case of tree-shaped documents by efficiently enumerating the matches of tree automata, how to efficiently update the enumeration results when the input document changes, and other open research directions.

October 22, 2020

Speaker: François Charoy (Loria, Nancy)
Title: Confiance et partage d’information à grande échelle
Abstract: La coopération entre organisations et personnes à grande échelle est aujourd’hui facilitée par des outils et des plateformes qui permettent le partage et la production d’information. Depuis les premiers travaux de Ellis dans les années 80 différentes techniques ont été développées pour permettre ce partage à l’échelle du Web. Cependant l’existence d’une telle possibilité n’est pas suffisante pour en assurer un usage cohérent avec les contraintes des domaines concernés. A travers une présentation de différents travaux menés dans l’équipe Coast du LORIA, nous montrerons les facteurs qui peuvent impacter la confiance entre partenaires lorsqu’il s’agit de partager des informations. Le premier facteur est le facteur humain que nous introduirons à travers une étude menée dans le domaine de la sécurité civile. Le second facteur concerne la sureté du partage et en particulier les plateformes et les algorithmes permettant d’assurer un partage sure et efficace de données à grande échelle en pair à pair. Le dernier facteur concerne la sécurité du partage et nous expliquerons quels sont les risques d’attaque contre une plateforme de partage pair à pair et comment on peut s’en protéger.

January 23, 2020

Speaker: Sarah Cohen (LRI)
Title: Computational reproducibility in the life sciences and research in computer science: round trip
Abstract: With the development of new experimental technologies, biologists are faced with an avalanche of data to be computationally analyzed for scientific advancements and discoveries to emerge. Faced with the complexity of analysis pipelines, the large number of computational tools, and the enormous amount of data to manage, there is compelling evidence that many (if not most) scientific discoveries will not stand the test of time. Increasing the reproducibility of computed results is of paramount importance. While several elements of solutions are currently available, ensuring reproducible analyses relies on progress made in several areas of research in computer science including fundamental aspects: analysis pipelines may form very complex graphs to compared and mined, they are also pieces of software with dependencies to their computational environment, maintaining they is highly challenging… After an introduction to the problem of computational reproducibility, we go on to discuss the challenges posed by this domain and describe the remaining opportunities of research in computer science.