Monthly Seminars – 2020 – DKM department – IRISA

January 23, 2020

Speaker: Sarah Cohen (LRI)
Title: Computational reproducibility in the life sciences and research in computer science: round trip
Abstract: With the development of new experimental technologies, biologists are faced with an avalanche of data to be computationally analyzed for scientific advancements and discoveries to emerge. Faced with the complexity of analysis pipelines, the large number of computational tools, and the enormous amount of data to manage, there is compelling evidence that many (if not most) scientific discoveries will not stand the test of time. Increasing the reproducibility of computed results is of paramount importance. While several elements of solutions are currently available, ensuring reproducible analyses relies on progress made in several areas of research in computer science including fundamental aspects: analysis pipelines may form very complex graphs to compared and mined, they are also pieces of software with dependencies to their computational environment, maintaining they is highly challenging… After an introduction to the problem of computational reproducibility, we go on to discuss the challenges posed by this domain and describe the remaining opportunities of research in computer science.

October 22, 2020

Speaker: François Charoy (Loria, Nancy)
Title: Confiance et partage d’information à grande échelle
Abstract: La coopération entre organisations et personnes à grande échelle est aujourd’hui facilitée par des outils et des plateformes qui permettent le partage et la production d’information. Depuis les premiers travaux de Ellis dans les années 80 différentes techniques ont été développées pour permettre ce partage à l’échelle du Web. Cependant l’existence d’une telle possibilité n’est pas suffisante pour en assurer un usage cohérent avec les contraintes des domaines concernés. A travers une présentation de différents travaux menés dans l’équipe Coast du LORIA, nous montrerons les facteurs qui peuvent impacter la confiance entre partenaires lorsqu’il s’agit de partager des informations. Le premier facteur est le facteur humain que nous introduirons à travers une étude menée dans le domaine de la sécurité civile. Le second facteur concerne la sureté du partage et en particulier les plateformes et les algorithmes permettant d’assurer un partage sure et efficace de données à grande échelle en pair à pair. Le dernier facteur concerne la sécurité du partage et nous expliquerons quels sont les risques d’attaque contre une plateforme de partage pair à pair et comment on peut s’en protéger.

October 24, 2020

Speaker: Antoine Amarilli (Telecom Paris)
Title: Enumerating pattern matches in texts and trees
Abstract: We study the data management task of extracting structured information from unstructured documents, e.g., raw text or HTML pages. We use the framework of document spanners, where the pattern to extract is specified declaratively by the user (as a regular expression with capture variables) and is translated to an automaton that then is evaluated on the document to compute the pattern matches. We focus on the last step of this pipeline: our goal is to efficiently find the matches of the automaton on the document, even when there can be many of them. We do this by proposing an enumeration algorithm, which first preprocesses the automaton and document, and then enumerates all matches with a small delay between consecutive matches. Unlike previous work, our algorithm achieves the best possible bounds in the input document (namely, linear preprocessing and constant delay), while remaining tractable in the automaton. The guiding principle of the algorithm is to compute a factorized representation of all matches as a product of the automaton and document, and design efficient indexes based on the structure of this representation. We also present our ongoing follow-up work, e.g., how to extend our algorithm to the case of tree-shaped documents by efficiently enumerating the matches of tree automata, how to efficiently update the enumeration results when the input document changes, and other open research directions.

November 12, 2020

Author: Thomas Schiex (Inrae, Toulouse)
Title: AI for Computational Protein Design: bridging symbolic and numerical IA
Abstract: As a person working on automated reasoning, logic and constraints in Artificial Intelligence but also as a computational biologist at INRAE, I quickly realized that logic alone was not ideal to model and solve the computational problems I faced as an INRAE researcher. For this reason, our team quickly worked on the AI side to extend the “propositional automated reasoning” techniques that reason on absolutely certain statements, to more flexible formulations using “weighted” information. This lead to the birth of several dedicated fundamental algorithms and their implementation in our dedicated solver “toulbar2”. Toulbar2 is now one of the most efficient solver in his area, and is capable of mixing logical and numerically weighted information rigorously, solving complex puzzles that combine pure logical knowledge with more vague information, including probabilistic statements. I will show how, in the last 8 years, we have put toulbar2 to the task of designing new proteins targeting a predefined biochemical/biological function, or more precisely a predefined 3-dimensional structure, standing of the shoulders of giants that designed biophysical force-fields and rotamer libraries, capturing slowly acquired and permanently improved bio-physical and statistical knowledge on proteins. This protein design problem, using a rigid backbone target, a catalog of discrete conformations for amino acid side-chains and a pairwise decomposable force-field such as AMBER, CHARMM or Rosetta score functions, is known to be NP-hard. For a reason we do not really understand, toulbar2 shines on these problems. Contrarily to the usual Monte Carlo based methods, it is able to find and prove that the solution it has found is optimal for the force-field used on problems of non trivial sizes. It also allows to rigorously satisfy non-trivial designer constraints. It outperforms other guaranteed optimization tools we have tried and was even able to show the limits of an optimized Monte-Carlo method in Rosetta (IPD, Univ. Washington). Recent comparisons of toulbar2 with D-Wave quantum annealing hardware (by Rosetta team members ) also show its good relative performances. Thanks to this, we have shown it is also capable of dealing with backbone flexibility, at least when the aim is to design a protein sequence that should fit several “states”. We have put all these techniques to work in practice with structural biologist colleagues, designing new self-assembling proteins, antibody or enzymes. As I will show, for a few years now, we have upgraded this pure force-field and design target constraints based approach with Machine Learned information extracted from multiple sequence alignments, allowing to refine the force-field for a given suitable design structure, bridging the gap between data, machine learning information, thermodynamic and logic information, which perfectly fits the usual “designer” situation. Going back to AI, we have showed that this approach is also able to learn how to play the Sudoku, without knowing the rules, just from images of solved grids, better than “Deep Neural Net”-friendly appraoches, while providing understandable and customizable learned rules.

November 28, 2020

Speaker: Laurent Simon (LABRI, Bordeaux)
Title: SAT : résoudre un problème difficile pour les résoudre tous
Abstract: Les progrès autour de la résolution pratique du problème SAT, le problème NP-Complet canonique, ont été spectaculaires dans certains domaines applicatifs. Même si des limites fortes existent toujours sur quelques problèmes fortement combinatoires, nous présenterons, dans cet exposé, quelques applications clés qui ont bénéficié de ces progrès. Nous présenterons également comment la logique propositionnelle, au coeur de SAT, permet de modéliser et de résoudre des problèmes de raisonnement bien au delà de ce formalisme initial. Ainsi, l’exposé se conclura par la présentation des progrès récents en compilation de connaissance, formalisme puissant, général et élégant pour le raisonnement.
Slides: view slides