Rayan Chikhi rayan.chikhi@pasteur.fr @RayanChikhi I am a researcher in bioinformatics and group leader at Institut Pasteur in Paris. In non-technical terms, my work consists of analyzing DNA using computers. Scientists can study humans, plants, animals, using DNA sequencing instruments. This has transformed biology in the last decade, e.g. to identify mutations in viruses; to study evolution; and so much more. We would like to have a complete and precise understanding of DNA, but this is not straightforward: sequencing data is challenging to process. So, people like me develop methods to do analyses faster and better. In technical terms, my interests range from fundamental data structures and algorithms, to their implementation and execution in the context of DNA and RNA sequencing. Part of my expertise is on the de novo assembly of genomes. Recently, I contributed to the analysis of all previously sequenced RNA data to find novel viruses (serratus.io). Short bio I studied Computer Science at ENS Rennes and obtained a PhD in 2012 under the supervision of Dominique Lavenier. After a postdoc at Penn State in Paul Medvedev's lab, CNRS hired me as a junior researcher in 2014 and I was part of the Bonsai bioinformatics team. In 2019 I started the Sequence Bioinformatics research group in the Department of Computational Biology at Institut Pasteur, partly funded by the Inception program. Research topics Genome analysis Algorithms and data structures De novo assembly
Rayan Chikhi

Group members Yoann Dufresne (research scientist) Camila Duitama (PhD student) Francesco Andreace (PhD student) Yoshihiro Shibuya (postdoc) Formerly Téo Lemane (PhD student, now postdoc at Genoscope) Riccardo Vicedomini (postdoc, now CNRS researcher) Luc Blassel (PhD student, now postdoc in Univ Lyon/Paris) Luca Denti (postdoc, now postdoc at Univ Milan) Mael Kerbiriou (engineer) Camille Marchet (postdoc, now CNRS permanent researcher) Pierre Marijon (PhD student, now bioinformatics staff at Sequoia, Paris) Daria Martchenko (visiting PhD student, 2018) Samarth Rangavittal (visiting PhD student, 2015) Interns' hall of fame: Loik Le Dreau (2010), Antoine Limasset (2013), Jordan Piorum (2015), Remi Godbille (2015), Marion Tommasi (2015), Alexis Dupuis (2017), Manon Curaudeau (2019), Leila Kechache (2019), Louis Mockly (2019), Augustin Giros (2019) Software Minia assembler Whole genome de novo assembler with very low memory usage, described in [11]. Kmergenie Automatic detection of the k-mer size for de novo assembly, described in [14]. DSK K-mer counting software, low-memory, low disk usage, supports large values of k, described in [13]. BCALM 2 Very scalable de Bruijn graph compaction, described in [24]. GATB Library C++ library for the development of reference-free Illumina data analysis software, described in [17]. Publications [66] G. Benoit et al, High-quality metagenome assembly from long accurate reads with metaMDBG, Nature Biotechnology (2024) [PDF] [65] B. Willink et al, The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies, Nature Ecology & Evolution (2024) [PDF] [64] F. Andreace, P. Lechat, Y. Dufresne, R. Chikhi, Comparing methods for constructing and representing human pangenome graphs, Genome Biology (2023) [PDF] [63] C. Duitama González, S. Rangavittal, R. Vicedomini, R. Chikhi, H. Richard, aKmerBroom: Ancient oral DNA decontamination using Bloom filters on k-mer sets, iScience (2023) [PDF] [62] L. Ayad, R. Chikhi, S. Pissis, Seedability: optimizing alignment parameters for sensitive sequence comparison, Bioinformatics Advances (2023) [PDF] [61] M. Forgia et al., Hybrids of RNA viruses and viroid-like elements replicate in fungi, Nature Communications (2023) [PDF] [60] B. Ekim, K. Sahlin, P. Medvedev, B. Berger, R. Chikhi, Efficient mapping of accurate long reads in minimizer space with mapquik, Genome Research (2023) [PDF] [59] L. Denti, P. Khorsand, P. Bonizzoni, F. Hormozdiari, R. Chikhi, SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads, Nature Methods (2022) [PDF] [58] F. Meyer et al., Critical Assessment of Metagenome Interpretation: the second round of challenges, Nature Methods (2022) [PDF] [57] Y. Dufresne et al, The K-mer File Format: a standardized and compact disk representation of sets of k-mers, Bioinformatics (2022) [PDF] [56] L. Blassel, P. Medvedev, R. Chikhi, Mapping-friendly sequence reductions: going beyond homopolymer compression, iScience (2022) [PDF] [55] S. Porrelli et al, Draft genome of the lowland anoa (Bubalus depressicornis) and comparison with buffalo genome assemblies (Bovidae, Bubalina), G3 (2022) [PDF] [54] T. Lemane, R. Chikhi, P. Peterlongo, kmdiff, large-scale and user-friendly differential k-mer analyses, Bioinformatics (2022) [PDF] [53] T. Lemane, P. Medvedev, R. Chikhi, P. Peterlongo, kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections, Bioinformatics Advances (2022) [PDF] [52] R. C. Edgar et al., Petabase-scale sequence alignment catalyses viral discovery, Nature (2022) [PDF] [51] B. Ekim, B. Berger, R. Chikhi, Minimizer-space de Bruijn graphs, RECOMB, Cell Systems (2021) Best Student Paper Award [PDF] [50] R. Vicedomini, C. Quince, A. Darling, R. Chikhi, Strainberry: automated strain separation in low-complexity metagenomes using long reads, Nature Communications (2021) [PDF] [49] P. Khorsand, L. Denti, HGSV Consortium, P. Bonizzoni, R. Chikhi, F. Hormozdiari, Comparative genome analysis using sample-specific string detection in accurate long reads, Bioinformatics Advances (2021) [PDF] [48] L. Yang et al., Recombination Marks the Evolutionary Dynamics of a Recently Endogenized Retrovirus, Molecular Biology and Evolution (2021) [PDF] [47] G. Lasaviciute et al., Human Bone Marrow Mesenchymal Stromal Cell-Derived CXCL12, IL-6 and GDF-15 and Their Capacity to Support IgG-Secreting Cells in Culture Are Divergently Affected by Doxorubicin, Hemato (2021) [PDF] [46] C. Quince et al., STRONG: metagenomics strain resolution on assembly graphs, Genome Biology (2021) [PDF] [45] R. Chikhi, J. Holub, P. Medvedev, Data Structures to Represent a Set of k-long DNA Sequences, ACM Computing Surveys (2021) [PDF] [44] R. Chikhi, A tale of optimizing the space taken by de Bruijn graphs, Computability in Europe (2021) [PDF] [43] C. Marchet, C. Boucher, S. Puglisi, P. Medvedev, M. Salson, R. Chikhi, Data structures based on k-mers for querying large collections of sequencing data sets, Genome Research (2020) [PDF] [42] A. Rahman, R. Chikhi, P. Medvedev, Disk Compression of k-mer Sets, WABI (2020) [PDF] [41] Y. Dufresne, C. Sun, P. Marijon, D. Lavenier, C. Chauve, R. Chikhi, A Graph-Theoretic Barcode Ordering Model for Linked-Reads, WABI (2020) [PDF] [40] C. Marchet, Z. Iqbal, D. Gautheret, M. Salson, R. Chikhi, REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets, ISMB (2020) [PDF] [39] P. Marijon, R. Chikhi, J-S. Varré, yacrd and fpa: upstream tools for long-read genome assembly, Bioinformatics (2020) [PDF] [38] D. Martchenko, R. Chikhi, A. Shafer, Genome Assembly and Analysis of the North American Mountain Goat (Oreamnos americanus) Reveals Species-Level Responses to Extreme Environments, G3 (2019) [PDF] [37] L. Lima et al., Comparative assessment of long-read error-correction software applied to Nanopore RNA-sequencing data, Briefings in Bioinformatics (2019) [PDF] [36] M. Kerbiriou, R.Chikhi, Parallel decompression of gzip-compressed files and random access to DNA sequences, HiCOMB (2019) [PDF] [35] P. Marijon, R. Chikhi, J-S. Varré, Graph analysis of fragmented long-read bacterial genome assemblies, Bioinformatics (2019) [PDF] [34] V. Crawford, A. Kuhnle, C. Boucher, R. Chikhi, T. Gagie, Practical Dynamic de Bruijn Graphs, Bioinformatics (2018) [PDF] [33] R. Chikhi, V. Jovicic, S. Kratsch, P. Medvedev, M. Milanic, S. Raskhodnikova, N. Varma, Bipartite Graphs of Small Readability, COCOON (2018) [PDF] [32] R. Chikhi, A. Schönhuth, Dualities in Tree Representations, CPM (2018) [PDF] [31] A Kuosmanen et al., Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-Linear Chaining Extended, RECOMB (2018) [Conference PDF] TALG 2019 [Journal PDF] [30] J. Audoux et al., DE-kupl: exhaustive capture of biological variation in RNA-seq data through k-mer decomposition, Genome Biology (2017) [Open-access] [29] S. Rangavittal et al., RecoverY: k-mer-based read classification for Y-chromosome-specific sequencing and assembly, Bioinformatics (2017) [PDF] [28] A. Sczyrba et al., Critical Assessment of Metagenome Interpretation-A Benchmark of Metagenomics Software, Nature Methods (2017) [PDF] [27] A. Limasset, G. Rizk, R. Chikhi, P. Peterlongo, Fast and scalable minimal perfect hashing for massive key sets, SEA (2017) [PDF] [26] C. Sun, R. S. Harris, R. Chikhi, P. Medvedev, AllSome Sequence Bloom Trees, RECOMB (2017) [PDF] [25] The Computational Pan-Genomics Consortium, Computational pan-genomics: status, promises and challenges, Briefings in Bioinformatics (2016) [PDF] [24] R. Chikhi, A. Limasset, P. Medvedev, Compacting de Bruijn graphs from sequencing data quickly and in low memory, ISMB (2016) [PDF] [23] M. Agaba et al., Giraffe genome sequence reveals clues to its unique morphology and physiology, Nature Communications (2016) [PDF] [22] M. Tomaszkiewicz et al., A time- and cost-effective strategy to sequence mammalian Y Chromosomes: an application to the de novo assembly of gorilla Y, Genome Research (2016) [PDF] [21] K. Sahlin, R. Chikhi, L. Arvestad, Genome scaffolding with PE-contaminated mate-pair libraries, WABI (2015) [Open-access] [20] R. Chikhi, P. Medvedev, M. Milanic, S. Raskhodnikova, On the readability of overlap digraphs, CPM (2015) and Discrete Applied Mathematics (2016) [Open-access] [19] R. Uricaru et al., Reference-free detection of isolated SNPs, Nucleic Acids Research (2014) [Open-access] [Webpage] [18] G. Rizk, A. Gouin, R. Chikhi, C. Lemaitre, MindTheGap: integrated detection and assembly of short and long insertions, Bioinformatics (2014) [Open-access] [Webpage] [17] E. Drezen et al., GATB: Genome Assembly & Analysis Tool Box, Bioinformatics (2014) [Open-access] [Webpage] [16] R. Chikhi, A. Limasset, S. Jackman, J. Simpson, P. Medvedev, On the representation of de Bruijn graphs, RECOMB (2014) [PDF] [15] K. R. Bradnam et al., Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species, GigaScience (2013) [PDF] [14] R. Chikhi, P. Medvedev, Informed and Automated k-Mer Size Selection for Genome Assembly, Bioinformatics (2013), HiTSeq (2013) Best Paper Award [PDF] [Webpage] [13] G. Rizk, D. Lavenier, R. Chikhi, DSK: k-mer counting with very low memory usage, Bioinformatics (2013) [PDF] [Webpage] [12] N. Maillet, C. Lemaitre, R. Chikhi, D. Lavenier, P. Peterlongo, Compareads: comparing huge metagenomic experiments, RECOMB Comparative Genomics (2012) [PDF] [Webpage] [11] R. Chikhi, G. Rizk. Space-efficient and exact de Bruijn graph representation based on a Bloom filter, WABI (2012) [PDF] [Webpage] [10] P. Peterlongo, R. Chikhi, Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer, BMC Bioinformatics (2012) [PDF] [Webpage] [9] G. Sacomoto et al., KisSplice: de novo calling alternative splicing events from RNA-seq data, RECOMB-seq, BMC Bioinformatics (2012) [PDF] [Webpage] [8] D. A. Earl et al., Assemblathon 1: A competitive assessment of de novo short read assembly methods, Genome Research (2011) [PDF] [7] G. Chapuis, R. Chikhi, D. Lavenier, Parallel and memory-efficient reads indexing for genome assembly, PPAM Parallel Bio-Computing Workshop (2011) [PDF] [6] R. Chikhi, D. Lavenier, Localized genome assembly from reads to scaffolds: practical traversal of the paired string graph, WABI (2011) [PDF] [5] R. Chikhi, L. Sael, D. Kihara, Protein binding ligand prediction using moment-based methods, Protein function prediction for omics era, D. Kihara ed., Springer (2011) [PDF] [4] D. Kihara, L. Sael, R. Chikhi, J. Esquivel-Rodriguez, Molecular surface representation using 3D Zernike descriptors for protein shape comparison and docking, Curr. Protein and Peptide Science (2010) [PDF] [3] R. Chikhi, L. Sael, D. Kihara, Real-time ligand binding pocket database search using local surface descriptors. Proteins: Structure, Function, and Bioinformatics (2010) [PDF] [2] R. Chikhi, D. Lavenier, Paired-end read length lower bounds for genome re-sequencing (Meeting Abstract) BMC Bioinformatics (2009) [PDF] [1] R. Chikhi, S. Derrien, A. Noumsi, P. Quinton, Combining flash memory and FPGAs to efficiently implement a massively parallel algorithm for content-based image retrieval, International Journal of Electronics (2008) [PDF] Talks Seqbim Keynote, 2022, The tumultuous fate of sequence bioinformatics ideas [PDF] ERGA Workshop, ECCB, 2022, The wonderful world of long-read genome assembly [PDF] Pangenomics summer school in Como, 2022, Introduction to sequence graph representations [Slides] [Data and code] Evomics Workshop on Genomics, 2022, We're gonna need a bigger instance [PDF] Tudastic, 2022, Big Biological Data [PDF] Compression+Computation, 2022, Minimizer-space de Bruijn graphs for pangenomics [PDF] Pangenomics Bio Hacking, 2021, Minimizer-space de Bruijn graphs for pangenomics [PDF] EBAME, 2021, Short-read metagenome assembly [PDF] CiE, 2021, A tale of optimizing the space usage of de Bruijn graphs [PDF] VanBUG, 2020, Efficient indexing of k-mer presence and abundance in sequencing datasets [PDF] [YouTube] CGSI, 2019, Q: Is de novo genome assembly a solved problem with long reads, yet? A: No [YouTube] [PDF] [Benchmark] CGSI, 2019, Recent advances in data structures for storing sets of k-mer sets [PDF] Helsinki Bioinformatics Day, 2019, Genome assembly with either short reads or long reads [PDF] HiCOMB, 2019, Parallel decompression of gzip-compressed files and random access to DNA sequences [PDF] Evomics Workshop on Genomics, 2019, de novo assembly & reference-free analysis [PDF] [Lab] BiG seminar, 2018, Large genome assembly [YouTube] [PDF] CGSI, 2018, k-mer data structures [YouTube] [PDF] CGSI, 2018, Metagenome assembly methods [YouTube] [PDF] CPM, 2018, Dualities in tree representations [PDF] Mosaic Webinar, 2018, Minia's entry at Mosaic Strains1 assembly challenge [PDF] Evomics Workshop on Genomics, 2018, de novo assembly & k-mers [PDF] [Lab] RNA-Seq Nanopore @ Evry, 2017, A review of RNA-seq nanopore read correction [PDF] BiATA, 2017, Ingredients for de novo (meta)genome assembly [PDF] Evomics Workshop on Genomics, 2017, de novo assembly [PDF] [Lab] Colib'Read Workshop, 2016, Graph representations of reference-free sequencing data [PDF] ISMB, 2016, Compacting de Bruijn graphs from sequencing data quickly and in low memory [PDF] ALEA, 2016, On the representation of de Bruijn graphs (focusing on navigational data structures) [PDF] SMPGD keynote, 2016, de Bruijn graphs of sequencing data [PDF] Evomics Workshop on Genomics, 2016, de novo assembly [PDF] [Lab] RECOMB, 2014, On the representation of de Bruijn graphs [PDF] Evomics Workshop on Genomics, 2014, de novo assembly [PDF] [Blog post] [Lab] ISMB/HiTSeq, 2013, Informed and Automated k-Mer Size Selection for Genome Assembly [PDF] Evomics Workshop on Genomics, 2013, de novo assembly (introduction) [PDF] WABI, 2012, Space-efficient and exact de Bruijn graph representation based on a Bloom filter [PDF] Thesis slides, 2012, Computational methods for de novo assembly of NGS data [PDF] WABI, 2011, Localized genome assembly from reads to scaffolds: practical traversal of the paired string graph [PDF] IBL, 2011, de novo assembly tools, Monument, Mapsembler [PDF] ISCBsc, 2009, Paired-end read length lower bounds for genome re-sequencing [PDF] Reports R. Chikhi, k-mer data structures in sequence bioinformatics, HDR Thesis, 2021 [PDF] [slides] Contains the entire "A tale of optimizing the space taken by de Bruijn graphs" article, and a summary of several methods: REINDEER, BCALM2, pugz and Minia. R. Chikhi, Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data, PhD Thesis, 2008-2012 [PDF] Summary: We discuss computational methods (theoretical models and algorithms) to perform the reconstruction (de novo assembly) of DNA sequences produced by high-throughput sequencers. This thesis introduces the following contributions - quantification of the maximum theoretical genome coverage achievable by recent sequencing data (Chapter 2) - theoretical models for paired-end assembly (Chapter 3) - two concepts for practical assembly: localized assembly and memory-efficient paired reads indexing (Chapter 4) - implementation details of a de novo assembly software, the Monument assembler (Chapter 5) - an algorithm that enumerates variants in sequencing data, implemented in the Mapsembler software (Chapter 6) R. Chikhi, Study of Unentanglement in Quantum Computing, Manuscript, research internship at MIT, Spring 2008 [PDF] Summary: We investigate the conjecture that one cannot simulate QMA(2) protocols in QMA using a quantum operation called a disentangler. Our results show that, when exponential precision is required, this conjecture holds unless P = NP. Moreover, also in the exponential precision case, we show that one only needs a stronger hypothesis to prove the conjecture. R. Chikhi, Protein surface descriptors for binding sites comparison and ligand prediction, Manuscript, research internship at Purdue University, Summer 2007 [PDF] Summary: We present a model for two dimensional ligand binding pockets representation and we apply it to pocket-pocket matching and binding ligand prediction. Retired software Mapsembler Targeted assembly on a desktop computer, see reference [10]. Paired reads repetitions Software package for computing the ratio of single and paired (as in paired NGS reads) exact repetitions within a genome. Useful for obtaining re-sequencing lower bounds inspired by [Whiteford 05]. See [2] and the corresponding talk for sample results and details. Monument Whole genome de novo assembler, described in [6] and [7] and [Phd Thesis]. (recommended instead: Minia) de Bruijn graph construction Hash table-free implementation of the de Bruijn graph for a set of reads. Also includes a tool that computes the union of two de Bruijn graphs and the cartesian product of abundances, useful for construction a multi-dataset de Bruijn graph. (recommended instead: BCALM 2) Pocket-Surfer Protein ligand binding pocket type prediction using a database of known binding sites. See [3] for more details.(recommended instead: 3D-Surfer)