A Gold Standard for Transcription Factor-Gene Regulatory Interactions in Escherichia coli K-12: Architecture of the Evidence Types.
Paloma LaraSocorro Gama-CastroHeladia SalgadoClaire RioualenLuis J Muñiz-RascadoJair S García-SoteloVíctor H TierrafríaJulio Collado-VidesPublished in: bioRxiv : the preprint server for biology (2023)
Genomics has strongly increased the available experimental strategies to identify the elements involved in regulatory networks, particularly those of the regulation of transcription initiation. As new methodologies emerge, a natural step is to compare their results with available knowledge obtained by previously established methodologies, such as the classic methods of molecular biology used to characterize transcription factor binding and regulatory sites, promoters, and transcription units. Such previous knowledge is dispersed in the scientific literature, limiting their accessibility. Fortunately, in the case of Escherichia coli K-12, the best studied microorganism, we have been continuously gathering this knowledge from original scientific publications for the last 30 years, and have made it available to the public in two databases, RegulonDB and EcoCyc. More recently, we have also gathered knowledge generated by genomic high throughput (HT) methodologies as can be appreciated in the latest RegulonDB version 11.0, where, in addition to the objects and interactions identified by methods of molecular biology, users can find datasets of binding sites from ChIP-seq, ChIP-exo, gSELEX and DAP-seq HT technologies. This has motivated us to work on improving our evidence codes to facilitate their use as gold standards. In order to enhance the precision of knowledge representation, we implemented three alternatives to curate regulatory interactions, based on the level of detail provided by the experiments. We present them in a way that users can select different versions of objects based on properties such as the methods used to identify them (classic, computational predictions, HT methods); in vivo vs in vitro approaches; and degree of confidence (weak, strong, confirmed) of the evidence supporting them. The collection of regulatory interactions is analyzed based on their supporting evidence, showing how classic methods still provide for the largest fraction of strong and confirmed sites, a distribution likely to change in the years to come as HT methods may become the dominant strategies to identify regulatory genomic components. We plan to keep updating and expanding these gold standard datasets as part of future RegulonDB releases.