eupolicy.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
This Mastodon server is a friendly and respectful discussion space for people working in areas related to EU policy. When you request to create an account, please tell us something about you.

Server stats:

204
active users

#Cheminformatics

6 posts4 participants0 posts today

Oscar4 5.3.0 has been released: github.com/BlueObelisk/oscar4/

"OSCAR (Open Source Chemistry Analysis Routines) is an open source extensible system for the automated annotation of chemistry in scientific articles. It can be used to identify chemical names, reaction names, ontology terms, enzymes and chemical prefixes and adjectives, and chemical data such as state, yield, IR, NMR and mass spectra and elemental analyses."

[maven-release-plugin] copy for tag 5.3.0
GitHubRelease 5.3.0 · BlueObelisk/oscar4[maven-release-plugin] copy for tag 5.3.0
Continued thread

the preprint argues that a flood of mediocre articles is (partly) caused by open data making it easy to do all sorts of analyses of these.

It seems to me that the observation is correct. I have had this impression in #cheminformatics too, and, indeed, running an analysis on a data set like Tox21, ChEMBL, etc, had been getting increasingly easy.

But I do not think that #openscience and #FAIR are the problem, nor restricting access the solution.

I have some ideas about the underlying problem:

new BridgeDb Datasources release: github.com/bridgedb/datasource

"Uses UniProtKB as name, removes EcoGene, and multiple small updates"

BridgeDb Datasources is a dataset with metadata about data sources and organisms used by BridgeDb Java and downstream tools like @wikipathways, PathVisio, and others

Uses UniProtKB as name, removes EcoGene, and multiple small updates.
Full Changelog: 2024092...2027072
GitHubRelease Release 20250728 · bridgedb/datasourcesUses UniProtKB as name, removes EcoGene, and multiple small updates. Full Changelog: 2024092...2027072

New Preprint Alert!

We're excited to share our latest work on #ChemRxiv! MARCUS (Molecular Annotation and Recognition for Curating Unravelled Structures) is a web-based platform for extracting chemical information from scientific papers.

📄 Preprint: doi.org/10.26434/chemrxiv-2025

🔗 Try it out: marcus.decimer.ai

ChemRxivMARCUS: Molecular Annotation and Recognition for Curating Unravelled StructuresThe exponential growth of chemical literature necessitates the development of automated tools for extracting and curating molecular information from unstructured scientific publications into open-access chemical databases. Current optical chemical structure recognition (OCSR) and named entity recognition solutions operate in isolation, which limits their scalability for comprehensive literature curation. Here we present MARCUS (Molecular Annotation and Recognition for Curating Unravelled Structures), a tool to aid curators in performing literature curation in the field of natural products. This integrated web-based platform combines automated text annotation, multi-engine OCSR, and direct submission capabilities to the COCONUT database. MARCUS employs a fine-tuned GPT-4 model to extract chemical entities and utilises an ensemble approach integrating DECIMER, MolNexTR, and MolScribe for structure recognition. The platform aims to streamline the data extraction workflow from PDF upload to database submission, significantly reducing curation time. MARCUS bridges the gap between unstructured chemical literature and machine-actionable databases, enabling FAIR data principles and facilitating AI-driven chemical discovery. Through open-source code, accessible models, and comprehensive documentation, the web application enhances accessibility and promotes community-driven development. This approach facilitates unrestricted use and encourages the collaborative advancement of automated chemical literature curation tools. We dedicate MARCUS to Dr Marcus Ennis, the longest-serving curator of the ChEBI database, on the occasion of his 75th birthday.

# Qleverfile for PubChem, use with the
# QLever CLI (`pip install qlever`)
#
# qlever get-data # ~2 hours, ~120 GB, ~19 billion triples
# qlever index # ~6 hours, ~20 GB RAM, ~350 GB disk space (for the index)
# qlever start # a few seconds

Source: github.com/ad-freiburg/qlever-

GitHubqlever-control/src/qlever/Qleverfiles/Qleverfile.pubchem at main · ad-freiburg/qlever-controlContribute to ad-freiburg/qlever-control development by creating an account on GitHub.

I think I am going to try to recover a bit of #cheminformatics / #chemistry #history, and make the index of the Internet Journal of Chemistry (IJC) FAIR in @wikidata

While the journal no longer exists, many articles are cited quite a few times.

I did some exploration some time ago, and for some I found full text "self-archiving" versions online.

And, TIL that Web of Science has entries for the articles too, which I just added for the 9 articles already in #Wikidata: w.wiki/Eide

Anyone have views or references on the effectiveness of count/sparse #cheminformatics fingerprints compared to binary/dense fingerprints?

What about comparing methods to turn count/sparse fingerprints into binary ones? I know of several approaches, but nothing methodical or published.

I'm trying to understand how I might add sparse/count fingerprints to chemfp.

Version 5.0b1 of chemfp - the comprehensive package for binary #cheminformatics fingerprints - is out and ready for curious beta testers.

Here's are highlights. For more info and Linux install info see chemfp.com/chemfp-50b1-availab

- shardsearch to search multiple files

- simhistogram for a histogram of pairwise Tanimoto comparisons

- the FPB file size limit increased from ~250M fingerprints to well over a billion

- new Klekota-Roth fingerprint type for RDKit and OpenEye

chemfp.com/

chemfp.comchemfp 5.0b1 available for beta testing