Preprint of a manuscript under review at the 22nd IEEE International Conference on eScience (eScience 2026), Naples, Italy. This is the author-submitted version and may differ from the final published article.
The rapid growth of protein sequence and structure databases has created a significant annotation gap that manual curation alone cannot bridge. Two recent computational biology methods, DPCfam and DPCstruct, address this by applying the Density Peak Clustering (DPC) algorithm to automatically group millions of protein domains into evolutionary families called metaclusters. This process yielded 81,384 sequence-based and 28,246 structure-based families with no manual curation, many of which have no counterpart in Pfam, CATH, or SCOP and form the “dark matter” of the protein universe. We present DPCexplorer, a unified Django web application that closes the access gap. Raw archives were preprocessed on the ORFEO HPC cluster and loaded into a PostgreSQL database optimised with B-Tree, GIN trigram, and functional indexes. Built on the Model-View-Template pattern, the platform accepts four query types (DPCfam MCID, DPCstruct MCID, Pfam ID, and UniProt accession) and returns paginated metadata tables, per-protein domain-architecture diagrams, and downloadable biological files (FASTA, MSA, HMM, PDB). For DPCstruct metaclusters it integrates the PDBe-Mol* 3D viewer to render AlphaFold2-predicted structures coloured by per-residue pLDDT confidence directly in the browser. DPCexplorer is fully reproducible from its GitHub repository and the associated Zenodo data deposit.
Publication Date: 2026-06-19