CRAG-dual-encoder-base
CRAG: Causal Reasoning for Adversomics Graphs
This is the base model in the CRAG dual-encoder family for drug-adverse drug reaction (ADR) relation extraction. It uses a dual-encoder architecture with PubMedBERT to score drug-ADR pairs for causal pharmacovigilance graph construction.
Model Description
CRAG-dual-encoder-base is designed to identify causal relationships between drugs and adverse drug reactions from biomedical text. Given a drug mention and an ADR mention in context, the model predicts whether they share a causal relationship.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ CRAG Dual-Encoder Base │
├─────────────────────────────────────────────────────────────┤
│ │
│ Drug Context ADR Context │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │PubMedBERT│ │PubMedBERT│ (separate weights) │
│ │ Drug │ │ ADR │ │
│ │ Encoder │ │ Encoder │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ │
│ ▼ ▼ │
│ [CLS] Pool [CLS] Pool │
│ │ │ │
│ └────────┬────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Bilinear │ │
│ │ Fusion │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ MLP Head │ │
│ │ (256→1) │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ P(causal) │
└─────────────────────────────────────────────────────────────┘
- Base Model:
microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext - Hidden Dimension: 768
- Fusion Dimension: 256
- Parameters: ~220M (two separate BERT encoders)
Training Procedure
The model was trained in two phases:
Phase 1: Contrastive Pre-training (3 epochs)
- InfoNCE loss with temperature τ=0.07
- Learns to bring true drug-ADR pairs close in embedding space
- Random negative sampling (mismatched pairs)
Phase 2: Classification Fine-tuning (5 epochs)
- Binary cross-entropy loss
- Balanced positive/negative samples
- Learning rate: 2e-5 with linear warmup
Training Data
- Dataset: ADE Corpus V2
- Configuration:
Ade_corpus_v2_drug_ade_relation - Training Examples: ~6,800 positive pairs + ~6,800 negative pairs
- Validation Examples: ~850 pairs
Performance
| Metric | Value |
|---|---|
| F1 Score | 88.3% |
Comparison with CRAG Family
| Model | F1 | AUC | Key Features |
|---|---|---|---|
| CRAG-dual-encoder-base | 88.3% | - | PubMedBERT, random negatives |
| CRAG-dual-encoder-ade | 97.5% | 99.1% | BioLinkBERT, hard negatives, focal loss |
| CRAG-dual-encoder-mimicause | 98.9% | 99.8% | + MIMICause causal reasoning |
Usage
import torch
from transformers import AutoTokenizer, AutoModel
# Load model (custom architecture - need to define DualEncoderModel class)
# See training script for architecture definition
tokenizer = AutoTokenizer.from_pretrained("chrisvoncsefalvay/CRAG-dual-encoder-base")
# Example: Score a drug-ADR pair
drug_context = "Patient was prescribed aspirin for pain management."
adr_context = "The patient experienced gastrointestinal bleeding."
# Tokenize
drug_inputs = tokenizer(drug_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
adr_inputs = tokenizer(adr_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
# Forward pass (pseudo-code - requires loading custom model)
# drug_repr = model.encode_drug(**drug_inputs)
# adr_repr = model.encode_adr(**adr_inputs)
# score = model.classify(drug_repr, adr_repr)
Intended Uses
Primary Use Cases
- Pharmacovigilance: Automated extraction of drug-ADR relationships from literature
- Causal Graph Construction: Building drug-ADR knowledge graphs for safety analysis
- Literature Mining: Screening biomedical publications for adverse event reports
- Clinical Decision Support: Identifying potential drug safety signals
Out-of-Scope Uses
- Direct clinical decision-making without human review
- Diagnosis or treatment recommendations
- Processing non-English text
- Identifying drug-drug interactions (different task)
Limitations
- English Only: Trained exclusively on English biomedical text
- Domain Specific: Optimized for drug-ADR relationships; may not generalize to other biomedical relations
- Context Dependency: Requires both drug and ADR to be mentioned in related context
- Base Model Performance: This base version achieves 88.3% F1; consider using CRAG-dual-encoder-ade or CRAG-dual-encoder-mimicause for production use
Ethical Considerations
- Model predictions should be validated by ___domain experts before use in clinical or regulatory settings
- False negatives may miss important safety signals; false positives may trigger unnecessary reviews
- The model reflects biases present in the training data (ADE Corpus V2, sourced from MEDLINE)
Citation
@misc{crag-dual-encoder-2024,
title={CRAG: Causal Reasoning for Adversomics Graphs - Dual-Encoder Models for Drug-ADR Relation Extraction},
author={von Csefalvay, Chris},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/chrisvoncsefalvay/CRAG-dual-encoder-base}
}
Model Card Authors
Chris von Csefalvay (@chrisvoncsefalvay)
Model Card Contact
For questions or issues, please open a discussion on this model's repository or contact chris@chrisvoncsefalvay.com.
- Downloads last month
- 10
Dataset used to train chrisvoncsefalvay/CRAG-dual-encoder-base
Collection including chrisvoncsefalvay/CRAG-dual-encoder-base
Evaluation results
- F1 Score on ADE Corpus V2self-reported0.883