ChainWeave

An all-inclusive Bitcoin blockchain analytics platform with automated parsing, graph construction, and ML experimentation.

ChainWeave ingests raw Bitcoin blockchain data, transforms it into behavioral subgraphs via a distributed ETL pipeline, and classifies those subgraphs using 23 graph neural network architectures — enabling researchers and analysts to identify patterns like illicit exchanges, darknet activity, and mixing services at scale.

Vision

A future where every blockchain transaction is transparent, every illicit pattern is detectable, and graph intelligence bridges the gap between raw on-chain data and actionable insight — empowering enterprises to protect financial systems and researchers to advance the science of graph analytics.

Mission

ChainWeave transforms open-source blockchain data into structured graph representations and applies state-of-the-art graph neural networks to uncover behavioral patterns at scale. The project serves two complementary missions:

Research. Advance the frontier of graph machine learning on real-world financial networks by publishing reproducible studies, benchmarking novel GNN architectures against established baselines, and releasing open datasets and tooling to the academic community.
Application. Deliver production-grade graph AI capabilities for fraud detection, anti-money laundering (AML), sanctions compliance, and risk scoring — helping financial institutions, exchanges, and regulators identify illicit activity faster and with greater precision than traditional rule-based systems.

Why ChainWeave?

Blockchain analytics today is fragmented: parsing tools, graph databases, and ML frameworks live in separate silos, requiring brittle glue code and manual hand-offs. ChainWeave unifies the entire pipeline — from raw blk*.dat files to interactive experiment dashboards — in a single, reproducible platform.

Key Capabilities

Automated blockchain parsing — Ingest Bitcoin block files in parallel via Celery workers, writing parsed transactions, inputs, outputs, and SegWit witness data directly to columnar Parquet without an intermediate document store.
Distributed ETL — A 4-stage PySpark pipeline performs UTXO resolution, distributed entity-cluster discovery via GraphFrames connected components, multi-source label enrichment, and node-feature assembly into ML-ready PyTorch Geometric tensors.
23 GNN architectures — From classic (GCN, GIN, GAT) to state-of-the-art (GraphGPS, DiffPool, PNA, SAT), all accessible through a single model registry.
8 swappable readout heads — Subgraph-level pooling strategies (sum, mean, max, attention, sort, Set2Set, GMT, GIN-style) for graph-level classification on entity-resolved clusters.
Downstream ensembles — Plug XGBoost or Random Forest classifiers on top of GNN embeddings for additional classification power.
Full-stack web interface — React dashboard for monitoring infrastructure, launching experiments, and visualizing graph structures.
Async job orchestration — Celery workers handle long-running parsing, ETL, training, and inference jobs with real-time WebSocket progress updates.

Architecture

Bitcoin blk*.dat files (5,500+ files) | v +--------------------------+ | Celery Blockchain Parser | Per-file Celery tasks -> row-grouped Parquet +------------+-------------+ (btc_staging.parquet on shared NFS) v +--------------------------+ | 4-stage PySpark ETL | (1) graph_creation: UTXO resolution | (Spark / DuckDB cluster)| (2) graph_components: GraphFrames CC entity res | | (3) label_enrichment: 8-source label join | | (4) feature_assembly: Per-cluster PyG tensors +------------+-------------+ v +--------------------------+ | GNN Training | 23 architectures x 8 readout heads | + Downstream Ensembles | XGBoost / Random Forest on embeddings +------------+-------------+ v +--------------------------+ | FastAPI + React | REST API, WebSocket progress, dashboards +--------------------------+

A separate text-mining sidecar runs in parallel with the four ETL stages, extracting OP_RETURN, input, witness, and coinbase script text and joining it to wallet labels for behavior-stratified text analysis.

ETL Pipeline

The PySpark / DuckDB ETL transforms raw blockchain data into ML-ready graph tensors in four stages:

Stage	Producer	Output	Purpose
1. graph_creation	btc_staging.parquet	btc_resolved_inputs / btc_exploded_outputs	UTXO resolution: shuffle-hash join inputs to outputs on (tx_id, pos)
2. graph_components	Resolved inputs	btc_connected_components.parquet	Distributed entity resolution via GraphFrames connected components (common-input-ownership heuristic)
3. label_components	CC + 8 label sources	btc_labeled_components.parquet	Multi-source enrichment: WalletExplorer, OFAC SDN, BitcoinAbuse, BitcoinHeist, Real-CATS, BABD-13, Elliptic++, manual curation
4. build_node_features	Labeled CC + per-address features	per-cluster PyG Data objects	Subgraph-level training tensors

GNN Model Zoo

ChainWeave includes 23 graph neural network architectures organized into four tiers, all accessible through a single build_model() factory:

Tier	Models	Description
Classic	GCN, GIN, GAT	Foundational 2-layer architectures
Advanced	DeepGCN, GraphSAGE, GATv2, EdgeGNN, GraphTransformer, ChebNet, DeepGIN, GatedGCN	3-4 layer models with residual connections, attention, spectral methods
Benchmark	EvolveGCN, GATResNet, MultiDistanceGCN, LayerWeightedGCN, WaveletGCN, DGI-GIN	Architectures from published Bitcoin / Elliptic classification papers
SOTA	GINE, PNA, GraphGPS, DiffPool, MinCutPool, GMT-GIN, SAT	State-of-the-art graph-level classification models

Eight swappable graph-level readout heads — sum, mean, max, global attention, sort pool, Set2Set, GMT (Graph Multiset Transformer), and GIN-style concatenation — can be paired with any architecture, and downstream XGBoost or Random Forest classifiers can be trained on the resulting GNN embeddings.

Research Agenda

Research Area	Key Questions
GNN architecture benchmarking	Which architectures best capture illicit transaction patterns? How do readout strategies affect graph-level classification?
Temporal graph learning	How do transaction graph structures evolve over time? Can temporal GNNs detect emerging laundering patterns before they fully materialize?
Adversarial robustness	How resilient are GNN-based detectors to adversarial manipulation (e.g., synthetic transactions designed to camouflage illicit flows)?
Cross-chain analytics	Can graph representations generalize across Bitcoin, Ethereum, and other UTXO / account-model chains?
Explainability	What subgraph structures drive a model's illicit classification? Can we produce human-interpretable evidence for investigators?
Scalability	How do GNN training and inference scale to billion-edge transaction graphs? What sampling, partitioning, and distributed strategies are effective?

Tech Stack

ML / GNN	PyTorch, PyTorch Geometric, XGBoost, scikit-learn
Data	PySpark, GraphFrames, DuckDB, MongoDB, PostgreSQL, Parquet
Backend	FastAPI, Celery, Redis, Uvicorn
Frontend	React 18, TypeScript, Vite, React Query, Recharts, D3, react-force-graph-2d
Orchestration	Dagster (paper-side weekly snapshots), Docker Compose, Apache Spark
CI / CD	GitHub Actions, Playwright, Bandit, Semgrep, Trivy