ChainWeave


An all-inclusive Bitcoin blockchain analytics platform with automated parsing, graph construction, and ML experimentation.

ChainWeave ingests raw Bitcoin blockchain data, transforms it into behavioral subgraphs via a distributed ETL pipeline, and classifies those subgraphs using 23 graph neural network architectures — enabling researchers and analysts to identify patterns like illicit exchanges, darknet activity, and mixing services at scale.


Vision

A future where every blockchain transaction is transparent, every illicit pattern is detectable, and graph intelligence bridges the gap between raw on-chain data and actionable insight — empowering enterprises to protect financial systems and researchers to advance the science of graph analytics.

Mission

ChainWeave transforms open-source blockchain data into structured graph representations and applies state-of-the-art graph neural networks to uncover behavioral patterns at scale. The project serves two complementary missions:

  • Research. Advance the frontier of graph machine learning on real-world financial networks by publishing reproducible studies, benchmarking novel GNN architectures against established baselines, and releasing open datasets and tooling to the academic community.
  • Application. Deliver production-grade graph AI capabilities for fraud detection, anti-money laundering (AML), sanctions compliance, and risk scoring — helping financial institutions, exchanges, and regulators identify illicit activity faster and with greater precision than traditional rule-based systems.

Why ChainWeave?

Blockchain analytics today is fragmented: parsing tools, graph databases, and ML frameworks live in separate silos, requiring brittle glue code and manual hand-offs. ChainWeave unifies the entire pipeline — from raw blk*.dat files to interactive experiment dashboards — in a single, reproducible platform.

Key Capabilities

  • Automated blockchain parsing — Ingest Bitcoin block files in parallel via Celery workers, writing parsed transactions, inputs, outputs, and SegWit witness data directly to columnar Parquet without an intermediate document store.
  • Distributed ETL — A 4-stage PySpark pipeline performs UTXO resolution, distributed entity-cluster discovery via GraphFrames connected components, multi-source label enrichment, and node-feature assembly into ML-ready PyTorch Geometric tensors.
  • 23 GNN architectures — From classic (GCN, GIN, GAT) to state-of-the-art (GraphGPS, DiffPool, PNA, SAT), all accessible through a single model registry.
  • 8 swappable readout heads — Subgraph-level pooling strategies (sum, mean, max, attention, sort, Set2Set, GMT, GIN-style) for graph-level classification on entity-resolved clusters.
  • Downstream ensembles — Plug XGBoost or Random Forest classifiers on top of GNN embeddings for additional classification power.
  • Full-stack web interface — React dashboard for monitoring infrastructure, launching experiments, and visualizing graph structures.
  • Async job orchestration — Celery workers handle long-running parsing, ETL, training, and inference jobs with real-time WebSocket progress updates.

Architecture

Bitcoin blk*.dat files (5,500+ files) | v +--------------------------+ | Celery Blockchain Parser | Per-file Celery tasks -> row-grouped Parquet +------------+-------------+ (btc_staging.parquet on shared NFS) v +--------------------------+ | 4-stage PySpark ETL | (1) graph_creation: UTXO resolution | (Spark / DuckDB cluster)| (2) graph_components: GraphFrames CC entity res | | (3) label_enrichment: 8-source label join | | (4) feature_assembly: Per-cluster PyG tensors +------------+-------------+ v +--------------------------+ | GNN Training | 23 architectures x 8 readout heads | + Downstream Ensembles | XGBoost / Random Forest on embeddings +------------+-------------+ v +--------------------------+ | FastAPI + React | REST API, WebSocket progress, dashboards +--------------------------+

A separate text-mining sidecar runs in parallel with the four ETL stages, extracting OP_RETURN, input, witness, and coinbase script text and joining it to wallet labels for behavior-stratified text analysis.


ETL Pipeline

The PySpark / DuckDB ETL transforms raw blockchain data into ML-ready graph tensors in four stages:

StageProducerOutputPurpose
1. graph_creation btc_staging.parquet btc_resolved_inputs / btc_exploded_outputs UTXO resolution: shuffle-hash join inputs to outputs on (tx_id, pos)
2. graph_components Resolved inputs btc_connected_components.parquet Distributed entity resolution via GraphFrames connected components (common-input-ownership heuristic)
3. label_components CC + 8 label sources btc_labeled_components.parquet Multi-source enrichment: WalletExplorer, OFAC SDN, BitcoinAbuse, BitcoinHeist, Real-CATS, BABD-13, Elliptic++, manual curation
4. build_node_features Labeled CC + per-address features per-cluster PyG Data objects Subgraph-level training tensors

GNN Model Zoo

ChainWeave includes 23 graph neural network architectures organized into four tiers, all accessible through a single build_model() factory:

TierModelsDescription
Classic GCN, GIN, GAT Foundational 2-layer architectures
Advanced DeepGCN, GraphSAGE, GATv2, EdgeGNN, GraphTransformer, ChebNet, DeepGIN, GatedGCN 3-4 layer models with residual connections, attention, spectral methods
Benchmark EvolveGCN, GATResNet, MultiDistanceGCN, LayerWeightedGCN, WaveletGCN, DGI-GIN Architectures from published Bitcoin / Elliptic classification papers
SOTA GINE, PNA, GraphGPS, DiffPool, MinCutPool, GMT-GIN, SAT State-of-the-art graph-level classification models

Eight swappable graph-level readout heads — sum, mean, max, global attention, sort pool, Set2Set, GMT (Graph Multiset Transformer), and GIN-style concatenation — can be paired with any architecture, and downstream XGBoost or Random Forest classifiers can be trained on the resulting GNN embeddings.


Research Agenda

Research AreaKey Questions
GNN architecture benchmarking Which architectures best capture illicit transaction patterns? How do readout strategies affect graph-level classification?
Temporal graph learning How do transaction graph structures evolve over time? Can temporal GNNs detect emerging laundering patterns before they fully materialize?
Adversarial robustness How resilient are GNN-based detectors to adversarial manipulation (e.g., synthetic transactions designed to camouflage illicit flows)?
Cross-chain analytics Can graph representations generalize across Bitcoin, Ethereum, and other UTXO / account-model chains?
Explainability What subgraph structures drive a model's illicit classification? Can we produce human-interpretable evidence for investigators?
Scalability How do GNN training and inference scale to billion-edge transaction graphs? What sampling, partitioning, and distributed strategies are effective?

Tech Stack

ML / GNNPyTorch, PyTorch Geometric, XGBoost, scikit-learn
DataPySpark, GraphFrames, DuckDB, MongoDB, PostgreSQL, Parquet
BackendFastAPI, Celery, Redis, Uvicorn
FrontendReact 18, TypeScript, Vite, React Query, Recharts, D3, react-force-graph-2d
OrchestrationDagster (paper-side weekly snapshots), Docker Compose, Apache Spark
CI / CDGitHub Actions, Playwright, Bandit, Semgrep, Trivy