Training Dataset

The AI model is trained on a hybrid dataset exceeding 35,000 examples, blending manual curation with synthetic generation. The dataset’s composition is as follows:

  • Manual Subset (10%): 3,500 examples labeled by domain experts, covering X posts from various crypto events (e.g., niche project launches like small DeFi tokens, major token listings on exchanges, unexpected delistings, security hacks, governance voting controversies, and rare instances like Bitcoin halving). Labels span all 36 parameters and 18 sentiments, with inter-annotator agreement of 88% (Cohen’s kappa).

  • Synthetic Subset (90%): 31,500 examples generated by Tier-1 AI models—Claude Opus, Claude Sonnet, and GPT-4-o—fine-tuned on the manual subset. Synthetic data is validated against real-world X samples to ensure non-synthetic fidelity.


Training:

Fine-tuning: The model is fine-tuned on the 35,000-example dataset using a multi-task learning objective—predicting both precursor parameters and sentiment scores. Loss functions combine mean squared error (MSE) for numerical scores and cross-entropy for categorical labels, weighted to prioritize real-time accuracy.

The training pipeline runs on the GPU cluster, with a batch size of 128 and an AdamW optimizer (learning rate 2e-5). Total training time is approximately 120 hours on 8xH100 GPUs, with continuous updates applied weekly based on new X data. The model achieves a mean absolute error (MAE) of 0.8 on sentiment predictions, validated against a holdout set of 5,000 posts.

Last updated