Sharing our implementation of automated risk scoring for release decisions in cb-22. We were facing issues with manual risk assessment taking 2-3 days per release and inconsistent criteria leading to bad releases slipping through.
We built a risk engine that combines defect severity, test coverage, and code changes to calculate a weighted risk score. The system automatically blocks releases above threshold and requires executive override.
Core calculation logic:
// Pseudocode - Risk scoring implementation:
1. Query all open defects linked to release (severity weights: critical=10, major=5, minor=1)
2. Calculate test coverage percentage from test execution results
3. Count code commits in release branch (high churn = higher risk)
4. Apply weighted formula: risk = (defectScore * 0.4) + (100 - coverage) * 0.35 + (churnScore * 0.25)
5. Compare against auto-approval threshold (score < 30) or escalation threshold (score > 70)
This eliminated our manual bottleneck and reduced production incidents by 60% in six months. Anyone else implementing similar risk models?
Impressive results! How do you handle the scenario where test coverage is high but all tests are low-quality? We tried a similar approach but found that coverage percentage alone doesn’t capture test effectiveness. Did you factor in test assertion depth or historical test failure rates?
Excellent discussion - let me share the complete implementation details addressing all the focus areas:
Risk Model Architecture:
Our model evolved through three iterations based on 18 months of historical incident data. We analyzed 240 releases and correlated incident severity with various metrics. The final model includes five weighted components:
- Defect Risk (40%): Critical defects = 10pts, Major = 5pts, Minor = 1pt, weighted by age (older defects score higher)
- Test Coverage Quality (35%): Not just coverage % but adjusted for test effectiveness - assertion count, historical stability, execution time patterns
- Code Churn (15%): Commit count + lines changed, normalized by team velocity
- Historical Release Performance (7%): Team’s past 6-month incident rate
- Environmental Factors (3%): Day-of-week, time-of-day, deployment window penalties
Weighted Scoring Calibration:
We used logistic regression on historical data to determine optimal weights. Initial weights were industry standard (defects=50%, tests=50%), but our data showed test quality was under-weighted. We ran 1000+ simulations and validated against holdout data. The current weights achieved 87% accuracy in predicting releases that would cause incidents.
Auto-approval thresholds:
- Score 0-30: Auto-approve (green)
- Score 31-70: Manager review required (yellow)
- Score 71-100: Executive escalation + mandatory risk mitigation plan (red)
We recalibrate quarterly using new incident data.
Auto-Threshold Implementation & Integration:
The automation runs as a codebeamer workflow script triggered on release status change to ‘Ready for Deployment’:
- Script queries linked items: defects (by severity), test cases (with execution results), commits (via git integration)
- Calculates component scores and weighted total
- Updates custom field ‘Risk Score’ on release item
- Workflow transition rules enforce thresholds:
- Green: Automatic transition to ‘Approved’
- Yellow: Requires ‘Release Manager’ role approval
- Red: Requires ‘VP Engineering’ approval + mandatory ‘Risk Mitigation Plan’ attachment
CI/CD integration via REST API: Jenkins polls the release status and risk score before deployment. If score >70, pipeline pauses and sends Slack notification to escalation channel. We use codebeamer’s webhook to notify Jenkins when a high-risk release gets executive override approval.
Test Quality Enhancements:
We exclude zero-assertion tests from coverage calculation and flag them for review. Tests with >30% historical failure rate get 0.5x weight in coverage score. We also implemented test execution time analysis - tests that suddenly run 3x slower indicate potential issues and add 5 points to risk score.
Results & Lessons Learned:
- 60% reduction in production incidents (from 2.5/month to 1.0/month average)
- Release approval time reduced from 2-3 days to 4 hours for low-risk releases
- 23% of releases now auto-approve (score <30)
- False positive rate: 12% (releases flagged high-risk but no incidents)
- Most important learning: Weight adjustments matter more than adding complexity. Our v1 model had 12 factors but performed worse than the streamlined 5-factor v3.
The system has paid for itself 4x in reduced incident costs and faster release cycles. Happy to share the workflow script template if anyone wants to implement similar automation.
Great question. We enhanced the model in phase 2 to include test quality metrics. We track assertion count per test case and historical pass/fail patterns. Tests with zero assertions get excluded from coverage calculation, and tests with >30% historical failure rate get flagged for review. This added about 8% improvement to our incident reduction. The weighted scoring adjusts based on test quality - high coverage with low-quality tests doesn’t lower risk as much.