I’ll provide a complete recovery and prevention solution covering all three critical areas:
Certificate Distribution (Immediate Recovery):
You need to restore connectivity first, then do the rotation properly. Use the gateway management emergency rollback feature:
POST /gateway/certificates/rollback
{
"targetVersion": "previous",
"scope": "gateway-only",
"preserveDeviceCerts": true
}
This reverts gateway certificates while keeping device certificates unchanged. Devices should reconnect within 5-10 minutes. Once connectivity is restored, follow the proper rotation sequence.
Device Trust Store Update (Proper Rotation Sequence):
Now implement the correct multi-phase rotation:
Phase 1 - Add New CA to Trust Store (Days 1-2):
POST /devices/truststore/update
{
"operation": "ADD",
"certificate": "<new_CA_cert>",
"priority": "secondary"
}
Monitor trust store update status:
GET /devices/truststore/status
Wait until all 18 devices report trust store update complete.
Phase 2 - Rotate Gateway Certificates (Day 3):
POST /gateway/certificates/rotate
{
"newCertificate": "<gateway_cert_signed_by_new_CA>",
"validationMode": "strict"
}
Devices now trust both old and new CAs, so they’ll accept the new gateway certificate.
Phase 3 - Remove Old CA (Day 10+):
POST /devices/truststore/update
{
"operation": "REMOVE",
"certificate": "<old_CA_cert>"
}
Only remove the old CA after verifying all devices are successfully connecting with the new certificates.
TLS Handshake Troubleshooting:
To diagnose and prevent future handshake failures, implement comprehensive TLS monitoring:
- Enable detailed TLS logging:
GatewayConfig.tls.debugLevel=VERBOSE
GatewayConfig.tls.logHandshakes=true
- Set up handshake failure alerts:
if (tlsError === 'certificate verify failed') {
alert('Trust store mismatch on device: ' + deviceId);
}
- Implement certificate chain validation testing:
POST /gateway/certificates/validate
{
"deviceId": "test_device",
"certificateChain": "<new_chain>",
"dryRun": true
}
Run this validation before actual rotation to catch trust store issues.
Best Practices for Certificate Rotation:
- Always maintain dual CA trust during rotation (overlap period: 7-14 days)
- Use certificate expiry monitoring with 30-day advance warnings
- Implement automated rotation for certificates with <90 day lifetime
- Test rotation on non-production devices first
- Maintain emergency rollback procedures
- Document CA trust chain for all device types
For Your Current Situation:
Since devices are offline, you have two recovery options:
Option A (Remote Recovery - if devices have fallback connectivity):
POST /devices/recovery/initiate
{
"deviceIds": ["list_of_18_devices"],
"method": "FALLBACK_CHANNEL",
"action": "UPDATE_TRUSTSTORE"
}
Option B (Manual Recovery - if no remote access):
Generate a recovery certificate bundle signed by the old CA:
openssl x509 -req -in gateway.csr -CA old_ca.crt \
-CAkey old_ca.key -out recovery_gateway.crt
Deploy this temporarily to restore connectivity, then follow the proper rotation sequence.
Monitoring Setup:
Implement these monitors to prevent future incidents:
- Certificate expiry alerts (30, 14, 7 days before expiry)
- Trust store synchronization status
- TLS handshake success rate per device
- Certificate chain validation errors
The key lesson: certificate rotation in IoT requires phased deployment with overlap periods. Never rotate certificates that devices depend on without first ensuring they trust the new CA.