After extensive debugging, here’s the complete solution covering all three focus areas:
Token Renewal Logic:
The intermittent failures were caused by stale app key references. Implement proper token validation before each API call:
var appKeyEntity = Things['ApplicationKeyThing'];
var keyInfo = appKeyEntity.GetKeyInfo();
if (keyInfo.expirationDate && keyInfo.expirationDate < new Date()) {
logger.warn('App key expired, regenerating');
appKeyEntity.RegenerateKey();
applicationKey = appKeyEntity.keyId;
}
var headers = {
appKey: applicationKey,
Accept: 'application/json'
};
Even if keys are set to never expire, validate them before critical operations. Also implement retry logic with exponential backoff for authentication failures.
Job Scheduling:
The root cause was overlapping job executions. Implement a distributed lock mechanism:
var lockName = 'DataSyncJob_Lock';
var lockAcquired = Resources['PersistenceManager'].AcquireLock({
lockName: lockName,
timeout: 60
});
if (!lockAcquired) {
logger.warn('Previous job still running, skipping');
return;
}
try {
// Execute sync job
} finally {
Resources['PersistenceManager'].ReleaseLock({lockName: lockName});
}
This prevents concurrent executions that can cause authentication conflicts. Also adjust your scheduler configuration to use a longer execution timeout and add monitoring for job duration trends.
Server Time Sync:
Clock skew was indeed a contributing factor. Here’s how we fixed it:
- Synchronized all servers using NTP with the same time source
- Verified time zones are configured correctly in platform-settings.json
- Added time drift monitoring to our health checks
- Implemented timestamp tolerance in our security validation:
var CLOCK_SKEW_TOLERANCE = 300000; // 5 minutes in ms
var serverTime = new Date().getTime();
var tokenTime = token.issuedAt;
if (Math.abs(serverTime - tokenTime) > CLOCK_SKEW_TOLERANCE) {
logger.error('Clock skew detected: ' + (serverTime - tokenTime) + 'ms');
// Trigger time sync alert
}
Additionally, we discovered that our load balancer was caching authentication headers incorrectly. Disable caching for endpoints that use app key authentication, and ensure your Load Balancer passes through authentication headers without modification.
Finally, enable detailed security logging in ThingWorx (set SecurityLogger to DEBUG level) to capture the exact reason for Access Denied errors. This helped us identify that some failures were due to permission changes on the user account, not token issues.