Our machine learning pipeline for network anomaly detection relies on VPC Flow Logs as the primary data source. We’ve noticed that certain traffic patterns are missing from the flow logs, which is reducing the effectiveness of our ML detection models. Specifically, we’re not seeing logs for traffic between instances in the same subnet, and some inter-subnet traffic appears to be incomplete.
The flow log collector is configured at the VPC level with default settings. We’re ingesting logs into Cloud Object Storage and processing them every 15 minutes for the ML pipeline. The anomaly detection model is trained to identify unusual network patterns, but with incomplete data, we’re getting false negatives. What VPC Flow Log coverage settings should we verify? Are there specific subnet-level logging configurations needed? Also, what are the recommended log retention settings for ML training datasets that need historical network behavior data?
That’s helpful context. I checked and our flow log collector is indeed set to capture only ‘accept’ traffic. I’ll change that to ‘all’. For the intra-subnet traffic issue, do I need to create separate flow log collectors for each subnet, or can I modify the VPC-level collector to include intra-subnet traffic? We have 6 subnets in the VPC and want to minimize management overhead.
Let me provide a comprehensive solution covering all three areas you need to address for complete ML pipeline coverage.
VPC Flow Log Coverage:
Your current VPC-level flow log collector has inherent limitations. VPC Flow Logs operate at three scope levels with different coverage:
- VPC Level: Captures only traffic crossing subnet boundaries, internet gateway traffic, and VPN traffic
- Subnet Level: Captures all traffic to/from instances in that subnet, including intra-subnet communication
- Network Interface Level: Most granular, captures all traffic for a specific instance
For comprehensive ML anomaly detection, you need subnet-level collectors. Create them programmatically:
ibmcloud is flow-log-create \
--subnet <subnet-id> \
--bucket <cos-bucket-name> \
--target all \
--active true \
--name ml-anomaly-subnet-<subnet-name>
Repeat for all 6 subnets. This ensures complete traffic visibility including intra-subnet flows that your ML model needs.
Subnet-Level Logging Configuration:
Key configuration parameters for ML data quality:
-
Traffic Type: Set to ‘all’ to capture both accepted and rejected traffic
- Rejected traffic is crucial for detecting security threats and misconfigurations
- Your ML model can learn patterns of normal rejection rates vs. attack patterns
-
Aggregation Interval: Align with your ML pipeline processing frequency
- Default: 10 minutes
- Your pipeline: 15 minutes
- Recommendation: Set flow log interval to 5 minutes for finer granularity
- This ensures your 15-minute processing windows always contain complete flow data
-
COS Bucket Structure: Organize logs by subnet and time for efficient ML processing
- Use separate prefixes:
/vpc-flows/subnet-1/, /vpc-flows/subnet-2/, etc.
- Enable versioning for data integrity
- Configure lifecycle policy for cost optimization
Log Retention Settings:
For ML training with network anomaly detection, implement tiered retention:
-
Hot Storage (0-30 days): Standard COS class
- Used for active ML model training and real-time inference
- Quick access for investigating recent anomalies
- Estimated cost: ~$0.023/GB/month
-
Warm Storage (31-90 days): COS Vault class
- Used for periodic model retraining
- Historical baseline establishment
- Estimated cost: ~$0.012/GB/month
-
Cold Storage (90+ days): COS Cold Vault class
- Long-term retention for compliance and trend analysis
- Accessed infrequently for model validation
- Estimated cost: ~$0.004/GB/month
Implement lifecycle policy in COS:
Transition to Vault: 30 days after object creation
Transition to Cold Vault: 90 days after object creation
Delete: 365 days after object creation (adjust based on compliance requirements)
ML Pipeline Optimization:
With complete flow log coverage, optimize your anomaly detection pipeline:
- Data Preprocessing: Normalize flow records from multiple subnet collectors into a unified schema
- Feature Engineering: Extract features like bytes per flow, packets per flow, flow duration, unique source/destination counts
- Temporal Windowing: Use 15-minute windows aligned with your processing interval
- Baseline Calculation: Maintain rolling 30-day baseline for anomaly scoring
- Model Retraining: Schedule weekly retraining with full 90-day historical dataset
Validation Steps:
After implementing subnet-level collectors:
- Wait 30 minutes for initial flow logs to appear in COS
- Verify intra-subnet traffic is now visible in logs
- Check that rejected traffic is being captured
- Monitor COS bucket size growth (expect 3-5x increase with complete coverage)
- Validate ML model performance improves with complete dataset (measure false negative rate reduction)
Cost Considerations:
Subnet-level collectors generate significantly more data than VPC-level collectors. For 6 subnets with moderate traffic, expect:
- Daily log volume: 50-200 GB (depends on traffic patterns)
- Monthly COS storage cost: $35-140 (with lifecycle policies)
- Flow log collector cost: $0.50 per collector = $3/month for 6 subnets
The improved ML model accuracy from complete data coverage should justify this cost increase through better threat detection and reduced false negatives.
Also check your flow log collector configuration for the ‘traffic type’ setting. By default, it might be set to capture only accepted traffic. For ML anomaly detection, you probably want to capture all traffic including rejected connections, as those can be important indicators of malicious activity or misconfigurations. Set the traffic type to ‘all’ to get complete visibility.
For log retention settings with ML training, consider that anomaly detection models need significant historical data to establish baseline behavior. I’d recommend at least 90 days of retention in Cloud Object Storage, with lifecycle policies to archive older data to cheaper storage tiers. Also, make sure your flow logs are being written in a consistent format and time interval - the 15-minute processing interval should align with the flow log aggregation window to avoid data gaps.
VPC Flow Logs at the VPC level don’t capture all traffic by default. Intra-subnet traffic (traffic between instances in the same subnet) is typically not logged unless you explicitly enable it. You need to create flow log collectors at the subnet level for each subnet where you want complete traffic visibility. VPC-level collectors only capture traffic that crosses subnet boundaries or goes to/from the internet.
Don’t forget about the flow log aggregation interval setting. The default is 10 minutes, which means flows are aggregated and written every 10 minutes. If your ML pipeline processes every 15 minutes, you might be missing some flow records that are still being aggregated. Consider adjusting either the flow log aggregation interval or your ML pipeline processing frequency to ensure proper synchronization. Also verify that your COS bucket has proper lifecycle management to handle the volume of logs generated by subnet-level collectors.