VPC Flow Logs missing traffic for ML anomaly detection pipeline in ic-2020 networking module

gregory_func · June 4, 2025, 4:47pm

Our machine learning pipeline for network anomaly detection relies on VPC Flow Logs as the primary data source. We’ve noticed that certain traffic patterns are missing from the flow logs, which is reducing the effectiveness of our ML detection models. Specifically, we’re not seeing logs for traffic between instances in the same subnet, and some inter-subnet traffic appears to be incomplete.

The flow log collector is configured at the VPC level with default settings. We’re ingesting logs into Cloud Object Storage and processing them every 15 minutes for the ML pipeline. The anomaly detection model is trained to identify unusual network patterns, but with incomplete data, we’re getting false negatives. What VPC Flow Log coverage settings should we verify? Are there specific subnet-level logging configurations needed? Also, what are the recommended log retention settings for ML training datasets that need historical network behavior data?

raymond_analyst · June 10, 2025, 6:30am

That’s helpful context. I checked and our flow log collector is indeed set to capture only ‘accept’ traffic. I’ll change that to ‘all’. For the intra-subnet traffic issue, do I need to create separate flow log collectors for each subnet, or can I modify the VPC-level collector to include intra-subnet traffic? We have 6 subnets in the VPC and want to minimize management overhead.

raymonddev · June 27, 2025, 10:05am

Let me provide a comprehensive solution covering all three areas you need to address for complete ML pipeline coverage.

VPC Flow Log Coverage: Your current VPC-level flow log collector has inherent limitations. VPC Flow Logs operate at three scope levels with different coverage:

VPC Level: Captures only traffic crossing subnet boundaries, internet gateway traffic, and VPN traffic
Subnet Level: Captures all traffic to/from instances in that subnet, including intra-subnet communication
Network Interface Level: Most granular, captures all traffic for a specific instance

For comprehensive ML anomaly detection, you need subnet-level collectors. Create them programmatically:


ibmcloud is flow-log-create \
  --subnet <subnet-id> \
  --bucket <cos-bucket-name> \
  --target all \
  --active true \
  --name ml-anomaly-subnet-<subnet-name>

Repeat for all 6 subnets. This ensures complete traffic visibility including intra-subnet flows that your ML model needs.

Subnet-Level Logging Configuration: Key configuration parameters for ML data quality:

Traffic Type: Set to ‘all’ to capture both accepted and rejected traffic
- Rejected traffic is crucial for detecting security threats and misconfigurations
- Your ML model can learn patterns of normal rejection rates vs. attack patterns
Aggregation Interval: Align with your ML pipeline processing frequency
- Default: 10 minutes
- Your pipeline: 15 minutes
- Recommendation: Set flow log interval to 5 minutes for finer granularity
- This ensures your 15-minute processing windows always contain complete flow data
COS Bucket Structure: Organize logs by subnet and time for efficient ML processing
- Use separate prefixes: /vpc-flows/subnet-1/, /vpc-flows/subnet-2/, etc.
- Enable versioning for data integrity
- Configure lifecycle policy for cost optimization

Log Retention Settings: For ML training with network anomaly detection, implement tiered retention:

Hot Storage (0-30 days): Standard COS class
- Used for active ML model training and real-time inference
- Quick access for investigating recent anomalies
- Estimated cost: ~$0.023/GB/month
Warm Storage (31-90 days): COS Vault class
- Used for periodic model retraining
- Historical baseline establishment
- Estimated cost: ~$0.012/GB/month
Cold Storage (90+ days): COS Cold Vault class
- Long-term retention for compliance and trend analysis
- Accessed infrequently for model validation
- Estimated cost: ~$0.004/GB/month

Implement lifecycle policy in COS:


Transition to Vault: 30 days after object creation
Transition to Cold Vault: 90 days after object creation
Delete: 365 days after object creation (adjust based on compliance requirements)

ML Pipeline Optimization: With complete flow log coverage, optimize your anomaly detection pipeline:

Data Preprocessing: Normalize flow records from multiple subnet collectors into a unified schema
Feature Engineering: Extract features like bytes per flow, packets per flow, flow duration, unique source/destination counts
Temporal Windowing: Use 15-minute windows aligned with your processing interval
Baseline Calculation: Maintain rolling 30-day baseline for anomaly scoring
Model Retraining: Schedule weekly retraining with full 90-day historical dataset

Validation Steps: After implementing subnet-level collectors:

Wait 30 minutes for initial flow logs to appear in COS
Verify intra-subnet traffic is now visible in logs
Check that rejected traffic is being captured
Monitor COS bucket size growth (expect 3-5x increase with complete coverage)
Validate ML model performance improves with complete dataset (measure false negative rate reduction)

Cost Considerations: Subnet-level collectors generate significantly more data than VPC-level collectors. For 6 subnets with moderate traffic, expect:

Daily log volume: 50-200 GB (depends on traffic patterns)
Monthly COS storage cost: $35-140 (with lifecycle policies)
Flow log collector cost: $0.50 per collector = $3/month for 6 subnets

The improved ML model accuracy from complete data coverage should justify this cost increase through better threat detection and reduced false negatives.

dorothy_pro · June 7, 2025, 6:39am

Also check your flow log collector configuration for the ‘traffic type’ setting. By default, it might be set to capture only accepted traffic. For ML anomaly detection, you probably want to capture all traffic including rejected connections, as those can be important indicators of malicious activity or misconfigurations. Set the traffic type to ‘all’ to get complete visibility.

michelle_coder · June 20, 2025, 4:08pm

For log retention settings with ML training, consider that anomaly detection models need significant historical data to establish baseline behavior. I’d recommend at least 90 days of retention in Cloud Object Storage, with lifecycle policies to archive older data to cheaper storage tiers. Also, make sure your flow logs are being written in a consistent format and time interval - the 15-minute processing interval should align with the flow log aggregation window to avoid data gaps.

kevin_cloud · June 4, 2025, 7:07pm

VPC Flow Logs at the VPC level don’t capture all traffic by default. Intra-subnet traffic (traffic between instances in the same subnet) is typically not logged unless you explicitly enable it. You need to create flow log collectors at the subnet level for each subnet where you want complete traffic visibility. VPC-level collectors only capture traffic that crosses subnet boundaries or goes to/from the internet.

michelle_coder · June 25, 2025, 11:46am

Don’t forget about the flow log aggregation interval setting. The default is 10 minutes, which means flows are aggregated and written every 10 minutes. If your ML pipeline processes every 15 minutes, you might be missing some flow records that are still being aggregated. Consider adjusting either the flow log aggregation interval or your ML pipeline processing frequency to ensure proper synchronization. Also verify that your COS bucket has proper lifecycle management to handle the volume of logs generated by subnet-level collectors.

Topic		Replies	Views
VPC Flow Logs not capturing outbound traffic for subnet in observability setup IBM Cloud question , networking , observability , ic-2019 , iam-permissions , monitoring-gap , missing-logs , vpc-flow-logs , subnet-config	7	0	March 4, 2025
Automated anomaly detection on ERP VPC Flow Logs reduced downtime by 40% for order management IBM Cloud use-case , networking , ic-2019 , sla-improvement , order-management , downtime-reduction , vpc-flow-logs , anomaly-detection	3	0	August 6, 2025
VPC Flow Logs for analytics data pipelines: best practices for managing log volume and monitoring costs Amazon Web Services (AWS) discussion , networking , analytics , cost-optimization , aws-2021 , log-retention , vpc-flow-logs , monitoring-cost	6	0	October 7, 2025
VPC Flow Logs for analytics data pipelines: best practices for reducing log volume and monitoring costs Amazon Web Services (AWS) discussion , monitoring , networking , analytics , cost-optimization , aws-2021 , log-retention , vpc-flow-logs	3	0	December 21, 2024
Best practices for monitoring network traffic in IBM Cloud observability module IBM Cloud discussion , monitoring , networking , alerts , observability , ic-2020 , flow-logs , sysdig , logdna	7	0	May 17, 2025
Log Analysis ML-based alerts delayed by several minutes in production IBM Cloud question , observability , ic-2021 , resource-allocation , machine-learning , incident-response , log-analysis , alert-delay , ml-batch-processing	5	1	October 15, 2025
VPC network latency spikes detected but monitoring shows zero packet loss - troubleshooting network performance IBM Cloud question , networking , ic-2020 , flow-logs , network-acl , monitoring-mana , ibm-cloud-vpc-flow , incomplete-metrics , latency-spikes	3	0	October 21, 2025
Pub/Sub integration with ML pipeline causes delayed messages and Dataflow processing lag Google Cloud IoT question , integration , dataflow , autoscaling , ml-pipeline , pubsub-23 , processing-lag , backlog	5	0	July 20, 2025
VPC routing table misroutes traffic between subnets, causing ERP integration failures IBM Cloud question , networking , erp-integration , vpc , ic-2019 , flow-logs , net-connect , subnet-cidr , routing-misroute	6	0	October 8, 2025

VPC Flow Logs missing traffic for ML anomaly detection pipeline in ic-2020 networking module

Related topics