Optimized production planning database for real-time MRP scheduling across 50+ manufacturing plants

I wanted to share our experience optimizing the database architecture for real-time MRP scheduling across our 6 manufacturing plants. We were experiencing scheduling latency of 15-20 minutes for production plan generation, which was unacceptable for our just-in-time manufacturing model.

Our solution involved implementing materialized views for frequently accessed planning data, table partitioning by plant to improve query locality, and designing covering indexes that eliminated table lookups for common scheduling queries. We also leveraged ION caching layer and set up read replica architecture to offload reporting queries from the primary transactional database.

The results were dramatic - MRP scheduling now completes in under 90 seconds, and we can run what-if scenarios without impacting production transactions. Happy to share technical details if anyone’s facing similar challenges.

Our covering indexes focused on the three most common scheduling queries: 1) Available capacity by work center and date range, 2) Material availability by item and plant, 3) Work order status by priority and due date. Each index includes all columns needed by its target query to avoid table lookups. We use composite indexes with plant_code as the leading column for partition pruning. Yes, indexes add storage overhead, but the performance gain was worth it - query times dropped from 8-12 seconds to under 1 second.

We partitioned the 5 highest-volume tables: work orders, material requirements, routing operations, capacity buckets, and item master. Used LIST partitioning on plant_code. The key was ensuring queries always include plant_code in WHERE clauses so partition pruning works. We saw 70% reduction in full table scans after implementation. Maintenance is minimal since plants don’t change frequently.

The read replica architecture is interesting. Did you implement synchronous or asynchronous replication? We’re concerned about data consistency for scheduling queries if we go with async replication.

This is impressive. What was your approach to table partitioning by plant? Did you partition all planning tables or just specific high-volume ones? We’re considering a similar approach but concerned about maintenance overhead.

Can you elaborate on the covering indexes? What columns did you include? We’re struggling with index bloat from trying to optimize too many query patterns simultaneously.

Let me provide a comprehensive breakdown of our implementation:

Materialized View Strategy: We created three categories of materialized views. First, aggregated capacity views that roll up work center availability by day/week/month - these refresh every 30 minutes. Second, material availability views that pre-join item master, inventory balances, and open purchase orders - refresh every 15 minutes during production hours. Third, work order summary views that include routing steps and material requirements - refresh on-demand when scheduling runs complete. The key was identifying which data changes frequently versus relatively static master data.

Table Partitioning by Plant: Implemented LIST partitioning on plant_code for work orders, material requirements, routing operations, capacity buckets, and shop floor transactions. This created physical data separation that dramatically improved query performance since most scheduling queries are plant-specific. We also added local indexes on each partition rather than global indexes, which reduced index maintenance overhead during data loads. Partition pruning now eliminates 80-85% of data from typical scheduling queries.

Covering Index Design: Built three critical covering indexes: 1) work_center_capacity_idx on (plant_code, work_center_id, schedule_date) INCLUDE (available_hours, utilized_hours, efficiency_factor), 2) material_availability_idx on (plant_code, item_id, required_date) INCLUDE (on_hand_qty, allocated_qty, on_order_qty), 3) work_order_priority_idx on (plant_code, priority_code, due_date) INCLUDE (status_code, order_qty, completed_qty). These indexes support 90% of our scheduling queries without table access.

ION Caching Layer: Implemented Redis-based caching through ION for relatively static planning data like item master, BOMs, and routings. Cache entries have 4-hour TTL during production hours and 12-hour TTL overnight. When MRP scheduling runs, it pulls from cache rather than querying the database repeatedly. We also cache the results of complex availability calculations that don’t change frequently. This reduced database query volume by 60% during scheduling runs.

Read Replica Architecture: Set up asynchronous streaming replication to two read replicas. One replica handles all planning and scheduling queries, the other handles reporting and analytics. This completely offloaded read traffic from the primary database, which now only handles transactional writes. We accept 30-60 second replication lag because scheduling doesn’t require absolute real-time data - the slight lag is acceptable for planning horizon queries. For critical real-time checks like inventory allocation, queries still hit the primary database.

Performance Results: MRP scheduling time dropped from 15-20 minutes to 75-90 seconds. Database CPU utilization during scheduling runs decreased from 85% to 35%. We can now run multiple what-if scenarios in parallel without impacting production transactions. Query response times for planning dashboards improved from 8-15 seconds to under 2 seconds. The architecture also improved system reliability - read replica failures don’t impact production transactions, and we can perform maintenance on replicas without downtime.

Implementation Lessons: Start with materialized views and covering indexes for quick wins, then add partitioning once you understand query patterns. Test partition pruning thoroughly - queries that don’t include partition keys get no benefit. Size your ION cache appropriately - we started too small and had excessive cache evictions. Monitor replication lag closely and alert if it exceeds 2 minutes. Document which queries use which replicas to avoid confusion during troubleshooting.