Sani Rus
04-04-2007, 11:09 PM
Hello group!
When running our application on full performances we encountered occasional
delays of 100ms. At >80% of CPU load they appear very often - once every few
seconds. If load is not so high, delays still appears but not so often.
Investigation leads us to finding that delays occur on semop() when
acquiring semaphore (this semaphore is used to protect a critical section,
which is executed very intensively ~4200/s). Further investigation makes us
to conclusion that delays occur due to round-robin scheduler
(sched_rr_get_interval() returns exact 100ms period, application processes
are running on RT priorities with SCHED_RR policy). When the process
exhausts its time-slice scheduler preempts it for RR interval. If this
happen when the semaphore is taken no one could return it for the 100ms. In
the mean time many processes try to execute the same critical section but
they are blocked. It has a consequence that all other processes, also those
with higher priority are blocked, while they are running their own
transitions (they should process messages from the message queue). These
results to full queues (>1000 messages) of blocked processes and make them
running for longer period (to empty theirs queues) and gives them a good
chance to be preempted by RR scheduler again. The circle is closed. When the
scheduler policy is changed to SCHED_FIFO all of this does not happen any
more, but the application runs more in bursts which is not desirable. Do you
agree with our findings from above? Do you have any suggestion how to
prevent the described problem and retain SCHED_RR policy?
Best regards,
Sani
When running our application on full performances we encountered occasional
delays of 100ms. At >80% of CPU load they appear very often - once every few
seconds. If load is not so high, delays still appears but not so often.
Investigation leads us to finding that delays occur on semop() when
acquiring semaphore (this semaphore is used to protect a critical section,
which is executed very intensively ~4200/s). Further investigation makes us
to conclusion that delays occur due to round-robin scheduler
(sched_rr_get_interval() returns exact 100ms period, application processes
are running on RT priorities with SCHED_RR policy). When the process
exhausts its time-slice scheduler preempts it for RR interval. If this
happen when the semaphore is taken no one could return it for the 100ms. In
the mean time many processes try to execute the same critical section but
they are blocked. It has a consequence that all other processes, also those
with higher priority are blocked, while they are running their own
transitions (they should process messages from the message queue). These
results to full queues (>1000 messages) of blocked processes and make them
running for longer period (to empty theirs queues) and gives them a good
chance to be preempted by RR scheduler again. The circle is closed. When the
scheduler policy is changed to SCHED_FIFO all of this does not happen any
more, but the application runs more in bursts which is not desirable. Do you
agree with our findings from above? Do you have any suggestion how to
prevent the described problem and retain SCHED_RR policy?
Best regards,
Sani