Possible hard to fix race in cmn_systimer_get(), ideas?

classic Classic list List threaded Threaded
15 messages Options
Mark Burton Mark Burton
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Possible hard to fix race in cmn_systimer_get(), ideas?

Hi,

I am seeing random occurrences of platform_timer_get_diff_us() finding start times after end times (this is with Lua not Lua long on the STM32F4) and I believe the problem is in cmn_systimer_get(). Imagine what will happen if the MCU timer rolls over between the first assignment to tempcnt and the first assignment to tempsys? tempcnt will have a high value just before the counter resets to zero but cmn_systimer_counter is already advanced by the time it is read on the next line. The following lines that check that cmn_systimer_counter hasn't changed don't help at all.

So, what to do? Disabling the system timer interrupt before assigning to tempcnt appears promising but then it may produce times that are in the past.

The only solution I can currently think of is to rework the system timer API and have a platform function that returns a single 64 bit uS counter value. The implementation of that function will have to take whatever precautions are necessary (disable interrupts, stop the counter, etc.) to ensure the resulting values are in sequence.

Does anyone have any good suggestions how to fix this?

Mark




_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Mark Burton Mark Burton
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?


Here's a possible fix:
_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Mark Burton Mark Burton
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?

In reply to this post by Mark Burton

Oops, hit send too soon, try again...

How about this...

Create a new platform function called something like
platform_timer_sys_wrapped() that returns non-zero if the
sys timer has wrapped since the function was last called. For ARM chips
this is trivial because the SysTick hardware provides exactly that
capability. Do the other supported chips have a similar capability?

Using that function, here's a new version of cmn_systimer_get():

timer_data_type cmn_systimer_get(void)
{
  u64 tempcnt, crtsys;

  platform_timer_sys_wrapped(); // reset wrapped state
  do {
    tempcnt = platform_timer_sys_raw_read();
    crtsys = cmn_systimer_counter;
  } while( platform_timer_sys_wrapped() );

  crtsys += tempcnt / cmn_systimer_ticks_for_us;
  if( crtsys > PLATFORM_TIMER_SYS_MAX ) // timer overflow
  {
    crtsys -= PLATFORM_TIMER_SYS_MAX;
    platform_timer_sys_disable_int();
    if( cmn_systimer_counter > PLATFORM_TIMER_SYS_MAX )
      cmn_systimer_counter -= PLATFORM_TIMER_SYS_MAX;
    platform_timer_sys_enable_int();
  }
  return ( timer_data_type )crtsys;
}

So when the do - while loop has exited you know that tempcnt and crtsys
are in sync because the MCU counter has not wrapped.

Any good?


_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Martin Guy Martin Guy
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?

If the value from the timer is a 32- or 64-bit value that wraps round,
you should just be able to subtract the old from the new and get a
positive value from 0 to (2^N)-1, and the unsigned arithmetic will
handle the wraparound case like all the others.

That's without looking at the code, of course. Just an idea.

    M

On 22/02/2014, Mark Burton <[hidden email]> wrote:

>
> Oops, hit send too soon, try again...
>
> How about this...
>
> Create a new platform function called something like
> platform_timer_sys_wrapped() that returns non-zero if the
> sys timer has wrapped since the function was last called. For ARM chips
> this is trivial because the SysTick hardware provides exactly that
> capability. Do the other supported chips have a similar capability?
>
> Using that function, here's a new version of cmn_systimer_get():
>
> timer_data_type cmn_systimer_get(void)
> {
>   u64 tempcnt, crtsys;
>
>   platform_timer_sys_wrapped(); // reset wrapped state
>   do {
>     tempcnt = platform_timer_sys_raw_read();
>     crtsys = cmn_systimer_counter;
>   } while( platform_timer_sys_wrapped() );
>
>   crtsys += tempcnt / cmn_systimer_ticks_for_us;
>   if( crtsys > PLATFORM_TIMER_SYS_MAX ) // timer overflow
>   {
>     crtsys -= PLATFORM_TIMER_SYS_MAX;
>     platform_timer_sys_disable_int();
>     if( cmn_systimer_counter > PLATFORM_TIMER_SYS_MAX )
>       cmn_systimer_counter -= PLATFORM_TIMER_SYS_MAX;
>     platform_timer_sys_enable_int();
>   }
>   return ( timer_data_type )crtsys;
> }
>
> So when the do - while loop has exited you know that tempcnt and crtsys
> are in sync because the MCU counter has not wrapped.
>
> Any good?
>
>
> _______________________________________________
> eLua-dev mailing list
> [hidden email]
> https://lists.berlios.de/mailman/listinfo/elua-dev
>
_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Mark Burton Mark Burton
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?


Hi Martin,

> If the value from the timer is a 32- or 64-bit value that wraps round,
> you should just be able to subtract the old from the new and get a
> positive value from 0 to (2^N)-1, and the unsigned arithmetic will
> handle the wraparound case like all the others.
>
> That's without looking at the code, of course. Just an idea.

Yes, but that's not what's happening.

At the moment, the sys timer value is made by adding together the value
of cmn_systimer_counter and the value returned by
platform_timer_sys_raw_read(). cmn_systimer_counter is incremented when
the platform timer wraps back to zero but the current code is flawed
because in the gap between reading the system timer value and adding it
to cmn_systimer_counter, every now and again, the system timer wraps
back to zero (which causes cmd_systimer_counter to increase) but the
pre-wrap system timer value is still added. Let me give you an example:

If you look at tstart and tend in platform_timer_get_diff_us() for a
typical situation where it's gone wrong, you get values like these:

(gdb) print tstart
$9 = 14374399999
(gdb) print tend
$10 = 14374352717

tstart is 14374399999 but it should be 14374349999, it's 50000 too high
(vtmr rate is 20Hz, aka 50mS).

Cheers,

Mark

_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Mark Burton Mark Burton
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?

In reply to this post by Martin Guy

Here's a little test program that shows the problem on the STM32F4:

print "Hello!"
while true do
  local start = tmr.read()
  math.sqrt(math.huge)
  local us = tmr.getdiffnow(nil, start)
  if(us > 1e6) then
    print(tmr.read() / 1e6 .. ": took " .. us)
  end
end

Typical output:

Hello!
50.450169: took 4.5035996273206e+15
294.900167: took 4.5035996273206e+15
301.650167: took 4.5035996273206e+15
350.550167: took 4.5035996273206e+15


_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Martin Guy Martin Guy
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?

In reply to this post by Mark Burton
On 23/02/2014, Mark Burton <[hidden email]> wrote:

>> If the value from the timer is a 32- or 64-bit value that wraps round,
>> you should just be able to subtract the old from the new and get a
>> positive value from 0 to (2^N)-1, and the unsigned arithmetic will
>> handle the wraparound case like all the others.
>> .
>> That's without looking at the code, of course. Just an idea.
>
> Yes, but that's not what's happening.
>
> At the moment, the sys timer value is made by adding together the value
> of cmn_systimer_counter and the value returned by
> platform_timer_sys_raw_read(). cmn_systimer_counter is incremented when
> the platform timer wraps back to zero but the current code is flawed
> because in the gap between reading the system timer value and adding it
> to cmn_systimer_counter, every now and again, the system timer wraps
> back to zero (which causes cmd_systimer_counter to increase) but the
> pre-wrap system timer value is still added.

Right. Nasty. Of the solutions you suggest, the 64-bit platform
function seems cleaner, handling the locking or wrap detection or
whatever in the hardware-specific way that the platform requires.

I think we should hear Bogdan, as the systimer is his invention and he
has written clear directions on how it is to be used at a platform
level:
http://www.eluaproject.net/doc/v0.9/en_arch_platform_timers.html#the_system_timer
so if you can resolve the bug within that framework, so much the better...

good luck!

   m
_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Mark Burton Mark Burton
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?


Hi,

> Right. Nasty. Of the solutions you suggest, the 64-bit platform
> function seems cleaner, handling the locking or wrap detection or
> whatever in the hardware-specific way that the platform requires.
>
> I think we should hear Bogdan, as the systimer is his invention and he
> has written clear directions on how it is to be used at a platform
> level:
> http://www.eluaproject.net/doc/v0.9/en_arch_platform_timers.html#the_system_timer
> so if you can resolve the bug within that framework, so much the
> better...
>
> good luck!

I have been testing a fix like I suggested a few emails back with this
as the implementation for Cortex M3/M4 chips:

int platform_timer_sys_wrapped(void)
{
  return (SysTick->CTRL & SysTick_CTRL_COUNTFLAG_Msk) != 0;
}

And this is how cmn_systimer_get() now looks:

timer_data_type cmn_systimer_get(void)
{
  u64 tempcnt, crtsys;

  platform_timer_sys_wrapped();
  do {
    tempcnt = platform_timer_sys_raw_read();
    crtsys = cmn_systimer_counter;
  } while( platform_timer_sys_wrapped() );

  crtsys += tempcnt / cmn_systimer_ticks_for_us;
  if( crtsys > PLATFORM_TIMER_SYS_MAX ) // timer overflow
  {
    crtsys -= PLATFORM_TIMER_SYS_MAX;
    platform_timer_sys_disable_int();
    if( cmn_systimer_counter > PLATFORM_TIMER_SYS_MAX )
      cmn_systimer_counter -= PLATFORM_TIMER_SYS_MAX;
    platform_timer_sys_enable_int();
  }
  return ( timer_data_type )crtsys;
}

It appears to fix the problem so I am currently working on a patch which
will implement platform_timer_sys_wrapped() for all platforms that need
it so that the fix doesn't break any existing builds.

Cheers,

Mark
_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Mark Burton Mark Burton
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?

Having thought about this some more, I realise that there's a much simpler fix. All it requires is to reorder the assignments in cmn_systimer_get() so they look like this:

  tempsys = cmn_systimer_counter;
  tempcnt = platform_timer_sys_raw_read();
  while( ( crtsys = cmn_systimer_counter ) != tempsys )
  {
    tempsys = crtsys;
    tempcnt = platform_timer_sys_raw_read();
  }

Now, when the loop finishes, you know that the value of tempcnt is consistent with the value of crtsys because the value of cmn_systimer_counter did not change while the platform timer value was being read and assigned to tempcnt.

This assumes that when the counter wraps, an interrupt is generated immediately and the interrupt handler calls cmn_systimer_periodic(). If interrupts are disabled then the counter could wrap and cmn_systimer_counter would not be advanced so the resulting time would be in the past but this has always been possible with this implementation.

Anyway, I have tested the above change and it appears to work OK so I shall submit a pull request.

Cheers,

Mark



On 23 February 2014 15:33, Mark Burton <[hidden email]> wrote:

Hi,

> Right. Nasty. Of the solutions you suggest, the 64-bit platform
> function seems cleaner, handling the locking or wrap detection or
> whatever in the hardware-specific way that the platform requires.
>
> I think we should hear Bogdan, as the systimer is his invention and he
> has written clear directions on how it is to be used at a platform
> level:
> http://www.eluaproject.net/doc/v0.9/en_arch_platform_timers.html#the_system_timer
> so if you can resolve the bug within that framework, so much the
> better...
>
> good luck!

I have been testing a fix like I suggested a few emails back with this
as the implementation for Cortex M3/M4 chips:

int platform_timer_sys_wrapped(void)
{
  return (SysTick->CTRL & SysTick_CTRL_COUNTFLAG_Msk) != 0;
}

And this is how cmn_systimer_get() now looks:

timer_data_type cmn_systimer_get(void)
{
  u64 tempcnt, crtsys;

  platform_timer_sys_wrapped();
  do {
    tempcnt = platform_timer_sys_raw_read();
    crtsys = cmn_systimer_counter;
  } while( platform_timer_sys_wrapped() );

  crtsys += tempcnt / cmn_systimer_ticks_for_us;
  if( crtsys > PLATFORM_TIMER_SYS_MAX ) // timer overflow
  {
    crtsys -= PLATFORM_TIMER_SYS_MAX;
    platform_timer_sys_disable_int();
    if( cmn_systimer_counter > PLATFORM_TIMER_SYS_MAX )
      cmn_systimer_counter -= PLATFORM_TIMER_SYS_MAX;
    platform_timer_sys_enable_int();
  }
  return ( timer_data_type )crtsys;
}

It appears to fix the problem so I am currently working on a patch which
will implement platform_timer_sys_wrapped() for all platforms that need
it so that the fix doesn't break any existing builds.

Cheers,

Mark


_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
jbsnyder jbsnyder
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?

Thanks for this.  I would have joined in on this on the weekend but I was out of town.

I actually had done something about this some time ago and had a conversation with Bogdan about it back then, though I didn't run the final implementation by him.  Then.... I forgot to merge it.  I _believe_ it has been working pretty well on the branch but I haven't delved into the related issues any time recently or thought about all the ins and outs.

Yours is simpler, but if I recall this covers some other potential situations and the rather small toggler might be nice for architectures with smaller words?

Sorry for leaving this one outstanding for so long.

@@ -396,6 +403,7 @@ int platform_timer_set_match_int( unsigned id, timer_data_type period_us, int ty
 
 static u32 cmn_systimer_ticks_for_us;
 static volatile u64 cmn_systimer_counter;
+static volatile u8 cmn_systimer_toggle = 0;
 static u32 cmn_systimer_us_per_interrupt;
 
 void cmn_systimer_set_base_freq( u32 freq_hz )
@@ -413,30 +421,44 @@ void cmn_systimer_set_interrupt_period_us( u32 period ) void cmn_systimer_periodic(void)
 {
+  cmn_systimer_toggle ^= 1;
   cmn_systimer_counter += cmn_systimer_us_per_interrupt;
 }
 
+u64 cmn_systimer_last_crtsys = 0;
+u8 cmn_systimer_overflow = 0;
 timer_data_type cmn_systimer_get(void)
 {
-  u64 tempsys, tempcnt, crtsys;
-
-  tempcnt = platform_timer_sys_raw_read();
-  tempsys = cmn_systimer_counter;
-  while( ( crtsys = cmn_systimer_counter ) != tempsys )
+  u64 tempcnt, crtsys;
+  u8 tmptoggle;
+  do
   {
+    tmptoggle = cmn_systimer_toggle;
     tempcnt = platform_timer_sys_raw_read();
-    tempsys = crtsys;
-  }
+    crtsys = cmn_systimer_counter;
+  } while( platform_timer_sys_raw_read() < tempcnt || cmn_systimer_toggle != tmptoggle );
+
   crtsys += tempcnt / cmn_systimer_ticks_for_us;
+  if( crtsys < cmn_systimer_last_crtsys && cmn_systimer_overflow == 0 )
+  {
+    while( crtsys < cmn_systimer_last_crtsys )
+      crtsys += cmn_systimer_us_per_interrupt;
+  }
+
   if( crtsys > PLATFORM_TIMER_SYS_MAX ) // timer overflow
   {
     crtsys %= PLATFORM_TIMER_SYS_MAX;
     platform_timer_sys_disable_int();
     cmn_systimer_counter = 0;
     platform_timer_sys_enable_int();
+    cmn_systimer_overflow = 1;
   }
+  else
+    cmn_systimer_overflow = 0;
+
+  cmn_systimer_last_crtsys = crtsys;
   return ( timer_data_type )crtsys;
 }


On Sun, Feb 23, 2014 at 11:59 AM, Mark Burton <[hidden email]> wrote:
Having thought about this some more, I realise that there's a much simpler fix. All it requires is to reorder the assignments in cmn_systimer_get() so they look like this:

  tempsys = cmn_systimer_counter;
  tempcnt = platform_timer_sys_raw_read();
  while( ( crtsys = cmn_systimer_counter ) != tempsys )
  {
    tempsys = crtsys;
    tempcnt = platform_timer_sys_raw_read();
  }

Now, when the loop finishes, you know that the value of tempcnt is consistent with the value of crtsys because the value of cmn_systimer_counter did not change while the platform timer value was being read and assigned to tempcnt.

This assumes that when the counter wraps, an interrupt is generated immediately and the interrupt handler calls cmn_systimer_periodic(). If interrupts are disabled then the counter could wrap and cmn_systimer_counter would not be advanced so the resulting time would be in the past but this has always been possible with this implementation.

Anyway, I have tested the above change and it appears to work OK so I shall submit a pull request.

Cheers,

Mark



On 23 February 2014 15:33, Mark Burton <[hidden email]> wrote:

Hi,

> Right. Nasty. Of the solutions you suggest, the 64-bit platform
> function seems cleaner, handling the locking or wrap detection or
> whatever in the hardware-specific way that the platform requires.
>
> I think we should hear Bogdan, as the systimer is his invention and he
> has written clear directions on how it is to be used at a platform
> level:
> http://www.eluaproject.net/doc/v0.9/en_arch_platform_timers.html#the_system_timer
> so if you can resolve the bug within that framework, so much the
> better...
>
> good luck!

I have been testing a fix like I suggested a few emails back with this
as the implementation for Cortex M3/M4 chips:

int platform_timer_sys_wrapped(void)
{
  return (SysTick->CTRL & SysTick_CTRL_COUNTFLAG_Msk) != 0;
}

And this is how cmn_systimer_get() now looks:

timer_data_type cmn_systimer_get(void)
{
  u64 tempcnt, crtsys;

  platform_timer_sys_wrapped();
  do {
    tempcnt = platform_timer_sys_raw_read();
    crtsys = cmn_systimer_counter;
  } while( platform_timer_sys_wrapped() );

  crtsys += tempcnt / cmn_systimer_ticks_for_us;
  if( crtsys > PLATFORM_TIMER_SYS_MAX ) // timer overflow
  {
    crtsys -= PLATFORM_TIMER_SYS_MAX;
    platform_timer_sys_disable_int();
    if( cmn_systimer_counter > PLATFORM_TIMER_SYS_MAX )
      cmn_systimer_counter -= PLATFORM_TIMER_SYS_MAX;
    platform_timer_sys_enable_int();
  }
  return ( timer_data_type )crtsys;
}

It appears to fix the problem so I am currently working on a patch which
will implement platform_timer_sys_wrapped() for all platforms that need
it so that the fix doesn't break any existing builds.

Cheers,

Mark


_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev



--
James Snyder
ph: (847) 448-0386

_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
jbsnyder jbsnyder
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?

OK, that one was diffed from an older version of master.  I cleaned it up a bit and made it mergeable here without hopefully undoing some of Mark's previous work:


On Mon, Feb 24, 2014 at 6:54 PM, James Snyder <[hidden email]> wrote:
Thanks for this.  I would have joined in on this on the weekend but I was out of town.

I actually had done something about this some time ago and had a conversation with Bogdan about it back then, though I didn't run the final implementation by him.  Then.... I forgot to merge it.  I _believe_ it has been working pretty well on the branch but I haven't delved into the related issues any time recently or thought about all the ins and outs.

Yours is simpler, but if I recall this covers some other potential situations and the rather small toggler might be nice for architectures with smaller words?

Sorry for leaving this one outstanding for so long.

@@ -396,6 +403,7 @@ int platform_timer_set_match_int( unsigned id, timer_data_type period_us, int ty
 
 static u32 cmn_systimer_ticks_for_us;
 static volatile u64 cmn_systimer_counter;
+static volatile u8 cmn_systimer_toggle = 0;
 static u32 cmn_systimer_us_per_interrupt;
 
 void cmn_systimer_set_base_freq( u32 freq_hz )
@@ -413,30 +421,44 @@ void cmn_systimer_set_interrupt_period_us( u32 period ) void cmn_systimer_periodic(void)
 {
+  cmn_systimer_toggle ^= 1;
   cmn_systimer_counter += cmn_systimer_us_per_interrupt;
 }
 
+u64 cmn_systimer_last_crtsys = 0;
+u8 cmn_systimer_overflow = 0;
 timer_data_type cmn_systimer_get(void)
 {
-  u64 tempsys, tempcnt, crtsys;
-
-  tempcnt = platform_timer_sys_raw_read();
-  tempsys = cmn_systimer_counter;
-  while( ( crtsys = cmn_systimer_counter ) != tempsys )
+  u64 tempcnt, crtsys;
+  u8 tmptoggle;
+  do
   {
+    tmptoggle = cmn_systimer_toggle;
     tempcnt = platform_timer_sys_raw_read();
-    tempsys = crtsys;
-  }
+    crtsys = cmn_systimer_counter;
+  } while( platform_timer_sys_raw_read() < tempcnt || cmn_systimer_toggle != tmptoggle );
+
   crtsys += tempcnt / cmn_systimer_ticks_for_us;
+  if( crtsys < cmn_systimer_last_crtsys && cmn_systimer_overflow == 0 )
+  {
+    while( crtsys < cmn_systimer_last_crtsys )
+      crtsys += cmn_systimer_us_per_interrupt;
+  }
+
   if( crtsys > PLATFORM_TIMER_SYS_MAX ) // timer overflow
   {
     crtsys %= PLATFORM_TIMER_SYS_MAX;
     platform_timer_sys_disable_int();
     cmn_systimer_counter = 0;
     platform_timer_sys_enable_int();
+    cmn_systimer_overflow = 1;
   }
+  else
+    cmn_systimer_overflow = 0;
+
+  cmn_systimer_last_crtsys = crtsys;
   return ( timer_data_type )crtsys;
 }


On Sun, Feb 23, 2014 at 11:59 AM, Mark Burton <[hidden email]> wrote:
Having thought about this some more, I realise that there's a much simpler fix. All it requires is to reorder the assignments in cmn_systimer_get() so they look like this:

  tempsys = cmn_systimer_counter;
  tempcnt = platform_timer_sys_raw_read();
  while( ( crtsys = cmn_systimer_counter ) != tempsys )
  {
    tempsys = crtsys;
    tempcnt = platform_timer_sys_raw_read();
  }

Now, when the loop finishes, you know that the value of tempcnt is consistent with the value of crtsys because the value of cmn_systimer_counter did not change while the platform timer value was being read and assigned to tempcnt.

This assumes that when the counter wraps, an interrupt is generated immediately and the interrupt handler calls cmn_systimer_periodic(). If interrupts are disabled then the counter could wrap and cmn_systimer_counter would not be advanced so the resulting time would be in the past but this has always been possible with this implementation.

Anyway, I have tested the above change and it appears to work OK so I shall submit a pull request.

Cheers,

Mark



On 23 February 2014 15:33, Mark Burton <[hidden email]> wrote:

Hi,

> Right. Nasty. Of the solutions you suggest, the 64-bit platform
> function seems cleaner, handling the locking or wrap detection or
> whatever in the hardware-specific way that the platform requires.
>
> I think we should hear Bogdan, as the systimer is his invention and he
> has written clear directions on how it is to be used at a platform
> level:
> http://www.eluaproject.net/doc/v0.9/en_arch_platform_timers.html#the_system_timer
> so if you can resolve the bug within that framework, so much the
> better...
>
> good luck!

I have been testing a fix like I suggested a few emails back with this
as the implementation for Cortex M3/M4 chips:

int platform_timer_sys_wrapped(void)
{
  return (SysTick->CTRL & SysTick_CTRL_COUNTFLAG_Msk) != 0;
}

And this is how cmn_systimer_get() now looks:

timer_data_type cmn_systimer_get(void)
{
  u64 tempcnt, crtsys;

  platform_timer_sys_wrapped();
  do {
    tempcnt = platform_timer_sys_raw_read();
    crtsys = cmn_systimer_counter;
  } while( platform_timer_sys_wrapped() );

  crtsys += tempcnt / cmn_systimer_ticks_for_us;
  if( crtsys > PLATFORM_TIMER_SYS_MAX ) // timer overflow
  {
    crtsys -= PLATFORM_TIMER_SYS_MAX;
    platform_timer_sys_disable_int();
    if( cmn_systimer_counter > PLATFORM_TIMER_SYS_MAX )
      cmn_systimer_counter -= PLATFORM_TIMER_SYS_MAX;
    platform_timer_sys_enable_int();
  }
  return ( timer_data_type )crtsys;
}

It appears to fix the problem so I am currently working on a patch which
will implement platform_timer_sys_wrapped() for all platforms that need
it so that the fix doesn't break any existing builds.

Cheers,

Mark


_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev



--
James Snyder
ph: <a href="tel:%28847%29%20448-0386" value="+18474480386" target="_blank">(847) 448-0386



--
James Snyder
ph: (847) 448-0386

_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Martin Guy Martin Guy
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?

Hi James!
   Mark's seems to be the smallest fix to the problem he found, and if
it does fix the problem, it seems to do so in a way that "just works"
without checking for the rare failing condition and compensating for
it when it happens. ...and smaller code size is always good in eLua.

   In both cases, it would be nice to have a comment before the
critical code sequence to say what problem it avoids and how, so that
people are more likely to be able to understand it in future and less
likely to break it by mistake

   Or is there an operational difference between the two solutions? If
they both achieve the same result, can we go for the smaller one.?

      M

On 25/02/2014, James Snyder <[hidden email]> wrote:

> OK, that one was diffed from an older version of master.  I cleaned it up a
> bit and made it mergeable here without hopefully undoing some of Mark's
> previous work:
> https://github.com/jsnyder/elua/compare/systimer_fix?expand=1
>
>
> On Mon, Feb 24, 2014 at 6:54 PM, James Snyder
> <[hidden email]>wrote:
>
>> Thanks for this.  I would have joined in on this on the weekend but I was
>> out of town.
>>
>> I actually had done something about this some time ago and had a
>> conversation with Bogdan about it back then, though I didn't run the
>> final
>> implementation by him.  Then.... I forgot to merge it.  I _believe_ it
>> has
>> been working pretty well on the branch but I haven't delved into the
>> related issues any time recently or thought about all the ins and outs.
>>
>> Yours is simpler, but if I recall this covers some other potential
>> situations and the rather small toggler might be nice for architectures
>> with smaller words?
>>
>> Sorry for leaving this one outstanding for so long.
>>
>> @@ -396,6 +403,7 @@ int platform_timer_set_match_int( unsigned id,
>> timer_data_type period_us, int ty
>>
>>  static u32 cmn_systimer_ticks_for_us;
>>  static volatile u64 cmn_systimer_counter;
>> +static volatile u8 cmn_systimer_toggle = 0;
>>  static u32 cmn_systimer_us_per_interrupt;
>>
>>  void cmn_systimer_set_base_freq( u32 freq_hz )
>> @@ -413,30 +421,44 @@ void cmn_systimer_set_interrupt_period_us( u32
>> period ) void cmn_systimer_periodic(void)
>>  {
>> +  cmn_systimer_toggle ^= 1;
>>    cmn_systimer_counter += cmn_systimer_us_per_interrupt;
>>  }
>>
>> +u64 cmn_systimer_last_crtsys = 0;
>> +u8 cmn_systimer_overflow = 0;
>>  timer_data_type cmn_systimer_get(void)
>>  {
>> -  u64 tempsys, tempcnt, crtsys;
>> -
>> -  tempcnt = platform_timer_sys_raw_read();
>> -  tempsys = cmn_systimer_counter;
>> -  while( ( crtsys = cmn_systimer_counter ) != tempsys )
>> +  u64 tempcnt, crtsys;
>> +  u8 tmptoggle;
>> +  do
>>    {
>> +    tmptoggle = cmn_systimer_toggle;
>>      tempcnt = platform_timer_sys_raw_read();
>> -    tempsys = crtsys;
>> -  }
>> +    crtsys = cmn_systimer_counter;
>> +  } while( platform_timer_sys_raw_read() < tempcnt ||
>> cmn_systimer_toggle
>> != tmptoggle );
>> +
>>    crtsys += tempcnt / cmn_systimer_ticks_for_us;
>> +  if( crtsys < cmn_systimer_last_crtsys && cmn_systimer_overflow == 0 )
>> +  {
>> +    while( crtsys < cmn_systimer_last_crtsys )
>> +      crtsys += cmn_systimer_us_per_interrupt;
>> +  }
>> +
>>    if( crtsys > PLATFORM_TIMER_SYS_MAX ) // timer overflow
>>    {
>>      crtsys %= PLATFORM_TIMER_SYS_MAX;
>>      platform_timer_sys_disable_int();
>>      cmn_systimer_counter = 0;
>>      platform_timer_sys_enable_int();
>> +    cmn_systimer_overflow = 1;
>>    }
>> +  else
>> +    cmn_systimer_overflow = 0;
>> +
>> +  cmn_systimer_last_crtsys = crtsys;
>>    return ( timer_data_type )crtsys;
>>  }
>>
>>
>> On Sun, Feb 23, 2014 at 11:59 AM, Mark Burton <[hidden email]>
>> wrote:
>>
>>> Having thought about this some more, I realise that there's a much
>>> simpler fix. All it requires is to reorder the assignments in
>>> cmn_systimer_get() so they look like this:
>>>
>>>   tempsys = cmn_systimer_counter;
>>>   tempcnt = platform_timer_sys_raw_read();
>>>   while( ( crtsys = cmn_systimer_counter ) != tempsys )
>>>   {
>>>     tempsys = crtsys;
>>>     tempcnt = platform_timer_sys_raw_read();
>>>    }
>>>
>>> Now, when the loop finishes, you know that the value of tempcnt is
>>> consistent with the value of crtsys because the value of
>>> cmn_systimer_counter did not change while the platform timer value was
>>> being read and assigned to tempcnt.
>>>
>>> This assumes that when the counter wraps, an interrupt is generated
>>> immediately and the interrupt handler calls cmn_systimer_periodic(). If
>>> interrupts are disabled then the counter could wrap and
>>> cmn_systimer_counter would not be advanced so the resulting time would
>>> be
>>> in the past but this has always been possible with this implementation.
>>>
>>> Anyway, I have tested the above change and it appears to work OK so I
>>> shall submit a pull request.
>>>
>>> Cheers,
>>>
>>> Mark
>>>
>>>
>>>
>>> On 23 February 2014 15:33, Mark Burton <[hidden email]> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> > Right. Nasty. Of the solutions you suggest, the 64-bit platform
>>>> > function seems cleaner, handling the locking or wrap detection or
>>>> > whatever in the hardware-specific way that the platform requires.
>>>> >
>>>> > I think we should hear Bogdan, as the systimer is his invention and
>>>> > he
>>>> > has written clear directions on how it is to be used at a platform
>>>> > level:
>>>> >
>>>> http://www.eluaproject.net/doc/v0.9/en_arch_platform_timers.html#the_system_timer
>>>> > so if you can resolve the bug within that framework, so much the
>>>> > better...
>>>> >
>>>> > good luck!
>>>>
>>>> I have been testing a fix like I suggested a few emails back with this
>>>> as the implementation for Cortex M3/M4 chips:
>>>>
>>>> int platform_timer_sys_wrapped(void)
>>>> {
>>>>   return (SysTick->CTRL & SysTick_CTRL_COUNTFLAG_Msk) != 0;
>>>> }
>>>>
>>>> And this is how cmn_systimer_get() now looks:
>>>>
>>>> timer_data_type cmn_systimer_get(void)
>>>> {
>>>>   u64 tempcnt, crtsys;
>>>>
>>>>   platform_timer_sys_wrapped();
>>>>   do {
>>>>     tempcnt = platform_timer_sys_raw_read();
>>>>     crtsys = cmn_systimer_counter;
>>>>   } while( platform_timer_sys_wrapped() );
>>>>
>>>>   crtsys += tempcnt / cmn_systimer_ticks_for_us;
>>>>   if( crtsys > PLATFORM_TIMER_SYS_MAX ) // timer overflow
>>>>   {
>>>>     crtsys -= PLATFORM_TIMER_SYS_MAX;
>>>>     platform_timer_sys_disable_int();
>>>>     if( cmn_systimer_counter > PLATFORM_TIMER_SYS_MAX )
>>>>       cmn_systimer_counter -= PLATFORM_TIMER_SYS_MAX;
>>>>     platform_timer_sys_enable_int();
>>>>   }
>>>>   return ( timer_data_type )crtsys;
>>>> }
>>>>
>>>> It appears to fix the problem so I am currently working on a patch
>>>> which
>>>> will implement platform_timer_sys_wrapped() for all platforms that need
>>>> it so that the fix doesn't break any existing builds.
>>>>
>>>> Cheers,
>>>>
>>>> Mark
>>>>
>>>
>>>
>>> _______________________________________________
>>> eLua-dev mailing list
>>> [hidden email]
>>> https://lists.berlios.de/mailman/listinfo/elua-dev
>>>
>>
>>
>>
>> --
>> James Snyder
>> ph: (847) 448-0386
>>
>
>
>
> --
> James Snyder
> ph: (847) 448-0386
>
_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Mark Burton Mark Burton
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?

In reply to this post by jbsnyder

Hi James, Martin,

> OK, that one was diffed from an older version of master.  I cleaned
> it up a bit and made it mergeable here without hopefully undoing some
> of Mark's previous work:
> https://github.com/jsnyder/elua/compare/systimer_fix?expand=1

I've studied that and have a couple of comments:

1 - the addition of the toggle variable only achieves the same goal as
reordering the assignments did in my last patch and, obviously, is more
complicated. Both changes detect that cmn_systimer_counter has changed
in the time that platform_sys_timer_raw_read() has read the system
timer counter.

2 - I can't see that testing (platform_system_timer_raw_read() <
tempcnt) helps, that just means that the timer has wrapped but you
would know that anyway because the value of cmn_sys_timer_counter would
have changed.

3 - I don't understand what lines 440-444 and the overflow flag are
achieving. Even without those additions, the timer doesn't lose any time
when it wraps at the PLATFORM_SYS_TIMER_MAX_BOUNDARY.

These changes add complication but I can't at the moment see the
benefit they bring. The current code in eula/eula appears to have fixed
the original problem I was seeing. Reminder, this little test program
produces output before the assignment reordering was done but doesn't
now:

while true do
  local start = tmr.read()
  math.sqrt(math.huge)
  local us = tmr.getdiffnow(nil, start)
  if(us > 1e6) then
    print(tmr.read() / 1e6 .. ": took " .. us)
  end
end



Cheers,

Mark



_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
jbsnyder jbsnyder
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?




On Wed, Feb 26, 2014 at 2:02 AM, Mark Burton <[hidden email]> wrote:

Hi James, Martin,

> OK, that one was diffed from an older version of master.  I cleaned
> it up a bit and made it mergeable here without hopefully undoing some
> of Mark's previous work:
> https://github.com/jsnyder/elua/compare/systimer_fix?expand=1
I've studied that and have a couple of comments:

1 - the addition of the toggle variable only achieves the same goal as
reordering the assignments did in my last patch and, obviously, is more
complicated. Both changes detect that cmn_systimer_counter has changed
in the time that platform_sys_timer_raw_read() has read the system
timer counter.

I think what I was going for here was to use a type where the type would be smaller than the native word size so that the operation would be atomic rather than the 64-bit operation which is split across more instructions.  The toggle being u8 would work even on smaller architectures (not that we're actually trying to run on an 8-bit micro).

Each time you're still doing work on a 64-bit type on a 32-bit architecture so you can get reads of cmn_systimer_counter where an interrupt could modify cmn_systimer_counter in between instructions.

I think that yours does actually deal with this because there are two separate reads so if one is garbage it will repeat the read until you get one that wouldn't be split.
 
2 - I can't see that testing (platform_system_timer_raw_read() <
tempcnt) helps, that just means that the timer has wrapped but you
would know that anyway because the value of cmn_sys_timer_counter would
have changed.

Unless the interrupt was masked for some reason, but I see your point.  We're already going to be losing time if that's happening.
 

3 - I don't understand what lines 440-444 and the overflow flag are
achieving. Even without those additions, the timer doesn't lose any time
when it wraps at the PLATFORM_SYS_TIMER_MAX_BOUNDARY. 

I believe what I was trying to do here was handle situations where the interrupt that was being used to trigger updates of the systimer value would run too long and lose a cycle so short of an actual systimer wrap this was intended to make sure the timer would always advance and not go backwards.

I'm not sure why I have a while loop in there since it would handle a single lost cycle at best.  It's not intended to ensure accurate timer behavior, just to avoid timer diffs thinking a wrap had happened.
 

These changes add complication but I can't at the moment see the
benefit they bring. The current code in eula/eula appears to have fixed
the original problem I was seeing. Reminder, this little test program
produces output before the assignment reordering was done but doesn't
now:

Thanks for looking at it.  I think your approach should cover the issue you were concerned about (wrapping) as well as the word-size atomicity issue if I'm understanding things correctly now and it's simpler than the toggler approach.

The other item is more of a workaround for bad user behavior to make sure that doing timer diffs wouldn't assume a systimer wrap when an incrementation of the systimer got dropped.

Best.


while true do
  local start = tmr.read()
  math.sqrt(math.huge)
  local us = tmr.getdiffnow(nil, start)
  if(us > 1e6) then
    print(tmr.read() / 1e6 .. ": took " .. us)
  end
end



Cheers,

Mark



_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev



--
James Snyder
ph: (847) 448-0386

_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Mark Burton Mark Burton
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Possible hard to fix race in cmn_systimer_get(), ideas?


Hi James,

> > 1 - the addition of the toggle variable only achieves the same goal
> > as reordering the assignments did in my last patch and, obviously,
> > is more complicated. Both changes detect that cmn_systimer_counter
> > has changed in the time that platform_sys_timer_raw_read() has read
> > the system timer counter.
> >
>
> I think what I was going for here was to use a type where the type
> would be smaller than the native word size so that the operation
> would be atomic rather than the 64-bit operation which is split
> across more instructions. The toggle being u8 would work even on
> smaller architectures (not that we're actually trying to run on an
> 8-bit micro).
>
> Each time you're still doing work on a 64-bit type on a 32-bit
> architecture so you can get reads of cmn_systimer_counter where an
> interrupt could modify cmn_systimer_counter in between instructions.
>
> I think that yours does actually deal with this because there are two
> separate reads so if one is garbage it will repeat the read until you
> get one that wouldn't be split.

OK, I can understand the idea but I think the current code is safe as
long as there is no delay between the system timer counter wrapping and
the subsequent interrupt calling cmn_systimer_periodic().

If there was a delay then it would be possible that
cmn_systimer_counter had not yet been advanced and so the
resulting time would jump backwards. Anyway, I don't think adding the
toggle variable would make any difference in this situation.

>
>
> > 2 - I can't see that testing (platform_system_timer_raw_read() <
> > tempcnt) helps, that just means that the timer has wrapped but you
> > would know that anyway because the value of cmn_sys_timer_counter
> > would have changed.
> >
>
> Unless the interrupt was masked for some reason, but I see your point.
>  We're already going to be losing time if that's happening.

Yes, if the interrupt is masked, it breaks.

> >
> > 3 - I don't understand what lines 440-444 and the overflow flag are
> > achieving. Even without those additions, the timer doesn't lose any
> > time when it wraps at the PLATFORM_SYS_TIMER_MAX_BOUNDARY.
>
>
> I believe what I was trying to do here was handle situations where the
> interrupt that was being used to trigger updates of the systimer value
> would run too long and lose a cycle so short of an actual systimer
> wrap this was intended to make sure the timer would always advance
> and not go backwards.
>
> I'm not sure why I have a while loop in there since it would handle a
> single lost cycle at best.  It's not intended to ensure accurate timer
> behavior, just to avoid timer diffs thinking a wrap had happened.
>
>
> >
> > These changes add complication but I can't at the moment see the
> > benefit they bring. The current code in eula/eula appears to have
> > fixed the original problem I was seeing. Reminder, this little test
> > program produces output before the assignment reordering was done
> > but doesn't now:
> >
>
> Thanks for looking at it.  I think your approach should cover the
> issue you were concerned about (wrapping) as well as the word-size
> atomicity issue if I'm understanding things correctly now and it's
> simpler than the toggler approach.
>
> The other item is more of a workaround for bad user behavior to make
> sure that doing timer diffs wouldn't assume a systimer wrap when an
> incrementation of the systimer got dropped.
>
> Best.

Let's keep an eye on it and consider further changes if necessary.

Cheers,

Mark

_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Loading...