Pour obtenir une clé
d'essai remplissez le formulaire ci-dessous
Demandez des tariffs
Nouvelle licence
Renouvellement de licence
--Sélectionnez la devise--
USD
EUR
RUB
* En cliquant sur ce bouton, vous acceptez notre politique de confidentialité

Free PVS-Studio license for Microsoft MVP specialists
To get the licence for your open-source project, please fill out this form
** En cliquant sur ce bouton, vous acceptez notre politique de confidentialité.

I am interested to try it on the platforms:
** En cliquant sur ce bouton, vous acceptez notre politique de confidentialité.

Message submitted.

Your message has been sent. We will email you at


If you haven't received our response, please do the following:
check your Spam/Junk folder and click the "Not Spam" button for our message.
This way, you won't miss messages from our team in the future.

>
>
>
An eternal question of timing

An eternal question of timing

30 Mar 2011
Author:

It seemed that long forum debates about methods of measuring algorithm's running time, functions to use and precision that should be expected were over. Unfortunately, we have to return to this question once again. Today we will discuss the question how we should measure speed of a parallel algorithm.

I want to say right away that I will not give you a concrete recipe. I myself have faced the issue of measuring parallel algorithms' speed only recently, so I am not an expert in this question. So, this post is rather a research article. I will appreciate if you share your opinions and recommendations with me. I think we will manage the problem together and make out an optimal solution.

The task is to measure the running time of a fragment of user code. I would use the following class to solve this task earlier:

class Timing {
public:
  void StartTiming();
  void StopTiming();
  double GetUserSeconds() const {
    return double(m_userTime) / 10000000.0;
  }
private:
  __int64 GetUserTime() const;
  __int64 m_userTime;
};

__int64 Timing::GetUserTime() const {
  FILETIME creationTime;
  FILETIME exitTime;
  FILETIME kernelTime;
  FILETIME userTime;
  GetThreadTimes(GetCurrentThread(),
                 &creationTime, &exitTime,
                 &kernelTime, &userTime);
  __int64 curTime;
  curTime = userTime.dwHighDateTime;
  curTime <<= 32;
  curTime += userTime.dwLowDateTime;
  return curTime;
}

void Timing::StartTiming() {
  m_userTime = GetUserTime();
}

void Timing::StopTiming() {
  __int64 curUserTime = GetUserTime();
  m_userTime = curUserTime - m_userTime;
}

This class is based on the GetThreadTimes function that allows you to separate running time of user code from running time of system functions. The class is intended for estimate of running time of a thread in user mode, so we use only the returned parameter lpUserTime.

Now consider a code sample where some number is calculated. We will use the Timing class to measure the running time.

void UseTiming1()
{
  Timing t;
  t.StartTiming();
  unsigned sum = 0;

  for (int i = 0; i < 1000000; i++)
  {
    char str[1000];
    for (int j = 0; j < 999; j++)
      str[j] = char(((i + j) % 254) + 1);
    str[999] = 0;
    for (char c = 'a'; c <= 'z'; c++)
      if (strchr(str, c) != NULL)
        sum += 1;
  }

  t.StopTiming();
  printf("sum = %u\n", sum);
  printf("%.3G seconds.\n", t.GetUserSeconds());
}

Being presented in this form, the timing mechanism behaves as it was expected and gives, say, 7 seconds on my machine. The result is correct even for a multi-core machine since it does not matter which cores will be used while the algorithm is running (see Figure 1).

0097_Timing/image1.png

Figure 1 - Work of one thread on a multi-core computer

Now imagine that we want to use capabilities of multi-core processors in our program and estimate benefits we will get from parallelizing the algorithm relying on the OpenMP technology. Let's parallelize our code by adding one line:

#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < 1000000; i++)
{
  char str[1000];
  for (int j = 0; j < 999; j++)
    str[j] = char(((i + j) % 254) + 1);
  str[999] = 0;
  for (char c = 'a'; c <= 'z'; c++)
    if (strchr(str, c) != NULL)
      sum += 1;
}

The program now prints the running time 1.6 seconds. Since we use a 4-core computer, I feel like saying "Hurray! We've got a 4-time speed-up and timing confirms this".

But really it is not so good: we are not measuring the running time of the algorithm. We are measuring the running time of the main thread instead. In this case, the measuring seems reliable because the main thread was working for the same time as the secondary threads. Let's carry out a simple experiment: we will explicitly specify 10 threads to be used instead of 4:

#pragma omp parallel for reduction(+:sum) num_threads(10)

Logic says that this code must work for about the same time as the code parallelized into 4 threads. We have a four-core processor, so we should expect that a larger number of threads will cause only slowdown. Instead, we will see the result about 0.7 seconds.

This is an expected result although we wanted to get quite a different thing. We created 10 threads. Each of them was working for about 0.7 seconds. It is time of the main thread, whose running time is measured with the Timing class, ran for. As you can see, this method cannot be used to measure speed of programs with parallel code fragments. Let's make it clearer by presenting it graphically in Figure 2.

0097_Timing/image2.png

Figure 2 - This is how work of 10 threads might look on a four-core computer

Of course, we may well use the time() function but its resolution is low and it will not allow you to separate the running time of user code from that of system code. There may be other processes influencing the time, which may also significantly distort timing.

A favorite timing function of many developers is QueryPerformanceCounter. Let's measure speed using this function. In a simple form, the timing class looks this way:

class Timing2 {
public:
  void StartTiming();
  void StopTiming();
  double GetUserSeconds() const {
    return value;
  }
private:
  double value;
  LARGE_INTEGER time1;
};

void Timing2::StartTiming()
{         
  QueryPerformanceCounter(&time1);  
}  

void Timing2::StopTiming()
{  
  LARGE_INTEGER performance_frequency, time2;
  QueryPerformanceFrequency(&performance_frequency);
  QueryPerformanceCounter(&time2);  
  value = (double)(time2.QuadPart - time1.QuadPart);
  value /= performance_frequency.QuadPart;
}

Unfortunately, we cannot do so on a multi-core computer anymore. :) Let's read the description of this function in MSDN:

On a multiprocessor computer, it should not matter which processor is called. However, you can get different results on different processors due to bugs in the basic input/output system (BIOS) or the hardware abstraction layer (HAL). To specify processor affinity for a thread, use the SetThreadAffinityMask function.

Let's improve the code and tie the main thread to one core:

class Timing2 {
public:
  void StartTiming();
  void StopTiming();
  double GetUserSeconds() const {
    return value;
  }
private:
  DWORD_PTR oldmask;
  double value;
  LARGE_INTEGER time1;
};

void Timing2::StartTiming()
{         
  volatile int warmingUp = 1;
  #pragma omp parallel for
  for (int i=1; i<10000000; i++)
  {
    #pragma omp atomic
    warmingUp *= i;
  }

  oldmask = SetThreadAffinityMask(::GetCurrentThread(), 1);

  QueryPerformanceCounter(&time1);
}

void Timing2::StopTiming()
{  
  LARGE_INTEGER performance_frequency, time2;
  QueryPerformanceFrequency(&performance_frequency);
  QueryPerformanceCounter(&time2);  
  SetThreadAffinityMask(::GetCurrentThread(), oldmask);

  value = (double)(time2.QuadPart - time1.QuadPart);
  value /= performance_frequency.QuadPart;
}

Readers might ask what for we need a strange loop that does nothing. Contemporary processors reduce their frequency at low load. This loop preliminarily increases the processor's speed to the maximum and therefore increases the precision of speed measurement a bit. Additionally, we are warming up all the available cores.

The timing method we have shown involves the same drawback: we cannot separate the running time of user code from that of system code. If there are other tasks running on a core at the same time, the result might also be rather inaccurate. But it seems to me that this method still can be applied to a parallel algorithm unlike GetThreadTimes.

Let's measure the results of Timing and Timing2 classes at various numbers of threads. For this purpose, the OpenMP-directive num_threads(N) is used. Let's arrange the data in a table shown in Figure 3.

0097_Timing/image3.png

Figure 3 - Algorithm's running time in seconds measured with functions GetThreadTimes and QueryPerformanceCounter on a four-core machine

As you may see, as long as the number of threads does not exceed the number of cores, the GetThreadTimes function gives us a result similar to that of the QueryPerformanceCounter function, which makes you think that measurement is correct. But if there are more threads, you cannot rely on its result.

Unfortunately, the program prints varying values from launch to launch. I do not know how to make measuring more accurate and correct. So I am waiting for your feedback and methods of correct timing of parallel algorithms.

You may download the program text here (a project for Visual Studio 2005).

Popular related articles
"Our legacy of the past" or why we divided the V512

Date: 12 Aoû 2022

Author: Mikhail Gelvih

As the saying goes, the first step is always the hardest. That's exactly what happened in our case – after delaying it for so long, we have finally split the V512 diagnostic rule. You can read more a…
Why do arrays have to be deleted via delete[] in C++

Date: 27 Jul 2022

Author: Mikhail Gelvih

This note is for C++ beginner programmers who are wondering why everyone keeps telling them to use delete[] for arrays. But, instead of a clear explanation, senior developers just keep hiding behind …
Intermodular analysis of C and C++ projects in detail. Part 2

Date: 14 Jul 2022

Author: Oleg Lisiy

In part 1 we discussed the basics of C and C++ projects compiling. We also talked over linking and optimizations. In part 2 we are going to delve deeper into intermodular analysis and discuss its ano…
Intermodular analysis of C and C++ projects in detail. Part 1

Date: 08 Jul 2022

Author: Oleg Lisiy

Starting from PVS-Studio 7.14, the C and C++ analyzer has been supporting intermodular analysis. In this two-part article, we'll describe how similar mechanisms are arranged in compilers and reveal s…
Four reasons to check what the malloc function returned

Date: 20 Avr 2022

Author: Andrey Karpov

Some developers may be dismissive of checks: they deliberately do not check whether the malloc function allocated memory or not. Their reasoning is simple — they think that there will be enough memor…

Comments (0)

Next comments
Unicorn with delicious cookie
Nous utilisons des cookies pour améliorer votre expérience de navigation. En savoir plus
Accepter