Article
08/10/2015

April 2012

Volume 27 Number 04

C++ - A Code-Based Introduction to C++ AMP

By Daniel Moth | April 2012

This article covers a prerelease technology called C++ AMP that will ship with Visual Studio 11. All information is subject to change.For more information about C++ AMP, read "Introduction to Tiling in C++ AMP" in this issue of MSDN Magazine.

Visual Studio 11 brings support for heterogeneous computing to the mainstream through a technology called C++ Accelerated Massive Parallelism (C++ AMP). This enables you to take advantage of accelerators, such as GPUs, for speeding up data parallel algorithms.

C++ AMP delivers performance, in a hardware-portable manner, without compromising the productivity you’ve come to expect from modern C++ and the Visual Studio package. It could provide orders of magnitude speed gains compared with just using the CPU only. At conferences I typically demo a single process taking advantage of both NVIDIA and AMD GPUs at the same time, while still having a CPU fallback solution.

In this code-driven introduction that explores C++ AMP, I’ll assume that you’re reading every line of code in the article. The inline code is a core part of the article, and what’s in the C++ code won’t necessarily be repeated in the article’s text.

Setup and Example Algorithm

First, let’s understand the simple algorithm that we’ll be working with along with the necessary setup code, in preparing for the conversion to using C++ AMP later.

Create an empty C++ project, add a new empty C++ file (Source.cpp) and type in the following self-explanatory code (I’m using noncontiguous line numbers to facilitate explanation in the article text, and you’ll find the same line numbers in the accompanying downloadable project):

1 #include <amp.h>                // C++ AMP header file
3 #include <iostream>             // For std::cout etc
4 using namespace concurrency;    // Save some typing :)
5 using std::vector;     // Ditto. Comes from <vector> brought in by amp.h
6
79 int main()
80 {
81   do_it();
82
83   std::cout << "Hit any key to exit..." << std::endl;
84   std::cin.get();
85 }

C++ AMP introduces a number of types across a number of header files. As per lines 1 and 4 in the preceding code snippet, the main header file is amp.h and the main types are added to the existing concurrency namespace. No additional setup or compilation option is required for using C++ AMP. Now let’s add a do_it function above main (see Figure 1).

Figure 1 The do_it Function, Called from main

52 void do_it()
53 {
54   // Rows and columns for matrix
55   const int M = 1024;
56   const int N = 1024;
57
58   // Create storage for a matrix of above size
59   vector<int> vA(M * N);
60   vector<int> vB(M * N);
61
62   // Populate matrix objects
63   int i = 0;
64   std::generate(vA.begin(), vA.end(), [&i](){return i++;});
65   std::generate(vB.begin(), vB.end(), [&i](){return i--;});
66
67   // Output storage for matrix calculation
68   vector<int> vC(M * N);
69
70   perform_calculation(vA, vB, vC, M, N);
76 }

In lines 59, 60 and 68, the code uses std::vector objects as flat containers for each matrix, even though a two-dimensional type is what you’d really want to be dealing with—more on this later.

It’s important to understand the usage of the lambda expressions on lines 64 and 65 that are passed to the std::generate method for populating the two vector objects. This article assumes you can use lambdas in C++ proficiently. For example, you should instantly understand that if the variable i was captured by value (by modifying the capture list either like this [i] or like this [=] and using the mutable keyword) then every member of the vector would have been initialized to 0! If you aren’t comfortable with using lambdas (a wonderful addition to the C++ 11 standard), please read the MSDN Library article “Lambda Expressions in C++” (msdn.microsoft.com/library/dd293608) and come back here when you’re done.

The do_it function introduced a call to perform_calculation, which is coded as follows:

7  void perform_calculation(
8    vector<int>& vA, vector<int>& vB, vector<int>& vC, int M, int N)
9  {
15   for (int i = 0; i < M; i++)
16   {
17     for (int j = 0; j < N; j++)
18     {
19       vC[i * N + j] = vA[i * N + j] + vB[i * N + j];
20     }
22   }
24 }

In this simplistic matrix addition example, one thing that sticks out is that the multidimensionality of the matrix is lost because of the linearized storage of the matrix in a vector object (which is why you had to pass in the matrix dimensions along with the vector objects). Furthermore, you have to perform funny arithmetic with the indices on line 19. This point would’ve been even more obvious if you wanted to add sub-matrices of these matrices together.

There’s been no C++ AMP code so far. Next, by changing the perform_calculation function, you’ll see how you can start introducing some of the C++ AMP types. In later sections you’ll learn how to fully leverage C++ AMP and speed up your data parallel algorithms.

array_view<T, N>, extent<N> and index<N>

C++ AMP introduces a concurrency::array_view type for wrapping containers of data—you can think of it as a smart pointer. It represents data in a rectangular manner, contiguous in the least-significant dimension. The reason for its existence will become obvious later, and next you’ll see some aspects of its usage. Let’s change the body of the perform_calculation function as follows:

11     array_view<int> a(M*N, vA), b(M*N, vB);
12     array_view<int> c(M*N, vC);
14
15     for (int i = 0; i < M; i++)
16     {
17       for (int j = 0; j < N; j++)
18       {
19         c(i * N + j) = a(i * N + j) + b(i * N + j);
20       }
22     }

This function, which compiles and runs on the CPU, has the same output as before. The only difference is the gratuitous use of the array_view objects that are introduced on lines 11 and 12. Line 19 still has the funny indexing (for now), but now it’s using the array_view objects (a, b, c) instead of the vector objects (vA, vB and vC) and it’s accessing the elements through the array_view function operator (in contrast with the prior usage of the vector subscript operator—more on this later).

You must tell the array_view through a template argument (int in this example) the element type of the container it wraps; you’ll pass the container as the last constructor argument (for example, the vC variable of vector type on line 12). The first constructor argument is the number of elements.

You can also specify the number of elements with a concurrency::extent object, so you can change lines 11 and 12 as follows:

10     extent<1> e(M*N);
11     array_view<int, 1> a(e, vA), b(e, vB);
12     array_view<int, 1> c(e, vC);

The extent<N> object represents a multidimensional space, where the rank is passed as a template argument. The template argument is 1 in this example, but the rank can be any value greater than zero. The extent constructor accepts the size of each dimension that the extent object represents, as shown on line 10. The extent object can then be passed to the array_view object constructor to define its shape, as shown on lines 11 and 12. On those lines I also added a second template argument to the array_view indicating that it represents a one-dimensional space—as in the earlier code example, I could’ve safely omitted that because 1 is the default rank.

Now that you know of these types, you can make further modifications to the function so it can access the data in a more natural two-dimensional manner, which resembles the matrix world more closely:

10     extent<2> e(M, N);
11     array_view<int, 2> a(e, vA), b(e, vB);
12     array_view<int, 2> c(e, vC);
14
15     for (int i = 0; i < e[0]; i++)
16     {
17       for (int j = 0; j < e[1]; j++)
18       {
19         c(i, j) = a(i, j) + b(i, j);
20       }
22     }

The changes to lines 10-12 make the array_view objects two-dimensional, so we’ll require two indices to access an element. Lines 15 and 17 access the bounds of the extent through its subscript operator instead of directly using the variables M and N; once you’ve encapsulated the shape into the extent, you can now use that object throughout your code.

The important change is on line 19, where you no longer need funny arithmetic. The indexing is much more natural, making the entire algorithm itself much more readable and maintainable.

If the array_view was created with a three-dimensional extent, then the function operator would expect three integers in order to access an element, still from the most-significant to the least-significant dimension. As you might expect from a multidimensional API, there’s also a way to index into an array_view through a single object passed to its subscript operator. The object must be of type concurrency::index<N> where N matches the rank of the extent with which the array_view is created. You’ll see later how index objects can be passed to your code, but for now let’s create one manually in order to get a feel for them and see them in action, by modifying the function body as follows:

10     extent<2> e(M, N);
11     array_view<int, 2> a(e, vA), b(e, vB);
12     array_view<int, 2> c(e, vC);
13
14     index<2> idx(0, 0);
15     for (idx[0] = 0; idx[0] < e[0]; idx[0]++)
16     {
17       for (idx[1] = 0; idx[1] < e[1]; idx[1]++)
18       {
19         c[idx] = a[idx] + b[idx];
//19         //c(idx[0], idx[1]) = a(idx[0], idx[1]) + b(idx[0], idx[1]);
20       }
22     }

As you can see from lines 14, 15, 17 and 19, the concurrency::index<N> type has a very similar interface to the extent type, except index represents an N-dimensional point instead of an N-dimensional space. Both the extent and index types support a number of arithmetic operations through operator overloading—for example, the increment operation shown in the preceding example.

Previously the loop variables (i and j) were used to index into the array_view, and now they can be replaced with a single index object on line 19. This demonstrates how, by using the array_view subscript operator, you can index into it with a single variable (in this example, idx of type index<2>).

At this point you have a basic understanding of three new types introduced with C++ AMP: array_view<T,N>, extent<N> and index<N>. They have more to offer, as shown in the class diagrams in Figure 2.

Figure 2 array_view, extent and index Classes

The real power and motivation for using this multidimensional API is to execute your algorithms on a data parallel accelerator, such as the GPU. To do that you need an entry point in the API for executing your code on the accelerator, plus a way to check at compile time that you’re using a subset of the C++ language that can be executed efficiently on such an accelerator.

parallel_for_each and restrict(amp)

The API that instructs the C++ AMP runtime to take your function and execute it on the accelerator is a new overload to concurrency::parallel_for_each. It accepts two arguments: an extent object and a lambda.

The extent<N> object, which you’re already familiar with, is used to determine how many times the lambda will be called on the accelerator, and you should assume that each time it will be a separate thread calling your code, potentially concurrently, without any order guarantees. For example, an extent<1>(5) will result in five calls to the lambda you pass to the parallel_for_each, whereas an extent<2>(3,4) will result in 12 calls to the lambda. In real algorithms you’ll typically be scheduling thousands of calls to your lambda.

The lambda must accept an index<N> object, which you’re already familiar with. The index object must have the same rank as the extent object passed to parallel_for_each. The index value will be different each time your lambda is called, of course—that’s how you distinguish between two different invocations of your lambda. You could think of the index value as the thread ID.

Here’s a code representation of what I described so far with parallel_for_each:

89     extent<2> e(3, 2);
90     parallel_for_each(e,
91       [=](index<2> idx)
92       {
93         // Code that executes on the accelerator.
94         // It gets invoked in parallel by multiple threads
95         // once for each index "contained" in extent e
96         // and the index is passed in via idx.
97         // The following always hold true
98         //      e.rank == idx.rank
99         //      e.contains(idx) == true
100        //      the function gets called e.size() times
101        // For this two-dimensional case (.rank == 2)
102        //      e.size() == 3*2 = 6 threads calling this lambda
103        // The 6 values of idx passed to the lambda are:
104        //      { 0,0 } { 0,1 } { 1,0 } { 1,1 } { 2,0 } { 2,1 }
105      }
106    );
107    // Code that executes on the host CPU (like line 91 and earlier)

This simple code, without an important addition to line 91, won’t compile:

error C3577: Concurrency::details::_Parallel_for_each argument #3 is illegal: missing public member: 'void operator()(Concurrency::index<_Rank>) restrict(amp)'

As the code was written, there would be nothing stopping you from using in the lambda body (lines 92-105) anything that the full C++ language (as supported by the Visual C++ compiler) allows. However, you’re restricted from using certain aspects of the C++ language on current GPU architectures, so you must indicate which parts of your code should comply with these restrictions (so you can find out at compile time if you’re breaking any rules). The indication must be made on the lambda and on any other function signatures that you call from the lambda. So you must modify line 91 as follows:

91         [=](index<2> idx) restrict(amp)

This is a key new language feature of the C++ AMP specification that’s added to the Visual C++ compiler. Functions (including lambdas) can be annotated with restrict(cpu), which is the implicit default, or restrict(amp) as shown in the previous code sample, or with a combination—for example, restrict(cpu, amp). No other options exist. The annotation becomes part of the function signature so it participates in overloading, which was a key motivation when designing it. When a function is annotated with restrict(amp), it’s checked against a set of restrictions, and if one is violated you receive a compiler error. The full set of restrictions is documented in this blog post: bit.ly/vowVlV.

One of the restrict(amp) restrictions for lambdas is that they can’t capture variables by reference (see caveat near the end of this article), nor can they capture pointers. With that restriction in mind, when you look at the last code listing for parallel_for_each, you would be right in wondering: “If I can’t capture by reference and can’t capture pointers, how will I ever observe results—that is, desirable side effects—from the lambda? Any changes I make to variables that I capture by value won’t be available to the outer code once the lambda completes.”

The answer to that question is a type that you already know: array_view. The array_view object is allowed to be captured by value in the lambda. It’s your mechanism for passing data in and out. Simply use array_view objects to wrap the real containers, then capture the array_view objects in the lambda for accessing and populating, and then access the appropriate array_view objects after the call to parallel_for_each.

Bringing It All Together

With your new knowledge, you can now revisit the earlier serial CPU matrix addition (the one that used array_view, extent and index) and replace lines 15-22 as follows:

15     parallel_for_each(e, [=](index<2> idx) restrict(amp)
16     {
19       c[idx] = a[idx] + b[idx];
22     });

You see that line 19 remained the same, and the double nested loop with manual index object creation within the extent bounds is replaced with a call to the parallel_for_each function.

When working with discrete accelerators that have their own memory, the capture of the array_view objects in the lambda passed to parallel_for_each results in a copy of the underlying data to the accelerator’s global memory. Similarly, after the parallel_for_each call, when you access the data through the array_view object (in this example, through c) the data gets copied back to the host memory from the accelerator.

You should know that if you want to access the results from array_view c through the original container vC (and not through the array_view), then you should call the synchronize method of the array_view object. The code would work as it stands because the array_view destructor calls synchronize on your behalf, but any exceptions would get lost that way, so I advise you to call synchronize explicitly instead. So add a statement anywhere after the parallel_for_each call, like this:

23          c.synchronize();

The reverse (ensuring that the array_view has the latest data from its original container if it has changed) is achieved via the refresh method.

More importantly, copying data across (typically) a PCIe bus can be very expensive, so you only want to copy data in the direction necessary. On the earlier listing you can modify lines 11-13 to indicate that the underlying data of array_view objects a and b must be copied to the accelerator (but won’t be copied back), and also that the underlying data of array_view c need not be copied to the accelerator. The necessary changes are bolded in the following snippet:

11          array_view<const int, 2> a(e, vA), b(e, vB);
12          array_view<int, 2> c(e, vC);
13          c.discard_data();

However, even with these modifications in place, the matrix addition algorithm isn’t sufficiently compute-intensive to offset the overhead of copying the data, so it’s in fact not a good candidate for parallelization with C++ AMP. I’ve been using it only to teach you the basics!

Having said that, by using this simple example throughout the article, you now have the skills to parallelize other algorithms that are compute-intensive enough to yield benefits. One such algorithm is matrix multiplication. Without any commentary from me, please ensure that you understand this simple serial implementation of the matrix multiplication algorithm:

void MatMul(vector<int>& vC, const vector<int>& vA,
  const vector<int>& vB, int M, int N, int W)
{
  for (int row = 0; row < M; row++)
  {
    for (int col = 0; col < N; col++)
    {
      int sum = 0;
      for(int i = 0; i < W; i++)
        sum += vA[row * W + i] * vB[i * N + col];
      vC[row * N + col] = sum;
    }
  }
}

… and the corresponding C++ AMP implementation:

array_view<const int, 2> a(M, W, vA), b(W, N, vB);
array_view<int, 2> c(M, N, vC);
c.discard_data();
parallel_for_each(c.extent, [=](index<2> idx) restrict(amp)
{
  int row = idx[0]; int col = idx[1];
  int sum = 0;
  for(int i = 0; i < b.extent[0]; i++)
    sum += a(row, i) * b(i, col);
  c[idx] = sum;
});
c.synchronize();

On my laptop machine the C++ AMP matrix multiplication yields a performance improvement of more than 40 times compared with the serial CPU code for M=N=W=1024.

Now that you understand all the basics, you might be wondering how you can choose which accelerator to execute your algorithm on, once you’ve implemented it using C++ AMP. Let’s examine that next.

accelerator and accelerator_view

Part of the concurrency namespace is the new accelerator type. It represents a device on the system that the C++ AMP runtime can use, which for the first release is hardware with a DirectX 11 driver installed (or DirectX emulators).

When the C++ AMP runtime starts, it enumerates all accelerators and, based on internal heuristics, picks one as the default. That’s why you didn’t have to deal with accelerators directly in all the preceding code—there was a default picked for you. If you want to enumerate the accelerators and even choose the default yourself, it’s very easy to do, as the self-explanatory code in Figure 3 shows.

Figure 3 Picking an Accelerator

26 accelerator pick_accelerator()
27 {
28   // Get all accelerators known to the C++ AMP runtime
29   vector<accelerator> accs = accelerator::get_all();
30
31   // Empty ctor returns the one picked by the runtime by default
32   accelerator chosen_one;
33
34   // Choose one; one that isn't emulated, for example
35   auto result =
36     std::find_if(accs.begin(), accs.end(), [] (accelerator acc)
37   {
38     return !acc.is_emulated; //.supports_double_precision
39   });
40   if (result != accs.end())
41     chosen_one = *(result); // else not shown
42
43   // Output its description (tip: explore the other properties)
44   std::wcout << chosen_one.description << std::endl;
45
46   // Set it as default ... can only call this once per process
47   accelerator::set_default(chosen_one.device_path);
48
49   // ... or just return it
50   return chosen_one;
51 }

On line 38 you can see querying of one of the many accelerator properties, and others are shown in Figure 4.

Figure 4 accelerator and accelerator_view Classes

If you want to have different parallel_for_each calls that use different accelerators, or for whatever other reason you want to be more explicit than setting the default accelerator globally for a process, then you need to pass an accelerator_view object to the parallel_for_each. This is possible because parallel_for_each has an overload that accepts an accelerator_view as the first parameter. Obtaining an accelerator_view object is as easy as calling default_view on an accelerator object; for example:

accelerator_view acc_vw = pick_accelerator().default_view;

Beyond DirectX 11 hardware, there are three special accelerators that C++ AMP makes available:

direct3d_ref: useful for correctness debugging, but not useful in production as it’s extremely slower than any real hardware.
direct3d_warp: a fallback solution for executing your C++ AMP code on the CPU using multi-core and Streaming SIMD Extensions today.
cpu_accelerator: not capable of executing C++ AMP code at all, in this release. It’s only useful for setting up staging arrays (an advanced optimization technique), which is beyond the scope of this article but is described in this blog post: bit.ly/vRksnn.

Tiling and Further Reading on Your Own

The most important topic not covered in this article is tiling.

From a scenario perspective, tiling takes the orders of magnitude performance gains that you see with the coding techniques explored so far and (potentially) makes the gains even larger. From an API perspective, tiling consists of tiled_index and tiled_extent types, as well as a tile_barrier type and a tile_static storage class. There’s also an overload to parallel_for_each that accepts a tiled_extent object, and whose lambda accepts a tiled_index object. Within that lambda you’re allowed to use tile_barrier objects and tile_static variables. I cover tiling in my second C++ AMP article on p. 40.

There are other topics you can explore on your own with the help of blog posts and the online MSDN documentation:

<amp_math.h> is a math library with two namespaces, one for high-precision math functions and the other for fast but less-accurate math functions. Opt in based on your hardware capabilities and your scenario requirements.
<amp_graphics.h> and <amp_short_vectors.h> plus some DirectX interop functions are available for working with graphics programming.
concurrency::array is a container data type that’s bound to an accelerator, with a nearly identical interface to array_view. This type is one of two types (the other being texture in the graphics namespace) that must be captured by reference in the lambda passed to parallel_for_each. This is the caveat I referred to earlier in the article.
Support for DirectX intrinsics such as atomics for cross-thread synchronization.
GPU debugging and profiling in Visual Studio 11.

Future-Proofing Your Investments

In this article I introduced you to a modern C++ data parallel API that allows you to express your algorithms in a manner that enables your app to utilize GPUs for performance gains. C++ AMP is designed so it can future-proof your investments for hardware that we’ve yet to see.

You learned how a handful of types (array_view, extent and index) help you work with multidimensional data, combined with a single global function (parallel_for_each) that allows you to execute your code—starting from a restrict(amp) lambda—on an accelerator (that you can specify through accelerator and accelerator_view objects).

Beyond the Microsoft Visual C++ implementation, C++ AMP is provided to the community as an open specification that anyone can implement, on any platform.

Daniel Moth is a principal program manager in the Microsoft Developer Division.

Thanks to the following technical experts for reviewing this article: Steve Deitz, Yossi Levanoni, Robin Reynolds-Haertle, Stephen Toub and Weirong Zhu

About the Author