- Add version of parallel_for with an ExecutionPolicy
- Consider abstraction for shared memory
- Make nbody example work without CUDA
- Consider launch_bounds support...
- Combine tests into small number of binaries
- Add streams to ExecutionPolicy
- Tests for cudaLaunch with and without nvcc
- Tests for other APIs
- Provide portable utility functions for cudaDeviceReset, etc.
- Fix/rename index accessors
- Move accessors to device_api.h