Cranberry Lair

Walking the programming parameter space

Memory Alignment And Why You (Might) Care — December 2, 2020

Memory Alignment And Why You (Might) Care

Something interesting came up about memory alignment yesterday and I figured it could be neat to write up a quick synopsis of memory alignment.

What is memory alignment? 

Memory alignment is the requirement that the address of your object must be a multiple of its alignment. 

i.e. a 4 byte integer has a memory alignment of 4 bytes, this mean that you can find the integer at addresses 0, 4, 8, 12, 16, 20, 24, etc 

Unaligned memory accesses used to be a pretty significant problem since CPUs could only load memory on particular alignment boundaries. (You could only load a 4 byte word at addresses that were a multiple of 4 bytes, I assume that compilers could likely handle the cast where memory was not aligned at the price of some performance – needs a reference)  

However, x86-64’s load instruction (mov) supports unaligned accesses [1]. This is not necessarily the case for other architectures, so be careful! 

Why do we care? 

Usually we don’t!  

Because: 

  • Most memory will be aligned on an 8 byte boundary if you are using a standard allocator (see max_align_t [2] which defines the alignment of all allocations by the standard allocators). 
  • Most memory can be loaded on an unaligned address without any problems anyways (sortof). 

However there is a case that can cause problems. 

Welcome to SIMD 

SIMD stands for Single Instruction, Multiple Data. I’ll be glossing over it for now, feel free to reach out if you’d like to talk about it some more. 

The important bit about SIMD in x86-64, is that it’s primary type, __m128, is 16 bytes large and has an alignment requirement of 16 bytes. 

This type is not loaded using the standard x86 load instruction. It uses a particular instruction that _requires_ the type to be aligned on a 16 byte boundary (movaps). If the address being loaded isn’t a multiple of 16 bytes, then your application will throw an exception. 

This means that if your standard allocator allocates on an 8 byte boundary, you are likely to crash if you allocate __m128 without telling it to align on a 16 byte boundary. 

There is an instruction that allows you to load this type without alignment (movups) and the compiler _can_ emit this instruction. 

However, this is easy to break. 

Consider this small program. 

uint8_t bytes[40];
__m128& vector1 = *(__m128*)&bytes[8];
__m128& vector2 = *(__m128*)&bytes[24];
__m128 vector = _mm_add_ps(vector1, vector2);

MSVC will generate this (authored to make more readable) 

movups xmm0,xmmword[vector1]
addps  xmm0,xmmword[vector2]

Notice that the first instruction is an unaligned load (yay!) HOWEVER! Notice that we’re also referencing vector2 in our addps instruction.

Das bad. 

addps requires it’s source arguments to be aligned to a 16 byte boundary. And our memory is very likely not. (24 is not a multiple of 16!) 

This will crash. 

The worst part? Inconsistently. It all depends on “bytes” original address. If it was at address 8, then vector1 will be at address 8+8=16 and vector2 will be at address 8+24=32, no crash. However it it was at address 0, well……. 

More things! 

How can we break this alignment? 

Packing into a buffer 

Let’s say I allocate a large block of memory, maybe 1000TB… or 32 bytes for simplicity. 

[................................] 

Maybe what I want to do, is add a series of objects into that buffer to make then nice and packed. 

First I add a byte into it. 

[1...............................] 

Then maybe I want to add a 4 byte integer into it 

[12222...........................] 

Then maybe I want to add our __m128 object 

[122223333333333333333...........] 

Uh oh! Now our __m128 object is at address 5, which definitely isn’t aligned to a 16 byte boundary. 

Custom allocation 

Another possibility, is to allocate our __m128 object using a standard allocator. 

say I do this: 

__m128* myVector = new __m128{}; 

This can easily cause a crash later on since standard allocators are only guaranteed to align to max_align_t which on most platforms is 8 bytes. (This seems it might change in C++17 [3]) 

What are the benefits of alignment? 

There are a few that I can think of: 

CPU Architectural simplicity. 

Disclaimer: This is pure speculation. If I can guarantee that all my memory loads are aligned at particular boundaries, I can likely make my CPU much simpler which takes up less space, less power, less heat. More space is more good. 

Cross cache loads/cross page loads. 

If I have memory aligned at powers of 2 such as 2,4,8,16,etc, then I can guarantee that I will not have a small data type that will cross a cache line boundary. 

Notice that with an 8 byte cache line, I can fit one 8 byte object. However, if the object is unaligned, the object could be in 2 cache lines! 

[........][........] 

[11111111][........] // Good 

[....1111][1111....] // Bad 

Our CPUs load memory as these singular cache line, requiring 2 cache lines to be loaded for a single object can introduce some undesired latency in a hot loop. [6] 

This gets even worse with memory pages, however, due to a bout of extreme laziness I will not be going into details about it. 

Conclusion 

And that’s it! Thanks for coming by, hope this was informative! Got any questions? Please reach out! 

Resources: 

[1] https://blog.quarkslab.com/unaligned-accesses-in-cc-what-why-and-solutions-to-do-it-properly.html 

[2] https://en.cppreference.com/w/cpp/types/max_align_t 

[3] https://stackoverflow.com/questions/56171482/why-does-the-c-standard-allow-stdmax-align-t-and-stdcpp-default-new-alignm 

[4] https://c9x.me/x86/html/file_module_x86_id_180.html 

[5] https://blog.ngzhian.com/sse-avx-memory-alignment.html 

[6] https://bits.houmus.org/2020-01-28/this-goes-to-eleven-pt1 

Managing C++ Compile Times — September 2, 2020

Managing C++ Compile Times

Preamble

Jumping into a large project that compiles in 30 minutes, you might be faced with 2 thoughts. Either “Wow! This project compiles in only 30 minutes!” Or “What?!? This project takes 30 minutes to compile?” Either way, you likely don’t want to be iterating on your changes at a rate of 30 minutes per compilation. So here’s a few things you can do to improve your iteration times.

Hack Your Prototypes – Short Iterations

This is likely where you spend 90% of your time when solving a problem. In this phase you can be searching for a bug, testing different solutions or even just investigating the behaviour of the program. In any of these scenarios, you likely want to iterate quickly without much down time. Here’s a few things you can do:

Don’t modify your headers

Modifying your headers is a sure fire way to trigger multiple files to recompile. (Or if you’re unlucky, the whole project) as a result, do anything in your power to avoid having to modify a header while you’re in this phase.

Work in a single cpp file

Don’t add static variables, private functions, or type declarations to your header. Add utility functions anywhere in your cpp file but not in the header.

Make ugly globals. (It’s fine, I won’t tell)

If you need some sort of static state for a class, consider making it a global in the cpp file instead. This holds even outside of the prototyping phase, make sure to mark it static, we don’t need pesky external linkage for these variables.

Manually add a function declaration to your cpp

If you need to have a function available to multiple cpp files, consider simply adding the declaration to the necessary cpp files. This wouldn’t fly for a final submission but we’re prototyping here.

Compile a single file

Out of habit, you might build the whole solution, or the specific project when working to see if you’ve introduced any compile errors. Instead, consider if you can afford to simply compile the single file you’re working in. You can typically afford this when making modifications local to a cpp file. This will allow you to avoid having to deal with the linker until you absolutely need to. It might even be worthwhile to make the shortcut convenient.

Use a smaller project

Is there a smaller project you can test in instead of the primary projects? If so, consider using that instead. Your iteration time will improve simply by the fact that your project has a lot less to handle.

In the same vein, consider If you can simply compile the project your files are in without compiling the larger projects that are likely to include a large number of files.

Disable optimizations locally

If you’re running with optimizations and optimizations are removing useful variables you’d like to investigate, consider disabling optimizations in your file instead of changing the project to a debug configuration. This will only require recompiling the single file instead of the whole project.

Be Nice To The Compiler And Yourself – Long Term Changes

At this point, we’re done iterating, we’ve solidified our solution and now we need to clean it up for submission. At this stage, you’re probably not compiling as frequently and a lot of your time is writing all the necessary bits to link it all together. Decisions made here will impact not only you, but all users of the project. We want our compile times to be short. Long compile times impact everyone, and that’s no fun. Here are some things to consider at this stage.

Does this change really need a new file?

If you’re considering creating a new file, consider if It’s really worth it. A new file implies a new header, a new header implies new includes, new includes means more dependencies, more dependencies means more compilation.

This is a balancing act. Adding a new cpp/h pair might actually allow you to reduce compile times by only including the relevant bits without needing to modify a highly used header. But it might also spawn into a monstrous chain of includes. It’s a matter of tradeoffs, consider them carefully.

New class != new file

Creating a new class does not imply that a new file should be created. Does your class really need to have a header? Is it used by any other class or is it more of a utility for a single file? If your class is only in use by a single cpp file, consider putting it in the cpp file. If it needs to become public, we can implement that change later. (Your coding standard might disagree here in some contexts, if it does, follow the standard)

Don’t add unnecessary includes to your headers

Headers are not your friend. Do everything you can to reduce the number of includes that you have to add to a header. The more headers required, the bigger the compilation chain grows. Consider if you can forward declare the required types, or if you’re simply including the header for convenience. A simple rule is to only include the header if you absolutely need it to compile. Avoid “general purpose” headers that include a large chunk of what you need. They’ll eat you alive (and your compile times).

Do you really need this function to be a private function?

When adding a private function to an object, consider if it has better use as a utility function embedded in the cpp file. Adding private methods to a header means that other files have to recompile if you modify functionality that they had no access to in the first place. It’s bad encapsulation. This is especially true for private static functions and private static variables.

Can you remove some includes from that header?

Are you working in a header and notice some unnecessary includes? See if you can remove them! Every little bit helps.

Standard headers are a good target to remove from inclusion. They tend to be heavily templated and can reduce compile times significantly.

If you can forward declare a type instead of including this file in your header, consider doing that.

Use the appropriate include type

Using a relative include “../../File.h” will reduce the load on the preprocessor if the path to the file is correct. Use this format when you reference a file from the same project. Use #include <Project/File.h> when the included file is part of a different project than our current one. Avoid #include “Project/File.h” or #include “../../File.h” if your file is not located at this specific location. This causes the preprocessor to search from the relative location of all the included files which can be quite a bit of work.

Does this need to be a template?

Be very wary of making a template that is included in your header. Templates suffer from a quantity of problems. Is your template needed by anyone else but your cpp? If not, consider embedding it. Does your type really need to be a template? Consider if you have an actual use case for this template. If you absolutely need this template. Consider de-templatifying the functionality that doesn’t require the generic type and putting that functionality either in a base class or a set of utility functions.

What’s problematic about templates? Templates require you to write their implementations in your header so that your compiler can instantiate it correctly. This means that any change to the internal functionality of your template will require a recompilation of all the files that include it. Templates can also be tremendously complex, causing a significant amount of work for your compiler. (Template metaprogramming anyone?) If you need your functionality to be expressed at compile time, consider constexpr, compilers are more efficient at processing these than processing template recursion. Also consider that templates require your compiler to create a new instance for every cpp that includes it. (If one cpp instantiates vector and another also instantiates vector the compiler has to do the work twice!) you might consider extern templates in this situation. But I would strongly recommend dropping the template altogether if you can.

Tools

There are some tools that can help you improve your compile times in a variety of ways. Here’s a few of them.

Ccache

Ccache is a compilation cache tool that maintains a cache of compilations of your cpp files. This works accross full rebuilds, different project version (Maybe you have 2 similar branches) and even across users.

https://ccache.dev/

Compiler Time Tracking Tools

Compilers are starting to incorporate tools to track where time is being spent in compilation. Clang/GCC have “-ftime-report” and MSVC has build insights https://devblogs.microsoft.com/cppblog/introducing-c-build-insights/

An excellent blog post from Aras Pranckevičius details these more thoroughly https://aras-p.info/blog/2019/01/12/Investigating-compile-times-and-Clang-ftime-report/

Show Include List

MSVC has an option to show what’s included in a file. Consider using this to know what’s being included and what doesn’t need to be included.

https://docs.microsoft.com/en-us/cpp/build/reference/showincludes-list-include-files?view=vs-2019

Clang/GCC have the -M and -H options. https://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Preprocessor-Options.html#Preprocessor-Options

Include What You Use

There is also an interesting tool that determines what includes can be removed and what could be forward declared.

https://include-what-you-use.org/

Links

https://stackoverflow.com/questions/8130602/using-extern-template-c11

https://en.cppreference.com/w/cpp/preprocessor/include

http://virtuallyrandom.com/c-compilation-fixing-it/

https://aras-p.info/blog/2019/01/12/Investigating-compile-times-and-Clang-ftime-report/

https://docs.microsoft.com/en-us/cpp/build/reference/showincludes-list-include-files?view=vs-2019

https://stackoverflow.com/questions/5834778/how-to-tell-where-a-header-file-is-included-from

https://include-what-you-use.org/

https://www.goodreads.com/book/show/1370617.Large_Scale_C_Software_Design