If we backport to an older GLib release, we won't have the newer atomic
helpers available. It's really not too much burden to do that manually if
it means we can run on older systems.
If we're running on a GCC older than 4.9, then we won't have the
stdatomic.h available. We can just use a full barrier instead using
__sync_synchronize() to get the same effect, albeit slower.
We want to be backtracing directly into the capture buffer, but also need
to skip a small number of frames.
If we call the backtrace before filling in information, we can capture to
the position *before* ev->addrs and then overwrite that data right after.
128 is a bit much and can slow us down considerably with user-space stack
traces. This can mess up the tree a bit, but we can alter how we view
things later on if we need to so that it is easier to read.
This ensures that the producer can produce as soon as the reader has moved
past that data. Now that the callback has a mutable consumption value, they
can read the whole data in one shot anyway.
This ideally should be dynamic in the future to copy out data at a rate
that keeps us around 33% usage rate so that we can still burst if we need
to but keep things empty enough to not loose data.
This frame type can be used to communicate with the peer over the mapped
ring buffer to denote that writing is finished and it can free any
resources for the mapping.
This removes the 8 bytes of framing data from the MappedRingBuffer which
means we can write more data without racing. But also this means that we
can eventually use the mapped ring buffer as our normal buffer for
capture writing (to be done later).
This is a simplified API for the inferior to use (such as from a
LD_PRELOAD) that will use mmap()'d ring buffer created by Sysprof. Doing
so can reduce the amount of overhead in the inferior enough to make some
workloads useful. For example, collecting memory statistics and backtraces
is now fast enough to be useful.
This is the start of a ring buffer to coordinate between processes without
the overhead of writing directly to files within the inferior process.
Instead, the parent process can monitor the ring buffer for framing
information and pass that along to the capture writer.
This brings over some of the techniques from the old memprof design.
Sysprof and memprof shared a lot of code, so it is pretty natural to
bring back the same callgraph view based on memory allocations.
This reuses the StackStash just like it did in memprof. While it
would be nice to reuse some existing tools out there, the fit of
memprof with sysprof is so naturally aligned, it's not really a
big deal to bring back the LD_PRELOAD. The value really comes
from seeing all this stuff together instead of multiple apps.
There are plenty of things we can implement on top of this that
we are not doing yet such as temporary allocations, cross-thread
frees, graphing the heap, and graphing differences between the
heap at to points in time. I'd like all of these things, given
enough time to make them useful.
This is still a bit slow though due to the global lock we take
to access the writer. To improve the speed here we need to get
rid of that lock and head towards a design that allows a thread
to request a new writer from Sysprof and save it in TLS (to be
destroyed when the thread exits).
So we have valid addresses to translate, we need to translate the jitmap
before we translate samples. Otherwise we likely won't have anything to
translate.