Unfortunately, otelcol currently has no receiver for logfiles that produces tracing data.
There is log tailing functionality, and e.g. a fluent forward
protocol receiver, but they all create "log" data, not tracing.
I've seen this proof of concept
to implement a forward
receiver that creates tracing data, but that seems to have no traction and no relation to the upstream project at all (not even git history.
Thus, we've settled on this approach:
- Format haproxy log as JSON, using the zipkin format
- Read log with fluentbit, send it via HTTP to the opentelemetry collector zipkin receiver
As far as I can tell, this is the only combination of fluentbit output and otelcol receiver that matches.
The only sensible fluenbit output for our purpose is HTTP, and the only otelcol receivers for plaintext over HTTP I can see are otlp
and zipkin
(everything is else is binary, grpc/thrift/protobuf/etc).
The fluentbit HTTP JSON format is fixed to "list of records" [{}, {}]
, which luckily works with zipkin.
(otlp json is {"resourcesSpans": [{},{}]}
, i.e. a toplevel dict and not a list, see here and here)
The fluentbit HTTP output can format the timestamp as either iso8601 or double,
but the zipkin protocol requires an int, which matches neither of those (and otelcol rejects them, I've tried).
Thus we point fluentbit to a separate timestamp field (we called it fluentbit
)
and do not output the resulting fluenbit-internal timestamp (json_date_key false
) at all.
Instead we generate and format a timestamp
field which fluentbit just passes through, like any other data field.
For this timestamp
field, zipkin requires microseconds since epoch, but haproxy offers milliseconds (via %Ts%ms
).
The haproxy log-format placeholders are not delineated, so we cannot use %Ts%ms000
.
Sample expressions are delineated (%[foo]
), but they don't provide the request start time, which is the timestamp we need.
We cannot use the JSON number formatting hack (%Ts%ms.e3
), as zipkin protocol requires an int, not a float (and otelcol rejects it).
Thus we use a fluentbit lua filter to multiply by 1000, which seems to be the only way in fluenbit to modify a record value.
zipkin requires microseconds, but haproxy "total request time" %Ta
is milliseconds.
Interestingly enough, here the JSON hack works (%Ta.e3
), not sure why otelcol accepts this float as an int,
but I sure am happy, since this seems way less cumbersome than the lua route.
This concrete setup assumes that the haproxy is not the first system in the request chain. Thus it relies on w3c traceparent for propagation of both the trace-id and the sampling decision (head-based sampling). Of course doing it this way is not at all necessary, you could just as well generate the trace-id in haproxy, but just to make the context situation clear.
We (ab)use the fact that the trace-flags
field is currently specifed as exactly "01" or "00" for "should record" or "not",
and write this directly into the sampling.priority
span attribute, where the otelcol probabilistic_sampler
processor expects it (and it recognizes int/float/string values there, so this works directly).
Also related to head-based sampling, there is no standardization yet for specifying the sampling rate.
We're using https://honeycomb.io/ as storage, which uses a custom span attribute SampleRate
(case sensitive) to transmit that (see e.g. here).
So we are (ab?)using the w3c tracestate
header to propagate this (because opentelemetry implementations already propagate this header for us),
combined with custom code/config in each system to parse that header and create a span attribute.
hi, wosc, I am very happy to see your greate solution. I also met the same question: how to collect request trace info that pass haproxy. I found fastly.lvc in your solution, could you explain the fastly.lvc? How to use in some service? Where to place the config file? thanks a lot.