Performance Difference Between dedup and stats

From Duckfez

I'm going to say there likely won't be a huge material difference in any of these


yes, leading wildcards usually matter as they force the reading of the lexicon of each tsidx file from beginning to end.  But, for index  it is a special case because all active index names are already known, in memory, and a set around O(1000)

the second one - the one with stats - does not actually need the fields  because stats knows which fields it needs, and because of "smart mode" will only extract the needed fields when running a transforming (reporting) command like stats.    If you're running in "verbose mode" then the added call to fields may help some, but you should not be performance testing things being run in verbose mode anyway!

which leaves us with, "is dedup more or less efficient that stats"

stats is really powerful from a performance point of view because it does map-reduce very well.  Each indexer is able to independently compute it's own "local stats" for your function and fields.   And then, pass back a much smaller result to the search head.  The search head then can take the much smaller result set, and compute the "global stats" based on the local ones.   For things like max() this is obvious (edited) 

but, dedup can do the exact same thing.   dedup is also easily map-reduced because each indexer can compute its own local set up dedup'ed results, and then pass that back to the indexer to run one more dedup cycle over everything

... so far ... there's nothing here that suggests that there should be a large performance difference in either of these

there is one semantic difference though - the stats returns only the three fields - f1, f2, and max(_time).    The dedup returns the newest whole raw event for each value of f1, f2

( which is why you need the table )

From a semantic point of view, if all you needed was the max(_time) for values of f1 and f2 then the stats is more correct

But, I think that if you do a really careful objective performance test (account for other load on the system, kernel data caching, etc) ... you'll find the differences are so tiny as to be like 5% either way

You'll only receive email when automine publishes a new post

More fromĀ automine