826 words

Performance Difference Between dedup and stats

From Duckfez

I'm going to say there likely won't be a huge material difference in any of these

yes, leading wildcards usually matter as they force the reading of the lexicon of each tsidx file from beginning to end.  But, for index  it is a special case because all active index names are already known, in memory, and a set around O(1000)

the second one - the one with stats - does not actually need the fields  because stats knows which fields it needs, and because of "smart mode" will only extract the needed fields when running a transforming (reporting) command like stats.    If you're running in "verbose mode" then the added call to fields may help some, but you should not be performance testing things being run in verbose mode anyway!

which leaves us with, "is dedup more or less efficient that stats"

stats is really powerful from a performance point of view because it does map-reduce very well.  Each indexer is able to independently compute it's own "local stats" for your function and fields.   And then, pass back a much smaller result to the search head.  The search head then can take the much smaller result set, and compute the "global stats" based on the local ones.   For things like max() this is obvious (edited) 

but, dedup can do the exact same thing.   dedup is also easily map-reduced because each indexer can compute its own local set up dedup'ed results, and then pass that back to the indexer to run one more dedup cycle over everything

... so far ... there's nothing here that suggests that there should be a large performance difference in either of these

there is one semantic difference though - the stats returns only the three fields - f1, f2, and max(_time).    The dedup returns the newest whole raw event for each value of f1, f2

( which is why you need the table )

From a semantic point of view, if all you needed was the max(_time) for values of f1 and f2 then the stats is more correct

But, I think that if you do a really careful objective performance test (account for other load on the system, kernel data caching, etc) ... you'll find the differences are so tiny as to be like 5% either way

Index config check

| rest splunk_server_group=dmc_group_indexer /servicesNS/-/-/data/indexes
| fields splunk_server title repFactor homePath homePath_expanded coldPath coldPath_expanded thawedPath thawedPath_expanded summaryHomePath_expanded tstatsHomePath tstatsHomePath_expanded
| eval Index = title, hot = mvappend(homePath, homePath_expanded), cold = mvappend(coldPath, coldPath_expanded), thawed = mvappend(thawedPath, thawedPath_expanded), summaries = summaryHomePath_expanded, dma = mvappend(tstatsHomePath, tstatsHomePath_expanded)
| stats values(splunk_server) AS "Indexers" values(repFactor) AS "Replication Factor" values(hot) AS "Hot/Warm" values(cold) AS "Cold" values(thawed) AS "Thawed" values(summaries) AS "Summaries" values(dma) AS "Data Model Accelerations" by Index

List all of your lookups

| rest splunk_server=local /servicesNS/-/-/data/transforms/lookups | fields title eai:appName type filename collection

Search to see number of concurrent searches

Courtesy of David Paper

index=_internal earliest=-1h group=search_concurrency host=<search head glob> ("system total") | rex field=_raw mode=sed "s/system total/user=system/g" |eval user=coalesce(user,"system") | timechart max(active_hist_searches) by user

Splunk clustering status


A peer showing no symptoms will be in the UP state this is the peak of health


If a peer shows concerning but tolerable symptoms it will be put in the UNSTABLE state.
In this state the peer is still searched but we emit warnings about our symptoms on the bulletin board.
Preempts all previous states. Currently symptoms that fall into this are:

  • Clock skew between search head and peer. We get the peer's time from the timestamp on the Http Response headers during the heartbeat. If this exceeds a configurable in limits.conf we consider clocks to be skewed.
  • Over subscribed peers. If an indexer is streaming back search results at a much slower rate than others then it can hold up the completion of the whole search. We currently have logic to detect such slow peers in the search process. Currently we use this logic to kill the peer before we get all the data. (Feature is off by default)


For all other symptoms we move the peer to the DOWN state. In this state the peer is not searched but we still heartbeat to monitor it. Preempts all previous states.


There should never be a situation where this state is reached. However, if this status code shows up in your indexing cluster, welp, there you are.

Data Durability Status and History

index=_internal host=indexer* OR host=cm* ((source=*splunkd.log* my guid) OR (source=*health* due_to_stanza="feature:data_searchable" color=red))
| eval type=case(match(source,"health"),"not searchable",match(source,"splunkd\.log"),"start-up")
| timechart span=1m dc(sourcetype) by type

Thanks to JonRust on Slack

Splunk dev with bump, refresh, restarts

_bump for “content files” (css/js/appserver), debug/refresh for “config changes/xml/conf” and “splunkweb restart” for persistant handlers. mod input, custom command py files are executed fresh each instantiation after the initial “pick up new things splunkd restart”. conf.spec requires restart

Thanks, alacercogitatus

Rolling authentication failures by device over 1 minute windows

|tstats summariesonly=true allow_old_summaries=true count from datamodel=Authentication where  Authentication.action="failure" by _time Authentication.dest span=1s 
| rename Authentication.* AS * 
| streamstats time_window=1m sum(count) AS dest_failures by dest