Understanding space usage in Azure Monitor logs

Data ingested to Azure Monitor logs is billed per-Gigabyte ingested. As a workspace will typically grow to have data coming from many different sources and solutions it is helpful to have a set of queries that allow you to quickly drill into where exactly the GBs (or TBs!) of data you have stored comes from.

I’ve found the below queries very helpful starting points for three main scenarios:

  • Regular monitoring (once/month) to see how data volumes are trending
  • Reacting to a monitoring alert based on overall ingestion volumes
  • Testing out a configuration change/new solution and observing the impact on data ingested

The latter is particularly important before rolling a change to a workspace with long retention - you wouldn’t want (hypothically :)) to accidentally ingest 100GB of IIS logs and then be forced to retain them for 2 years…

Workspace solution/data type usage

This query provides a summary of the top solutions and data types in a single workspace, along with how they contribute to overall usage. If your workspace has a lot of different solutions or data types you’ll benefit from changing the values passed to the top-nested operators.

Usage
| where TimeGenerated > startofday(ago(30d)) and TimeGenerated < startofday(now())
| where IsBillable
| top-nested 1 of 'All' by AllData = round(sum(Quantity),0)
, top-nested 2 of Solution with others='Others' by SolutionTotal = round(sum(Quantity), 0)
, top-nested 2 of DataType with others = 'Others' by SolutionDataTotal = round(sum(Quantity), 0)
| where AllData != 0
| extend SolutionPct = round((SolutionDataTotal / SolutionTotal) * 100, 1)
| extend OverallPct = round((SolutionDataTotal / AllData) * 100,1)
| project AllData, SolutionTotal, Solution, DataType, SolutionPct, SolutionDataTotal, OverallPct
| order by OverallPct desc

A couple of example results are below - one for a security focused workspace, and one for a workspace which has a bit more going on.

Security Workspace

We can see here that this workspace is dominated by the Security event log - almost 70% of all the data is the SecurityEvent data type in the Security solution. This corresponds to the Windows Security Event Log records that are uploaded.

Complex Workspace

This workspace demonstrates how the top-nested operator works. Inside of the LogManagement solution we’ve got 2 distinct data types (Perf and a Custom Log - the one ending in _CL). We’ve also got the ‘Others’ record which tells us that the longer tail of data types in that solution might be worth looking at as they comprise nearly 30% of the solution total, and 15% of overall data.

Workspace data over time

If you have multiple workspaces it can be helpful to see how their data trends over time. Extending this query to run over additional workspaces is unfortunately a matter of copy-paste:

workspace("primaryWorkspace").Usage
| where TimeGenerated > startofday(ago(30d)) and TimeGenerated < startofday(now())
| where IsBillable
| extend Workspace = "Primary"
| summarize sum(Quantity) by Solution, bin(TimeGenerated, 1d), DataType, Workspace
| union (
workspace("secondaryWorkspace").Usage
| where TimeGenerated > startofday(ago(30d)) and TimeGenerated < startofday(now())
| where IsBillable
| extend Workspace = "Secondary"
| summarize sum(Quantity) by Solution, bin(TimeGenerated, 1d), DataType, Workspace
)
| render timechart with(title = 'Workspace Data Usage')

In this example the primary workspace is responsible for capturing data which will be retained for a long period of time, and the secondary workspace captures higher volume data. Use the dropdown to change between viewing by Workspace, Solution or Data Type.

Data usage over time

Note the usage of startofday(...) when filtering on TimeGenerated. Cutting boundaries at midnight at the start/end removes weird artefacts where the graph looks like it is unusually low due to only part of a day’s data being counted.

Drilling into a single data type

At some point a graph will invariably point to something that looks a little weird - an upward trend or maybe a huge spike - and you’ll want to understand where that is coming from. The good news is Azure Monitor logs provides a pair of virtual columns _IsBillable and _BilledSize that let you drill in to any data type and find out exactly what it is costing you.

Event
| where TimeGenerated > startofday(ago(30d)) and TimeGenerated < startofday(now())
| where _IsBillable == true 
| extend computerName = tolower(tostring(split(Computer, '.')[0]))
| where computerName != ""
| summarize TotalVolumeMB = sum(_BilledSize / 1024 / 1024) by computerName, bin(TimeGenerated, 1d)
| render barchart with(title = 'EventLog by Computer')

In this example assume we discovered the Event data type in a particular workspace had a spike - running the query should tell us quickly if this was limited to a single server or all servers. And as you can see below there were three servers that had a very eventful day a little while back!

EventLog by computer

Further reading

The official docs are a great place to start reading more about space usage and monitoring for Azure Monitor logs. I’ve also put a couple of queries on GitHub that I’ll update over time as I find any better ways of cutting this data.

[Read More]

Auto scale down all Event Hub namespaces with Azure Functions

A little over a year ago I lamented the lack of an auto-deflate feature for Event Hubs, and offered a way to programatically scale down your namespaces. That solution still works, but requires a redeploy each time you wanted to add a namespace. Today we’ll look at an upgraded function app which programatically discovers and scales-down all Event Hub namespaces it has access to.

With the addition of a PowerShell script to grant the appropriate permissions to all of your namespaces, you can be up and running (or deflating) in a matter of minutes.

[Read More]

Overhead of Event Hub outputs with Azure Function Apps

In a world where you’re billed for what you use, it pays to really understand what exactly it is you are using. The pricing model of the Azure Function Apps consumption plan sounds pretty simple (pay based on execution count and execution time) - though as always the devil is in the detail.

In a recent project where we’ve migrated a workload from a dedicated (on-prem) server to a function app, someone asked what sounded like a fairly simple question:

  • Do we pay (on the function app side) for the output to an Event Hub, and if so how much does it cost?

This post explores the answer to that question for a trivial C# function, and provides a few pointers to help get your head around consumption billing.

If you’re wondering why we wanted to deconstruct the function’s cost when an execution unit costs a mere 16 picdollars (16x10^-12), consider what we saw after our first week at full load:

A lot of executions

5 trillion execution units… Interesting!

Execution units vs. what you see on the pricing page is covered later in the post.

[Read More]

Adding caching to your PowerShell scripts

Most of the time when some part of a script takes a long time to run and you want to re-use the result you’ll store it in a variable:

$databases = Get-ListOfDatabases # assume this is an expensive/long-running query
Invoke-SomeOperation -Server serverOne -Databases $databases
Invoke-SomeOtherOperation -Server serverOne -Databases $databases

As long as the function you’re calling allows you to pass in the ‘expensive’ argument you’re fine - but what about cases when the computation takes place inside the function:

Invoke-SomeOperation -Server serverOne # internally calls Get-ListOfDatabases
Invoke-SomeOtherOperation -Server serverOne # internally calls Get-ListOfDatabases

Sometimes it makes sense to rework the script to accept the argument, though other times it can be cleaner to modify the call site to cache the result, rather than changing every function to accept and potentially pass through that argument.

The rest of this post will go through a specific example that motivated caching, share a generic function that implements scriptblock-based caching, and call out a few and gotchas.

[Read More]
Tags: PowerShell

Resumable Online Index Rebuilds - SQL 2017's Silver Bullet

Every new version of SQL Server comes with a whole grab-bag of new features that open up exciting possibilities, or obviate the need for workarounds (or dirty hacks). In the runup to deploying SQL 2017 I thought that automatic plan correction was going to be the closest thing to a silver bullet I’d seen in any release so far (in terms of value added vs. effort to implement), but it has been eclipsed for us by the awesomeness of Resumable Online Index Rebuilds (ROIR).

In this post I’ll talk through a few of the scenarios where this feature really shines, and how it has transformed the way we think about index maintenance. If you’d like more details about how ROIR is implemented I’d encourage you to read through the excellent paper detailing ROIR - this covers how the online index rebuild algorithm was updated, and also demonstrates how in most cases ROIR outperforms the non-resumable version in terms of performance.

[Read More]
Tags: SQL