Case Study

A windows case study.

The details of what I would do for Linux would be different, but the premise would remain.

Case Study

Current Situation

A load balanced set of 2x IIS servers running 50+ apps, with a application dependency of SQL Server running on another host.
Alert fired
Site is not responsive (assumed timing out)
IIS Pid consumes 100% CPU AND Memory causing the application to hang/timeout requests.
Possibly Server is UP but handling requests very slowly, or not at all.

Consider having your Load balancer return 503s or a custom response, when backends don't respond. Status codes can and should be measured, allowing you to have alerts which may help you narrow down which route/application has issues.

Given

Windows
IIS, Server count: 2 (Interesting that there are only 2)
Application count: 50+ (over committed?)
Load Balanced
- assumption "evenly" implies round robin, especially with the absense of the words "session" or "affinity" or "weight"
Alert has fired, the down event has already happened (assumed or via customer support calls)

Basic Topology

Below is a Basic Topology of the servers involved with the case study.

---
title: Basic Topology
---

flowchart LR
  subgraph Infra
    LB["Load Balancer"]
    serverA["IIS A"]
    serverB["IIS B"]
    sql["SQL Server(s)"]

    serverA --> sql
    serverB --> sql
  end

  user["Users"]

  user --> LB
  LB --round robin-->serverA
  LB --round robin-->serverB

Strategy Windows

The primary goal is to confirm, and identify the application and possibly the route/api in question which causes the issue.

Once the application is identified, attempt to identify AND confirm the request which is causing the issue. Reproduce it if possible.

Once we have confidence that we've narrowed down the specific application and request (which causes the issue), we can begin to dig deeper.

It could be that there is resource contention, such as disk, network, or an issue with a dependency such as a database or another service (api).

When investigating disk IO issues, first determine if the disk is full or experiencing saturated IO. Saturated IO can manifest as prolonged wait times, often indicative of a faulty device or slow network/storage connection. This can show up as excessive blocking, hindering overall system performance.

In the case of databases, locks on specific queries or sets of queries could be causing issues. Additionally, conflicts arising from interactions with other applications may contribute to (even compound) the problem. Analysis can help identify the root cause, including pinpointing any locks and associated queries.

The "Thundering Herd" problem comes to mind as well in situations like this. Do all the applications start AT ONCE where require excessive resources but slowly settle down? Are startup times staggered? Health checks should be differentiated from readiness.

There is a case where there could exist a SINGLE database with multiple applications accessing the same data. Hopefully each application has different service accounts, where each long running query can be traced back to a specific user...

If IO (disk or networking), and database show no issues - identify the change request associated with said feature. Doing a git blame or similar source control blame to track requests to code will be required. Though It is very possible that no changes were made.

In which case there should be a change management record with database changes. One should be able to cross reference SQL profile information with the change management list.

SRE should be be comfortable running explain plans on Queries alongside developers, and understanding the code being deployed. SREs should shepherd and advise where possible. SREs need expertise in the tools being used, even if that means writing new ones or debugging source code of openly used ones.

Confirmation

First confirm the alert is real. Network splits happen, monitoring tools can go down. Ensure the alert is real and factual. It is possible a manual human entry was made to the Load Balancer, and the rules are incorrect (If you nodded, tsk tsk, everything should be IaC and not manual ...)

---
title: Basic Confirmation
---

graph TB
  alert[Alert Fired] --> confirmAlert{Confirm Alert}
  confirmAlert -- yes --> sd[App Down]
  sd --> remove_server[consider taking backend out of LB rotation and duplicate issue]
  confirmAlert -- no --> fixalert[Fix Alert or Wait for Net Split resolution]

Q1 - Are server resources the issue?

---
title: Are server resources the issue?
---
graph TB
  counters[Look at Perf Counters] --> isdisk{Disk IO Issues}
  isdisk -- yes --> opsfix[Fix/Add/Swap Disk]
  isdisk -- no --> req_queue_depth{Request Queue Depth?}
  opsfix --> fixed{Fixed?}
  fixed -- yes --> iac[IaC the change]
  fixed -- no --> req_queue_depth

Q2 - Is the server right sized?

---
title: Is the server right sized or is the App slow?
---

graph TB
  req_queue_depth{Request Queue Depth?} -- shallow --> slowapp[Low RPS and app is just slow]
  slowapp --> devs[Work with Devs to profile and enhance app]
  req_queue_depth -- deep --> increase_resources[Add more resources]
  increase_resources --> fixed{Fixed?}
  devs --> fixed
  fixed -- yes --> iac[IaC the change]
  fixed -- no --> pools{Identify Single or Multiple Pools}

Q3 - Which App and what's wrong with it?

---
title: Which App and what's wrong with it?
---

graph TB
  pools{Identify Single or Multiple Pools} --Single--> check_pool[Check pool Resources]
  check_pool --> human_error{Human error? Throttle set?}
  human_error -- yes --> iac[fix and IaC the change]
  pools --Multi--> identpool[Identify Specific Pool]
  human_error -- no --> identpool
  identpool --> inspect_curr_req[inspect pool's current running requests]
  inspect_curr_req --> identify_app[Identify The Specific App]
  identify_app --> need_more{Need More Info?}
  need_more -- yes --> logging[Analyze App Logs associated with Req]
  need_more -- no --> workdevs[Work With Devs]
  logging --> need_more2{Need More Info?}
  need_more2 -- yes --> identify_deps[Identify App Dependencies]
  need_more2 -- no --> id_app[Identify App]
  id_app --> workdevs

Q4 - App Dependencies?

---
title: App Dependencies?
---

graph TB
  identify_deps[Identify App Dependencies] --> other_svc{Calls to Another Service?}
  identify_deps -- optionally --> profileapp[Profile App]
  profileapp --> other_svc
  other_svc -- yes --> seeq3[See Q3 Above]
  other_svc -- no --> database[Database]
  database --> res_issues{Resource Issue?}
  res_issues -- yes --> seeq1[see Q1 above]
  seeq1 --> fixed1{Fixed?}
  fixed1 -- yes --> iac[IaC the change]
  fixed1 -- no --> seeq2[see Q2 above]
  seeq2 --> fixed2{Fixed?}
  fixed2 --> yes --> iac
  fixed2 --> no --> analyze_db[Analyze DB long running queries and locks]
  res_issues -- no --> analyze_db
  analyze_db --> devs[Work with Devs]

Interesting Tools

Windows:
- May consider Network Monitor Agent on windows (at least from Server 2003...)
- SQL Activity Monitor
- SQL Query Store
- SQL LQS
- SQL Perf Dashboard
- SQL Replay (replay request ability is a fantastic reproducibility technique, and should be applied to backend services and APIs as well)
- Perf Counters
- Extended Events
- DebugDiag2
- dotnet-dump
- dotnet-gcdump
- dotnet-trace
- there are many dotnet cli global tools one may use

Short Term Recommendations

Add more IIS Servers, treat them as cattle and not as pets. Two servers seem like they might be pets. If you named the servers, they're Pets.
Look at server pools, and add more.
- Server pools WILL need to be constrained. Each app must be a good citizen. No single app NOR pool, should consume all resources.
- if hardware, consider pinning to specific cores, and don't span sockets else cpu cache will need to be duplicated across - lowering performance
- Can't do it at the IIS level? Try resource pooling with VMWare.
Scale Servers OUT, and start to think about scaling each app independently. Think about 5 IIS servers, and putting for example one app across only 3 of them. Kubernetes makes this much, much easier (which is why I recommend it...).
Assume more issues will arrise, put into place metrics right away.
- There should be a predictive alerts put in place that can alert much earlier for the specific issue that was realized, especially if it is seen as possibly being recurring.
Day 1 some things need to be done via IaC, one must start to eliminate the possibility of human error.
Documentation should be updated right away for personnel to quickly identify Apps and each Application's dependencies.
SOPs should immediately be re-evaluated
App and Infra docs should immediately be re-evaluated
Remediation process should be put in place and evaluated for effectiveness

Long Term Recommendations

Designing a no expense spared architecture requires a deep understanding of your specific needs, goals, and constraints, which can only be achieved through further dialogue and engagement.

Every client faces unique challenges, both in terms of technology and personnel. While I can offer general guidance and best practices, a tailored solution that meets your organization's needs requires a more thorough assessment and collaboration.

Building core competancy and putting processes and systems in place takes time - especially so when people are involved.

If you're interested in exploring professional services further and discussing a more detailed strategy, I invite you to initiate a dialogue. Together, we can work towards developing a comprehensive plan that aligns with your goals and objectives.

Non Technical Challenges

Below is a list of some non technical challenges one may face:

Alignment across business and development on Performance Metrics - Establishing Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) can lead to very challenging discussions with product owners. Questions may arise regarding the comfort level of stakeholders with SRE teams enforcing these metrics and holding teams accountable.
Financial Commitments - Convincing business stakeholders to invest in additional resources, such as personnel and cloud services, can be a challenging task.
Education - Promoting the adoption of SRE principles across the organization requires extensive education and awareness-building. This includes educating business leaders on the value proposition and implications of SRE practices.
Cross Team Collaboration - Development managers may initially perceive new processes introduced by SRE teams as hindrances to their own workflows. Building a high degree of collaboration between development and SRE teams is essential for overcoming resistance and fostering a constant culture of cooperation.
Skill Enhancement and Process Refinement - As SRE practices are slowly integrated into the organization, team composition, skills, and Standard Operating Procedures will need to be continuously evaluated and adjusted. This ensures alignment with evolving business objectives and technological advancements.

JasonGiedymin/case_study.md