InIt-Ref

APM (Application Performance Management) + ARM (Application Resource Managment)
https://sre.google/books/
https://www.youtube.com/playlist?list=PLIivdWyY5sqLOiLXJDlN-wKd0g7hf_9vC
https://www.youtube.com/watch?v=OnK4IKgLl24
https://www.youtube.com/watch?v=3EEZmSwMXp8
https://www.dynatrace.com/news/tag/sre/
https://video.dynatrace.com/watch/UDw5uqrt1xSigePvtceqAf?
https://www.dynatrace.com/trial/resources/
https://www.youtube.com/playlist?list=PLqt2rd0eew1arEMzMM_tCZzF0JwgANaFt
https://www.dynatrace.com/support/help/how-to-use-dynatrace/

INIT-Text

- Matric: Latency, Call count, Erroneous calls, Error rate
- Aggregation: sum, mean, min, max, 25,50,75,90,95,98,99th
- Threshold: ms, count, %

InIt-Notes

SRE focuses on improving software system reliability across key categories including availability, performance, latency, efficiency, capacity, and incident response.
service-level indicators (SLIs) and service-level objectives (SLOs)
Uptime: "five nines" or 99.999%, over five minutes of downtime per year.
Uptime: "four nines" or 99.99%, nearly an hour of downtime per year.
Dynatrace is both an Application Performance Monitoring and application Management tool, it can be used as Cloud based SaaS offering or installed on-prem and more.
APM: application performance management
ELK Stack: is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana
ELK Stack/Elastic & New Relic & Datadog & Dynatrace
Azure, Terraform, Ansible, concourse-ci, Elasticsearch/Kibana, Dynatrace, Prometheus, Graylog, StoreBox
NEW-Work: AWS, Azure, concourse, Jenkins, Aurora DB, Dynatrace, New Relic, ElasticSearch, Kibana

InIt-Youtube

https://www.youtube.com/watch?v=X9r0sjBWdlA
https://www.dynatrace.com/news/blog/openstack-monitoring-beyond-the-elastic-stack-part-2/
https://www.youtube.com/watch?v=C9Sm0pmQLC0 (Turbonomic)
https://www.youtube.com/watch?v=MjehIjs8ilY (Instana & Turbonomic)

SRE-Google

https://cloud.google.com/blog/products/devops-sre/sre-fundamentals-sli-vs-slo-vs-sla

InIt-Definitions

Source: https://www.leanix.net/en/wiki/vsm/site-reliability-engineering-sre

SRE monitor systems in production and analyze their performance to detect areas of improvement.
SRE observations help them calculate the potential cost of outages and plan for contingency.
SRE usually split their time between operations and the development of systems and software.
SRE spent time on building and deploying services that optimize the workflow for IT and support departments.
SRE determine what new features can be implemented and when this is possible through the help of SLAs, SLIs, SLOs.
Service Level Agreements (SLAs), Service Level Indicators (SLI), and Service Level Objectives (SLO).

Obeservability

- Comprehensive Log Collection.
- Comprehensive Metric Collection.
- Comprehensive Tracing Collection.
- Comprehensive Dependency Collection.
- Comprehensive Relating of Logs, Metrics, Dependencies.
- Automated and Instant Instrumentation.
- High Cardinality Analytics.
- Dependency Map and AI Based Root Cause.
- Automated Problem Resolution.

Definitions

source: https://cloud.google.com/blog/products/devops-sre/sre-fundamentals-sli-vs-slo-vs-sla

Service-Level Objective (SLO)

SRE begins with the idea that a prerequisite to success is availability. A system that is unavailable cannot perform its function and will fail by default. Availability, in SRE terms, defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future.

Service-Level Agreement (SLA)

An SLA normally involves a promise to someone using your service that its availability SLO should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLO is going to hurt the service team, so they will push hard to stay within SLO. If you’re charging your customers money, you will probably need an SLA.

Service-Level Indicator (SLI)

A service’s behavior: the frequency of successful probes of a system. This is a Service-Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load-balancing between the two. If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries as your SLIs.

Dynatrace

SRE Toolchain

https://www.dynatrace.com/news/blog/sre-vs-devops/

Containers for Microservices

Docker
Kubernetes
Swarm
Apache Mesos
Podman

Source Control Tools

Git

CI/CD Tools

Jenkins
CircleCI
GitLab
GoCD
Semaphore
Concourse: https://concourse-ci.org/

Data Storage Tools

MySQL
PostgreSQL
MonogoDB
Apache Hadoop
Apache Hive
Amazon Aurora (MySQL and PostgreSQL-compatible)
MariaDB (fork from MySQL)

Configuration Management Tools

Ansible
Chef
Puppet
Saltstack

Metrics Collection Tools

Prometheus
Stackdriver (Google Cloud Operations)
InfluxDB
Sensu Go

Log Aggregation Tools

Fluentd
Sentry
Logstash

Distributed Tracing Tools

OpenTelemetry
Jaeger

Application Performance Monitoring Tools

Appdynamics
New Relic
Dynatrace

Dashboarding Tools

Grafana
Stashboard
Redash
Metabase

Incident Management

Pagerduty
Opsgenie
Squadcast

IT-SDK-SRE

Contents