IT-SDK-SRE

From wiki.samerhijazi.net
Revision as of 18:52, 2 December 2022 by Samerhijazi (talk | contribs) (InIt-Notes)
Jump to navigation Jump to search

InIt-Ref

INIT-Text

- Matric: Latency, Call count, Erroneous calls, Error rate
- Aggregation: sum, mean, min, max, 25,50,75,90,95,98,99th
- Threshold: ms, count, %

InIt-Notes

  • SRE focuses on improving software system reliability across key categories including availability, performance, latency, efficiency, capacity, and incident response.
  • service-level indicators (SLIs) and service-level objectives (SLOs)
  • Uptime: "five nines" or 99.999%, over five minutes of downtime per year.
  • Uptime: "four nines" or 99.99%, nearly an hour of downtime per year.
  • Dynatrace is both an Application Performance Monitoring and application Management tool, it can be used as Cloud based SaaS offering or installed on-prem and more.
  • APM: application performance management
  • ELK Stack: is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana
  • ELK Stack/Elastic & New Relic & Datadog & Dynatrace
  • Azure, Terraform, Ansible, concourse-ci, Elasticsearch/Kibana, Dynatrace, Prometheus, Graylog, StoreBox
  • NEW-Work: AWS, Azure, concourse, Jenkins, Aurora DB, Dynatrace, New Relic, ElasticSearch, Kibana

InIt-Youtube

SRE-Google

InIt-Definitions

Source: https://www.leanix.net/en/wiki/vsm/site-reliability-engineering-sre

  • SRE monitor systems in production and analyze their performance to detect areas of improvement.
  • SRE observations help them calculate the potential cost of outages and plan for contingency.
  • SRE usually split their time between operations and the development of systems and software.
  • SRE spent time on building and deploying services that optimize the workflow for IT and support departments.
  • SRE determine what new features can be implemented and when this is possible through the help of SLAs, SLIs, SLOs.
  • Service Level Agreements (SLAs), Service Level Indicators (SLI), and Service Level Objectives (SLO).

Obeservability

- Comprehensive Log Collection.
- Comprehensive Metric Collection.
- Comprehensive Tracing Collection.
- Comprehensive Dependency Collection.
- Comprehensive Relating of Logs, Metrics, Dependencies.
- Automated and Instant Instrumentation.
- High Cardinality Analytics.
- Dependency Map and AI Based Root Cause.
- Automated Problem Resolution.

Definitions

Service-Level Objective (SLO)

SRE begins with the idea that a prerequisite to success is availability. A system that is unavailable cannot perform its function and will fail by default. Availability, in SRE terms, defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future.

Service-Level Agreement (SLA)

An SLA normally involves a promise to someone using your service that its availability SLO should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLO is going to hurt the service team, so they will push hard to stay within SLO. If you’re charging your customers money, you will probably need an SLA.

Service-Level Indicator (SLI)

A service’s behavior: the frequency of successful probes of a system. This is a Service-Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load-balancing between the two. If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries as your SLIs.

Dynatrace

SRE Toolchain

Containers for Microservices

  • Docker
  • Kubernetes
  • Swarm
  • Apache Mesos
  • Podman

Source Control Tools

  • Git

CI/CD Tools

Data Storage Tools

  • MySQL
  • PostgreSQL
  • MonogoDB
  • Apache Hadoop
  • Apache Hive
  • Amazon Aurora (MySQL and PostgreSQL-compatible)
  • MariaDB (fork from MySQL)

Configuration Management Tools

  • Ansible
  • Chef
  • Puppet
  • Saltstack

Metrics Collection Tools

  • Prometheus
  • Stackdriver (Google Cloud Operations)
  • InfluxDB
  • Sensu Go

Log Aggregation Tools

  • Fluentd
  • Sentry
  • Logstash

Distributed Tracing Tools

  • OpenTelemetry
  • Jaeger

Application Performance Monitoring Tools

  • Appdynamics
  • New Relic
  • Dynatrace

Dashboarding Tools

  • Grafana
  • Stashboard
  • Redash
  • Metabase

Incident Management

  • Pagerduty
  • Opsgenie
  • Squadcast