IT-SDK-SRE

From wiki.samerhijazi.net
Revision as of 14:45, 5 December 2022 by Studying (talk | contribs) (InIt-Ref)
Jump to navigation Jump to search

init-Ref

INIT-Text

- Matric: Latency, Call count, Erroneous calls, Error rate
- Aggregation: sum, mean, min, max, 25,50,75,90,95,98,99th
- Threshold: Is the pass/fail criteria in (time, count, %) that you define for your test metrics.
- Latency: The time taken for a packet to be transferred across a network. You can measure this as one-way to its destination or as a round trip.
- Throughput: The quantity of data being sent and received within a unit of time.

InIt-Notes

  • SRE focuses on improving software system reliability across key categories including availability, performance, latency, efficiency, capacity, and incident response.
  • service-level indicators (SLIs) and service-level objectives (SLOs)
  • Uptime: "five nines" or 99.999%, over five minutes of downtime per year.
  • Uptime: "four nines" or 99.99%, nearly an hour of downtime per year.
  • Dynatrace is both an Application Performance Monitoring and application Management tool, it can be used as Cloud based SaaS offering or installed on-prem and more.
  • APM: application performance management
  • ELK Stack: is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana
  • ELK Stack/Elastic & New Relic & Datadog & Dynatrace
  • Azure, Terraform, Ansible, concourse-ci, Elasticsearch/Kibana, Dynatrace, Prometheus, Graylog, StoreBox
  • NEW-Work: AWS, Azure, concourse, Jenkins, Aurora DB, Dynatrace, New Relic, ElasticSearch, Kibana

InIt-Youtube

SRE-Google

InIt-Definitions

Source: https://www.leanix.net/en/wiki/vsm/site-reliability-engineering-sre

  • SRE monitor systems in production and analyze their performance to detect areas of improvement.
  • SRE observations help them calculate the potential cost of outages and plan for contingency.
  • SRE usually split their time between operations and the development of systems and software.
  • SRE spent time on building and deploying services that optimize the workflow for IT and support departments.
  • SRE determine what new features can be implemented and when this is possible through the help of SLAs, SLIs, SLOs.
  • Service Level Agreements (SLAs), Service Level Indicators (SLI), and Service Level Objectives (SLO).

Monitoring & Observability

Monitoring is tooling or a technical solution that allows teams to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs.
Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.
Monitoring is the process of using pre-configurd telemetry data with dashboards and alerts to understand your application's health and performance.
Oberservability is the ability to understand the inner state of your evolving systems by analyzing all available outputs in real time.

Obeservability

- Comprehensive Log Collection.
- Comprehensive Metric Collection.
- Comprehensive Tracing Collection.
- Comprehensive Dependency Collection.
- Comprehensive Relating of Logs, Metrics, Dependencies.
- Automated and Instant Instrumentation.
- High Cardinality Analytics.
- Dependency Map and AI Based Root Cause.
- Automated Problem Resolution.

Definitions

Service-Level Objective (SLO)

SRE begins with the idea that a prerequisite to success is availability. A system that is unavailable cannot perform its function and will fail by default. Availability, in SRE terms, defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future.

Service-Level Agreement (SLA)

An SLA normally involves a promise to someone using your service that its availability SLO should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLO is going to hurt the service team, so they will push hard to stay within SLO. If you’re charging your customers money, you will probably need an SLA.

Service-Level Indicator (SLI)

A service’s behavior: the frequency of successful probes of a system. This is a Service-Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load-balancing between the two. If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries as your SLIs.

Dynatrace

SRE Toolchain

Containers for Microservices

  • Docker
  • Kubernetes
  • Swarm
  • Apache Mesos
  • Podman

Source Control Tools

  • Git

CI/CD Tools

Data Storage Tools

  • MySQL
  • PostgreSQL
  • MonogoDB
  • Apache Hadoop
  • Apache Hive
  • Amazon Aurora (MySQL and PostgreSQL-compatible)
  • MariaDB (fork from MySQL)

Configuration Management Tools

  • Ansible
  • Chef
  • Puppet
  • Saltstack

Metrics Collection Tools

  • Prometheus
  • Stackdriver (Google Cloud Operations)
  • InfluxDB
  • Sensu Go

Log Aggregation Tools

  • Fluentd
  • Sentry
  • Logstash

Distributed Tracing Tools

  • OpenTelemetry
  • Jaeger

Application Performance Monitoring Tools

  • Appdynamics
  • New Relic
  • Dynatrace

Dashboarding Tools

  • Grafana
  • Stashboard
  • Redash
  • Metabase

Incident Management

  • Pagerduty
  • Opsgenie
  • Squadcast