IT-SDK-SRE
Contents
- 1 init-Ref
- 2 INIT-Text
- 3 InIt-Notes
- 4 InIt-Youtube
- 5 SRE-Google
- 6 InIt-Definitions
- 7 Monitoring & Observability
- 8 Definitions
- 9 Dynatrace
- 10 SRE Toolchain
- 10.1 Containers for Microservices
- 10.2 Source Control Tools
- 10.3 CI/CD Tools
- 10.4 Data Storage Tools
- 10.5 Configuration Management Tools
- 10.6 Metrics Collection Tools
- 10.7 Log Aggregation Tools
- 10.8 Distributed Tracing Tools
- 10.9 Application Performance Monitoring Tools
- 10.10 Dashboarding Tools
- 10.11 Incident Management
init-Ref
- APM (Application Performance Management) + ARM (Application Resource Managment)
- https://www.dynatrace.com/news/blog/what-are-slos/
- https://sre.google/books/
- https://www.youtube.com/playlist?list=PLIivdWyY5sqLOiLXJDlN-wKd0g7hf_9vC
- https://www.youtube.com/watch?v=OnK4IKgLl24
- https://www.youtube.com/watch?v=3EEZmSwMXp8
- https://www.dynatrace.com/news/tag/sre/
- https://video.dynatrace.com/watch/UDw5uqrt1xSigePvtceqAf?
- https://www.dynatrace.com/trial/resources/
- https://www.youtube.com/playlist?list=PLqt2rd0eew1arEMzMM_tCZzF0JwgANaFt
- https://www.dynatrace.com/support/help/how-to-use-dynatrace/
INIT-Text
- Matric: Latency, Call count, Erroneous calls, Error rate - Aggregation: sum, mean, min, max, 25,50,75,90,95,98,99th - Threshold: Is the pass/fail criteria in (time, count, %) that you define for your test metrics. - Latency: The time taken for a packet to be transferred across a network. You can measure this as one-way to its destination or as a round trip. - Throughput: The quantity of data being sent and received within a unit of time.
InIt-Notes
- SRE focuses on improving software system reliability across key categories including availability, performance, latency, efficiency, capacity, and incident response.
- service-level indicators (SLIs) and service-level objectives (SLOs)
- Uptime: "five nines" or 99.999%, over five minutes of downtime per year.
- Uptime: "four nines" or 99.99%, nearly an hour of downtime per year.
- Dynatrace is both an Application Performance Monitoring and application Management tool, it can be used as Cloud based SaaS offering or installed on-prem and more.
- APM: application performance management
- ELK Stack: is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana
- ELK Stack/Elastic & New Relic & Datadog & Dynatrace
- Azure, Terraform, Ansible, concourse-ci, Elasticsearch/Kibana, Dynatrace, Prometheus, Graylog, StoreBox
- NEW-Work: AWS, Azure, concourse, Jenkins, Aurora DB, Dynatrace, New Relic, ElasticSearch, Kibana
InIt-Youtube
- https://www.youtube.com/watch?v=X9r0sjBWdlA
- https://www.dynatrace.com/news/blog/openstack-monitoring-beyond-the-elastic-stack-part-2/
- https://www.youtube.com/watch?v=C9Sm0pmQLC0 (Turbonomic)
- https://www.youtube.com/watch?v=MjehIjs8ilY (Instana & Turbonomic)
SRE-Google
InIt-Definitions
Source: https://www.leanix.net/en/wiki/vsm/site-reliability-engineering-sre
- SRE monitor systems in production and analyze their performance to detect areas of improvement.
- SRE observations help them calculate the potential cost of outages and plan for contingency.
- SRE usually split their time between operations and the development of systems and software.
- SRE spent time on building and deploying services that optimize the workflow for IT and support departments.
- SRE determine what new features can be implemented and when this is possible through the help of SLAs, SLIs, SLOs.
- Service Level Agreements (SLAs), Service Level Indicators (SLI), and Service Level Objectives (SLO).
Monitoring & Observability
- https://www.instana.com/blog/observability-vs-monitoring/
- https://cloud.google.com/architecture/devops/devops-measurement-monitoring-and-observability#:~:text=Monitoring%20is%20based%20on%20gathering,patterns%20not%20defined%20in%20advance.
Monitoring is tooling or a technical solution that allows teams to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs. Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.
Monitoring is the process of using pre-configurd telemetry data with dashboards and alerts to understand your application's health and performance. Oberservability is the ability to understand the inner state of your evolving systems by analyzing all available outputs in real time.
Obeservability
- https://www.instana.com/media/securepdfs/Ranking-the-Observability-Offerings-APM-Experts.pdf
- https://play-with.instana.io/#/home
- Comprehensive Log Collection. - Comprehensive Metric Collection. - Comprehensive Tracing Collection. - Comprehensive Dependency Collection. - Comprehensive Relating of Logs, Metrics, Dependencies. - Automated and Instant Instrumentation. - High Cardinality Analytics. - Dependency Map and AI Based Root Cause. - Automated Problem Resolution.
Definitions
Service-Level Objective (SLO)
SRE begins with the idea that a prerequisite to success is availability. A system that is unavailable cannot perform its function and will fail by default. Availability, in SRE terms, defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future.
Service-Level Agreement (SLA)
An SLA normally involves a promise to someone using your service that its availability SLO should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLO is going to hurt the service team, so they will push hard to stay within SLO. If you’re charging your customers money, you will probably need an SLA.
Service-Level Indicator (SLI)
A service’s behavior: the frequency of successful probes of a system. This is a Service-Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load-balancing between the two. If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries as your SLIs.
Dynatrace
- https://www.dynatrace.com/support/help/
- https://university.dynatrace.com/ondemand/course/22170
- https://www.dynatrace.com/support/help/
- https://community.dynatrace.com/
SRE Toolchain
Containers for Microservices
- Docker
- Kubernetes
- Swarm
- Apache Mesos
- Podman
Source Control Tools
- Git
CI/CD Tools
- Jenkins
- CircleCI
- GitLab
- GoCD
- Semaphore
- Concourse: https://concourse-ci.org/
Data Storage Tools
- MySQL
- PostgreSQL
- MonogoDB
- Apache Hadoop
- Apache Hive
- Amazon Aurora (MySQL and PostgreSQL-compatible)
- MariaDB (fork from MySQL)
Configuration Management Tools
- Ansible
- Chef
- Puppet
- Saltstack
Metrics Collection Tools
- Prometheus
- Stackdriver (Google Cloud Operations)
- InfluxDB
- Sensu Go
Log Aggregation Tools
- Fluentd
- Sentry
- Logstash
Distributed Tracing Tools
- OpenTelemetry
- Jaeger
Application Performance Monitoring Tools
- Appdynamics
- New Relic
- Dynatrace
Dashboarding Tools
- Grafana
- Stashboard
- Redash
- Metabase
Incident Management
- Pagerduty
- Opsgenie
- Squadcast