At face value uptime is the easiest monitor to setup. But, when it matters it is one of the hardest SLO to set. As long as the values are low it’s simple Up / (Up+Down) %
But when it starts crossing the 99.9x% mark is where this gets interesting. If something is checked only x seconds, was it always up or only up during the check? Who monitors the uptime of the uptime monitor?
As always, there is no silver bullet, just trade offs. The talk is a tale of these trade offs.
Scars, Battle scars, and Expensive scars.
Most often than not, software engineers don’t know the operational challenges of their code - what can fail in production code.
Manjot Pahwa, Rishu Mehrotra, Kalyan Somasundaram, and I discuss:
- Google, LinkedIn and startup world view of SRE
- Often neglected area of work of an SRE - cost
- Cost of SRE and reliability
- Incident management
- Authority and roles
- Day-to-day and incident management
- Operational load and toil
Pulumi or Terraform?
How much and what all to automate?
Automated vs automatic?
Ansible or K8s?
Serverless or needless?
Choices, choices, choices - and very expensive these choices are.
An attempt to lay out facts (from popular tech fiction) + trade-offs associated with these choices.
Obsolete software == stable software.
Stability or release velocity?
Treat operations as product.
Reliability != uptime.
Reliability != buffet lunch. Pay for what you order.
Treat SLOs as gears.
Focus on SRE when \$ spend < \$ lost.
SRE Tenets - minimize downtime, find where else is it happening, prevent future failures.
SREs should be able to withstand boredom.
SRE maturity model.
Systems fail but the real failures are the ones from those we learn nothing.
This talk is a tale of few such failures that went right under our noses and what we did to prevent those. The failures covered range from Heterogenous systems, unordered events, missing correlations and just human errors.
Software is opaque, To see what it’s doing you inject observation capability into it. This goes beyond logs & stepping through in a debugger for you have to observe the live system, not your sandbox. How does Control theory use observability to build systems that thrive on the feedback and improve? Slides ⧉
Every product either dies a hero or lives long enough to hit Reliability issues. While you go about fixing this, What is the cost, both in terms of effort and business lost, of failure and how much does each nine of reliability cost? The talk considers a sample and straightforward product and evaluates the depths of each failure point. We take one fault at a time and introduce incremental changes to the architecture, the product, and the support structure like monitoring and logging to detect and overcome those failures. Slides ⧉
TLS has come a long way and probably one of the least discussed topics in public Talks. The talk walks through understanding Certs, how server-to-server TLS exchange happens. What is CRL and how do you detect Revocation Lists. Problem with CRL lists? What is OCSP and how does it solve the problem of CRLs. What is the problem with OCSP? What is OCSP Stapling? Why languages do not address the problems of identifying revocation and expired certs. How do we bring this all together to bring an actually trustable server to server exchange? Slides ⧉
You know Single Responsibility Principle, Dependency Inversion Principle, and SOLID design pattern. You know DRY, loose coupling, CAP theorem. You have probably also heard of benefits of Functional Programming. Can one apply these learning and Programming principles to scale an organization as well? Slides ⧉
What are Containers and How is Docker made? It's a bunch of namespaces and cgroups put together to build the process isolations that we see. What are namespaces and how do they operate? The talk invokes one Linux namespace at a time, as system calls from a Golang code, up to a full-fledged container. Slides ⧉
Cgroups and Namespace are the shoes and shorts of the container race, not in any particular order. They have been around for a while but not too many see the usage and power they have. The talk is a consortium of cookbooks where these were used to sole Infrastructure problems I have encountered. Slides ⧉
The talk is an anatomy of the data processing systems, their building blocks, methods & purpose. We split the system into layers; defining the relevance, need, & behavior of each. We study common frameworks & tools, what layer do they fit in & later showcasing typical architectures and deployments. Slides ⧉
How much does your application weigh when you build it using HTTP Constructs? Can you achieve the same availability and reliability using another alternatives? Slides ⧉
Microservices have rapidly evolved over the years as a popular way of developing applications. But they bring their own set of challenges in the form of what design pattern to use, monitoring, logging, error detection, scaling and service discovery.
The talk explores the common characteristics and design patterns to be considered while dealing with service oriented architectures. It also talks about Signal-Slots, RPC architectures, monitoring, log and error handling, function point scaling, and common unix philosophies that help you design scalable distributes systems.
Diving into code samples, demos and production deployments; I would like to showcase Gilmour, a cross language library we have authored for effective microservices that exchange data over non-HTTP transports.
Slides ⧉
We all build software, and we see ourselves using OOP in some manner or the other. Inheritance is one of the core properties of OOP. What are the common variants of Inheritance? Single, multiple, and mixin based inheritance.
All of these suffer from conceptual and practical limitations. Ir-respective of the choice of language, our design ends up the same way. Usually a mesh of interconnected types. As the size of the project goes and we introduce more types, the complexity and cost of testing that system keeps increasing. The Internet is full of memes on that. We go about identifying and illustrating these problems.
We then talk about traits. A trait is essentially a group of pure methods that compose classes and is a primitive unit of code reuse.
In this model, classes are composed from a set of traits by specifying glue code that connects the traits together and accesses the necessary state. We demonstrate how traits overcome these problems, and help you build simpler and reusable code.
This talk is based on a research paper published in Traits: Composable Units of Behaviour"
CDNlysis syncs Amazon Cloudfront log entries from S3 bucket and streams them to multiple database backends. You can later query for:
- Understanding how the Bandwidth is being used.
- Finding out the most popular and most downloadable content.
- Generate trends for your most popular Videos, Audios, Slides etc.
- Understand geographical behaviour of the Requests.
- Amount of Bytes transferred to & fro the Cloudfront distributions.
- Find out the most profitable referrer from where your content is being accessed. etc.
Gottp was designed using backend servers in mind and offered:
- Background Workers
- Call Aggregation using Non-Blocking or Blocking Pipes.
- Optionally Listens on Unix Domain socket.
- In-built error traceback emails.
- Optional data compression using zlib/gzip.
- Automatic listing of all exposed URLs
We called it "Party", a Persistent Queue processor responsible for handling all the non real-time stuff varying from Video encoding, Indexing Search Documents, Updating cache, tracking participant progress, sending out emails, payments, releasing Node.js sockets etc. Details ⧉