Loading…

Note: Meeting Room 7 will be available as an On-Call Room for attendees.

Wednesday, August 30
 

07:30 IST

Morning Coffee and Tea
Wednesday August 30, 2017 07:30 - 09:00 IST
Prefunction

09:00 IST

Care and Feeding of SRE
As SRE enters the ops zeitgeist, much of the focus has been placed on tactics—techniques that individual operations teams can adopt to improve their effectiveness. While there is value in singleton adoption, I'll make the case in this talk that organizational support and culture across the organization that corresponds with these tactics results in impact far greater than the sum of its parts. I'll focus on three SRE goals: maintaining SLOs, managing operational load, and maximizing leverage, and discuss failure modes without sufficient organizational support. These aren't tactics that can be fully implemented by an operations team. SRE is an organizational strategy that need to be adopted by the business.

Speakers
ND

Narayan Desai

Google
Narayan is a jack of many trades, having worked as a sysadmin, software engineer, computational biologist, and computer science researcher, and most recently as an SRE manager at Google. When not working with computers, or people working with computers, he spends his time in high... Read More →


Wednesday August 30, 2017 09:00 - 09:40 IST
Pembroke and Lansdowne Rooms

09:40 IST

Diversity and Inclusion in SRE: A Postmortem
Whether a cause or a consequence of diversity & inclusion problems, members of minority groups in SRE experience harassment, bullying, and anti-social exclusion far too often. Although primarily an ethical and behavioural issue, it also has extremely costly negative effects on team effectiveness, arising from loss of psychological safety and even attrition. The data supporting these assertions are reasonably clear, but what is perhaps less clear is what to do about it.

We therefore analyse the situation in the form of a postmortem, suggesting some root causes, presenting a timeline, and analysing factors which contributed to (and offset) The Incident, and propose some actions to remediate.

Speakers
NM

Niall Murphy

Microsoft
Niall Richard Murphy is Director of Engineering for Microsoft Azure Ireland, where his group works on cloud engineering systems and SRE. He is the instigator, co-author, and co-editor of two books on SRE, and a history of the Irish Internet. He is the holder of degrees in Computer... Read More →


Wednesday August 30, 2017 09:40 - 10:20 IST
Pembroke and Lansdowne Rooms

10:20 IST

Break with Refreshments
Wednesday August 30, 2017 10:20 - 11:00 IST
Prefunction

11:00 IST

Globalizing SRE in a Walkup Culture
The SRE discipline necessitates deep understanding of the organisation's business and technical needs, as well as clear objectives across the entire team.

What happens when you begin to distribute your SRE team across a large geological divide?

We'll talk about what went into creating Wayfair's global SRE presence, from initial hire to full contributing team—what worked, what didn't, and what we have planned.

Speakers
BL

Bill Lincoln

Director, Site Reliability Engineering, Wayfair
Assoc. Director of Platform Engineering Bill manages the global SRE - Platform Engineering team at Wayfair. SRE's scope includes all things production that have an impact on e-commerce at Wayfair. We have expanded to a global team with team members in both the US and EU.


Wednesday August 30, 2017 11:00 - 11:30 IST
Lansdowne Room

11:00 IST

SRE Your gRPC—Building Reliable Distributed Systems (Illustrated with GRPC)
Distributed systems have sharp edges, and we have a wealth of experience cutting ourselves on them. We want to share our experience with SREs elsewhere, so they can skip making the same mistakes and join us making exciting new ones instead!
We will share practical suggestions from 14 years of failing gracefully:
  • In a distributed service, every component is a frontend to another one down the stack. How can it deal with backend failures so that the service as a whole does not go down?  
  • In a distributed service, every component is a backend for another one up the stack. How can it be scaled and managed, avoiding overload and under-use?  
  • In a distributed service, latency is often the biggest uncertainty. How can it be kept predictable?  
  • In a distributed service, availability, processing, and latency costs contributions are hard to assign. When things (inevitably) go wrong, what components are to blame? When they work, where are the biggest opportunities for improvement?
We will cover best and worst practices, using specific gRPC examples for illustration.

Speakers
GK

Gabe Krabbe

Google
Gabe Krabbe has been a Site Reliability Engineer at Google for over 14 years. He has worked on, and sometimes against, multiple generations of the Ads management and serving infrastructure. Before joining Google, he worked for various companies as a system administrator. Gabe frequently... Read More →
avatar for Gráinne Sheerin

Gráinne Sheerin

Google
Grainne is a Site Reliability Engineer for Google Ireland. She's a tech lead responsible for Ad Serving infrastructure and has 5 years of experience in production engineering. She a physicist, earning a doctorate in Nanoscience from Dublin City University. Prior to Google, she masqueraded... Read More →


Wednesday August 30, 2017 11:00 - 12:00 IST
Pembroke Room

11:00 IST

Load-Shedding: Overview of Different Methodologies
This talk gives an inventory and overview of the different methods for dealing with load-shedding and overload in production stacks, including an overview of the methods developed at Google and the open-source solutions.

We'll review the pros and cons, scope and effort levels of each method, and compare with existing approaches, including circuit-breakers.

Speakers
avatar for Acacio Cruz

Acacio Cruz

Director - Frameworks & Production Platforms, Google
Acacio has been an SRE manager since 2007, and manager of Google's Load-shedding & Traffic Management team since 2009. He is now a SWE Director in Frameworks and Software Infrastructure.


Wednesday August 30, 2017 11:00 - 12:00 IST
Meeting Rooms 1+2

11:00 IST

SRE 101
The purpose of an SRE team is to keep its services up, reliable, performant and efficient. How do effective SRE teams do this?


We'll run through an overview of key SRE competencies: monitoring and alerting, incident response, disaster recovery, performance and efficiency, change management and capacity planning.


We'll also look at the habits of successful SRE teams and some common pitfalls.

Speakers
avatar for Laura Nolan

Laura Nolan

Stanza
Laura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Laura is a Principal Engineer at Stanza, where she is building software to help humans understand... Read More →


Wednesday August 30, 2017 11:00 - 12:30 IST
Meeting Room 9

11:30 IST

Make Haste Slowly: Balancing SRE Diligence in Urgency Driven Organizations
Shopify is a commerce platform which has grown to power hundreds of thousands of businesses in a little over ten years. Along with the company, the production engineering organization has evolved from a founder's part time job to a team of over seventy people. Because of all that rapid and continued growth, the culture highly rewards speed and urgency. "Move fast, break things" is fine…unless you are responsible for site reliability and availability. And the databases. Especially the databases.

This talk is about the tension between an urgency driven organization and the diligent SRE teams that operate within it. We'll examine how to build, nurture, and support those teams. We'll look at how to celebrate and reward them for being prudent, cautious, and skeptical. And because it is the deliberate pace of these teams that allows the rest of the organization to move quickly, we'll dive into how to concretely measure the benefits and sell them as positives to the rest of the organization. Attendees will leave with tools and techniques to highlight the importance of their work as SREs, when to trade speed for diligence and how to move fast and stay sane—all without cutting corners.

Speakers
JH

Jason Hiltz-Laforge

Shopify
I'm a production engineering lead at Shopify, where I try not to break too many things at once. Apart from computers, I enjoy naming things and trying to convince my coworkers that all food is essentially salad. Outside of work, I like spending time with my wife and two daughters... Read More →


Wednesday August 30, 2017 11:30 - 12:00 IST
Lansdowne Room

12:00 IST

Want to Solve Over-Monitoring and Alert Fatigue? Create the Right Incentives!
Telemetry monitors and their (constant) beeping is a pretty common sight in hospitals. I saw these at the NICU where my twins were being cared for after being born prematurely. My wife and I used to freak out every time one of these went off. Unlike a missed alarm that said your site's down, failing to act on an alarm at a hospital can have much more critical consequences; in 2010 at a hospital in Massachusetts, a patient's death was directly linked to telemetry monitoring after alarms signaling a critical event went unnoticed by 10 nurses.

I attempted to solve this problem when I joined Zynga (in 2013) as the head of SRE. I will go over our failed attempts including filtering the noise, adding heads, building more tools, etc. Will also cover how I came up with an initiative called "clean room" as a way to incentivize engineering teams to keep the noise levels low. Finally, go over some of the tactics that worked (and ones that didn't).

Most people I spoke to about "clean room" almost always walked away having learned something (some have said it's common sense). Share, learn, ask questions, participate - I'll try to make it fun!

Speakers
avatar for Kishore Jalleda

Kishore Jalleda

Stealth
After a decade of leading (global) SRE teams at Microsoft, Yahoo, Zynga, and IMVU, Kishore Jalleda pivoted to full-time coding and building products that can organize the world's unstructured data and processes to help people lower their stress, make better decisions, and focus on... Read More →


Wednesday August 30, 2017 12:00 - 12:30 IST
Lansdowne Room

12:00 IST

Profiling Node Applications
Node runs on a powerful JavaScript engine, but that same engine can complicate things when it comes to obtaining accurate information on your application's performance. There are plenty of tools for profiling C++ or Java applications, but understanding JavaScript interactions with native code can be extremely challenging. In this talk we will discuss profiling options for Node.js, including perf_events, dtrace, the V8's engine built-in --prof switch, and tools based on the bleeding-edge kernel BPF technology. We will also talk about turning profiler results into flame graphs, an innovative visualization tool for understanding stack sample reports, and for figuring out the time split across the JavaScript and native parts of your application.

Speakers
avatar for Sasha Goldshtein

Sasha Goldshtein

CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft Regional Director and MVP, Pluralsight and O’Reilly author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor... Read More →


Wednesday August 30, 2017 12:00 - 12:30 IST
Pembroke Room

12:00 IST

Managing SSH Access without Managing SSH Keys
Everyone uses SSH to manage their production infrastructure, but it's really difficult to do a good job of managing SSH keys. Many organisations don't know how many SSH keys have access to production systems or how protected those keys are. A trusted SSH private key can be years old, unprotected by passphrase, and shared among multiple people who may not even work for you.

With some tooling and configuration SSH keys can be replaced with limited-use ephemeral certificates, issued centrally and with better access controls and automatic key expiration, solving many of the shortcomings of using SSH keys.


This talk will cover:


  • Managing SSH keys: The bad parts

  • Replacing SSH keys with ephemeral certificates: how & why

  • Discussion of an implementation of a CA for SSH certificates

  • Call for participation, showing github source


Speakers
NS

Niall Sheridan

Senior Systems Engineer, Intercom
I love a good disaster


Wednesday August 30, 2017 12:00 - 12:30 IST
Meeting Rooms 1+2

12:30 IST

Conference Luncheon
Sponsored by Palantir

Wednesday August 30, 2017 12:30 - 13:40 IST
Sussex Restaurant and Herbert Room

13:40 IST

The Dangers of Being Overly-Paranoid
Shipping code is not enough. You also need logs, tests and static analyzers to have the necessary confidence in the change you just deployed. Especially when you’re shipping to production every 10 minutes, at peak time.


A common philosophy for feeling that things are in control is simply adding more data. And when something bad happens, first reaction is even more instrumentation to cover that specific scenario. You need more information in the logs to troubleshoot, more inputs for your tests and more linting rules. And then you’ll never run into that problem again, right!?


Well…maybe, but you’ve just hit a bigger one:


When your app is small, you can easily get away with test duplication, log noisiness and introducing new tools.


As you grow, the noise becomes impactful. You’re being slowed down by the shortcuts and poor decisions you took earlier, until it becomes a non trivial problem to solve.


In this talk we will explore the approach we took in Intercom for introducing information sanity, by focusing on:


  • logging myths, and dangerous fallacies

  • being deliberate about operational and engineering needs

  • better troubleshooting using structured and canonical logs

  • getting more benefits and performance out of notoriously slow static analyzers

  • writing tests with performance in mind


Speakers
avatar for Ingrid Epure

Ingrid Epure

Software Engineer, Intercom
Ingrid is an engineer currently working for Intercom in Dublin, Ireland. She is passionate about distributed systems, automation and simplifying things. She is a conference speaker, an active member in the Python community and loves mentoring and helping with community-driven eve... Read More →


Wednesday August 30, 2017 13:40 - 14:30 IST
Lansdowne Room

13:40 IST

Standing On the Shoulders of Giants: Unleashing the Power of Scriptable Load Balancers
Every year, our organizations continue adding more services. It’s unsustainable to have a dedicated team of SREs for each one. That’s why, as an industry, we’ve moved to the product-team SRE model. We’re now accustomed to building custom services that applications reach out to, but not middleware services that operate on requests before they reach their destination.

Load balancers have the potential to provide application-aware middleware without making changes to the application itself. However, traditional load balancers can’t be easily and deeply customized or redeployed quickly without significant risk. Instead, we can embed a scripting language to fulfill these requirements.

At Shopify, we do this with Nginx and LuaJIT via OpenResty. Our Nginx scripts deploy in 10 seconds, run through a thorough suite of automated tests, and have allowed us to solve sharding across data centers, handle some of the world’s biggest flash sales, and respond quickly to layer 7 DDoS attacks. What once took a large team of engineers can now be accomplished by one of any size.

The lessons learned from building this middleware framework are applicable to any service. By solving hard problems in your load balancers, you can benefit every application or service you run.

Speakers
avatar for Emil Stolarsky

Emil Stolarsky

Production Engineer, Production Engineer, Shopify
Emil is a production engineer at Shopify where he works on performance, scriptable load balancers, and DNS tooling. When he's not trying to make Shopify's global performance heat map green, he's shivering over a spiked cup of coffee in the great Canadian north.


Wednesday August 30, 2017 13:40 - 14:30 IST
Pembroke Room

13:40 IST

Networks for SREs: What Do I Need to Know for Troubleshooting Applications
All of us depend on the underlying network to be stable whether in the datacenter or in the cloud. We all have a basic knowledge of how traditional networks run, however in the past 10 years, we’ve moved to building redundant physical topologies in our networks, optimized the routing methodologies accordingly, moved into the cloud and gotten greater visibility and tuneables in the Linux kernel network stack. A lot has changed!

However, the way we troubleshoot the network in relation to the applications we support hasn’t adapted. In this session, we’ll review the progress that network infrastructure has made look at specific examples where traditional troubleshooting responses fail us and demonstrate our need to rethink our approach to making applications and the network interact harmoniously.

Speakers
avatar for Michael Kehoe

Michael Kehoe

Staff SRE, LinkedIn
Michael Kehoe is a Staff SRE at LinkedIn who works on building scalable monitoring infrastructure, reliability principles, and incident management. Michael previously interned at NASA Ames on their PhoneSat project. Michael's key interests lie in network engineering and automatio... Read More →


Wednesday August 30, 2017 13:40 - 14:30 IST
Meeting Rooms 1+2

13:40 IST

SRE Your gRPC—Building Reliable Distributed Systems (Workshop)
Distributed systems have sharp edges, and we have a wealth of experience in cutting ourselves on them.

In this workshop, participants will learn how to specify and use gRPC-based services (including an introduction to protocol buffers). Particular emphasis will be placed on engineering for reliability in the face of inevitable failures and errors. This will include identifying and implementing appropriate strategies for different requirements and circumstances, as well as enabling effective debugging through strong instrumentation.

All topics covered will include hands-on coding exercises.

Participants need to have a working knowledge of C++, Go, Java, or Python, and must bring a laptop running the Chrome browser (and a suitable charger).

Speakers
LC

Lisa Carey

Google
Lisa Carey is a Technical Writer for Google Cloud Platform in Dublin. She has written documentation for many technologies including Protocol Buffers, gRPC, and Cloud APIs, and regularly runs writing workshops for Google engineers. She holds degrees from Trinity College Dublin.
GK

Gabe Krabbe

Google
Gabe Krabbe has been a Site Reliability Engineer at Google for over 14 years. He has worked on, and sometimes against, multiple generations of the Ads management and serving infrastructure. Before joining Google, he worked for various companies as a system administrator. Gabe frequently... Read More →
avatar for Gráinne Sheerin

Gráinne Sheerin

Google
Grainne is a Site Reliability Engineer for Google Ireland. She's a tech lead responsible for Ad Serving infrastructure and has 5 years of experience in production engineering. She a physicist, earning a doctorate in Nanoscience from Dublin City University. Prior to Google, she masqueraded... Read More →


Wednesday August 30, 2017 13:40 - 17:00 IST
Meeting Room 8

13:40 IST

Mastering Linux Performance Tools
All kinds of applications run on Linux, from web servers to distributed database engines and embedded applications. Troubleshooting performance in the field, especially when invasive profilers can't be used, is a delicate art that requires a solid understanding of the system and low-overhead tools. In this workshop, we will visit a spectrum of Linux performance monitoring tools.

We will start with a simple performance checklist based on the USE method, including tools like top, iostat, vmstat, mpstat, sar, and others. Then, once we identify the overloaded resource, we will dig in deeper using perf: tracepoints, hardware events, dynamic probes, and USDT. We will also collect stack traces of heavy events (CPU usage, disk accesses, network) and visualize them using flame graphs.

Finally, we will discuss the emerging superpower for Linux performance monitoring: BPF and BCC. This is a new kernel technology that enables low-overhead, super-efficient monitoring and tracing tools, which perform aggregation closer to the source where the events occur and provide useful information at a fraction of the cost. We will review a performance checklist based on BCC tools, and explore one-liners from the general-purpose trace and argdist tools.

Speakers
avatar for Sasha Goldshtein

Sasha Goldshtein

CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft Regional Director and MVP, Pluralsight and O’Reilly author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor... Read More →


Wednesday August 30, 2017 13:40 - 17:00 IST
Meeting Room 9

14:30 IST

Show Me the RIGHT Numbers! Are Our Users Happy?
Are your users happy? If you’re not really sure, then you’re focusing on the wrong metrics.


In this session we will show you outside-in approaches to choosing service level indicators and objectives that reflect user happiness.

Speakers
avatar for Perry Statham

Perry Statham

Observability Architect, Kyndryl
Perry Statham is an software architect with Kyndryl. As a veteran of the development vs. operations wars, he’s been doing DevOps since long before it was a buzzword.


Wednesday August 30, 2017 14:30 - 15:00 IST
Lansdowne Room

14:30 IST

InStream: Large Scale Distribution using Bittorrent, Python, Salt, and Kafka
Deploying application/services to all the servers across every datacenter can be painful for any company with a big infrastructure, including LinkedIn.

Our deployment model had some centralized pieces which became bottlenecks at scale. This talk will describe how we built a service in Python, based on Saltstack and Kafka, which can deploy any service to all servers asynchronously with a P2P distribution model, rate limiting and fast rollbacks.

Speakers
avatar for Harsh Sharma

Harsh Sharma

SRE
I've been an SRE at LinkedIn for over a year, working with Platform and Horizontal teams, and as one of the primary owners of InStream, building internal tools and supporting different platform services. I enjoy being an SRE and wish to contribute as much as I can to the global SRE... Read More →


Wednesday August 30, 2017 14:30 - 15:00 IST
Pembroke Room

14:30 IST

Anycast Is Not Load Balancing
We'll discuss IP anycast (what it is, how it works), what use cases it's more or less suited to, and some of the complexity it introduces (complete with war stories).

Speakers
avatar for Murali Suriar

Murali Suriar

Google
Lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running software defined network control systems. Left Google to get on a boat. Got bored and came back.


Wednesday August 30, 2017 14:30 - 15:00 IST
Meeting Rooms 1+2

15:00 IST

Break with Refreshments
Wednesday August 30, 2017 15:00 - 15:40 IST
Prefunction

15:40 IST

Use Load Testing to Build a Proper Mental Model of Your Service
Large organisations often have teams dedicated to building and using load test frameworks for their production services. Intercom's engineering team was too small to have accurate load tests for all of it's systems, but as we acquired larger customers, having accurate load-test numbers, and being able to communicate them to the business became more critical. This talk will cover some things we learnt about load testing, and how it changed our mental models of some of our infrastructure.

Speakers
avatar for John Looney

John Looney

Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know... Read More →


Wednesday August 30, 2017 15:40 - 16:10 IST
Lansdowne Room

15:40 IST

Capturing and Analyzing Millions of Queries without Any Overhead
This talk is about a new way of monitoring and analyzing millions of queries with no overhead.

Optimizing queries is the most important aspect of scaling database servers. Before we can optimize, we need to identify the problematic queries. We have slow-query log in MySQL where we can set a threshold and all the queries crossing threshold will be logged in a file and later can be used for analysis. Other way is to use performance_schema database inside MySQL which gives various metrics of queries.

The problem is that enabling the slow query log will incur a 25-35% overhead on the database, since we have to have to write to a file. Additionally, since only queries exceeding the threshold will be logged, we won't have any data about queries below that threshold. Meanwhile, enabling performance_schema incurs a 10-20% overhead, and is complex to understand.

To minimize overhead and effectively measure all queries, we have built a query analyzer which incurs less than 3% CPU overhead and no overhead on any other resources.

Speakers
avatar for Karthik Appigatla

Karthik Appigatla

Staff SRE, LinkedIn
Karthik Appigatla has been working on various large scale data stores for a decade primarily focused on MySQL. Currently, he has been working for LinkedIn for the last 5 years. Prior to LinkedIn, he worked for Yahoo, Pythian and Percona where he was responsible for helping clients... Read More →
BT

Basavaiah Thambara

Sr. Database Engineer, LinkedIn
Basavaiah Thambara (Basu) has decade of experience designing, building and scaling MySQL databases. He is currently working as a staff database engineer at LinkedIn managing Espresso, an in-house distributed NoSQL datastore. He currently lives in Bangalore,India https://in.linked... Read More →


Wednesday August 30, 2017 15:40 - 16:10 IST
Pembroke Room

15:40 IST

Bots Are Fast, Humans Are Smarter—Eliminate Unwanted Traffic and Defend Against DDoS
In a world with ever-growing DDoS attacks, L7 attacks give even the most experienced engineers the sweats. Imagine if instead of following easy to detect patterns, bots could mimic the behaviour of customers. Well, that’s exactly what Shopify sees every day during flash sales.

Come and learn how we block nearly all bot traffic on our load balancers without any human intervention. We will share our challenges of differentiating between web crawlers and bots, users behind NATs and bots rotating user agents, as well as fast humans and browser extensions. When the stakes are blocking a customer completing a checkout, misclassification isn’t an option.

This is not yet another machine learning talk, but an example of how simple statistics, heuristics and some sane limits can give great results with minimal complexity. The lessons learned in this talk are applicable to any real-world problem with inexact constraints.

Speakers
avatar for Felix Glaser

Felix Glaser

Senior Production Security Engineer ☁️ 生产安全工程师 ☁️, Shopify
Felix likes to climb, cycle, and code in Canada. The first two outside and the other one at Shopify, where he works on securing containers and their deployment into the cloud.


Wednesday August 30, 2017 15:40 - 16:10 IST
Meeting Rooms 1+2

16:10 IST

Traffic Steering using Rum DNS @ Linkedin
Do you serve customers across the world with varying network conditions? Do you struggle with automatically sending your users away from an unavailable POP/DC to the next best one? Wouldn’t it be great if you magically knew about the regional issues related to last mile connectivity from the users?

LinkedIn uses real user measurements backed by triple-vendor DNS. These measurements are collected from members’ browsers to gain insight into our performance from every last mile. We then leverage Big Data to send members to closest edges in real-time and deliver fast member experience.

LinkedIn members from Mumbai visiting LinkedIn will be sent to our Mumbai POP using member’s geolocation. If the POP is unreachable, member browsers elsewhere will report this to our RUM backend and our DNS will learn to stop resolving to the unreachable POP - all within a matter of a few seconds.

Attendees will learn about:


  • Web Performance: CDNs, POPs, RUM steering

  • Multi-vendor strategy for redundancy & performance

  • Tools for cross-vendor consistency, vendor fail-out, global site monitoring & status boards


Speakers
AR

Abhijeet Rastogi

Senior Site Reliability Engineer, LinkedIn
Abhijeet has been working with LinkedIn for around 2.5 years and a total of 5 years as an SRE. He joined LinkedIn with experience of architecting VPS hosting using OpenStack and hosting email infrastructure at scale. He has worked on managing DNS, CDN and Traffic infrastructure at... Read More →


Wednesday August 30, 2017 16:10 - 17:00 IST
Lansdowne Room

16:10 IST

OK Log: Distributed and Coördination-Free Logging
This talk explores the motivation, design, prototype, and optimization of OK Log, a distributed and coördination-free log system for big ol' (cloud-native) clusters.

We first motivate the need for a such a system, setting it apart from existing products like Elasticsearch. Then, we carve out a solution in the distributed systems space, paying due homage to the old gremlins of consistency and coördination. Finally, we review the component and architecture model, and demonstrate how it copes with typical operations and failure modes.

This talk is about an open-source product, but it is not a product pitch. Instead, it's meant to be a case study of a learning exercise: approaching a deceptively subtle problem domain from first principles, and using methodological software engineering to derive a solution. I hope it inspires others to reach for something more self-actualizing than the plumbing together of databases and message busses.

Speakers
PB

Peter Bourgon

Fastly
Peter Bourgon is a Go aficionado and is quite keen on distributed systems. He's written Go kit, a toolkit for microservices in Go, among several other OSS projects. He is a professional typist, and has typed for Bloomberg, SoundCloud, and Weaveworks; he currently types for Fastly... Read More →


Wednesday August 30, 2017 16:10 - 17:00 IST
Pembroke Room

16:10 IST

Google SDN Peering: An Early Engagement Case Study
How do you build a new SRE team around a completely novel product? This talk will deal with some of the challenges involved in launching Espresso, Google's software defined peering architecture.


  • How do you build an SRE team for a product which isn't serving real users yet?

  • How do you build a cohesive team and structure out of many disparate teams? (Networking, SRE, software development)

  • How do you build oncall discipline in a team which largely hasn't been oncall before?


And as an aside, we'll also get into some of the technical details of Espresso, since it's necessary to understand what made it so challenging and different.

Speakers
avatar for Murali Suriar

Murali Suriar

Google
Lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running software defined network control systems. Left Google to get on a boat. Got bored and came back.


Wednesday August 30, 2017 16:10 - 17:00 IST
Meeting Rooms 1+2

17:00 IST

Happy Hour
Sponsored by Facebook

Wednesday August 30, 2017 17:00 - 18:00 IST
Herbert Room

18:00 IST

18:00 IST

State of SRE: Share Your Successes and Challenges

Many companies are adopting SRE. SREcon is a fabulous way to learn about how others in the industry are implementing and approaching SRE. We learn about some companies through talks, and others through networking. This BoF has two aims: a) share our companies successes and challenges in implementing SRE and b) consider if the industry would benefit from having a survey and white paper (similar to the State of DevOps survey and white paper).


Speakers

Wednesday August 30, 2017 18:00 - 19:30 IST
Meeting Rooms 1+2

18:00 IST

Burnout and Work-Life Balance

Wednesday August 30, 2017 18:00 - 19:30 IST
Pembroke Room

18:00 IST

Tales from Production
Speakers
avatar for Kurt Andersen

Kurt Andersen

Program Committee, LinkedIn
Kurt Andersen was one of the co-chairs for SREcon-Americas in 2017 and 2018. He has been active in the anti-abuse community for over 20 years and is currently the senior IC for the Product SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging... Read More →
avatar for John Looney

John Looney

Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know... Read More →


Wednesday August 30, 2017 18:00 - 19:30 IST
Lansdowne Room

19:30 IST

SRENOG: Networking skills for SREs
Speakers
avatar for Murali Suriar

Murali Suriar

Google
Lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running software defined network control systems. Left Google to get on a boat. Got bored and came back.


Wednesday August 30, 2017 19:30 - 21:00 IST
Pembroke Room

19:30 IST

Building SRE in small orgs
Speakers
avatar for Chris Sinjakli

Chris Sinjakli

SRE, PlanetScale
Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems. All his programs are made from organic, hand-picked, artisanal keypresses.


Wednesday August 30, 2017 19:30 - 21:00 IST
Lansdowne Room
 
Thursday, August 31
 

08:00 IST

Morning Coffee and Tea
Thursday August 31, 2017 08:00 - 09:00 IST
Prefunction

09:00 IST

How We Try to Make a Lion Bulletproof; Setting Up SRE in a Global Financial Organization
By now, most of us have read the O’Reilly book about Google SRE or have heard of other tech companies’ SRE teams. Our story is about doing SRE in a more traditional and more regulated environment: in the largest bank in the Netherlands.

This talk will address the history, present and future of SRE within a financial institution with a BizDevOps way of working. In doing so, we will talk about how our journey started at SREcon in Dublin last year, why we wanted and need to do SRE within ING, how our SREs started, the distributed way of working of SRE within ING, what technologies we actually work with and why, and our plans for the future.

Lastly, we will share our SRE dos-and-donts after a year of experience. We hope our talk inspires engineers or tech leads to start doing SRE within their (possibly more traditional) company and that those who already implemented SRE can learn from our journey.

Speakers
avatar for Janna Brummel

Janna Brummel

ING
Janna is IT chapter lead for the site reliability engineering squad within the Domestic Bank (Retail) for ING in the Netherlands. Her job is to help other teams within the bank to know more about their services' performance and to be able to respond more efficiently to incidents... Read More →
RV

Robin van Zijll

SRE, ING N.V.
Robin is a Site Reliability Engineer @ ING and PO of the SRE Team, and has years of experience in being on-call for all services offered to our retail customers.


Thursday August 31, 2017 09:00 - 09:50 IST
Lansdowne Room

09:00 IST

Incident Command at the Edge
As a content delivery network, Fastly operates an edge environment for many large scale web properties and APIs. In order to deal with emerging threats to its network, Fastly needed to develop processes that allowed it to respond effectively to availability and security incidents at scale. The network engineering, SRE, and security teams at Fastly leverage a protocol called “Incident Command” to rapidly engage various teams across the company, and make sure customer properties are protected. Let the Fastly VP of SRE take you to the far side of the edge, and learn more about the challenges a large global network faces and the protocols that we found helped for us.

Speakers
LP

Lisa Phillips

Fastly
Lisa Phillips is a leader in the reliability, with particular interest in social media and speeding up content delivery. She has worked for 20 years in tech and database operational roles for large sites Livejournal, Six Apart and Twitter - where she helped kill the fail whale. Lisa... Read More →


Thursday August 31, 2017 09:00 - 09:50 IST
Pembroke Room

09:00 IST

Deploying Changes to Production in the Age of the Microservice
You decoupled your APIs from their implementations and put them behind RPC interfaces. You build and deploy services independently. You code health is impeccable. You put your user data in a persistent, replicated, and consistent store, where it belongs. Your developer velocity has skyrocketed.

Now we have new problems. We’ve got N independent services with M edges of interaction between them. That’s N services that need to be built, tested, and deployed on the infrastructure that expected you to have one service whose mess of entanglement was a secret you had with the compiler.

How do we deploy N binaries with N sources of static configuration and M sources of runtime configuration safely without losing our collective minds? In this talk, I’ll share some of how we grew that aforementioned N from 1 to many in Gmail. Specifically:


  • Consistent naming schemas for services, environments
  • Maintaining lightweight, easy-to-change production configuration abstraction layers
  • Release early, often
  • Canary everything by sharding into more A/B environments than you'd think you’d need
  • Encourage backwards compatibility in all APIs
  • Validate and test all configuration before changing global state

And, of course, some of things we (Gmail) learned by breaking things along the way.

Speakers
SS

Samantha Schaevitz

Staff Software & Site Reliability Engineer, Google Apps, Google
Samantha Schaevitz is a Staff SRE who's worked on Google Apps since 2013. She enjoys simplifying complex systems and skiing in the Alps near Zürich, where she lives.


Thursday August 31, 2017 09:00 - 09:50 IST
Meeting Rooms 1+2

09:00 IST

Tech Writing 101 for SREs
From post-mortems to operations manuals to code comments, writing things down for others is an unavoidable part of the life of an SRE.

In this workshop, you’ll learn writing principles to help you present technical information from two experienced Google technical writers - and each other! Through a series of pair-work exercises you’ll work through a variety of topics to improve the clarity, readability, and effectiveness of your writing, and possibly think about a toothbrush like you’ve never thought about one before. If you've never before taken any technical writing training, this workshop is perfect for you. If you've taken technical writing training, this class will serve as a great refresher.

There is a small amount of pre-reading for participants in this workshop (~30 minutes of reading about basic technical writing concepts).

The workshop runs for two hours with a short break.

Speakers
avatar for Betsy Beyer

Betsy Beyer

Google
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. She holds degrees... Read More →
LC

Lisa Carey

Google
Lisa Carey is a Technical Writer for Google Cloud Platform in Dublin. She has written documentation for many technologies including Protocol Buffers, gRPC, and Cloud APIs, and regularly runs writing workshops for Google engineers. She holds degrees from Trinity College Dublin.


Thursday August 31, 2017 09:00 - 12:30 IST
Meeting Room 9

09:00 IST

Distributed Systems Reasoning
All distributed systems make tradeoffs and compromises. Different designs behave very differently with respect to cost, performance, and how they behave under failure conditions.

It's important to understand the tradeoffs that the building blocks in your systems make, and the implications this has for your system as a whole. In this workshop we'll look at several examples of different real-world distributed systems and discuss their strengths and shortcomings.

This workshop will include some practical elements. You will be given some system designs to read and to evaluate, and then we'll discuss the implications of each design together as a group.

Speakers
avatar for John Looney

John Looney

Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know... Read More →
TS

Theo Schlossnagle

Circonus
Theo Schlossnagle is the founder and CEO of Circonus. Previously, he founded OmniTI, the go-to source for organizations facing today’s most challenging scalability, performance, and security problems; was the Founder of Message Systems, Inc. now Sparkpost; and researched distributed... Read More →


Thursday August 31, 2017 09:00 - 17:00 IST
Meeting Room 8

09:50 IST

From Firefighting to Proactive Work: the Journey of a Small Infrastructure Team in a Hyper Growth Environment
Due to an incident on our main datastore, we react and spent an entire week trying to keep Intercom up, with the help of 20 engineers from other teams. During this tough week, we had obliged to drop any other projects and focus on building a firefighting organization.

After the urgency period, it became evident to us that we need to focus on reactive work to prevent the incident from happening again. It was the launch-pad for the conception of a brand-new organization for our team, focusing on ownership and high impact work.

Few months after, results ruled in favour of our hard work: we’ve reduced system interruptions by more than 80% ! But good news and radical changes also come with consequences: we need to deal with multiple implications and drastically change our way to work as a team.

During this talk we will cover:


  • our journey from a firefighting to a proactive work organization

  • good and bad organizational decisions we made

  • impacts on the morale of the team


Speakers
AG

Alex Gerlic

Intercom
Engineering Manager at Intercom


Thursday August 31, 2017 09:50 - 10:20 IST
Lansdowne Room

09:50 IST

Resiliency Testing with Toxiproxy
Fibers get cut, databases crash, and you’ve adopted Chaos Engineering to challenge your production environment as much as possible. But what are you doing to craft the resiliency test suites that minimizes the impact of failure on your application as much as possible? How do you debug resiliency problems locally and make sure single points of failures don't creep into the application in the first place? We’ve used the open-source Toxiproxy for the past two years to emulate timeouts, latency and outages in development environments. This talk will equip you with the tools to start writing resiliency test suites to harden your own applications, to supplement other chaos engineering practises.

Speakers
JP

Jake Pittis

Shopify
In between teaching his team about jazz, Jake can be found on the Production Engineering Team at Shopify. He's worked preparing the platform for massive celebrity sales, making Shopify run out of multiple data centres, and the resiliency stack to protect the app against misbehaving... Read More →


Thursday August 31, 2017 09:50 - 10:20 IST
Pembroke Room

09:50 IST

Application Automation with Habitat
Container Orchestration Systems make for a great operational experience for deploying and management of containers. But that’s only part of the story when running containers in production. How do you build containers that contain only what you need (like no build systems/tools)? How do you orchestrate configuration of your application after the containers have been launched? How do you make it easy to modify an application config while keeping the containers immutable? How can you give your developers a means to declare dependencies for their applications?

Habitat, our open-source project for application automation, simplifies container management by packaging applications in a compact, atomic, and easily auditable format that makes it easier to deploy your application on various container runtimes and manage them over their lifecycle.

Speakers
avatar for Mandi

Mandi

Technical Community Manager, Chef Software
Mandi Walls is Technical Community Manager, EMEA at Chef. For Chef, she helps organizations increase their effectiveness using configuration management and modernizing IT practices. She is a long-time sysadmin focusing on large complex web systems.


Thursday August 31, 2017 09:50 - 10:20 IST
Meeting Rooms 1+2

10:20 IST

Break with Refreshments
Thursday August 31, 2017 10:20 - 10:50 IST
Prefunction

10:50 IST

Building a Culture of Reliability
Getting customers to care about Reliability is hard. Getting stakeholders to care about Reliability is harder. Getting the entire company to care about Reliability is even harder.

In this talk, I will cover what steps that every leader in any organization can take to get more people to care about Reliability. Because Reliability is one of those things that people only notice when it goes in the wrong direction, it can be hard to show the value of it and why it is so important.

We will walk through cultural and management changes, metrics to watch and obsess over, and some tooling that can help along the way.

Speakers
avatar for Arup Chakrabarti

Arup Chakrabarti

Director of Engineering, PagerDuty
Arup has been working in the space of software operations since 2007. He started out at as an Operations Engineer at Amazon, helping to reduce customer defects with multiple teams for the Amazon Marketplace. Since then, he has managed and built operations teams at Amazon and Netflix... Read More →


Thursday August 31, 2017 10:50 - 11:45 IST
Lansdowne Room

10:50 IST

Case Study: Lessons Learned from Our First Worldwide Outage
Last year, on March 10, Incapsula experienced the first worldwide outage in its history… While relatively short in duration, it affected thousands of websites that rely on our security and acceleration every day.

Rooted in a 3-year old dormant bug in our IncapRules code, this outage made us realize there were changes we needed to make in the way we write and qualify code. As VP of Engineering, the faulty code and our testing procedures are my responsibility, and it was up to me to lead the team to achieve an order of magnitude higher reliability.

One of the key things we were missing was a way to propagate customer configuration across our network in a way that is fast but without compromising on safety. The result was a new configuration sandbox system which achieved that.

In this talk I’ll present the process we took to analyze the true reliability of our system and the framework we use to reason about it, to prioritize tasks across teams and to design a more reliable service.

Speakers
YC

Yoav Cohen

Imperva Incapsula
Yoav is VP of Engineering for Imperva Incapsula, and has been with the company since they made their first sale. In between meetings you will find him working on build systems or nasty performance bugs. When not doing so he tries to sneak a few minutes on his guitar or doing laps... Read More →


Thursday August 31, 2017 10:50 - 11:45 IST
Pembroke Room

10:50 IST

One Ring to Rule them…
Rollout automation is something that every service and team needs, and many reinvent the wheel. I'll talk about - why the wheel gets reinvented - a system design that discourages reinvention, including an architecture diagram - the organisational challenges encountered when converting many services to use this new system design - how well the conversion attempt worked in practice This is based on my experience initiating and running a program to replace rollout automation across Storage SRE in Google.

Speakers
JT

John Tobin

Google
John Tobin manages Bigtable SRE and Cloud Bigtable SRE at Google Dublin, and has worked on several of Google's storage systems. He is currently involved in efforts to improve collaboration between teams across Storage SRE - standardising tools and processes, reducing duplicated effort... Read More →


Thursday August 31, 2017 10:50 - 11:45 IST
Meeting Rooms 1+2

11:45 IST

Tech Leadership in SRE
The job of a technical lead (TL) is probably not what you think it is.

The role of a TL in a team is as much about fostering social bonds as it is about technical excellence. For a tech lead, it's not enough to be a technical expert, they must also foster unity within a team so they can move forward—together—towards the same goal.

This talk will (quickly) cover the role of a tech lead, some myths about tech leads, interactions/overlaps with other team roles (e.g; manager, individual contributor), and how to best make use of your own TL.

Speakers
SR

Sean Rees

SRE, Google
Sean Rees is a long-time SRE tech lead in Google, now working in the networking space after stints in storage and ads.


Thursday August 31, 2017 11:45 - 12:30 IST
Lansdowne Room

11:45 IST

When Trouble Comes to Town
One's inclination when tackling an incident is usually to dive to the bottom of the stack where the problem is occurring and start debugging the root cause. However, it's important to first take a step back and approach the incident at a high level to ensure the fastest and most efficient resolution possible. This talk proposes seven steps to consider when tackling an incident: assessing the impact; communicating internally; looking for what changed; trying to mitigate; investigating the root cause; confirming resolution; and documenting and following up. It also touches on various tools which help with these steps.

Speakers
MG

Michael Gorven

Facebook
Michael Gorven is a Production Engineer at Facebook, where he works on the Web Foundation team and previously Instagram. He fixes things when they break, improves the reliability of the system, helps engineer it to scale, and reverts diffs. Previously he was an early employee at South... Read More →


Thursday August 31, 2017 11:45 - 12:30 IST
Pembroke Room

11:45 IST

Dancing with Squads—Do you know what your Code Repos are Telling You?
Have you ever wondered why certain service teams are always at the center of issues? Code Commits fail in certain areas; have you ever wondered why? In the quest to understand our services, data and looking from the outside in, we will take the audience thru how we developed a methodology by our Data Scientists collaborating with University Research to understand the areas that most impacted Site Reliability and how the SRE team could use this data to develop new policies.

We found ourselves asking, are you listening to what your Code and Issue history is telling you ? Do you know what Risks you are taking with your code ? How does the squad organization and climate create patterns that impact availability. This session will take the audience thru how we answered those and more questions about our own code.

Speakers
avatar for Don Cronin

Don Cronin

IBM
Don has more than 25 years experience in developing software. He currently leads the DevOps Analytics mission. His focus is on improving the DevOps lifecycle using big data technics so developers can deliver greater quality with faster velocity. Previously he led an adtech group... Read More →
avatar for Rob Orr

Rob Orr

Offering Manager SRE, IBM
Rob Orr is Offering Manager at IBM focusing on bringing SRE Capabilities into IBM's offerings and capabilities.


Thursday August 31, 2017 11:45 - 12:30 IST
Meeting Rooms 1+2

12:30 IST

Conference Luncheon
Sponsored by Google

Thursday August 31, 2017 12:30 - 13:30 IST
Sussex Restaurant and Herbert Room

13:30 IST

Building an SRE Capability Inside a Large Organization
Agilent Technologies, while traditionally known as a hardware company, has started to deliver several Software-as-a Service offerings through acquisitions and in-house product development. This talk will discuss how to transform the thinking across a decentralized, large corporate organization to approach software delivery differently. Traditionally, we are used to delivering software by burning and shipping CDs/DVDs with annual release cycles. Fast-forward to today with SaaS products using CI/CD approaches, etc and releasing as frequently as twice a week, this requires a completely different operating model and doesn’t fit in the traditional organizational framework/processes.

Speakers
SG

Sriram Gollapalli

Agilent Technologies, Inc.
Sriram Gollapalli is the Director of Technology in the CrossLab Group at Agilent Technologies, Inc. He was the Co-Founder, CTO/COO at iLab Solutions from September 2006 until it was acquired by Agilent in August, 2016. The iLab Operating Software is a SaaS product serving academic... Read More →


Thursday August 31, 2017 13:30 - 14:30 IST
Lansdowne Room

13:30 IST

Cognitive Bias and On-Call
This talk will be comprised of:



  • An analysis a set of cognitive biases, with illustrated examples (e.g. anchoring/priming, substitition/availability, loss aversion, etc)

  • Introduction of Kahneman/Tversky's "System 1/System 2" hypothesis (i.e. that our mental architecture is divided quite sharply into two modes of thinking/being)

  • Description of the on-call experience for SREs

  • Relation of this to previous cognitive biases; assertion that on-call is actually about using humans' infinite "jump out of the system" ability, otherwise the software could just fix itself

  • Description of techniques to move an engineer from System 1 to System 2 thinking (which is what you actually want)

  • Call to action for self-healing software



The attendees will learn:


  • What psychological tricks might affect their next on-call shift

  • What to do about them

  • Why on-call sucks (no, really) and why there may be a future without it


Speakers
NM

Niall Murphy

Microsoft
Niall Richard Murphy is Director of Engineering for Microsoft Azure Ireland, where his group works on cloud engineering systems and SRE. He is the instigator, co-author, and co-editor of two books on SRE, and a history of the Irish Internet. He is the holder of degrees in Computer... Read More →


Thursday August 31, 2017 13:30 - 14:30 IST
Pembroke Room

13:30 IST

SRE 101, Revisited

This presentation replaces the talk by Dinah McNutt, who is unable to attend. Laura Nolan will revisit her SRE 101 content from yesterday; if you missed the session due to the meeting room being at capacity, this is your opportunity to attend.

The purpose of an SRE team is to keep its services up, reliable, performant and efficient. How do effective SRE teams do this?

We'll run through an overview of key SRE competencies: monitoring and alerting, incident response, disaster recovery, performance and efficiency, change management and capacity planning.

We'll also look at the habits of successful SRE teams and some common pitfalls.


Speakers
avatar for Laura Nolan

Laura Nolan

Stanza
Laura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Laura is a Principal Engineer at Stanza, where she is building software to help humans understand... Read More →


Thursday August 31, 2017 13:30 - 14:30 IST
Meeting Rooms 1+2

13:30 IST

Linux System Metrics
While you can learn a lot by emitting metrics from your application, some insights can only be gained by looking at OS metrics. Yet OS metrics, despite being commonly used are also frequently misunderstood.

In this hands-on workshop, we will go over commonly used CPU, memory, disk and network metrics, and make sure we understand each of them. We will experiment with old & new tools to acquire and analyse metrics, evaluate a black-box workloads by looking at metrics, and assess the effect of extreme metric values on simple applications. During our adventure, we will learn about Linux internals, it’s underlying optimizations and unexpected limitations.

To participate, just make sure to bring a laptop, and have a chromium-based browser installed.

Speakers
avatar for Nati Cohen

Nati Cohen

HERE Mobility
Nati Cohen is a Production Engineer at Here Technologies and a Teaching Assistant at the Interdisciplinary Center Herzliya. Previous experience includes: operations consulting, software development, *nix administration and security research in the Intelligence Corps as well as in... Read More →
avatar for Avishai Ish-Shalom

Avishai Ish-Shalom

Engineer in Residence, Aleph VC
Avishai is a veteran operations and software engineer with years of high scale production experience. At present, Avishai helps growing startups and the Israeli high-tech eco-system as Engineer in Residence in Aleph VC fund. In his spare time, Avishai is spreading weird ideas and... Read More →


Thursday August 31, 2017 13:30 - 17:00 IST
Meeting Room 9

14:30 IST

The Why, What, and How of Starting an SRE Engagement
One of the hardest things to do is trust an outside voice. What are the boundaries between live site features and service features? How much expertise is required to be on-call? Who decides what’s in the best interests of the service? How is this not another Ops team or a staff augment? Who’s "in charge" and who makes prioritization calls? How do you build mutual trust? These are just some of the challenges in building a successful partnership between a product group and SRE.

In this talk we will present what we learned about the technical, organizational, and political systems that were needed to provide SRE to the Azure Internet-of-Things product group and how this can be used as a template for your services. We will discuss how to start an engagement, build partnerships and trust across organizations, provide ROI, keep a distinct identity and the frameworks that were developed to maintain tight organizational alignment including a new take on error budgets.

Let’s continue the conversation!

Speakers
RC

Richard Clawson

Microsoft Azure
Richard Clawson is a Site Reliability Engineer working on the Azure SRE Team. He is part of the team in Azure that is working to improve operations across the Azure stack. Currently he is focused on creating repeatable patterns and practices for SRE engagements. Before Azure he was... Read More →
avatar for Josh Gilliland

Josh Gilliland

Microsoft Azure
Josh Gilliland is a Program Manager for Azure SRE at Microsoft, focused on maturing the SRE model for Azure. Prior to that he was in program management in support of Azure networking.


Thursday August 31, 2017 14:30 - 15:00 IST
Lansdowne Room

14:30 IST

Reducing MTTR and False Escalations: Event Correlation at Linkedin
LinkedIn’s production stack is made up of over 900 applications, 2200 internal API’s and hundreds of databases. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner.


In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SRE’s who own the unhealthy service.


We’ll discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIn’s oncall engineers.

Speakers
avatar for Michael Kehoe

Michael Kehoe

Staff SRE, LinkedIn
Michael Kehoe is a Staff SRE at LinkedIn who works on building scalable monitoring infrastructure, reliability principles, and incident management. Michael previously interned at NASA Ames on their PhoneSat project. Michael's key interests lie in network engineering and automatio... Read More →


Thursday August 31, 2017 14:30 - 15:00 IST
Pembroke Room

14:30 IST

Automated Debugging of Bad Deployments
Debugging a bad deployment can be tedious, from identifying new stack traces to figuring out who introduced them. At Pinterest we have automated most of these processes using using ElasticSearch to identify new stack traces and git-stacktrace to figure out who caused them. Git-stacktrace parses the stack trace and looks for related git changes. This has reduced the time needed to figure out who broke the build from minutes to just a few seconds.

Speakers
avatar for Joe Gordon

Joe Gordon

Pinterest
Joe is an SRE at Pinterest, where he works on homefeed and performance. He has previously spoken at numerous conferences such as EuroPython, LinuxCon and LCA (Linux Conference Australia).


Thursday August 31, 2017 14:30 - 15:00 IST
Meeting Rooms 1+2

15:00 IST

Startup Systems Εngineer's Instruction Manual
What happens when you take the leap of faith and leave the security of a systems engineering team to become the first systems person at a startup? What should you expect?

This talk is about the challenges of being the sole systems engineer at a young company. The amount of work is overwhelming but the experience is worth it. We will explain the key elements of a newborn infrastructure and the stages leading it to maturity.

The challenges of this role are not limited to solving technical problems. The habits, processes and standards you will establish, will pave the way to go from a single engineer to a team.

Speakers
avatar for effie mouzeli

effie mouzeli

Systems Engineer, Logicea LLC
Systems Engineer at Logicea, a young software house. Main responsibilities are operations, automation (Deployment Pipelines, Configuration Management etc.), assist in product architecture, work closely with developers and occasionally, pull rabbits out of hats and chase them.


Thursday August 31, 2017 15:00 - 15:30 IST
Lansdowne Room

15:00 IST

The Never-Ending Story of Site Reliability
I strongly believe that the commonly proposed bogey-man of "automating ourselves out of a job" betrays a simplistic and highly incomplete understanding of the SRE field. SRE teams can always grow and develop.


The Dreyfus model is a model of professional expertise that plots an individual's progression through a series of five levels: novice, advanced beginner, competent, proficient, and expert. The idea in this talk is to take aspects of SRE practice (such as monitoring, measurement against SLOs, incident management and postmortems, etc) and provide indications of what these look like at the different Dreyfus levels - not so much at an individual level as at an organizational one.


Often, companies and teams will show uneven levels of proficiency - frequently due to pressures to develop some areas more than others. The intent of this talk is to provide a framework within which attendees can gauge their own company's progress and anticipate/plan for growing weak areas.


The concepts for this talk have emerged through exposure to companies at different skill/experience levels. It is not by any means definitive, but I hope to provide a useful rubric for discussion.

Speakers
avatar for Kurt Andersen

Kurt Andersen

Program Committee, LinkedIn
Kurt Andersen was one of the co-chairs for SREcon-Americas in 2017 and 2018. He has been active in the anti-abuse community for over 20 years and is currently the senior IC for the Product SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging... Read More →


Thursday August 31, 2017 15:00 - 15:30 IST
Pembroke Room

15:00 IST

Debugging at Scale—Going from Single Box to Production
It's very easy to launch a debugger on your dev box, attach to the right process and step through code. However, things are different when you need to debug an issue in production that's getting tens of thousands of requests per second. What if the issue reproduces only in production? How do you debug without affecting production traffic? What techniques can you use in your development to make it easier to debug issues? Does your application use tracing? What debug logs are written out to aid in analysis?


This talk will cover:


  1. Challenges with debugging in production

  2. Various approaches that are used in the industry

  3. Examples from Bing and Cortana incidents and steady state problems to illustrate the techniques

  4. Service design ideas that make them easier to debug


Speakers
KS

Kumar Srinivasamurthy

Microsoft Corp
Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing and Cortana Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening... Read More →


Thursday August 31, 2017 15:00 - 15:30 IST
Meeting Rooms 1+2

15:30 IST

Break with Refreshments
Thursday August 31, 2017 15:30 - 16:00 IST
Prefunction

16:00 IST

The History of How We Came to Be
The talk will feature research from international standardisation committees, the history of women, work, and computerization, and personal anecdotes of the mainframe era to construct a theory as to how and why engineering and operations split apart. We will also examine the issues with an eye to gender and workplace politics.

In the course of this talk, we will look at:


  • The earliest jobs in computers (~40s). What were they, why did they exist, and who held them?

  • What different and distinct things happened in the 50s, 60s, 70s, 80s, 90s, and noughties?

  • What was the difference between the keypunch operator and the computer operator?

  • What was unglamorous about programming in the 40s, such that women were allowed to do it?

  • What factors led to women being crowded out of programming as a profession?

  • Why did system administration come into being as a job family, and why was it minimally viable to separate that from programming?


Speakers
NM

Niall Murphy

Microsoft
Niall Richard Murphy is Director of Engineering for Microsoft Azure Ireland, where his group works on cloud engineering systems and SRE. He is the instigator, co-author, and co-editor of two books on SRE, and a history of the Irish Internet. He is the holder of degrees in Computer... Read More →


Thursday August 31, 2017 16:00 - 16:30 IST
Lansdowne Room

16:00 IST

Gamifying Reliability Excellence—The Service Score Card
What makes a “good” service is a moving target. Technologies and requirements change over time. It can be impossible to ensure that none of your services have been left behind. The Service ScoreCard approach is to have a small check for each service initiative we have, this could be anything measurable; deployment frequency, the oncall team all have phone; ensuring the latest version of the JVM. The Service ScoreCard, gives each service a grade from 'F' to 'A+', based on passing or failing the list of checks. As soon as anyone see the service grade’s slipping everyone rallies to improve the grades. We can then set up rules based on the grades, “Only B and above services can deploy 24 / 7”, “moratorium on services without an A+” or “No SRE support until the services below C grade”.

Speakers
DL

Daniel Lawrence

LinkedIn
Daniel will fix anything with python, even if it's not broken. He is an Aussie on loan to LinkedIn in the USA as an SRE, focusing on looking after the jobs and recruiting services. When he is not working on tricky problems for LinkedIn, he plays _a lot_ of video games.


Thursday August 31, 2017 16:00 - 16:30 IST
Pembroke Room

16:00 IST

Fast and Safe Production Monitoring of JVM Applications with BPF Magic
All of us have seen these evasive performance issues or production bugs in the field, which standard monitoring tools don't see or catch. BPF is a Linux kernel technology that enables fast, safe, dynamic tracing of a running system without any preparation or instrumentation in advance. The JVM itself has a myriad of insertion points for tracing garbage collections, object allocations, JNI calls, and even method calls with extended probes. When the JVM tracepoints don't cut it, the Linux kernel and libraries allow tracing system calls, network packets, scheduler events, off-CPU time, time blocked on disk accesses, and even database queries. In this talk, we will see a holistic set of BPF-based tools for monitoring JVM applications on Linux, and revisit a systems performance checklist that includes classics like fileslower, opensnoop, and strace—all based on the non-invasive, fast, and safe BPF technology.

Speakers
avatar for Sasha Goldshtein

Sasha Goldshtein

CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft Regional Director and MVP, Pluralsight and O’Reilly author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor... Read More →


Thursday August 31, 2017 16:00 - 17:00 IST
Meeting Rooms 1+2

16:30 IST

Hiring SREs May Be Literally Impossible
If we're gonna do this SRE thing, we need to find the right people to do it.


After a few recent discussions, it became clear just how much everyone—at large companies and small—is struggling to find those people.


You can barely get enough applicants in the door, and by the time you've run your interview process you're left making a handful of offers.


Hiring SREs from the outside world is a competitive, expensive game to play. So why focus so much on people outside your company? You've got potential SREs sat all around you!


In this talk, we'll set the scene with a little look at the realities of hiring SREs. We won't stay there for too long though, because that's not what's going to save us!


The bulk of the talk will be spent looking at ways to discover budding SREs in your organisation, how to nurture their interest, and how to coach them in a role that's new to them.

Speakers
avatar for Chris Sinjakli

Chris Sinjakli

SRE, PlanetScale
Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems. All his programs are made from organic, hand-picked, artisanal keypresses.


Thursday August 31, 2017 16:30 - 17:00 IST
Lansdowne Room

16:30 IST

Incident Management and Chatops at Shopify
SREs are expected to be incident management experts. Yet, incident handling is hard, often messy, and exhausting. We encounter new incidents, look up everywhere for possible explanations, sometimes tunnel on symptoms, and, under pressure, forget some good practices.


At Shopify, we care not only about handling incidents quickly and efficiently, but also SRE well-being. We have a special IMOC (Incident Manager On Call) rotation and an incident chatbot to assist IMOCs. In this talk, I’ll first explain the IMOC role and how training SREs for this duty is essential to handling incidents well.


Our chatbot assists the IMOC by reducing manual effort and context switching. We integrated the bot with our conversation tool and several third-party tools (PagerDuty, StatusPage, Github) to send timely reminders. It also binds the incident to a discussion channel where all communications happen, allows status page updates directly from the chat room, keeps notes and records event times, and generates service disruption content. To avoid burnout for long-running incidents, the chatbot also reaches out to other IMOCs.


Our chatbot supports best practices and "streamlines" incident response. Attendees will leave with strategies for incorporating chatbots into their incident management and considerations for automating precisely and smartly.

Speakers
DN

Daniella Niyonkuru

Shopify
Daniella Niyonkuru is a Production Engineer at Shopify where she helps build a better, faster and more resilient platform. Previously, Daniella worked as an Aircraft System Software Specialist, and researched Formal Model Driven Development for Embedded Systems.


Thursday August 31, 2017 16:30 - 17:00 IST
Pembroke Room

17:00 IST

Lightning Talks
  • 6 Ways a Culture of Communication Strengthens Your Team’s Resiliency
    Jaime Woo, Shopify
  • Dynamic Documentation in 5 minutes
    Daniel Lawrence, LinkedIn
  • Resource management and isolation, the non-shiny way
    Luiz Viana, Demonware
  • Collecting metrics with Snap - the open telemetry framework
    Guy Fighel, SignifAI
  • Decentralized Data
    Jason Koppe, Indeed

Thursday August 31, 2017 17:00 - 18:00 IST
Pembroke and Lansdowne Rooms

18:00 IST

Conference Reception
Sponsored by Circonus

Thursday August 31, 2017 18:00 - 19:30 IST
Herbert Room

19:30 IST

Developing Software Skills
Speakers

Thursday August 31, 2017 19:30 - 21:00 IST
Meeting Rooms 1+2

19:30 IST

Psychological Safety in SRE
Speakers
avatar for John Looney

John Looney

Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know... Read More →


Thursday August 31, 2017 19:30 - 21:00 IST
Pembroke Room

19:30 IST

The SRE Book
Speakers
NM

Niall Murphy

Microsoft
Niall Richard Murphy is Director of Engineering for Microsoft Azure Ireland, where his group works on cloud engineering systems and SRE. He is the instigator, co-author, and co-editor of two books on SRE, and a history of the Irish Internet. He is the holder of degrees in Computer... Read More →


Thursday August 31, 2017 19:30 - 21:00 IST
Lansdowne Room
 
Friday, September 1
 

08:00 IST

Morning Coffee and Tea
Friday September 1, 2017 08:00 - 09:00 IST
Prefunction

09:00 IST

Tech Writing: How a Writer Can Make Your Life Easier, and Your Work Have More Impact!
The sparsely-attended SREcon17 Americas Tech Writing talk focused on HOW to work with Tech Writers. I'm instead focusing on WHY you should work with a TW--because they make your life easier, and can make the work you're already doing have more impact.

Reasons to engage a TW:


  • If you need to explain your product/service/etc. to users: Chances are, the most satisfying, enjoyable, and rewarding part of your job is engineering work—creating a tool, fixing a problem, redesigning infrastructure, etc. It *isn't* answering the same questions from users over and over. → Get solid documentation in place to free up engineer time.

  • If your team internal documentation is a mess: It might be hard to find the docs you need when you get paged, hard to identify current content, or some information might just be flat-out missing. → A TW can help you whip a documentation rat's nest into shape, and give you the tools to maintain your docs easily moving forward.

  • If you want to make your work more visible (so other people can leverage it and learn from it): → A TW can help you get that information out there!


Speakers
avatar for Betsy Beyer

Betsy Beyer

Google
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. She holds degrees... Read More →


Friday September 1, 2017 09:00 - 09:30 IST
Lansdowne Room

09:00 IST

Building an On-Premise Kubernetes Cluster For a Large Web Application
Recently, Shopify began migrating from our custom container management system to Kubernetes. This switch will makes us more efficient at running our large Rails monolith, as well as the current and future microservices that run alongside. The first step in migrating was building a cluster using our own hardware. Running Kubernetes on-premise requires building services that cloud providers hide from their customers: Etcd, high-availability master nodes, scalable networking, Ingress, and persistent storage. We believe that understanding the challenges and tradeoffs in providing these services is beneficial to not only those who run their own cluster, but also to those who use cloud providers.


Beyond building the cluster, we also had to modify our core application and tooling to fit Kubernetes’ container-centric framework. We expect that most applications currently on homegrown deployment systems will have to similarly overcome host-based assumptions. In our case: unbounded jobs, hard coded assumptions about hosts, and services exposed to external monitoring tools via global DNS.


Attendees will leave this talk equipped to decide if running their own Kubernetes cluster is right for them and how to make the shift as successful as possible.

Speakers
DT

Danny Turner

Production Engineer, Shopify
Daniel Turner is a Production Engineer at Shopify. He is part of the team building our Kubernetes clusters as well as maintaining Shopify’s data centers.


Friday September 1, 2017 09:00 - 09:30 IST
Pembroke Room

09:00 IST

The EU's New Data Protection Law - a Survival Guide
What data do you hold?
Are you processing the data, or controlling it?
Do you have the consents to use that data like that?
Do you have a register of all that data and every way you use it, and what for?
Can you find every piece of data you hold that relates to an individual, copy it and send it to them—for free—within 30 days?
What happens when they say they want it erased?
The General Data Protection Directive comes into force on the 25th May 2018. New powers mean regulators can impose fines for breaches up to 4% of annual turnover. This workshop is for anyone trying to make sure that their organisation isn't in breach by the implementation date.
GDPR isn't just a compliance project. It's a business culture change project. Let's struggle our way through together.

Speakers
avatar for John Looney

John Looney

Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know... Read More →
avatar for Simon McGarr

Simon McGarr

M3AAWG Senior Advisor, Data Compliance Europe
Simon McGarr is a lawyer with McGarr Solicitors in Dublin, and the managing director of Data Compliance Europe, a global consultancy on GDPR and data protection matters. He's a Senior Policy Advisor for M3AAWG and a guest lecturer with the European Academy of Law in Trier as well... Read More →


Friday September 1, 2017 09:00 - 12:00 IST
Meeting Rooms 1+2

09:00 IST

Statistics for Engineers
Statistics is the art of extracting information from data. In this workshop, we will visit the statistical methods that are relevant for operating modern IT infrastructures. Containerized cloud architectures are incredibly difficult monitoring targets. Creating probabilistic models of the behaviors of these systems, that can be used for reliable predictions is a very difficult task. In fact, it's so difficult that I don't think anyone has done that, yet. We will certainly not try to here.

Instead, we will take a different path in this workshop, and talk about statistical methods that are known to work and provided value for your daily job as a SRE. In this workshop you will learn:


  • How to measure the quality of APIs you provide and consume.

  • How to interpret the telemetry data that is emitted from the systems you are running.

  • How to aggregate metrics from single nodes to service-level views.


Topics we will cover in depths include: data visualisation, averages, percentiles, histograms, regressions, robustness and mergeability. We will cover the material from a theoretical and a practical perspective. Bring pen and paper as well as your laptop!

Speakers
avatar for Heinrich Hartmann

Heinrich Hartmann

Analytics Lead, Circonus
Heinrich Hartmann is the Analytics Lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. In his prior life, Heinrich pursued an academic career as a mathematician... Read More →


Friday September 1, 2017 09:00 - 12:00 IST
Meeting Room 9

09:30 IST

Distributed Systems, Like It or Not
Over the last twenty years, complex distributed systems have been deployed to solve the leading challenges in the systems resiliency and robustness realm. At this point in systems architecture design, distributed systems are everywhere in everything; even the most simple architectures incorporate distributed software and carry with that the failure scenarios they bring.

SREs are put in an even more complicated situation, because of their wide net or responsibilities, to manage distributed systems of distributed systems. Things can and will go wrong and one of the fundamental skills for SREs going forward will be strong distributed systems reasoning skills.

In this talk we discuss the types of failure scenarios that distributed systems bring with them (with anecdotes) and develop various reasoning skills that can be used to tackle these challenges with increased confidence.

Speakers
TS

Theo Schlossnagle

Circonus
Theo Schlossnagle is the founder and CEO of Circonus. Previously, he founded OmniTI, the go-to source for organizations facing today’s most challenging scalability, performance, and security problems; was the Founder of Message Systems, Inc. now Sparkpost; and researched distributed... Read More →


Friday September 1, 2017 09:30 - 10:00 IST
Pembroke Room

09:30 IST

Postmortem Action Items: Plan the Work and Work the Plan
We discuss best practices and challenges for developing high-quality action items (AIs) for a postmortem, plus methods of ensuring these AIs actually get implemented.

Speakers
JL

John Lunney

G Suite SRE, Google
John Lunney is a Senior Site Reliability Engineer at Google Zürich. His team manages G Suite, productivity apps for Enterprise customers. He holds a degree in Computational Linguistics from Trinity College in Dublin, Ireland. Before Google, he worked on several lexicography projects... Read More →


Friday September 1, 2017 09:30 - 10:30 IST
Lansdowne Room

10:00 IST

Avoiding and Breaking Out of Capacity Prison
Capacity management at any scale has many moving pieces and requires a range of activities from capacity forecasting to emergency response. Capacity issues can directly impact your service scalability, performance and availability. Lead time to acquire new capacity can make a capacity management plan as important as your service monitoring. Being prepared can help ensure a great customer experience even during difficult times.


In this talk, we will present a comprehensive set of activities necessary to execute a capacity management plan for a storage service of any size. We will discuss learnings from Microsoft Azure Storage - one of the largest and fastest growing storage systems on the planet and how SREs used code to proactively scale and remove complex manual effort and toil through automation. The work here has resulted in an improved customer experience, better work/life balance and reduced cost.

Speakers
avatar for Jake Welch

Jake Welch

Principal Software Engineer, Microsoft
Jake Welch is a Site Reliability Engineer on the Microsoft Azure team in NYC. He has worked on large scale services for a decade, in both dev and operational roles. At Microsoft, he primarily works on infrastructure services with focus on Storage and Security.


Friday September 1, 2017 10:00 - 10:30 IST
Pembroke Room

10:30 IST

Break with Refreshments
Friday September 1, 2017 10:30 - 11:00 IST
Prefunction

11:00 IST

Service with an Angry Smile: Passive-Aggressive Behavior in SRE
Awareness and discussion of psychological safety as a key ingredient for productive and successful teams has grown recently, thanks to media coverage and pioneering research by companies and scholars. While flagrant forms of disrespect like angry shouting and insults obviously threaten psychological safety in teams, so too can passive-aggressive behaviors such as complaining, pouty silence, “forgetting” to complete tasks, and stubbornness.


In the SRE context, passive-aggressive behaviors can have disastrous consequences. These include outages and incidents that could have been avoided with better preparation or notification; narrowly focused quick-fixes instead of systemic, long-term maintenance efforts; blame instead of solutions-oriented post-mortems, and refusal to share knowledge. In many cases, few or no words are spoken; silent resistance is the hostile act.


This talk will bring attention to passive-aggressive behavior as a set of hostile acts that one can (and should) identify, manage and overcome in the tech/SRE environment. For context, it will draw upon psychological research, history, a bit of pop culture for fun, and anecdotes from SREs. And for guidance, you’ll hear some tried-and-true agile and communications methods for managing or eliminating passive-aggressive behavior in your teams and interactions.

Speakers
avatar for Lauri Apple

Lauri Apple

Agile coach/Project manager, Zalando
Based in Berlin, Lauri Apple develops and evangelizes Zalando’s open source efforts. She's also a producer/agile project manager for the company's core search engineering team and co-leads Zalando’s InnerSource initiative. Before joining Zalando, Lauri was the tech evangelist... Read More →


Friday September 1, 2017 11:00 - 11:30 IST
Lansdowne Room

11:00 IST

Run Less Software; Use Less Bits
At Intercom, we believe that to enable us to:


  • improve availability and reduce risks,

  • save time and money,

  • improve operability,

  • and enable us to move fast for the long term

  • we should build and run Intercom using the smallest sensible set of core infrastructure components.

  • we are cautious about adding new technologies to the mix

  • we’d often rather consider using one of our existing/established technology components and write (and maintain) more software ourselves, rather than taken on the overhead of learning and maintaining expertise in a new / more powerful technology.

  • where tools, systems or workflows are required to support building/managing Intercom, especially in areas that could be deemed “undifferentiated heavy lifting”, we’d rather not write any software or operate any systems ourselves at all, and instead use world-class 3rd-party services.



We summarise this beliefs in a infrastructure design principle we call “run less software, use less bits”.


In this talk we use some real examples to go deep on how we use this principle to make hard decisions and good, informed, deliberate trade offs.

Speakers
avatar for Rich Archbold

Rich Archbold

Director of Engineering, Intercom
Richard Archbold is an Engineering Director at Intercom, a highly successful and fast growing Irish technology startup company that provides customer communication software to Internet businesses. Intercom's mission is to make web business personal. Previous to Intercom, Richard has... Read More →


Friday September 1, 2017 11:00 - 11:30 IST
Pembroke Room

11:30 IST

The Cult(Ure) of Strength
"Strength," "Courage," and "Bravery" are virtues often heaped upon individuals undergoing hardship. These compliments come from a deep-rooted cultural value that sacrifice should be praiseworthy and that performing in the face of difficulty is a sign of virtue. In tech, strength is valued to the point of caricature, creating a culture of depersonalization and overwork that disproportionately affects people who by their identities or job descriptions are asked too often to "take one for the team."

Through the lens of my 15+ year journey through the STEM pipeline, I'll talk about the culture of strength and how we can better set expectations to manage hardship and workload in the workplace or community.

Speakers
EG

Emily Gorcenski

Simple
A data scientist and technologist with a background in aeronautical engineering, plasma physics, and biotechnology, Emily likes exploring the intersection of society and technology and is driven towards building good technological citizenship.


Friday September 1, 2017 11:30 - 12:00 IST
Lansdowne Room

11:30 IST

Monitoring Cloudflare's Planet-Scale Edge Network
Cloudflare operates a global anycast edge network serving content for 6 million web sites. This talk explains how we monitor our network, how we migrated from Nagios to Prometheus and the architecture we chose to provide maximum reliability for monitoring. We'll also discuss the impact of alert fatigue and how we reduced alert noise by analysing data, making alerts more actionable and alerting on symptoms rather than causes.


This talk will cover:



  • The challenges of monitoring a high volume, anycast, edge network across 100+ locations

  • The architecture we chose to maximise the reliability of our monitoring

  • Why Prometheus excels as the new industry standard for modern monitoring

  • Approaches reducing alert noise and alert fatigue

  • Triaging alerts into a ticket system
  • Analysing past alert data for continuous improvement

  • The pain points we endured

  • Effecting change across engineering teams


Speakers
MB

Matt Bostock

Platform Operations, Cloudflare
Matt is a Platform Operations engineer at Cloudflare, where he has spent the last year promoting a monitoring utopia. He was previously tech lead for the GOV.UK Infrastructure team and is a keen contributor to open source software. He also loves bacon, avocado, running, and the Oxford... Read More →


Friday September 1, 2017 11:30 - 12:00 IST
Pembroke Room

12:00 IST

Conference Luncheon
Friday September 1, 2017 12:00 - 13:00 IST
Sussex Restaurant and Herbert Room

13:00 IST

Panel: AMA for New SREs
If you're new to SRE, or considering becoming an SRE, and you have questions, come to this session. You'll get the opportunity to ask a variety of experienced SREs for their opinion on topics related to SRE teams and culture, hiring, oncall, troubleshooting, performance, release management, and more.

Speakers
OK

Ola Klapcinska

Google
Ola has been a Site Reliability Engineer at Google London for three years. She has been SREing at Ads and Cloud fronted teams, and most recently focusing on Monitoring. When not at work, she roams around Europe and occasionally other continents.
avatar for John Looney

John Looney

Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know... Read More →
avatar for Gráinne Sheerin

Gráinne Sheerin

Google
Grainne is a Site Reliability Engineer for Google Ireland. She's a tech lead responsible for Ad Serving infrastructure and has 5 years of experience in production engineering. She a physicist, earning a doctorate in Nanoscience from Dublin City University. Prior to Google, she masqueraded... Read More →
avatar for Chris Sinjakli

Chris Sinjakli

SRE, PlanetScale
Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems. All his programs are made from organic, hand-picked, artisanal keypresses.


Friday September 1, 2017 13:00 - 14:00 IST
Meeting Room 8

13:00 IST

Monitoring Design Principles
In this presentation we'll re-examine monitoring to understand how to formulate valuable goals and align monitoring design and implementation with those goals. With a focus on outcomes and behavior that leads to outcomes we'll focus on performance data and not security monitoring.


Attendees will learn to ask the right questions when approaching the monitoring of systems and businesses. They will understand why and how monitoring should fit into the overall systems architecture to reduce risk and increase value.

Speakers
TS

Theo Schlossnagle

Circonus
Theo Schlossnagle is the founder and CEO of Circonus. Previously, he founded OmniTI, the go-to source for organizations facing today’s most challenging scalability, performance, and security problems; was the Founder of Message Systems, Inc. now Sparkpost; and researched distributed... Read More →


Friday September 1, 2017 13:00 - 14:00 IST
Lansdowne Room

13:00 IST

And the CFO Wept: AWS Cost Control
"I'll just spin up an instance to test something this afternoon" you say with the best of intentions. Unbeknownst to you, you'll retire before that instance does. In this hilarious talk, Corey delves into the details of how the AWS bill goes from a few cents an hour into something suspiciously reminiscent of a phone number.


From low hanging fruit to weird Amazon billing gotchas, this talk serves as a survey of all of the sharp edges around Amazon's least understood product—its ridiculous monthly bill.

Speakers
avatar for Corey Quinn

Corey Quinn

Editor, Last Week in AWS
Corey is a Cloud Economist at the Quinn Advisory Group and an advisor to ReactiveOps. He has a history as an engineering director, public speaker, and cloud architect. Corey specializes in helping companies address horrifying AWS bills, hosts the "Screaming in the Cloud" podcast... Read More →


Friday September 1, 2017 13:00 - 14:00 IST
Pembroke Room

13:00 IST

CRE: Expanding SRE to inside Your Customer's Organisation
The Cloud is a scary place. You're trusting your entire business to a platform out of your control. This applies to any platform, any SaaS, PaaS, IaaS provider. If they have an outage it's out of your hands!


Introducing CRE: Expanding SRE to inside your customer's organisation. Making a reliable system for your own business is one thing, but doing it when you are providing a platform to another company is a new and exciting area.


Learn how Google is exploring Customer Reliability Engineering. Sharing everything from design decisions to monitoring.

Speakers
ST

Stephen Thorne

Google
Stephen is a Senior Site Reliability Engineer working at Google, and a founding member of the EMEA Customer Reliability Engineering team. He writes a blog at medium.com/@jerub about the SRE book, and thoroughly enjoys experiencing exciting new failures and making sure they never happen... Read More →


Friday September 1, 2017 13:00 - 14:00 IST
Meeting Rooms 1+2

13:00 IST

Being an Effective Ally to LGBTQ+, Non-Binary, Women, and Poc in the Tech Industry
There are a lot of white, male, heterosexuals in the Tech Industry, this demographic makes up the majority by a large margin. This is also the demographic which holds the most privilege.


Homogenization in our industry is bad. We need the creative ideas and mindset that diversity brings to allow us to innovate and build amazing things.


Those with privilege and power need to understand it, and learn how to use it be become an ally to those people who are in marginalized parts of our industry to help create safe space and welcoming spaces where people can feel that they can express themselves and be celebrated for their differences.


This talk will unpack privilege and discuss how you can be an effective ally to those who have no voice.

Speakers
avatar for Chris Stankaitis

Chris Stankaitis

The Pythian Group
Working for Pythian, Chris builds and manages high performing SRE and Hadoop Teams which are globally distributed (follow the sun) and remote (work from home) based. Working with companies from startup to web-scale, Chris's teams keeps many of the sites and services people use on... Read More →


Friday September 1, 2017 13:00 - 14:00 IST
Meeting Room 9

14:00 IST

Break with Refreshments
Friday September 1, 2017 14:00 - 14:40 IST
Prefunction

14:40 IST

Have You Tried Turning It off and Turning It on Again?
Most of us have a backup strategy, many of us have a restore strategy and several of us have a fully tested restore strategy. But backups are not the whole story. I'll talk about the parts of disaster recovery we're less prepared for, and dependencies that you might not think about until one day when you really do turn an entire service, entire site or (perish the thought!) an entire company off and on again.


This talk will cover managing complexity, testing your fallback plan and avoiding dependency cycles that make it impossible to restart groups of systems. Like, where do you store the documentation on how to recover the documentation server?

Speakers
avatar for Tanya Reilly

Tanya Reilly

Squarespace
Tanya Reilly has been a Site Reliability Engineer at Google since 2005, working on low level infrastructure like distributed locking, load balancing, and bootstrapping. Before Google, she worked as a Systems Administrator at eircom.net, Ireland's largest ISP, and before that she was... Read More →


Friday September 1, 2017 14:40 - 15:30 IST
Pembroke and Lansdowne Rooms

15:30 IST

100 Teams, 100 Ways to Fail
Every SRE organization hits the same problems at some point: How do we convince teams to let us help, and own the work and results together? As an SRE, you will encounter different kinds of resistance from the teams you work with.


Azure has 100+ teams, and Azure SRE has gained experience with every type of engagement on the map. If any of these scenarios sound familiar to you:



  • Engineers that do not understand your utopian visions (nobody understands me)!

  • Everyone is rational, no one is right

  • “This too shall pass” i.e. the team that knows you will eventually sort out the rough edges in your tooling or find another job, and can safely ignore you until then



Come join Azure SRE as we share stories about teams we’ve worked with, the resistance we’ve run into, and sometimes even how we fixed it.

Speakers
JK

John Keiser

Microsoft Azure
John Keiser is a Mad Scientist of the internet age, having developed, tested and led teams for the last 20 years at places like Netscape, Bing, and Chef. Microsoft Azure now lets him play with their service, having been convinced he wouldn’t rewrite anything too critical. If his... Read More →
BB

Ben Broderick Phillips

Microsoft Azure
Ben Broderick Phillips spent the first five years of his career automating infrastructure in a “worldwide datacenter,” which he later found out was just the server closet. After they let him out, he went to work for Microsoft, who gave him his very own desk and asked him to build... Read More →


Friday September 1, 2017 15:30 - 16:20 IST
Pembroke and Lansdowne Rooms

16:20 IST

Persistent SRE Antipatterns: Pitfalls On the Road to Creating a Successful SRE Program Like Netflix and Google
What isn’t Site Reliability Engineering? Does your NOC escalate outages to your DevOops Engineer, who in turn calls your Packaging and Deployment Team? Did your Chef just sprinkle some Salt on your Ansible Red Hat and call it SRE? Lots of companies claim to have SRE teams, but some don’t quite understand the full value proposition, or what shiny technologies and organizational structures will negatively impact your operations, rather than empowering your team to accomplish your mission.


You’ll hear stories about anti-patterns in Monitoring, Incident Response, Configuration Management, and more that we’ve tripped over in our own teams, seen actually proposed as good practice in talks at other conferences, and heard as we speak to peers scattered around the industry. We'll also discuss how Google and Netflix each view the role of the SRE, and how it differs from the traditional Systems Administrator role. The talk also explains why freedom and responsibility are key, trust is required, and when chaos is your friend.

Speakers
avatar for Blake Bisset

Blake Bisset

Dropbox
Blake got his first legal tech job at 16, long enough ago that he’s entitled to make shakeyfists while shouting “Get off my LAN!”He did three startups (a Dupont/ConAgra venture; a UW biotech spinoff; and this other time some kids were sitting around New Year's Eve, wondering... Read More →
avatar for Jonah Horowitz

Jonah Horowitz

Site Reliability Engineer
Jonah is a Senior Site Reliability Engineer with 18 years experience building and scaling production applications. He's worked at several startups and large companies including Quantcast, Netflix, and Stripe.


Friday September 1, 2017 16:20 - 17:10 IST
Pembroke and Lansdowne Rooms

17:10 IST

Closing Remarks
Speakers
avatar for Avishai Ish-Shalom

Avishai Ish-Shalom

Engineer in Residence, Aleph VC
Avishai is a veteran operations and software engineer with years of high scale production experience. At present, Avishai helps growing startups and the Israeli high-tech eco-system as Engineer in Residence in Aleph VC fund. In his spare time, Avishai is spreading weird ideas and... Read More →
avatar for Laura Nolan

Laura Nolan

Stanza
Laura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Laura is a Principal Engineer at Stanza, where she is building software to help humans understand... Read More →


Friday September 1, 2017 17:10 - 17:20 IST
Pembroke and Lansdowne Rooms
 
Filter sessions
Apply filters to sessions.