Loading…

Note: Meeting Room 7 will be available as an On-Call Room for attendees.

Track 3 [clear filter]
Wednesday, August 30
 

11:00 IST

Load-Shedding: Overview of Different Methodologies
This talk gives an inventory and overview of the different methods for dealing with load-shedding and overload in production stacks, including an overview of the methods developed at Google and the open-source solutions.

We'll review the pros and cons, scope and effort levels of each method, and compare with existing approaches, including circuit-breakers.

Speakers
avatar for Acacio Cruz

Acacio Cruz

Director - Frameworks & Production Platforms, Google
Acacio has been an SRE manager since 2007, and manager of Google's Load-shedding & Traffic Management team since 2009. He is now a SWE Director in Frameworks and Software Infrastructure.


Wednesday August 30, 2017 11:00 - 12:00 IST
Meeting Rooms 1+2

12:00 IST

Managing SSH Access without Managing SSH Keys
Everyone uses SSH to manage their production infrastructure, but it's really difficult to do a good job of managing SSH keys. Many organisations don't know how many SSH keys have access to production systems or how protected those keys are. A trusted SSH private key can be years old, unprotected by passphrase, and shared among multiple people who may not even work for you.

With some tooling and configuration SSH keys can be replaced with limited-use ephemeral certificates, issued centrally and with better access controls and automatic key expiration, solving many of the shortcomings of using SSH keys.


This talk will cover:


  • Managing SSH keys: The bad parts

  • Replacing SSH keys with ephemeral certificates: how & why

  • Discussion of an implementation of a CA for SSH certificates

  • Call for participation, showing github source


Speakers
NS

Niall Sheridan

Senior Systems Engineer, Intercom
I love a good disaster


Wednesday August 30, 2017 12:00 - 12:30 IST
Meeting Rooms 1+2

13:40 IST

Networks for SREs: What Do I Need to Know for Troubleshooting Applications
All of us depend on the underlying network to be stable whether in the datacenter or in the cloud. We all have a basic knowledge of how traditional networks run, however in the past 10 years, we’ve moved to building redundant physical topologies in our networks, optimized the routing methodologies accordingly, moved into the cloud and gotten greater visibility and tuneables in the Linux kernel network stack. A lot has changed!

However, the way we troubleshoot the network in relation to the applications we support hasn’t adapted. In this session, we’ll review the progress that network infrastructure has made look at specific examples where traditional troubleshooting responses fail us and demonstrate our need to rethink our approach to making applications and the network interact harmoniously.

Speakers
avatar for Michael Kehoe

Michael Kehoe

Staff SRE, LinkedIn
Michael Kehoe is a Staff SRE at LinkedIn who works on building scalable monitoring infrastructure, reliability principles, and incident management. Michael previously interned at NASA Ames on their PhoneSat project. Michael's key interests lie in network engineering and automatio... Read More →


Wednesday August 30, 2017 13:40 - 14:30 IST
Meeting Rooms 1+2

14:30 IST

Anycast Is Not Load Balancing
We'll discuss IP anycast (what it is, how it works), what use cases it's more or less suited to, and some of the complexity it introduces (complete with war stories).

Speakers
avatar for Murali Suriar

Murali Suriar

Google
Lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running software defined network control systems. Left Google to get on a boat. Got bored and came back.


Wednesday August 30, 2017 14:30 - 15:00 IST
Meeting Rooms 1+2

15:40 IST

Bots Are Fast, Humans Are Smarter—Eliminate Unwanted Traffic and Defend Against DDoS
In a world with ever-growing DDoS attacks, L7 attacks give even the most experienced engineers the sweats. Imagine if instead of following easy to detect patterns, bots could mimic the behaviour of customers. Well, that’s exactly what Shopify sees every day during flash sales.

Come and learn how we block nearly all bot traffic on our load balancers without any human intervention. We will share our challenges of differentiating between web crawlers and bots, users behind NATs and bots rotating user agents, as well as fast humans and browser extensions. When the stakes are blocking a customer completing a checkout, misclassification isn’t an option.

This is not yet another machine learning talk, but an example of how simple statistics, heuristics and some sane limits can give great results with minimal complexity. The lessons learned in this talk are applicable to any real-world problem with inexact constraints.

Speakers
avatar for Felix Glaser

Felix Glaser

Senior Production Security Engineer ☁️ 生产安全工程师 ☁️, Shopify
Felix likes to climb, cycle, and code in Canada. The first two outside and the other one at Shopify, where he works on securing containers and their deployment into the cloud.


Wednesday August 30, 2017 15:40 - 16:10 IST
Meeting Rooms 1+2

16:10 IST

Google SDN Peering: An Early Engagement Case Study
How do you build a new SRE team around a completely novel product? This talk will deal with some of the challenges involved in launching Espresso, Google's software defined peering architecture.


  • How do you build an SRE team for a product which isn't serving real users yet?

  • How do you build a cohesive team and structure out of many disparate teams? (Networking, SRE, software development)

  • How do you build oncall discipline in a team which largely hasn't been oncall before?


And as an aside, we'll also get into some of the technical details of Espresso, since it's necessary to understand what made it so challenging and different.

Speakers
avatar for Murali Suriar

Murali Suriar

Google
Lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running software defined network control systems. Left Google to get on a boat. Got bored and came back.


Wednesday August 30, 2017 16:10 - 17:00 IST
Meeting Rooms 1+2
 
Thursday, August 31
 

09:00 IST

Deploying Changes to Production in the Age of the Microservice
You decoupled your APIs from their implementations and put them behind RPC interfaces. You build and deploy services independently. You code health is impeccable. You put your user data in a persistent, replicated, and consistent store, where it belongs. Your developer velocity has skyrocketed.

Now we have new problems. We’ve got N independent services with M edges of interaction between them. That’s N services that need to be built, tested, and deployed on the infrastructure that expected you to have one service whose mess of entanglement was a secret you had with the compiler.

How do we deploy N binaries with N sources of static configuration and M sources of runtime configuration safely without losing our collective minds? In this talk, I’ll share some of how we grew that aforementioned N from 1 to many in Gmail. Specifically:


  • Consistent naming schemas for services, environments
  • Maintaining lightweight, easy-to-change production configuration abstraction layers
  • Release early, often
  • Canary everything by sharding into more A/B environments than you'd think you’d need
  • Encourage backwards compatibility in all APIs
  • Validate and test all configuration before changing global state

And, of course, some of things we (Gmail) learned by breaking things along the way.

Speakers
SS

Samantha Schaevitz

Staff Software & Site Reliability Engineer, Google Apps, Google
Samantha Schaevitz is a Staff SRE who's worked on Google Apps since 2013. She enjoys simplifying complex systems and skiing in the Alps near Zürich, where she lives.


Thursday August 31, 2017 09:00 - 09:50 IST
Meeting Rooms 1+2

09:50 IST

Application Automation with Habitat
Container Orchestration Systems make for a great operational experience for deploying and management of containers. But that’s only part of the story when running containers in production. How do you build containers that contain only what you need (like no build systems/tools)? How do you orchestrate configuration of your application after the containers have been launched? How do you make it easy to modify an application config while keeping the containers immutable? How can you give your developers a means to declare dependencies for their applications?

Habitat, our open-source project for application automation, simplifies container management by packaging applications in a compact, atomic, and easily auditable format that makes it easier to deploy your application on various container runtimes and manage them over their lifecycle.

Speakers
avatar for Mandi

Mandi

Technical Community Manager, Chef Software
Mandi Walls is Technical Community Manager, EMEA at Chef. For Chef, she helps organizations increase their effectiveness using configuration management and modernizing IT practices. She is a long-time sysadmin focusing on large complex web systems.


Thursday August 31, 2017 09:50 - 10:20 IST
Meeting Rooms 1+2

10:50 IST

One Ring to Rule them…
Rollout automation is something that every service and team needs, and many reinvent the wheel. I'll talk about - why the wheel gets reinvented - a system design that discourages reinvention, including an architecture diagram - the organisational challenges encountered when converting many services to use this new system design - how well the conversion attempt worked in practice This is based on my experience initiating and running a program to replace rollout automation across Storage SRE in Google.

Speakers
JT

John Tobin

Google
John Tobin manages Bigtable SRE and Cloud Bigtable SRE at Google Dublin, and has worked on several of Google's storage systems. He is currently involved in efforts to improve collaboration between teams across Storage SRE - standardising tools and processes, reducing duplicated effort... Read More →


Thursday August 31, 2017 10:50 - 11:45 IST
Meeting Rooms 1+2

11:45 IST

Dancing with Squads—Do you know what your Code Repos are Telling You?
Have you ever wondered why certain service teams are always at the center of issues? Code Commits fail in certain areas; have you ever wondered why? In the quest to understand our services, data and looking from the outside in, we will take the audience thru how we developed a methodology by our Data Scientists collaborating with University Research to understand the areas that most impacted Site Reliability and how the SRE team could use this data to develop new policies.

We found ourselves asking, are you listening to what your Code and Issue history is telling you ? Do you know what Risks you are taking with your code ? How does the squad organization and climate create patterns that impact availability. This session will take the audience thru how we answered those and more questions about our own code.

Speakers
avatar for Don Cronin

Don Cronin

IBM
Don has more than 25 years experience in developing software. He currently leads the DevOps Analytics mission. His focus is on improving the DevOps lifecycle using big data technics so developers can deliver greater quality with faster velocity. Previously he led an adtech group... Read More →
avatar for Rob Orr

Rob Orr

Offering Manager SRE, IBM
Rob Orr is Offering Manager at IBM focusing on bringing SRE Capabilities into IBM's offerings and capabilities.


Thursday August 31, 2017 11:45 - 12:30 IST
Meeting Rooms 1+2

13:30 IST

SRE 101, Revisited

This presentation replaces the talk by Dinah McNutt, who is unable to attend. Laura Nolan will revisit her SRE 101 content from yesterday; if you missed the session due to the meeting room being at capacity, this is your opportunity to attend.

The purpose of an SRE team is to keep its services up, reliable, performant and efficient. How do effective SRE teams do this?

We'll run through an overview of key SRE competencies: monitoring and alerting, incident response, disaster recovery, performance and efficiency, change management and capacity planning.

We'll also look at the habits of successful SRE teams and some common pitfalls.


Speakers
avatar for Laura Nolan

Laura Nolan

Stanza
Laura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Laura is a Principal Engineer at Stanza, where she is building software to help humans understand... Read More →


Thursday August 31, 2017 13:30 - 14:30 IST
Meeting Rooms 1+2

14:30 IST

Automated Debugging of Bad Deployments
Debugging a bad deployment can be tedious, from identifying new stack traces to figuring out who introduced them. At Pinterest we have automated most of these processes using using ElasticSearch to identify new stack traces and git-stacktrace to figure out who caused them. Git-stacktrace parses the stack trace and looks for related git changes. This has reduced the time needed to figure out who broke the build from minutes to just a few seconds.

Speakers
avatar for Joe Gordon

Joe Gordon

Pinterest
Joe is an SRE at Pinterest, where he works on homefeed and performance. He has previously spoken at numerous conferences such as EuroPython, LinuxCon and LCA (Linux Conference Australia).


Thursday August 31, 2017 14:30 - 15:00 IST
Meeting Rooms 1+2

15:00 IST

Debugging at Scale—Going from Single Box to Production
It's very easy to launch a debugger on your dev box, attach to the right process and step through code. However, things are different when you need to debug an issue in production that's getting tens of thousands of requests per second. What if the issue reproduces only in production? How do you debug without affecting production traffic? What techniques can you use in your development to make it easier to debug issues? Does your application use tracing? What debug logs are written out to aid in analysis?


This talk will cover:


  1. Challenges with debugging in production

  2. Various approaches that are used in the industry

  3. Examples from Bing and Cortana incidents and steady state problems to illustrate the techniques

  4. Service design ideas that make them easier to debug


Speakers
KS

Kumar Srinivasamurthy

Microsoft Corp
Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing and Cortana Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening... Read More →


Thursday August 31, 2017 15:00 - 15:30 IST
Meeting Rooms 1+2

16:00 IST

Fast and Safe Production Monitoring of JVM Applications with BPF Magic
All of us have seen these evasive performance issues or production bugs in the field, which standard monitoring tools don't see or catch. BPF is a Linux kernel technology that enables fast, safe, dynamic tracing of a running system without any preparation or instrumentation in advance. The JVM itself has a myriad of insertion points for tracing garbage collections, object allocations, JNI calls, and even method calls with extended probes. When the JVM tracepoints don't cut it, the Linux kernel and libraries allow tracing system calls, network packets, scheduler events, off-CPU time, time blocked on disk accesses, and even database queries. In this talk, we will see a holistic set of BPF-based tools for monitoring JVM applications on Linux, and revisit a systems performance checklist that includes classics like fileslower, opensnoop, and strace—all based on the non-invasive, fast, and safe BPF technology.

Speakers
avatar for Sasha Goldshtein

Sasha Goldshtein

CTO, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft Regional Director and MVP, Pluralsight and O’Reilly author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor... Read More →


Thursday August 31, 2017 16:00 - 17:00 IST
Meeting Rooms 1+2
 
Friday, September 1
 

09:00 IST

The EU's New Data Protection Law - a Survival Guide
What data do you hold?
Are you processing the data, or controlling it?
Do you have the consents to use that data like that?
Do you have a register of all that data and every way you use it, and what for?
Can you find every piece of data you hold that relates to an individual, copy it and send it to them—for free—within 30 days?
What happens when they say they want it erased?
The General Data Protection Directive comes into force on the 25th May 2018. New powers mean regulators can impose fines for breaches up to 4% of annual turnover. This workshop is for anyone trying to make sure that their organisation isn't in breach by the implementation date.
GDPR isn't just a compliance project. It's a business culture change project. Let's struggle our way through together.

Speakers
avatar for John Looney

John Looney

Intercom
John Looney did 24x7 support for a webhosting company, spent nearly 12 years in Google as an SRE (compute, storage, datacenters and Ads) as well as running team-build courses. He is now applying SRE to Intercom's infrastructure. He is passionate about ensuring that engineers know... Read More →
avatar for Simon McGarr

Simon McGarr

M3AAWG Senior Advisor, Data Compliance Europe
Simon McGarr is a lawyer with McGarr Solicitors in Dublin, and the managing director of Data Compliance Europe, a global consultancy on GDPR and data protection matters. He's a Senior Policy Advisor for M3AAWG and a guest lecturer with the European Academy of Law in Trier as well... Read More →


Friday September 1, 2017 09:00 - 12:00 IST
Meeting Rooms 1+2

13:00 IST

CRE: Expanding SRE to inside Your Customer's Organisation
The Cloud is a scary place. You're trusting your entire business to a platform out of your control. This applies to any platform, any SaaS, PaaS, IaaS provider. If they have an outage it's out of your hands!


Introducing CRE: Expanding SRE to inside your customer's organisation. Making a reliable system for your own business is one thing, but doing it when you are providing a platform to another company is a new and exciting area.


Learn how Google is exploring Customer Reliability Engineering. Sharing everything from design decisions to monitoring.

Speakers
ST

Stephen Thorne

Google
Stephen is a Senior Site Reliability Engineer working at Google, and a founding member of the EMEA Customer Reliability Engineering team. He writes a blog at medium.com/@jerub about the SRE book, and thoroughly enjoys experiencing exciting new failures and making sure they never happen... Read More →


Friday September 1, 2017 13:00 - 14:00 IST
Meeting Rooms 1+2
 
Filter sessions
Apply filters to sessions.