Academic Positions

  • Present 2015

    Head Teaching Assistant

    ECE454, Computer System Programming

  • Present 2018

    Lab Teaching Assistant

    ECE568, Computer Security

  • 2018 2015

    Lecturing Teaching Assistant

    ECE472, Economics & Entrepreneurship

Industry Positions

  • 2016/9 2016/5

    Sales & Trading ETF Desk Analyst

    BMO Capital Markets
    Downtown Toronto, Ontario, Canada

  • 2013/9 2012/5

    IBM DB2 Kernel Developer

    IBM Canada Lab
    Markham, Ontario, Canada

Education

  • Ph.D.2017 - Present

    Master of Applied Science in

    Computer Engineering Supervised by Ding Yuan

    University of Toronto, Canada

  • M.A.Sc.2017

    Master of Applied Science in

    Computer Engineering Supervised by Ding Yuan

    University of Toronto, Canada

  • B.A.Sc.2015

    Bachelor of Applied Science in

    Computer Engineering Major and Business Minor

    University of Toronto, Canada

Honors, Awards and Grants

  • 2019 Oct. - Present
    SOSP/OSDI (the top conferences for systems research) Hall of Fame
    Currently ranked 84th in the world and one of the two graduate students on the hall of fame list.
  • 2015 Sept. - Present
    Rogers Research Fellowship
    image
    Rogers research fellowship is awarded to M.A.Sc. or Ph.D. students who are in good academic standing and making satisfactory progress toward the completion of their degree for the duration of their course of studies.
  • 2019
    ACM SOSP'19 Student Scholarship
  • 2018
    USNIX OSDI'18 Student Grant
  • 2017
    ACM SOSP'17 Student Scholarship
  • 2015 Sept. - Present
    M.A.Sc. Dean's Honour List
    image
    Dean's Honour List is awarded to students whose cumulated GPA average is greater or equal to 3.7.
  • 2015 Aug.
    ACM SOSP'15 Student Scholarship
  • 2014 Sept.
    Usenix OSDI'14 Student Grant
  • 2014 March
    Accenture Business Case Competition - 1st Place
    image
    Teams formed randomly were given a business case created by Accenture to solve. The team presented a solution for data gathering and analysis for health care systems to a panel of judges from Accenture. First place was awarded to the team.
  • 2014 Jan.
    IEEEXtreme Programing Competition - Ranked 135th Worldwide
    image
    IEEEXtreme is a global challenge in which teams of IEEE Student members—advised and proctored by an IEEE member, and often supported by an IEEE Student Branch—compete in a 24-hour time span against each other to solve a set of programming problems

Research Side Projects

  • image

    High Performance Computing Cluster Design

    Responsible for design, procurement, implementation, upgrade, maintenance and management of cost-efficient high performance, high capacity, high network throughput physical and virtual server farm.

Filter by type:

Sort by year:

Log Processing And Analysis

Yu Luo, Kirk Rodrigues, Michael Stumm, Ding Yuan, Xu Zhao
US Patent 2020

Abstract

A log of execution of an executable program is obtained. Log messages contained in the log are parsed to generate object identifiers representative of instances of programmatic elements in the executable program. Relationships among the object identifiers are identified. A representation of identified relationships is constructed and outputted as, for example, a visual representation.

Systems And Processes For Computer Log Analysis

Muhammad Faizan Ullah, David Lion, Yu Luo, Michael Stumm, Ding Yuan, Xu Zhao, Yongle Zhang
US Patent 2019

Abstract

Existing program code, which is executable on one or more computers forming part of a distributed computer system, is analyzed. The analysis identifies log output instructions present in the program code. Log output instructions are those statements or other code that generate log messages related to service requests processed by the program code. A log model is generated using the analysis. The log model is representative of causal relationships among service requests defined by the program code. The log model can then be applied to logs containing log messages generated by execution of the program code, during its normal operation, to group log messages for improved analysis, including visualization, of the performance and behavior of the distributed computer system.

The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure

Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, Ding Yuan
Conference Papers ACM SOSP'19

Abstract

The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a totally ordered sequence of instructions, then the root cause can be identified by the first instruction where the failure execution deviates from the non-failure execution that has the longest instruction sequence prefix in common with that of the failure execution. Thus, root cause analysis is transformed into a principled search problem to identify the non-failure execution with the longest common prefix. We present Kairux, a tool that does just that. It is, in most cases, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. Kairux uses tests from the system's rich unit test suite as building blocks to construct the non-failure execution that has the longest common prefix with the failure execution in order to locate the root cause. By evaluating Kairux on some of the most complex, real-world failures from HBase, HDFS, and ZooKeeper, we show that Kairux can accurately pinpoint each failure's respective root cause.

Log20: Fully Automated Optimal Placement of Log Printing Statements under Specified Overhead Threshold

Xu Zhao, Kirk Rodrigues, Yu Luo, Michael Stumm, Ding Yuan, Yuanyuan Zhou
Conference Papers ACM SOSP'17

Abstract

When systems fail in production environments, log data is often the only information available to programmers for postmortem debugging. Consequently, programmers' decision on where to place a log printing statement is of crucial importance, as it directly affects how effective and efficient postmortem debugging can be. This paper presents Log20, a tool that determines a near optimal placement of log printing statements under the constraint of adding less than a specified amount of performance overhead. Log20 does this in an automated way without any human involvement. Guided by information theory, the core of our algorithm measures how effective each log printing statement is in disambiguating code paths. To do so, it uses the frequencies of different execution paths that are collected from a production environment by a low-overhead tracing library. We evaluated Log20 on HDFS, HBase, Cassandra, and ZooKeeper, and observed that Log20 is substantially more efficient in code path disambiguation compared to the developers' manually placed log printing statements. Log20 can also output a curve showing the trade-off between the informativeness of the logs and the performance slowdown, so that a developer can choose the right balance.

The Game of Twenty Questions: Do You Know Where to Log?

Xu Zhao, Kirk Rodrigues, Yu Luo, Michael Stumm, Ding Yuan, Yuanyuan Zhou
Conference Papers ACM HotOS'17

Abstract

A production system's printed logs are often the only source of runtime information available for postmortem debugging, performance analysis and profiling, security auditing, and user behavior analytics. Therefore, the quality of this data is critically important. Recent work has attempted to enhance log quality by recording additional variable values, but logging statement placement, i.e., where to place a logging statement, which is the most challenging and fundamental problem for improving log quality, has not been adequately addressed so far. This position paper proposes we automate the placement of logging statements by measuring how much uncertainty, i.e., the expected number of possible execution code paths taken by the software, can be removed by adding a logging statement to a basic block. Guided by ideas from information theory, we describe a simple approach that automates logging statement placement. Preliminary results suggest that our algorithm can effectively cover, and further improve, the existing logging statement placements selected by developers. It can compute an optimal logging statement placement that disambiguates the entire function call path with only 0.218% of slowdown.

Non-intrusive Performance Profiling for Entire Software Stacks based on the Flow Reconstruction Principle

Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, Michael Stumm
Conference Papers Usenix OSDI'16

Abstract

Understanding the performance behavior of distributed server stacks at scale is non-trivial. Servicing a single request can trigger numerous sub-requests to heterogenous software components; many similar requests are serviced concurrently and in parallel. When a user experiences a performance slow-down, it is extremely difficult to identify the root cause, software components, and machines that are the culprits.

This paper describes Stitch, a non-intrusive tool capable of profiling the performance of an entire distributed software stack soley from the unstructured logs output by heterogeneous software components. Stitch is substantially different from all prior related tools. It is the first tool capable of constructing a system model of an entire software stack without requiring any domain knowledge. It is the first non-intrusive tool able to help diagnose complex cross-component performance issues. It focuses entirely on objects, their relationships and their interactions as a way to deal with complexity.

We have evaluated Stitch on various software stacks, including Hive/Hadoop, OpenStack, and Spark, and found Stitch miscategorized 3% of all objects. A controlled user study shows that Stitch can speed up various profiling and diagnosis tasks on real-world systems by a factor of at least 4.6 when compared with completing the same tasks without the tool

Simple Testing Can Prevent Most Critical Failures

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm
Featured Magazine Article The USENIX Magazine ;login, Feburary 2015, Volume 40, No.1

Abstract

Programming large, production-quality distributed systems still fail periodically, sometimes catastrophically where most or all users experience an outage or data loss. Conventional wisdom has it that these failures can only manifest themselves on large production clusters and are extremely difficult to prevent a priori, because these systems are designed to be fault tolerant and are well-tested. By investigating 198 user-reported failures that occurred on production-quality distributed systems, we found that almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors, and, surprisingly, many of them are caused by trivial mistakes such as error handlers that are empty or that contain expressions like "FIXME" or "TODO" in the comments. We therefore developed a simple static checker, Aspirator, capable of locating trivial bugs in error handlers; it found 143 new bugs and bad practices that have been fixed or confirmed by the developers.

lprof : A Non-intrusive Request Flow Profiler for Distributed Systems

Xu Zhao † , Yongle Zhang † , David Lion, Muhammad FaizanUllah, Yu Luo, Ding Yuan, Michael Stumm
Conference Papers Usenix OSDI'14

Abstract

Applications implementing cloud services, such as HDFS, Hadoop YARN, Cassandra, and HBase, are mostly built as distributed systems designed to scale. In order to analyze and debug the performance of these systems effectively and efficiently, it is essential to understand the performance behavior of service requests, both in aggregate and individually.

lprof is a profiling tool that automatically reconstructs the execution flow of each request in a distributed application. In contrast to existing approaches that require in- strumentation, lprof infers the request-flow entirely from runtime logs and thus does not require any modifications to source code. lprof first statically analyzes an application’s binary code to infer how logs can be parsed so that the dispersed and intertwined log entries can be stitched together and associated to specific individual requests.

We validate lprof using the four widely used distributed services mentioned above. Our evaluation shows lprof ’s precision in request extraction is 90%, and lprof is helpful in diagnosing 65% of the sampled real-world performance anomalies.

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm
Conference Papers Usenix OSDI'14

Abstract

Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.

We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.

QTrace: An interface for customizable full system instrumentation

Xin Tong, Jack Luo, Andreas Moshovos
Conference Papers IEEE Performance Analysis of Systems and Software (ISPASS'13)

Abstract

This work presents QTrace, an open-source instrumentation extension API developed on top of QEMU. QTrace instruments unmodified applications and OS binaries for uni- and multi- processor systems, enabling custom, full-system instrumentation tools for the x86 guest architecture. Computer architects can use QTrace to study whole program execution including system-level code. This paper motivates the need for QTrace, illustrates what QTrace can do, and discusses how QEMU was modified to implement QTrace.

At My Office

You can find me at my office located at:

D.L. Pratt Building Room 372

I am usually at my office every weekday from 9:00 until 6:00 pm, but you may consider sending an email to fix an appointment.