Academic Positions

  • Present 2015

    Lab Teaching Assistant

    ECE454, Computer System Programming

  • Present 2015

    Lecturing Teaching Assistant

    ECE472, Economics & Entrepreneurship

Industry Positions

  • 2016/9 2016/5

    Sales & Trading ETF Desk Analyst

    BMO Capital Markets
    Downtown Toronto, Ontario, Canada

  • 2013/9 2012/5

    IBM DB2 Kernel Developer

    IBM Canada Lab
    Markham, Ontario, Canada

Education

  • M.A.Sc.2015 Fall - Present

    Master of Applied Science in

    Computer Engineering Supervised by Ding Yuan

    University of Toronto, Canada

  • B.A.Sc.2015

    Bachelor of Applied Science in

    Computer Engineering Major and Business Minor

    University of Toronto, Canada

Honors, Awards and Grants

  • 2015 Sept. - Present
    Rogers Research Fellowship
    image
    Rogers research fellowship is awarded to M.A.Sc. or Ph.D. students who are in good academic standing and making satisfactory progress toward the completion of their degree for the duration of their course of studies.
  • 2015 Sept. - Present
    M.A.Sc. Dean's Honour List
    image
    Dean's Honour List is awarded to students whose cumulated GPA average is greater or equal to 3.7.
  • 2015 Aug.
    ACM SOSP'15 Student Scholarship
  • 2014 Sept.
    Usenix OSDI'14 Student Grant
  • 2014 March
    Accenture Business Case Competition - 1st Place
    image
    Teams formed randomly were given a business case created by Accenture to solve. The team presented a solution for data gathering and analysis for health care systems to a panel of judges from Accenture. First place was awarded to the team.
  • 2014 Jan.
    IEEEXtreme Programing Competition - Ranked 135th Worldwide
    image
    IEEEXtreme is a global challenge in which teams of IEEE Student members—advised and proctored by an IEEE member, and often supported by an IEEE Student Branch—compete in a 24-hour time span against each other to solve a set of programming problems

Research Side Projects

  • image

    High Performance Computing Cluster Design

    Responsible for design, procurement, implementation, upgrade, maintenance and management of cost-efficient high performance, high capacity, high network throughput physical and virtual server farm.

    Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

    Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

    Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Filter by type:

Sort by year:

Non-intrusive Performance Profiling for Entire Software Stacks based on the Flow Reconstruction Principle

Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, Michael Stumm
Conference Papers Usenix OSDI'16

Abstract

Understanding the performance behavior of distributed server stacks at scale is non-trivial. Servicing a single request can trigger numerous sub-requests to heterogenous software components; many similar requests are serviced concurrently and in parallel. When a user experiences a performance slow-down, it is extremely difficult to identify the root cause, software components, and machines that are the culprits.

This paper describes Stitch, a non-intrusive tool capable of profiling the performance of an entire distributed software stack soley from the unstructured logs output by heterogeneous software components. Stitch is substantially different from all prior related tools. It is the first tool capable of constructing a system model of an entire software stack without requiring any domain knowledge. It is the first non-intrusive tool able to help diagnose complex cross-component performance issues. It focuses entirely on objects, their relationships and their interactions as a way to deal with complexity

We have evaluated Stitch on various software stacks, including Hive/Hadoop, OpenStack, and Spark, and found Stitch miscategorized 3% of all objects. A controlled user study shows that Stitch can speed up various profiling and diagnosis tasks on real-world systems by a factor of at least 4.6 when compared with completing the same tasks without the tool

Simple Testing Can Prevent Most Critical Failures

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm
Featured Magazine Article The USENIX Magazine ;login, Feburary 2015, Volume 40, No.1

Abstract

Programming large, production-quality distributed systems still fail periodically, sometimes catastrophically where most or all users experience an outage or data loss. Conventional wisdom has it that these failures can only manifest themselves on large production clusters and are extremely difficult to prevent a priori, because these systems are designed to be fault tolerant and are well-tested. By investigating 198 user-reported failures that occurred on production-quality distributed systems, we found that almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors, and, surprisingly, many of them are caused by trivial mistakes such as error handlers that are empty or that contain expressions like "FIXME" or "TODO" in the comments. We therefore developed a simple static checker, Aspirator, capable of locating trivial bugs in error handlers; it found 143 new bugs and bad practices that have been fixed or confirmed by the developers.

lprof : A Non-intrusive Request Flow Profiler for Distributed Systems

Xu Zhao † , Yongle Zhang † , David Lion, Muhammad FaizanUllah, Yu Luo, Ding Yuan, Michael Stumm
Conference Papers Usenix OSDI'14

Abstract

Applications implementing cloud services, such as HDFS, Hadoop YARN, Cassandra, and HBase, are mostly built as distributed systems designed to scale. In order to analyze and debug the performance of these systems effectively and efficiently, it is essential to understand the performance behavior of service requests, both in aggregate and individually.

lprof is a profiling tool that automatically reconstructs the execution flow of each request in a distributed application. In contrast to existing approaches that require in- strumentation, lprof infers the request-flow entirely from runtime logs and thus does not require any modifications to source code. lprof first statically analyzes an application’s binary code to infer how logs can be parsed so that the dispersed and intertwined log entries can be stitched together and associated to specific individual requests.

We validate lprof using the four widely used distributed services mentioned above. Our evaluation shows lprof ’s precision in request extraction is 90%, and lprof is helpful in diagnosing 65% of the sampled real-world performance anomalies.

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm
Conference Papers Usenix OSDI'14

Abstract

Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.

We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.

QTrace: An interface for customizable full system instrumentation

Xin Tong, Jack Luo, Andreas Moshovos
Conference Papers IEEE Performance Analysis of Systems and Software (ISPASS'13)

Abstract

This work presents QTrace, an open-source instrumentation extension API developed on top of QEMU. QTrace instruments unmodified applications and OS binaries for uni- and multi- processor systems, enabling custom, full-system instrumentation tools for the x86 guest architecture. Computer architects can use QTrace to study whole program execution including system-level code. This paper motivates the need for QTrace, illustrates what QTrace can do, and discusses how QEMU was modified to implement QTrace.

At My Office

You can find me at my office located at:

D.L. Pratt Building Room 372

I am usually at my office every weekday from 9:00 until 6:00 pm, but you may consider sending an email to fix an appointment.