Head Teaching Assistant
ECE454, Computer System Programming
I am a fourth year PhD Candidate candidate in Computer Engineering Department at University of Toronto. I joined Professor Ding Yuan's research group in 2015 as a master student after working together for extended periods during my undergraduate studies. I received my bachelor with business minor and master degree in computer engineering at University of Toronto as well.
My main interests are aerial drone photography and cooking. As I don't have a lot of time, my only sports activities are badminton and snowboarding. I also love my significant other and our two dogs, a French bulldog and Boston Terrier. They are so cute!!!
ECE454, Computer System Programming
ECE568, Computer Security
ECE472, Economics & Entrepreneurship
BMO Capital Markets
Downtown Toronto, Ontario, Canada
IBM Canada Lab
Markham, Ontario, Canada
Master of Applied Science in
Computer Engineering Supervised by Ding Yuan
University of Toronto, Canada
Master of Applied Science in
Computer Engineering Supervised by Ding Yuan
University of Toronto, Canada
Bachelor of Applied Science in
Computer Engineering Major and Business Minor
University of Toronto, Canada
My research interest is in system software, with a focus on developing practical solutions to improve the availability and performance of large software systems. Many of our research group's pioneer work is in the field of failure diagnosis in large distributed systems via log analysis. Our ultimate goal is to build effective and practical tools which can triage and help users perform post-mortem failure analysis as well as providing non-intrusive monitoring on full stack distributed systems. Many intermediate steps are required to reach our ambitious target and many of our research can be found in the papers mentioned in inside the Publications section as well as my advisor's webpage. We hope our research will make great impact within the systems community as well as the world.
A log of execution of an executable program is obtained. Log messages contained in the log are parsed to generate object identifiers representative of instances of programmatic elements in the executable program. Relationships among the object identifiers are identified. A representation of identified relationships is constructed and outputted as, for example, a visual representation.
Existing program code, which is executable on one or more computers forming part of a distributed computer system, is analyzed. The analysis identifies log output instructions present in the program code. Log output instructions are those statements or other code that generate log messages related to service requests processed by the program code. A log model is generated using the analysis. The log model is representative of causal relationships among service requests defined by the program code. The log model can then be applied to logs containing log messages generated by execution of the program code, during its normal operation, to group log messages for improved analysis, including visualization, of the performance and behavior of the distributed computer system.
The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a totally ordered sequence of instructions, then the root cause can be identified by the first instruction where the failure execution deviates from the non-failure execution that has the longest instruction sequence prefix in common with that of the failure execution. Thus, root cause analysis is transformed into a principled search problem to identify the non-failure execution with the longest common prefix. We present Kairux, a tool that does just that. It is, in most cases, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. Kairux uses tests from the system's rich unit test suite as building blocks to construct the non-failure execution that has the longest common prefix with the failure execution in order to locate the root cause. By evaluating Kairux on some of the most complex, real-world failures from HBase, HDFS, and ZooKeeper, we show that Kairux can accurately pinpoint each failure's respective root cause.
When systems fail in production environments, log data is often the only information available to programmers for postmortem debugging. Consequently, programmers' decision on where to place a log printing statement is of crucial importance, as it directly affects how effective and efficient postmortem debugging can be. This paper presents Log20, a tool that determines a near optimal placement of log printing statements under the constraint of adding less than a specified amount of performance overhead. Log20 does this in an automated way without any human involvement. Guided by information theory, the core of our algorithm measures how effective each log printing statement is in disambiguating code paths. To do so, it uses the frequencies of different execution paths that are collected from a production environment by a low-overhead tracing library. We evaluated Log20 on HDFS, HBase, Cassandra, and ZooKeeper, and observed that Log20 is substantially more efficient in code path disambiguation compared to the developers' manually placed log printing statements. Log20 can also output a curve showing the trade-off between the informativeness of the logs and the performance slowdown, so that a developer can choose the right balance.
A production system's printed logs are often the only source of runtime information available for postmortem debugging, performance analysis and profiling, security auditing, and user behavior analytics. Therefore, the quality of this data is critically important. Recent work has attempted to enhance log quality by recording additional variable values, but logging statement placement, i.e., where to place a logging statement, which is the most challenging and fundamental problem for improving log quality, has not been adequately addressed so far. This position paper proposes we automate the placement of logging statements by measuring how much uncertainty, i.e., the expected number of possible execution code paths taken by the software, can be removed by adding a logging statement to a basic block. Guided by ideas from information theory, we describe a simple approach that automates logging statement placement. Preliminary results suggest that our algorithm can effectively cover, and further improve, the existing logging statement placements selected by developers. It can compute an optimal logging statement placement that disambiguates the entire function call path with only 0.218% of slowdown.
Understanding the performance behavior of distributed server stacks at scale is non-trivial. Servicing a single request can trigger numerous sub-requests to heterogenous software components; many similar requests are serviced concurrently and in parallel. When a user experiences a performance slow-down, it is extremely difficult to identify the root cause, software components, and machines that are the culprits.
This paper describes Stitch, a non-intrusive tool capable of profiling the performance of an entire distributed software stack soley from the unstructured logs output by heterogeneous software components. Stitch is substantially different from all prior related tools. It is the first tool capable of constructing a system model of an entire software stack without requiring any domain knowledge. It is the first non-intrusive tool able to help diagnose complex cross-component performance issues. It focuses entirely on objects, their relationships and their interactions as a way to deal with complexity.
We have evaluated Stitch on various software stacks, including Hive/Hadoop, OpenStack, and Spark, and found Stitch miscategorized 3% of all objects. A controlled user study shows that Stitch can speed up various profiling and diagnosis tasks on real-world systems by a factor of at least 4.6 when compared with completing the same tasks without the tool
Programming large, production-quality distributed systems still fail periodically, sometimes catastrophically where most or all users experience an outage or data loss. Conventional wisdom has it that these failures can only manifest themselves on large production clusters and are extremely difficult to prevent a priori, because these systems are designed to be fault tolerant and are well-tested. By investigating 198 user-reported failures that occurred on production-quality distributed systems, we found that almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors, and, surprisingly, many of them are caused by trivial mistakes such as error handlers that are empty or that contain expressions like "FIXME" or "TODO" in the comments. We therefore developed a simple static checker, Aspirator, capable of locating trivial bugs in error handlers; it found 143 new bugs and bad practices that have been fixed or confirmed by the developers.
Applications implementing cloud services, such as HDFS, Hadoop YARN, Cassandra, and HBase, are mostly built as distributed systems designed to scale. In order to analyze and debug the performance of these systems effectively and efficiently, it is essential to understand the performance behavior of service requests, both in aggregate and individually.
lprof is a profiling tool that automatically reconstructs the execution flow of each request in a distributed application. In contrast to existing approaches that require in- strumentation, lprof infers the request-flow entirely from runtime logs and thus does not require any modifications to source code. lprof first statically analyzes an application’s binary code to infer how logs can be parsed so that the dispersed and intertwined log entries can be stitched together and associated to specific individual requests.
We validate lprof using the four widely used distributed services mentioned above. Our evaluation shows lprof ’s precision in request extraction is 90%, and lprof is helpful in diagnosing 65% of the sampled real-world performance anomalies.
Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.
We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.
This work presents QTrace, an open-source instrumentation extension API developed on top of QEMU. QTrace instruments unmodified applications and OS binaries for uni- and multi- processor systems, enabling custom, full-system instrumentation tools for the x86 guest architecture. Computer architects can use QTrace to study whole program execution including system-level code. This paper motivates the need for QTrace, illustrates what QTrace can do, and discusses how QEMU was modified to implement QTrace.
I would be happy to talk to you if you need my assistance in your research. Though I have limited time, skills and experience as a student but I will try to help you out as best as I can.
You can find me at my office located at:
D.L. Pratt Building Room 372
I am usually at my office every weekday from 9:00 until 6:00 pm, but you may consider sending an email to fix an appointment.