Dependable Computing Systems Lab

 

The DCS Lab, directed by Dr. Song Fu, aims to explore in-depth understanding of reliability, availability, and performance in distributed and cloud computing systems, and develop innovative system technologies. The research has been supported in part by funding from the National Science Foundation, and Los Alamos National Laboratory.


 People

Director:  Dr. Song Fu

PhD candidates:  Ziming Zhang,  Qiang Guan

Undergraduate research students:  Oluwaseun Adeyemi, Chi-Chen Chiu

Research Topics

System Reliability and Availability, Dependable Computing, Self-Managing and Reconfigurable System, Virtual Machines, Cloud Computing, Distributed Systems and Networking, Power Management, and Performance Evaluation.

Research Grants

  • NSF, “CSR: Failure-Aware Monitoring and Management of Online Availability and Performance for Dependable Computing Clusters”, 9/2009-8/2012 (PI)

  • Amazon, “Advanced Distributed Computing on Amazon EC2 Cloud”, 4/2012-3/2013 (PI)

  • LANL, “An Infrastructure to Support Research on Dependable High-Productivity Cluster Systems”, 11/2008-9/2009 (PI)

  • Xilinx Inc., Reconfigurable Computing on FPGAs, 10/2010-9/2011 (PI)

  • UNT, RIG: Autonomic Anomaly Identification for Developing Highly Dependable Cloud Computing Systems, 2/2012-1/2013 (PI)

  • UNT, “ROP: Developing High-Productivity and Green Resource Management Systems in Cloud Computing”, 9/2011-8/2012 (PI)

  • UNT, “RIG: Explore Failure Prediction for Proactive Resource Management in Highly Dependable Systems”, 1/2011-12/2011 (PI)

  • NMT, “Failure-Aware Resource Management for High-Availability Computing with Reconfigurable Distributed Virtual Machines“,1/2010-12/2010 (PI)

Projects

  • Failure-aware resource management for high-availability computing in networked computer systems

In large-scale networked computer systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. In this project, we are exploring the proactive failure management techniques to enhance the system availability. As a long-term objective, we are to develop a self-managing system which is capable of predicting, detecting, and recovering from failures automatically. To characterize failure dynamics, we propose theoretical models to quantify the temporal and spatial correlation among failure events. We cluster failure events based on their correlations and predict their future occurrences. We implemented a failure prediction framework, called PREdictor of Failure Events Correlated Temporal-Spatially (hPREFECTS). We evaluated the performance of hPREFECTS in both offline prediction of failures by using the Los Alamos HPC traces and online prediction in a production computational grid. By exploring failure prediction techniques, we are developing failure-aware resource management mechanisms to construct resilient computing systems with dependability assurance.

Q. Guan, Z. Zhang and S. Fu, "Ensemble of Bayesian Predictors and Decision Trees for Proactive Failure Management in Cloud Computing Systems", Journal of Communications, pp 52--61, 7(1), 2012.

S. Fu, "Failure Detection and Prediction for Dependable Cloud Computing Systems", Proc. of IEEE Global Communication Conference (GLOBECOM'11), December 2011.

N. DeBardeleben, S. Blanchard, Q. Guan, Z. Zhang, and S. Fu, "Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience", Proc. of Resilience, the 17th International European Conference on Parallel and Distributed Computing (Euro-Par'11), September 2011.

Q. Guan, Z. Zhang and S. Fu, "Proactive Failure Management by Integrated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems", Proc. of IEEE International Conference on Availability, Reliability and Security (ARES'11), August 2011.

Q. Guan, Z. Zhang and S. Fu, "Ensemble of Bayesian Predictors for Autonomic Failure Management in Cloud Computing", Proc. of the 20th IEEE International Conference on Computer Communications and Networks (ICCCN'11), July 2011.

Q. Guan and S. Fu, "auto-AID: A Data Mining Framework for Autonomic Anomaly Identification in Networked Computer Systems", Proc. of the 29th IEEE International Performance Computing and Communications Conference (IPCCC'10), December 2010.

S. Fu and C.-Z. Xu, "Quantifying Event Correlations for Proactive Failure Management in Networked Computing Systems", Journal of Parallel and Distributed Computing, Elsevier, pp 1100--1109, 70(11), November 2010.

Z. Zhang and S. Fu, "A Hierarchical Failure Management Framework for Dependability Assurance in Compute Clusters", International Journal of Computational Science, pp 313--326, 4(4), October 2010.

Q. Guan, D. Smith and S. Fu, "Anomaly Detection in Large-Scale Coalition Clusters for Dependability Assurance", Proc. of the 17th IEEE International Conference on High Performance Computing (HiPC'10), December 2010.

Z. Zhang and S. Fu, "Failure Prediction for Autonomic Management of Networked Computer Systems with Availability Assurance", Proc. of DPDNS, IEEE International Parallel and Distributed Processing Symposium (IPDPS'10), April, 2010.

S. Fu, "Failure-Aware Resource Management for High-Availability Computing Clusters with Distributed Virtual Machines", Journal of Parallel and Distributed Computing, Elsevier, pp 384--393, 70(4), April 2010.

S. Fu, "Dependability Enhancement for Coalition Clusters with Autonomic Failure Management", Proc. of the 15th IEEE International Symposium on Computers and Communications (ISCC'10), June, 2010.

Z. Zhang and S. Fu, "Proactive Failure Management for High Availability Computing in Computer Clusters", Proc. of IEEE International Conference on Computational Sciences and Optimization (CSO'10), May, 2010.

S. Fu, "Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing", Proc. of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid'09), May, 2009.

S. Fu and C.-Z. Xu, "Proactive Resource Management for Failure Resilient High Performance Computing Clusters", Proc. of the IEEE International Conference on Availability, Reliability and Security (ARES'09), March, 2009.

S. Fu and C.-Z. Xu, "Exploring Event Correlation for Failure Prediction in Coalitions of Clusters", Proc. of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'07), Reno, Nevada, November, 2007.

S. Fu and C.-Z. Xu, "Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management", Proc. of the 26th IEEE International Symposium on Reliable Distributed Systems (SRDS'07), Beijing, October, 2007.

S. Fu, C.-Z. Xu and H. Shen, “Random Choices for Churn Resilient Load Balancing in Peer-to-Peer Networks”, Proc. of the 22nd ACM/IEEE International Parallel and Distributed Processing Symposium (IPDPS'08), 2008.

S. Fu and C.-Z. Xu, "hPREFECTS: a Failure Prediction Framework Exploring Spatio-Temporal Event Correlation in Coalitions of Clusters", Submitted for journal publication.

Poster: Proactive Failure Management for Dependable Networked Computer Systems, University of North Texas.

 

  • Service migration in distributed virtual machines for adaptive computing

Grand applications run on a large number of computers. Hardware/software faults are common phenomena in such an environment. They result in redoing the entire or part of computation. To tackle this problem, we proposed service migration to migrate runtime support services in distributed virtual machines for adaptive computing. As an illustrating system, we developed a mobile distributed shared array (M-DSA), which supports a Java-compliant distributed virtual machine to accommodate parallel computation in heterogeneous grids. The migration decision problem was solved by a stochastic optimization model. We evaluated the system performance on a campus grid with several benchmark application programs.

S. Fu and C.-Z. Xu, "Stochastic Modeling and Analysis of Hybrid Mobility in Reconfigurable Distributed Virtual Machines", Journal of Parallel and Distributed Computing, pp 1442--1454, 66(11), Elsevier, 2006.

S. Fu, C.-Z. Xu, B. Wims, and R. Basharahil, "Distributed Shared Arrays with Mobility Support for Reconfigurable Distributed Virtual Machines", Journal of Cluster Computing, pp 237--255, 9(3), 2006.

S. Fu and C.-Z. Xu, "Service Migration in Distributed Virtual Machines for Adaptive Grid Computing", Proc. of the 34th IEEE International Conference on Parallel Processing (ICPP'05), Norway, June, 2005.  (Best paper award nominee)

S. Fu and C.-Z. Xu, "Mobility Support for Adaptive Grid Computing" (book chapter), Scalable and Secure Internet Services and Architecture, Chapman & Hall/CRC, 2005.

S. Fu and C.-Z. Xu, "Migration Decision for Hybrid Mobility in Reconfigurable Distributed Virtual Machines", Proc. of the 33rd IEEE International Conference on Parallel Processing (ICPP'04), Montreal, Quebec, Canada, Aug. 15-18, 2004.

R. Basharahil, B. Wims, C.-Z. Xu, and S. Fu, "Distributed shared arrays: An integration of message passing and multithreading on SMP clusters", Journal of Supercomputing, pp 161--184, 31(2), 2004.

Poster: Service Migration in Distributed Virtual Machines for Adaptive Computing, Cluster and Internet Computing Laboratory, Wayne State University

 

  • Coordinated Access Control for Mobile Computing

Mobile code has as its defining trait the ability to travel from machine to machine autonomously on behalf of its owner. As a side-effect of this flexibility, a mobile code can visit a machine that is not supposed to be visited according to the itinerary and/or access shared resources multiple times in a coalition environment. As a formal treatment of these safety issues, we proposed an itinerary language and its operational semantics to reason the itinerary safety for mobile codes with proactive migrations. Protection of the shared resources in a coalition environment is realized by a coordinated access control mechanism, which checks the access history of a mobile code. The protection policies for shared resources are specified by temporal & spatial constraints and are enforced by an access control system based on mobile codes' roles. The coordinated access control mechanism has been implemented in the Naplet mobile agent system.

S. Fu and C.-Z. Xu, "Coordinated Access Control with Temporal and Spatial Constraints on Mobile Execution in Coalition Environments", Journal of Future Generation Computer Systems, pp 804--815, 23(6), Elsevier, 2007.

S. Fu and C.-Z. Xu, "Mobile Code and Protection" (book chapter), Handbook of Information Security, John Wiley & Sons, 2005.

S. Fu and C.-Z. Xu, "A Coordinated Spatio-Temporal Access Control Model for Mobile Computing in Coalition Environments", Proc. of the 19th ACM/IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), Denver, Colorado, Apr. 3-8, 2005.

C.-Z. Xu and S. Fu, "Privilege delegation and agent-oriented access control in Naplet", Proc. of IEEE International Workshop on Mobile Distributed Computing (In conjunction with ICDCS'03), Providence, Rhode Island, Jun. 2003.

Poster: Mobile Code and Protection, Cluster and Internet Computing Laboratory, Wayne State University