Dependable Computing Systems Lab

 

The DCS Lab, directed by Dr. Song Fu, aims to explore in-depth understanding of reliability, energy efficiency, and performance in HPC, distributed and cloud systems, and develop innovative system technologies. The research has been supported in part by funding from the National Science Foundation, Los Alamos National Laboratory, Amazon, Nvidia, Xilinx, and University of North Texas.


 People

Director Dr. Song Fu

PhD students Ziming Zhang, Song Huang, Mahendra Talasila (co-advised with Dr. Shengli Fu), Bayaner Arigong (co-advised with Dr. Hualiang Zhang)

Master's students:  Ramureddy Venumuddala, Brisia Munoz, Siddhartha Katragadda, Milko Balakchiev

Undergraduate and TAMS research studentsJason He, Alex Lu

Visiting scholar:  Xiajun Wang

Alumni: Qiang Guan (PhD), Husanbir Singh Pannu (PhD), Savitha Chintakindi (MS), Devender Singh (MS), Neeraja Ganesh (MS), Kayleigh Bush (BS), Yash Ramamurthy (BS), Chi-Chen Chiu (BS), Oluwaseun Adeyemi (BS), Yosef Borga (BS), Michael Wang (TAMS).

Research Topics

System Resilience, Fault Avoidance and Tolerance, Anomaly Detection, Soft Errors, Resource Management, Energy and Performance.

Projects & Publications

  • Autonomic anomaly and soft error management

Q. Guan, S. Fu, N. DeBardeleben and S. Blanchard, "F-SEFI: A Fine-grained Soft Error Fault Injection Tool for Profiling Application Vulnerability", Proc. of the 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS'14), 10 pages, May 2014.

Q. Guan, N. DeBardeleben, S. Blanchard and S. Fu, "Towards Exploring the Soft Error Susceptibility of Heapsort Algorithms", accepted by the 44th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'14), June 2014.

Q. Guan and S. Fu, "Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures",  Proc. of the 32nd IEEE International Symposium on Reliable Distributed Systems (SRDS'13), 10 pages, October 2013.

Q. Guan, S. Fu, N. DeBardeleben and S. Blanchard, "Exploring Time and Frequency Domains for Accurate and Automated Anomaly Detection in Cloud Computing Systems", Proc. of the 19th IEEE/IFIP International Symposium on Dependable Computing (PRDC'13), 10 pages, December 2013.

Q. Guan and S. Fu, "Autonomic Failure Identification and Diagnosis for Building Dependable Computing Systems", Proc. of ACM/IEEE Supercomputing Conference (SC'13), November 2013.

Q. Guan and S. Fu, "Wavelet-Based Multi-Scale Anomaly Identification in Cloud Computing Systems", Proc. of IEEE Global Communications Conference (GLOBECOM'13), December 2013.

H. Pannu, J. Liu and S. Fu, "AAD: Adaptive Anomaly Detection System for Cloud Computing Infrastructures", Proc. of the 31st IEEE International Symposium on Reliable Distributed Systems (SRDS'12), October 2012.

H. Pannu, J. Liu and S. Fu, "A Self-Evolving Anomaly Detection Framework for Developing Highly Dependable Utility Clouds", Proc. of IEEE Global Communications Conference (GLOBECOM'12), December 2012.

Q. Guan, C.-C. Chiu and S. Fu, "A Cloud Dependability Analysis Framework for Assessing System Dependability in Cloud Computing Infrastructures", Proc. of the 18th IEEE/IFIP International Symposium on Dependable Computing (PRDC'12), 10 pages, December 2012.

H. Pannu, J. Liu, Q. Guan and S. Fu, "An Autonomic Failure Detection System for Cloud Computing Infrastructures", Proc. of the 31st IEEE International Performance Computing and Communications Conference (IPCCC'12), 10 pages, December 2012.

Q. Guan, C.-C. Chiu, Z. Zhang and S. Fu, "Efficient and Accurate Anomaly Identification Using Reduced Metric Space in Utility Clouds", Proc. of IEEE International Conference on Networking, Architecture, and Storage (NAS'12), 10 pages, June 2012.

Q. Guan, Z. Zhang and S. Fu, "Ensemble of Bayesian Predictors and Decision Trees for Proactive Failure Management in Cloud Computing Systems", Journal of Communications, pp 52--61, 7(1), 2012.

H. Pannu, J. Liu and S. Fu, "A Hybrid Anomaly Detection Framework in Cloud Computing using One-Class and Two-Class Support Vector Machines", Proc. of International Conference on Advanced Data Mining and Applications (ADMA'12), 12 pages, December 2012.

S. Fu, "Performance Metric Selection for Autonomic Anomaly Detection on Cloud Computing Systems", Proc. of IEEE Global Communication Conference (GLOBECOM'11), December 2011.

N. DeBardeleben, S. Blanchard, Q. Guan, Z. Zhang, and S. Fu, "Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience", Proc. of Resilience, the 17th International European Conference on Parallel and Distributed Computing (Euro-Par'11), 10 pages, September 2011.

Q. Guan, Z. Zhang and S. Fu, "Proactive Failure Management by Integrated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems", Proc. of IEEE International Conference on Availability, Reliability and Security (ARES'11), August 2011.

Q. Guan, Z. Zhang and S. Fu, "Ensemble of Bayesian Predictors for Autonomic Failure Management in Cloud Computing", Proc. of the 20th IEEE International Conference on Computer Communications and Networks (ICCCN'11), July 2011.

  • Power profiling and management

Z. Zhang, Q. Guan, J. He and S. Fu, "Adaptive Power and Performance Management with Resource Auto-Configuration in Cloud Computing Systems", Journal of Communications, in press, 2013.

Z. Zhang, Q. Guan and S. Fu, "An Adaptive Power Management Framework for Autonomic Resource Configuration in Cloud Computing Infrastructures", Proc. of the 31st IEEE International Performance Computing and Communications Conference (IPCCC'12), 10 pages, December 2012.

Z. Zhang and S. Fu, "Characterizing Power and Energy Usage in Cloud Computing Systems", Proc. of IEEE International Conference on Cloud Computing Technology and Science (CloudCom'11), December 2011.

Z. Zhang and S. Fu, "macropower: A Coarse-Grain Power Profiling Framework for Energy-Efficient Cloud Computing", Proc. of the 30th IEEE International Performance Computing and Communications Conference (IPCCC'11), December 2011.

  • Failure prediction and resource management in computer clusters

S. Fu and C.-Z. Xu, "Failure-Aware Resource Management for High-Availability Computing Clusters with Distributed Virtual Machines", Journal of Parallel and Distributed Computing, Elsevier, pp 384--393, 70(4), April 2010.

S. Fu and C.-Z. Xu, "Quantifying Event Correlations for Proactive Failure Management in Networked Computing Systems", Journal of Parallel and Distributed Computing, Elsevier, pp 1100--1109, 70(11), November 2010.

S. Fu and C.-Z. Xu, "Exploring Event Correlation for Failure Prediction in Coalitions of Clusters", Proc. of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'07), Reno, Nevada, November, 2007.

S. Fu and C.-Z. Xu, "Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management", Proc. of the 26th IEEE International Symposium on Reliable Distributed Systems (SRDS'07), Beijing, October, 2007.

S. Fu, "Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing", Proc. of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid'09), May, 2009.

Q. Guan and S. Fu, "auto-AID: A Data Mining Framework for Autonomic Anomaly Identification in Networked Computer Systems", Proc. of the 29th IEEE International Performance Computing and Communications Conference (IPCCC'10), December 2010.

Q. Guan, D. Smith and S. Fu, "Anomaly Detection in Large-Scale Coalition Clusters for Dependability Assurance", Proc. of the 17th IEEE International Conference on High Performance Computing (HiPC'10), December 2010.

Z. Zhang and S. Fu, "Failure Prediction for Autonomic Management of Networked Computer Systems with Availability Assurance", Proc. of DPDNS, IEEE International Parallel and Distributed Processing Symposium (IPDPS'10), April, 2010.

S. Fu, C.-Z. Xu and H. Shen, “Random Choices for Churn Resilient Load Balancing in Peer-to-Peer Networks”, Proc. of the 22nd ACM/IEEE International Parallel and Distributed Processing Symposium (IPDPS'08), 2008.

  • Service migration in distributed virtual machines for adaptive computing

S. Fu and C.-Z. Xu, "Stochastic Modeling and Analysis of Hybrid Mobility in Reconfigurable Distributed Virtual Machines", Journal of Parallel and Distributed Computing, pp 1442--1454, 66(11), Elsevier, 2006.

S. Fu, C.-Z. Xu, B. Wims, and R. Basharahil, "Distributed Shared Arrays with Mobility Support for Reconfigurable Distributed Virtual Machines", Journal of Cluster Computing, pp 237--255, 9(3), 2006.

S. Fu and C.-Z. Xu, "Service Migration in Distributed Virtual Machines for Adaptive Grid Computing", Proc. of the 34th IEEE International Conference on Parallel Processing (ICPP'05), Norway, June, 2005.  (Best paper award nominee)

S. Fu and C.-Z. Xu, "Mobility Support for Adaptive Grid Computing" (book chapter), Scalable and Secure Internet Services and Architecture, Chapman & Hall/CRC, 2005.

S. Fu and C.-Z. Xu, "Migration Decision for Hybrid Mobility in Reconfigurable Distributed Virtual Machines", Proc. of the 33rd IEEE International Conference on Parallel Processing (ICPP'04), Montreal, Quebec, Canada, Aug. 15-18, 2004.

R. Basharahil, B. Wims, C.-Z. Xu, and S. Fu, "Distributed shared arrays: An integration of message passing and multithreading on SMP clusters", Journal of Supercomputing, pp 161--184, 31(2), 2004.

  • Coordinated Access Control for Mobile Computing

S. Fu and C.-Z. Xu, "Coordinated Access Control with Temporal and Spatial Constraints on Mobile Execution in Coalition Environments", Journal of Future Generation Computer Systems, pp 804--815, 23(6), Elsevier, 2007.

S. Fu and C.-Z. Xu, "Mobile Code and Protection" (book chapter), Handbook of Information Security, John Wiley & Sons, 2005.

S. Fu and C.-Z. Xu, "A Coordinated Spatio-Temporal Access Control Model for Mobile Computing in Coalition Environments", Proc. of the 19th ACM/IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), Denver, Colorado, Apr. 3-8, 2005.

C.-Z. Xu and S. Fu, "Privilege Delegation and Agent-Oriented Access Control in Naplet", Proc. of IEEE International Workshop on Mobile Distributed Computing in conjunction with ICDCS'03, Providence, Rhode Island, Jun. 2003.

Sponsors