Dependable
Computing Systems Lab
|
|
The DCS Lab,
directed by Dr. Song Fu, aims to explore
in-depth understanding of reliability, availability, and performance in
distributed and cloud computing systems, and develop innovative system
technologies. The research has been supported in part by funding from
the National Science Foundation, and Los Alamos National Laboratory.
People
Director:
Dr.
Song Fu
PhD candidates:
Ziming Zhang,
Qiang Guan
Undergraduate research students: Oluwaseun Adeyemi, Chi-Chen Chiu
Research
Topics
System
Reliability and Availability, Dependable Computing, Self-Managing and
Reconfigurable System, Virtual Machines,
Cloud Computing, Distributed Systems and Networking,
Power Management, and Performance Evaluation.
Research
Grants
-
NSF, “CSR: Failure-Aware Monitoring and Management of Online
Availability and Performance for Dependable Computing Clusters”,
9/2009-8/2012 (PI)
-
Amazon,
“Advanced Distributed Computing on Amazon EC2 Cloud”, 4/2012-3/2013 (PI)
-
LANL,
“An Infrastructure to Support Research on Dependable High-Productivity
Cluster Systems”, 11/2008-9/2009 (PI)
-
Xilinx Inc.,
“Reconfigurable
Computing on FPGAs”, 10/2010-9/2011
(PI)
-
UNT,
“RIG: Autonomic
Anomaly Identification for Developing Highly Dependable Cloud Computing
Systems”, 2/2012-1/2013
(PI)
-
UNT,
“ROP: Developing High-Productivity and Green Resource Management Systems in Cloud Computing”, 9/2011-8/2012 (PI)
-
UNT,
“RIG: Explore Failure Prediction for Proactive Resource Management in Highly Dependable Systems”, 1/2011-12/2011 (PI)
-
NMT,
“Failure-Aware Resource Management for High-Availability Computing with
Reconfigurable Distributed Virtual Machines“,1/2010-12/2010
(PI)
Projects
In
large-scale networked computer systems, component failures become norms
instead of exceptions. Failure prediction is
a crucial technique for self-managing resource burdens. In this project, we
are exploring the proactive failure management techniques to enhance the
system availability. As a long-term objective, we are to develop a self-managing system which is capable of
predicting, detecting, and recovering from failures automatically. To
characterize failure dynamics, we propose theoretical models to quantify the
temporal and spatial correlation among failure events. We cluster failure
events based on their correlations and predict their future occurrences. We
implemented a failure prediction framework, called PREdictor
of Failure Events Correlated Temporal-Spatially (hPREFECTS). We evaluated the performance of hPREFECTS in both offline prediction of
failures by using the Los Alamos HPC traces and online prediction in a
production computational grid.
By
exploring failure prediction techniques, we are developing failure-aware
resource management mechanisms to construct resilient computing systems with
dependability assurance.
Q. Guan, Z. Zhang and S.
Fu, "Ensemble of Bayesian
Predictors and Decision Trees for Proactive Failure Management in Cloud
Computing Systems",
Journal of Communications,
pp 52--61, 7(1), 2012.
S. Fu, "Failure Detection and Prediction for Dependable Cloud Computing Systems",
Proc. of
IEEE Global Communication Conference (GLOBECOM'11),
December 2011.
N. DeBardeleben, S. Blanchard, Q.
Guan, Z. Zhang, and S. Fu, "Experimental Framework for Injecting
Logic Errors in a Virtual Machine to Profile Applications for Soft Error
Resilience",
Proc. of
Resilience, the 17th International European Conference on Parallel and
Distributed Computing (Euro-Par'11),
September 2011.
Q. Guan, Z. Zhang and S.
Fu, "Proactive Failure
Management by Integrated Unsupervised and Semi-Supervised Learning for
Dependable Cloud Systems",
Proc. of
IEEE
International Conference on Availability, Reliability and Security (ARES'11),
August 2011.
Q. Guan, Z. Zhang and S. Fu, "Ensemble
of Bayesian Predictors for Autonomic Failure Management in Cloud Computing",
Proc. of the
20th IEEE
International Conference on Computer Communications and Networks (ICCCN'11),
July 2011.
Q. Guan and S. Fu, "auto-AID:
A Data Mining Framework for Autonomic Anomaly Identification in Networked
Computer Systems",
Proc. of the 29th IEEE
International Performance Computing and Communications Conference (IPCCC'10),
December 2010.
S. Fu and C.-Z. Xu, "Quantifying
Event Correlations for Proactive Failure Management in Networked Computing
Systems", Journal of
Parallel and Distributed Computing, Elsevier, pp 1100--1109, 70(11), November 2010.
Z. Zhang and S. Fu, "A
Hierarchical Failure Management Framework for Dependability Assurance in
Compute Clusters", International Journal of
Computational Science, pp 313--326, 4(4),
October 2010.
Q. Guan, D. Smith and S. Fu, "Anomaly
Detection in Large-Scale Coalition Clusters for Dependability Assurance",
Proc. of
the 17th IEEE International Conference on High Performance Computing (HiPC'10),
December 2010.
Z. Zhang and S. Fu, "Failure Prediction for Autonomic Management
of Networked Computer Systems with Availability Assurance",
Proc. of
DPDNS, IEEE International Parallel and Distributed Processing Symposium
(IPDPS'10), April, 2010.
S. Fu, "Failure-Aware
Resource Management for High-Availability Computing Clusters with Distributed
Virtual Machines", Journal of
Parallel and Distributed Computing, Elsevier, pp 384--393, 70(4), April 2010.
S. Fu, "Dependability Enhancement for Coalition Clusters with
Autonomic Failure Management",
Proc. of
the 15th IEEE
International
Symposium on Computers and Communications (ISCC'10), June, 2010.
Z. Zhang and S. Fu, "Proactive Failure Management for High
Availability Computing in Computer Clusters",
Proc. of
IEEE International Conference on
Computational Sciences and Optimization (CSO'10), May, 2010.
S. Fu, "Failure-Aware Construction and Reconfiguration
of Distributed Virtual Machines for High Availability Computing",
Proc. of the IEEE International Symposium
on Cluster Computing and the Grid (CCGrid'09), May, 2009.
S. Fu and C.-Z. Xu,
"Proactive Resource Management for Failure Resilient High
Performance Computing Clusters",
Proc. of the IEEE
International Conference on Availability, Reliability and Security (ARES'09),
March, 2009.
S. Fu and C.-Z.
Xu, "Exploring Event
Correlation for Failure Prediction in Coalitions of Clusters", Proc. of the ACM/IEEE International
Conference for High Performance Computing, Networking, Storage, and Analysis
(SC'07),
Reno,
Nevada,
November, 2007.
S. Fu and C.-Z. Xu,
"Quantifying
Temporal and Spatial Correlation of Failure Events for Proactive Management",
Proc. of the 26th IEEE International Symposium
on Reliable Distributed Systems (SRDS'07),
Beijing,
October, 2007.
S. Fu, C.-Z. Xu and H. Shen, “Random Choices
for Churn Resilient Load Balancing in Peer-to-Peer Networks”, Proc. of the 22nd ACM/IEEE International
Parallel and Distributed Processing Symposium (IPDPS'08), 2008.
S. Fu and C.-Z. Xu, "hPREFECTS: a Failure Prediction Framework Exploring Spatio-Temporal Event Correlation in Coalitions of
Clusters", Submitted for journal publication.
Poster: Proactive
Failure Management for Dependable Networked Computer Systems, University
of North Texas.
Grand applications run on a
large number of computers. Hardware/software faults are common phenomena in
such an environment. They result in redoing the entire or part of
computation. To tackle this problem, we proposed service
migration to migrate runtime support services in distributed virtual
machines for adaptive computing. As an illustrating system, we developed a
mobile distributed shared array (M-DSA), which supports a Java-compliant
distributed virtual machine to accommodate parallel computation in
heterogeneous grids. The migration decision problem
was solved by a stochastic optimization model. We evaluated the system
performance on a campus grid with several benchmark application programs.
S. Fu and C.-Z.
Xu, "Stochastic Modeling
and Analysis of Hybrid Mobility in Reconfigurable Distributed Virtual
Machines", Journal of Parallel
and Distributed Computing, pp 1442--1454, 66(11), Elsevier, 2006.
S. Fu, C.-Z.
Xu, B. Wims, and R. Basharahil, "Distributed
Shared Arrays with Mobility Support for Reconfigurable Distributed Virtual
Machines", Journal of Cluster
Computing, pp 237--255, 9(3), 2006.
S. Fu and
C.-Z. Xu, "Service
Migration in Distributed Virtual Machines for Adaptive Grid Computing",
Proc. of the 34th IEEE International Conference on Parallel Processing
(ICPP'05), Norway, June, 2005.
(Best paper award nominee)
S. Fu and C.-Z. Xu, "Mobility
Support for Adaptive Grid Computing" (book chapter), Scalable and Secure Internet Services and
Architecture, Chapman & Hall/CRC, 2005.
S. Fu and C.-Z.
Xu, "Migration
Decision for Hybrid Mobility in Reconfigurable Distributed Virtual Machines",
Proc. of the 33rd IEEE International Conference on Parallel Processing
(ICPP'04), Montreal,
Quebec,
Canada,
Aug. 15-18, 2004.
R. Basharahil, B. Wims, C.-Z. Xu, and S. Fu, "Distributed
shared arrays: An integration of message passing and multithreading on SMP
clusters", Journal of
Supercomputing, pp 161--184, 31(2), 2004.
Poster:
Service
Migration in Distributed Virtual Machines for Adaptive Computing, Cluster
and Internet Computing Laboratory, Wayne
State
University
Mobile code
has as its defining trait the ability to travel from machine to machine
autonomously on behalf of its owner. As a side-effect of this flexibility, a mobile
code can visit a machine that is not supposed to be visited according to the
itinerary and/or access shared resources multiple times in a coalition
environment. As a formal treatment of these safety issues, we proposed an
itinerary language and its operational semantics to reason the itinerary safety for mobile codes with proactive
migrations. Protection of the shared resources in a coalition environment is
realized by a coordinated access control mechanism,
which checks the access history of a mobile code. The protection policies for
shared resources are specified by temporal & spatial constraints and are
enforced by an access control system based on mobile codes' roles. The
coordinated access control mechanism has been implemented in the Naplet mobile agent system.
S. Fu and C.-Z.
Xu, "Coordinated
Access Control with Temporal and Spatial Constraints on Mobile Execution in
Coalition Environments", Journal
of Future Generation Computer Systems, pp 804--815, 23(6),
Elsevier, 2007.
S. Fu and C.-Z.
Xu, "Mobile Code and Protection"
(book chapter), Handbook of Information Security, John Wiley &
Sons, 2005.
S. Fu and C.-Z.
Xu, "A
Coordinated Spatio-Temporal Access Control Model
for Mobile Computing in Coalition Environments", Proc. of the
19th ACM/IEEE International Parallel and Distributed Processing Symposium
(IPDPS'05), Denver,
Colorado,
Apr. 3-8, 2005.
C.-Z.
Xu and S. Fu, "Privilege
delegation and agent-oriented access control in Naplet",
Proc. of IEEE International Workshop on
Mobile Distributed Computing (In conjunction with ICDCS'03),
Providence, Rhode Island, Jun. 2003.
Poster: Mobile
Code and Protection, Cluster and Internet Computing Laboratory,
Wayne
State
University
|