The Warren Center for Network and Data Sciences
We are delighted that a new Center is being launched to focus on topics closely aligned with the NETS Program. The Warren Center for Network and Data Sciences will include the faculty from the NETS Program as well as from across Penn, all focused on network science and data science topics.ASPEN - cluster processing of heterogeneous dynamic data
Managing Heterogeneity in Highly Distributed Stream, Cloud, and Sensor Systems
With the advent of low-cost wireless sensing devices, it is predicted that the world will quickly move to one in which many environments are instrumented for reasons of security, scientific monitoring, environmental control, entertainment, etc. There are many fundamental questions about how to develop applications in this emerging sensor network world. Perhaps the most important are how to support rich, complex applications that may have confidentiality requirements, heterogeneous types of sensors, different connectivity levels, and timing constraints. The Aspen (Abstract Sensor Programming Environment) project focuses on the challenges in developing a programming environment and runtime system for this style of environment. We are investigating a number of complementary topics and ideas:- Complex analysis in a cluster/cloud setting: Many sensor and stream data items need complex analysis. Building upon ideas from MapReduce and from our ORCHESTRA distributed query engine, we are developing new techniques for supporting cluster computation with incremental updates over recursive operations (e.g., PageRank, optimization).
- Distributed coordination and control: Many complex computations need to be continuously rebalanced, redistributed, and replanned based on monitored activity -- this is a form of adaptive processing. We are developing new declarative techniques to address these problems.
- New programming model: We are building upon a declarative style of programming to develop a new language, group-based programming, for complex sensor applications. The goal is to combine compositional, database-style declarative computation with constraints on timing, security, distribution, and actuation in a seamless way. This work is funded by NSF CNS-0721541.
- Security and privacy: We have studied how sensor network application security is affected by node-level compromise. We are developing further language constructs for specifying encryption levels and other properties for data along certain channels.
- Runtime monitoring and checking: We seek to develop techniques for monitoring performance and triggering events in response to constraint violations. This work is funded by NSF CNS-0721541.
- Home health care and hospital applications: We hope to develop a number of applications useful in home hospice and hospital care, which monitor patients and also connect patients with the care they need. This work is funded by NSF CNS-0721541.
- Declarative information integration and query optimization: The core programming model is based on database query languages. We are developing techniques for supporting schema mappings over streams, distributed in-network join computation, and recursive queries for regions. Importantly, we are developing techniques for performing distributed, decentralized optimization of such computations. This work is funded by NSF IIS-0713267.
- Stream algorithms: In a distributed setting, many nodes have limited resources and must use approximate algorithms to make decisions and capture synopses of system activity. This work is funded by NSF IIS-0713267.
- Interfacing to Java code: Many real control systems require Java, C, or other procedural code for sophisticated sensor data processing or decision-making. This work is funded by Lockheed Martin.
- Declarative monitoring and re-optimization: We seek to build a declarative infrastructure for monitoring distributed query execution status, plus adaptive re-optimization, using declarative techniques. This work is funded by Lockheed Martin.
Query-driven data integration
The Q Query System
One of the major challenges for end users today (whether scientists, researchers, policymakers, etc.) is how to pose integrative queries over Web data sources. Today one can fairly easily input a keyword search into Google (Bing, Yahoo, etc.) and receive satisfactory answers if one's information need matches that of someone in the past: the search engine will point at a Web page containing the already-assembled content.
The challenge lies when an information discovery query is being posed -- one that requires assembly of content from multiple data items, but has not previously been posed. The Q System attempts to provide an intuitive means of posing such queries.
In Q, the user first defines a web query form to answer queries related to a specific topic or topic domain: this is done by describing (using keywords) the set of concepts that need to be interrelated. The system finds sets of data sources related to each of the topics. Then, using automatic schema matching algorithms, it finds ways of combining source data items to return results.
Differential privacy
Putting Differential Privacy to Work
A wealth of data about individuals is constantly accumulating in various databases in the form of medical records, social network graphs, mobility traces in cellular networks, search logs, and movie ratings, to name only a few. There are many valuable uses for such datasets, but it is difficult to realize these uses while protecting privacy. Even when data collectors try to protect the privacy of their customers by releasing anonymized or aggregated data, this data often reveals much more information than intended. To reliably prevent such privacy violations, we need to replace the current ad-hoc solutions with a principled data release mechanism that offers strong, provable privacy guarantees. Recent research on differential privacy has brought us a big step closer to achieving this goal. Differential privacy allows us to reason formally about what an adversary could learn from released data, while avoiding the need for many assumptions (e.g. about what an adversary might already know), the failure of which have been the cause of privacy violations in the past. However, despite its great promise, differential privacy is still rarely used in practice. Proving that a given computation can be performed in a differentially private way requires substantial manual effort by experts in the field, which prevents it from scaling in practice.
This project aims to put differential privacy to work---to build a system that supports differentially private data analysis, can be used by the average programmer, and is general enough to be used in a wide variety of applications. Such a system could be used pervasively and make strong privacy guarantees a standard feature wherever sensitive data is being released or analyzed. The long-term goal is to combine ideas from differential privacy, programming languages, and distributed systems to make data analysis techniques with strong, provable privacy guarantees practical for general use.
Secure network provenance
Operators of distributed systems often find themselves needing to answer a diagnostic or forensic question. Some part of the system is found to be in an unexpected state; for example, a suspicious routing table entry is discovered, or a proxy cache is found to contain an unusually large number of advertisements. The operators must determine the causes of this state before they can decide on an appropriate response. On the one hand, there may be an innocent explanation: the routing table entry could be the result of a misconfiguration, and the cache entries could have appeared due to a workload change. On the other hand, the unexpected state may be the symptom of an ongoing attack: the routing table entry could be the result of route hijacking, and the cache entries could be a side-effect of a malware infection. In this situation, it would be helpful to be able to ask the system to "explain" its own state, e.g., by describing a chain of events that link the state to its root causes, such as external inputs.
As long as the system is working correctly, emerging network provenance techniques can construct such explanations. However, if some of the nodes are faulty or have been compromised by an adversary, the situation is complicated by the fact that the adversary can cause the nodes under his control to lie, suppress information, tamper with existing data, or report nonexistent events. This can cause the provenance system to turn from an advantage into a liability: its answers may cause operators to stop investigating an ongoing attack because everything looks fine.
The goal of this project is to provide secure network provenance, that is, the ability to correctly explain system states even when (and especially when) the system is faulty or under attack. Towards this goal, we are substantially extending and generalizing the concept of network provenance by adding capabilities needed in a forensic setting, we are developing techniques for securely storing provenance without trusted components, and we are designing methods for efficiently querying secure provenance. We are evaluating our techniques in the context of concrete applications, such as Hadoop MapReduce or BGP interdomain routing.