Data Feast, Privacy Famine: What Is a Healthy Data Diet?
Chief Privacy Officer & General Manager of Data Systems, Intelius, Inc.
The big data feast is upon us, but are we just gorging on junk food? Is there sufficient awareness, control, context, fairness, and history to keep us from bloating our collective waistlines? Would we know a healthy data diet if we saw one? This talk will draw parallels between food and data on how science, business, and societal values shape how we produce, consume, regulate, and think about both. For food, understanding starts with the science of chemistry and biology. From this understanding grow the culinary arts, social rituals, and societal values around how we consume food. Businesses determine how food is produced and marketed, and governments regulate abuses.
Data is certainly more nuanced and abstract than food, but it is following a similar trajectory. For example, social media data has exploded amid disruptive information technology, massively parallel computing, and machine learning. Now the challenge is to fortify social media with our societal values (like discretion, disclosure, fairness, equality) that have governed every media innovation since the invention of parchment. Some believe that data and privacy are inversely related—that is, with more data comes less privacy. That is not necessarily true. Privacy isn't just about data that's breached a security wall. For data that wants to freely flow, privacy is about respecting boundaries and defining appropriate uses. Responsible innovation will mean healthier data use in line with long held social traditions.
Data is the new medium of social communication and is forcing a healthy debate to define public/private boundaries, fair access, and appropriate use. Like food, social communication (and the data that drives it) is a necessity for humanity's survival. This talk will discuss the key ingredients to avoid the empty calories.
Crowdsourcing Big Data
Chairman and Co-founder, CrowdFlower
In this presentation, Lukas Biewald will discuss how crowdsourcing provides channels for researchers, businesses, or even armchair social scientists to gather large amounts of data overnight rather than waiting years. Traditional means of data collection are often time consuming, tedious, and flawed. Biewald will demonstrate how crowdsourcing, utilizing robust quality-control mechanisms, offers a faster, more accurate, and scalable solution.
Data Science in Education and for Discovery
Professor of Astrophysics and Computational Sciences, George Mason University
I will discuss the rise of data science as a new academic and research discipline. Data-intensive opportunities are growing significantly across the spectrum of academic, government, and business enterprises. In order to respond to this data-driven digital transformation, it is imperative to train the next-generation workforce in the data-science skill areas. Among these skills are knowledge discovery and information extraction from massive data collections. I will describe some of the techniques that we are applying both in research (for scientific discovery) and in the classroom (to engage students in inquiry-driven evidence-based learning). Specific examples of surprise detection in big data will be presented.
IPUMS International—Building a Census Data Time Machine
IT Core Director, Minnesota Population Center, University of Minnesota
The IPUMS-International project has collected nearly four hundred million person-records of census data from around the world, with over thirty thousand unique variables. This data comes from many sources and in many forms, but we make it comparable across time and location. This session will cover how we organize and integrate the data; how metadata are created, organized, and processed; and how our processes have fared as the project scales.
The Role of Visualization and Citizen Science in Astronomy
Archive Scientist, Space Telescope Science Institute
Like many other scientific disciplines, astronomy has witnessed a tremendous growth over the past two decades. As a result, astronomers have become very efficient at creating massive datasets that describe the properties of our nearby universe. Given its primarily visual focus, and its potential to address fundamental questions about humanity, astronomy is in the unique position to be an ideal testbed for algorithms and techniques that address "big data" problems. As a result, the astronomical community, in coordination with all NASA data centers, is trying to cope with "big data" by making use of novel approaches to data mining, data visualization, and data distribution.
Here I will present two applications that showcase how astronomers are tackling the problems associated with the wealth of data at their disposal. First I will describe how GPUs, coupled with clever image processing and Mathematica, are helping the search for extraterrestrial planets. Then I will show how citizen science is changing the social fabric of astronomy and redefining what scientific questions can be addressed by large datasets.
A Rapid-Learning Health System: Using Electronic Health Records and Apps
Director, Rapid Learning Project, George Washington University
This talk will update progress toward a national rapid-learning health system, using tens of billions of dollars of public investment for electronic health records, patient registries, and learning networks. In particular, the talk will highlight a national apps strategy, via public policy and marketplace developments, as a creative new approach to collect many more data modules and to generate maximum benefits for many more users and uses. Specific references will likely be made to leading-edge developments such as the forthcoming Real-Time Oncology Network; the National Cardiovascular Research Infrastructure; an in-the-works National Quality Registry Network; and to selected National Institutes of Health, Centers for Disease Control and Prevention, and Food and Drug Administration databases. The talk will also describe the evolving strategy of "rapid cycle" learning that will use ten billion dollars in the Center for Medicare and Medicaid Innovation funds to test, pilot, and roll out new research findings and best practices into the healthcare system as part of a continuous learning cycle.
ACCRA Cost of Living Index—A Private Data Collection Effort since 1968
COLI project Manager, C2ER
The Council for Community and Economic Research (C2ER) produces the ACCRA Cost of Living Index (COLI) to provide a useful and reasonably accurate measure to compare cost of living differences among urban areas in the United States. This session will provide the COLI methodology, data collection, and quality control procedures. It will also demonstrate how and where the index data can be utilized.
A New and Old View of Computing and Data
David Alan Grier
Associate Professor of International Science and Technology Policy, Elliott School of International Affairs
For the past 70+ years, our view of computing and hence our view of data has been locked to the finite discrete automata, the idea behind Alan Turing's abstract machines and John von Neumann's more physical ideas. With crowdsourcing, we are being pushed back into a processing model that flourished during the years that Turing and von Neumann developed their ideas of computation. This model, which was used heavily by the Works Progress Administration, employed large numbers of workers and labor markets to handle computational and data processing problems. It is a natural extension of the classical finite discrete automata. It provides new capabilities and new ways of conceiving data, but it also suggests new limitations to the nature of computation and data gathering.
Managing Technical Talent: How to Find the Right Analyst for Your Problem
In an age of big and complex data and myriad analytical techniques, the governance of technical expertise is a critical issue, yet it's rarely seriously considered. Competitions generate much needed objective information about which analysts and techniques work best in specific situations. Whereas a single data scientist can do well on a problem, how can the best one be found? And competition adds fresh eyes and new ideas and elicits greater effort (the Roger Bannister effect). Kaggle has hosted competitions that have raced to the frontier of what's humanly possible in areas as diverse as prioritizing preventative health care, designing games-rating systems, and predicting traffic flow.
The Need for Data Standards: How the InChI Project Is More than Just a Standard for Chemists
Project Director, InChI Trust
InChI, the IUPAC Chemical Identifier, was developed to be an open-source, computer-readable standard for representing chemical structures in the modern world (e.g., internet and search engines). InChI is more than a standard, because what we need is not a standard; what we need is an arbitrary standard that can, in practice, be used by everyone. For practical (i.e., political) reasons we need a standard that does not conflict with any existing structure representation that any person, group, or organization is currently using. InChI is not a replacement for what is currently being used by anyone. Ninety-nine percent of the value of InChI is its unique ability be a link to information from diverse sources—chemical, physical, biological, environmental, medical, and so on. If everyone adds the arbitrary standard InChI and InChIKey to their computer-readable record of information, this will much improve the search for information and knowledge. InChI is designed to maximize access to information and data internally and on the internet (fee or free) in the most effective manner.
Metadata Standards and XML Technologies for Unlocking Statistical Data
Vice President, Metadata Technology/Open Data Foundation
As demand for socioeconomic data, health data, and official statistics continues to grow, government agencies, international organizations, data producers, and research centers are under increased pressure to make data more widely available to researchers, stakeholders, or the general public. This presents significant challenges, as such information cannot always be easily released. Fundamental statistical principles and national legislation require the custodians to protect the privacy of the underlying respondents and ensure the data is used according to its intended purposes. Data by itself is also of limited usefulness if not of quality and surrounded by comprehensive documentation. Effectively and responsibly providing access to statistical data is not a trivial task.
At the same time, funding agencies around the globe are formulating new policies encouraging or requiring researchers to provide data management plans or strategies as an integrated component of their proposals. This aims to encourage knowledge sharing, collaboration, and open access to publicly funded research data. While sound, these polices raises new challenges to the individual data users.
Fortunately, the past decade has seen the emergence of metadata standards, best practices, and technologies that can facilitate such processes. Specifications such as the Statistical Data and Metadata Exchange standard (SDMX) or the Data Documentation Initiative (DDI) have come to maturity and are rapidly being adopted by agencies and individuals around the globe. Unlike in other domains, these standards have been widely endorsed and have the advantage to face no or little competition. Numerous tools and platforms are also becoming available to facilitate the management, discovery, access, or analysis of data. Combined, these provide powerful instruments to realize effective and secure data preservation, dissemination, exchange, and sharing solutions.
Our presentation will summarize the challenges of providing access to statistical data; outline the standards and technology landscape surrounding socioeconomic data, health data, and official statistics; provide an update on recent achievements; and highlight ongoing initiatives around the globe.
Data Modeling among Non-programmers
Data Architect, Danish Commerce and Companies Agency
Data modeling is in practice an interdisciplinary and group-based activity that leads to a symbolic representation of selected aspects of a domain. Efficient and adequate physical implementations of data models require some understanding of computer programming. Unfortunately this understanding does not come easy and often involves years of practical experience. Based on experiences from within two different domains—functional genomics (genes, diseases, and patients) and government data (citizens, cars, and businesses)—I will try to highlight three concepts that have proven difficult but valuable to introduce to the domain experts.
Commercial Search Engine Developers and Universities: A Critical Time for Collaboration in the Coming Age of Publicly Accessible Research Data
Research Data Management Librarian, Cornell Institute for Social and Economic Research
Driven by new data-sharing requirements from funding agencies, most recently and notably the National Science Foundation, academic researchers are on the verge of making rapidly increasing amounts and varieties of research data available for replication of findings and reuse. Universities are now building or enhancing repositories to help researchers make their data available, and are employing and helping develop domain-specific metadata standards, such as the DDI, to aid in the discoverability and manageability of these datasets. However, with the growing amount of data and number of repositories, the risk of "data silos" increases as well. Providers of commercial search engines must join the current efforts of global, web-scale data discovery—otherwise, the usefulness of the search engines and the research data generated with public funding are both at risk.
From Dollars to Ideas: New Tools for Measuring Influence
Director of Sunlight Labs, Sunlight Foundation
To date, analytic examinations of the problem of political influence have centered on the flow of money through mechanisms like campaign contributions, contracts, and earmarks. But financial transactions are only one signal that can be used to detect when someone has gained an inappropriate amount of control over our political institutions. The Sunlight Foundation's Tom Lee will discuss new tools and datasets for tracking the manipulation of government through the systematic use of language and ideas.
How to Compare One Million Images? Visualizing Patterns in Art, Games, Comics, Photography, Cinema, Animation, Web, and Print Media
Professor, University of California, San Diego (UCSD)
The explosive growth of cultural content on the web, including social media and the digitization work by museums, libraries, and companies, makes possible a fundamentally new paradigm for the study of cultural content. We can use computational data analysis and new interactive visualization techniques to analyze patterns and trends in massive cultural datasets. We call this paradigm cultural analytics. I will show examples of visualizations of patterns in cinema, animation, video games, magazines, and comics created in our lab (softwarestudies.com) at the University of California, San Diego (UCSD) and California Institute for Telecommunications and Information Technology (Calit2). The presentation will highlight new visualization techniques for big data that use next-generalization scalable displays such as the HIPerSpace system, which offers 35,840 x 8,000 pixels resolution.
Statistical Abstract of the United States: The Value of Data
Branch Chief, U.S. Census Bureau
The presentation discusses the value of the Statistical Abstract to the statistical community, government, researchers, and decision makers. It will also highlight the collaboration between agencies, organizations, and private companies that make up the three hundred sources of data.
Introducing Encyclopedia of Life V2: International, Personal, and Reusable Biodiversity Data
Director, Species Pages Group Encyclopedia of Life, Smithsonian's National Museum of Natural History
EOL connects worldwide audiences with information on the organisms with whom we share our planet. The scope of our task is vast—1.9 million species have already been described over the last few hundred years, and 15 to 20 thousand more are described every year. What we know is constantly changing, and different audiences need information relevant to them, in the languages that they speak. This week marks the launch of a major upgrade at www.eol.org, designed to accelerate and deepen engagement with this unique content curation community. I will present our new features such as virtual collections and communities, data richness scores, and internationalization. EOL V2 addresses the demand for informative contexts, language translation, data quality, content building, and data reuse.
Thomson Reuters and Big Data
CTO, Thomson Reuters
Thomson Reuters is the leading source of intelligent information for the world's businesses and professionals. The massive changes in the scale, volatility, and latency requirements caused by big data demand a significant change in the way we build and manage our systems. The sheer volume of information our customers need requires us to think differently about how we build context and apply it to these information sources.
Partner and Head of Visualization, Periscopic
The world is more than what is visible around us. Data visualization is a practice that can generate insight and hasten understanding. By working through two case studies, I will show how data visualization can transform the invisible into rich intelligence.
First, Yahoo!'s email traffic that is sent and received will be illuminated with a small interactive visualization. I will also demonstrate a forthcoming social media visualization for GE Healthymagination, which looks at conversations about breast cancer.
I will discuss challenges our team has encountered and how we've remedied them:
The Inscrutable Lines of Cause and Effect
Chief Innovation Officer, Demand Media
Why do Academy Award winners live longer than the other nominees, and first basemen outlive other players on the team? Why do children in schools with fluorescent lighting get fewer cavities than those in incandescent-lit schools? The universe is full of non-obvious causal relationships invisible to both the eye and intuition. How might a sophisticated computational answer engine of the future help us find these relationships and thereby cure disease, end poverty, and usher in a new golden age for humanity?
Sports Analytics: Managing and Making Sense of Player-Tracking Data
Associate Vice President, Commercial Products, STATS, LLC
This presentation shares developments in player-tracking technology in sports and the challenge of managing the large volume of new data now available from it. Precise player positioning and movement are now tracked multiple times per second, and this data is used by teams and media to derive never-before-available analysis. This presentation highlights the current and future uses of that data.
Doing Business in the Face of the Information Explosion
Vice President Global Data Strategy, Dun & Bradstreet
The massive rate of change in information availability provides a richness never before available for extracting information about business entities and their related attributes. Sadly, this same tsunami of information is also providing a huge challenge to finding and adjudicating the unique identity of business. When coupled with the increased propensity and sophistication of those who would misrepresent the truth, the problem of business-entity identification becomes increasingly more complex. Another trend that makes the problem more complex is the increased incidence of businesses practicing across borders in different languages and writing systems, thereby forcing them to adopt different persona and nomenclature. I would propose to discuss these problems and how Dun & Bradstreet is thinking about them in the context of doing business in the face of the information explosion.
Global Health Data Exchange
Director of Data Development, Institute for Health Metrics and Evaluation
It is IHME's new tool for anyone interested in global and public health data, with a primary objective of increasing discoverability of health-related data and a secondary objective of increasing the amount of data being shared.
Making State Government Data Accessible and Understandable
State Representative, Washington State
The push toward open government data is accelerating, bringing the promise of better public policy and more transparency in government function. This effort shares many of the challenges facing any large enterprise that wants to open up an ocean of information to a vast audience. The audience for this effort is a spectrum from casual observers to policy experts, so a balance between high-level summary and agonizing detail is hard to establish. Information is isolated in departmental silos, with few standards for data presentation. For any given data set, it is difficult to communicate the relevant context or to automatically express its dynamic relationships with other variables. Using examples drawn from Washington State, we will examine some past efforts and consider improvements through approaches such as crowdsourcing, public APIs, and common standards.
Crowdsourced, Collaborative Genealogy
Geni's millions of users have created what may be the largest crowdsourced document in history, a single family tree that connects almost sixty million people. Hear about the technical and cultural challenges that Geni has faced in building its platform and growing its community. Geni also offers free access to its robust dataset through a public API.
Empowering People with Data—Data.gov: What's Now and What's Next
Alan Vander Mallie
Data.gov Program Manager, U.S. General Services Administration
Data discovery, visualization, and exploration are keys to empowerment for people all over the world. Although scientists, researchers, analysts, economists, media, and programmers are all interested in and routinely download government data, the technically untrained citizen or public constituent prefers online interactive exploration and visualization of data rather than downloading. As the nation's front door to U.S. data, Data.gov advances the shared understanding and ingenuity of citizens and keeps government accountable. The first national effort of its kind, Data.gov "democratizes data" and puts it to work in the American people's hands. One operating principle of Data.gov is to meet the public's need for information and knowledge by making data available online using intuitive and familiar web standards for searching, browsing, visualizing, and sharing information. By streamlining publishing, Data.gov is making it easier for agencies to contribute high-value data. By providing more and better data for analysts, journalists, and researchers, Data.gov is leading the development of apps and reports for people to make better-informed decisions.
Drug Efficacy in the Wild
Research Scientist, PatientsLikeMe
Amyotrophic lateral sclerosis (ALS) is a devastating illness that is uniformly fatal, typically within two to four years. I'll describe the online collection and analysis of data from ALS patients who experimented with taking lithium carbonate to slow the progression of their disease. In particular, I'll describe the algorithm we developed to reduce potential bias owing to lack of randomization. Our findings contradicted the results of the research trial that had originally motivated the patients to use this treatment.
Financial Data Management
Chief Technology Officer, Morningstar Inc.
The presentation will introduce financial data with a focus on time series data: how they are usually collected, managed, and used in a financial setting. It will include case studies of challenges and solutions on managing the time series data, for example, how the landscape changed when we extended our coverage to the global market.
The Sea Around Us: Seeing Our Past, Present, and Future through Data in Space and Time (Impacts of Fisheries on the World's Marine Ecosystems)
Project Manager & Senior Researcher, Sea Around Us Project, UBC Fisheries Centre FishBase & SeaLifeBase
The Sea Around Us Project (named after the book by Rachel Carson) at the University of British Columbia develops and uses fully integrated and cross-linked databases on all aspects related to global fisheries, both through in-house efforts (e.g., global fisheries catches, fishing effort, water temperature, primary production, marine habitats, and socioeconomic data such as prices, fishing costs and government subsidies), as well as close collaboration with deep-linked datasets on biodiversity from the globally leading resources FishBase and SeaLifeBase. We heavily emphasize and ensure global coverage of all our datasets in time (back to 1950) and space (using 180,000 half-degree latitude-by-longitude cells). This emphasis on complete coverage in time and space, as well as our insistence on comprehensive interconnectivity, has contributed to our globally unique leadership position for assessing, documenting, and communicating the effects of fishing (both ecological as well as socioeconomic) on societies and marine ecosystems. In contrast to most other agencies dealing with fisheries around the world, ours is unique in primarily addressing questions and issues of concern to non-governmental organizations (NGOs). Increasingly, we are being called upon by international agencies (UNEP, FAO, World Bank, WTO, etc.) for input. Interestingly, the entire project, since its inception in 1999 as a global, collaborative effort, has been funded entirely outside both governmental and private enterprise funding streams through generous support from the Pew Charitable Trusts, driven by a clear strategic vision. I will illustrate our approach as well as fundamentals of the underlying and derived data streams of both the Sea Around Us Project as well as FishBase and SeaLifeBase, and place this in data- and content-specific context.