Schedule

Thursday, September 5 | Friday, September 6

Thursday, September 5

  Gallery Ballroom 1-2 Gallery Ballroom 3
8–9:30am
Foyer: Sign in / Breakfast
 
9:30–10:45am
Welcome and Keynote
Stephen Wolfram, CEO of Wolfram Research & Creator of Wolfram|Alpha
 
10:45–11am
Foyer: Break
 
11–11:30am
"GDELT: Real-Time Automated Global Behavior and Beliefs Mapping, Modeling, and Forecasting Using Hundreds of Millions of Events"
Kalev Leetaru, Yahoo! Fellow in Residence, Georgetown University

GDELT: Real-Time Automated Global Behavior and Beliefs Mapping, Modeling, and Forecasting Using Hundreds of Millions of Events

Electronic media is allowing faster, more representative, and round-the-clock access to societal behavior around the globe. Today, documenting of and reaction to major events of the moment pour in within minutes from Bangladesh to Buenos Aires, offering unparalleled visibility into the heartbeat of global society. Moreover, the constant stream of daily life that flows across media platforms provides rich contextual background information on the narratives and patterns of daily life of each region and culture. This talk will present early findings from a new dataset containing more than a quarter-billion geolocated events, from riots and protests to peace appeals and diplomatic exchanges, with global coverage 1979 to present, including daily updates.

The Global Database of Events, Language, and Tone (GDELT) (http://gdelt.utdallas.edu) is a wide-ranging initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world over the last two centuries down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first "real-time social sciences Earth observatory." Nearly a quarter-billion georeferenced events capture global behavior in more than 300 categories and cover all countries 1979 to present with daily updates, creating an open academic research platform for studying global societal-scale behavior. GDELT uniquely combines both the behavioral "event" data common to current watchboarding and modeling environments with a massive array of "beliefs" indicators and relationship networks capturing both what is happening around the world and how populations are reacting to and understanding those events in real time.

GDELT is designed to help support new theories and descriptive understandings of the behaviors and driving forces of global-scale social systems, from the micro-level of the individual through the macro-level of the entire planet by offering real-time synthesis of global societal-scale behavior into a rich quantitative database allowing real-time monitoring and analytical exploration of those trends. GDELT's evolving ability to capture ethnic, religious, and other social and cultural group identity and relationships will offer profoundly new insights into the interplay of those groups over time, offering a rich, new platform for understanding patterns of social evolution, while the data's real-time nature will expand current understanding of social systems beyond static snapshots toward theories that incorporate the nonlinear behavior and feedback effects that define human interaction and greatly enrich fragility indexes, early-warning systems, and forecasting efforts.

GDELT's goal is to help uncover previously obscured spatial, temporal, and perceptual evolutionary trends through new forms of analysis of the vast textual repositories that capture global societal activity, from news and social media archives to knowledge repositories.

This talk will present an overview of GDELT, some current applications, and future directions.

More information can be found on the GDELT website: http://gdelt.utdallas.edu.

 
11:30am–noon
"Geo Symptom Tracker"
Anurag Jain, VP Engineering, WebMD

Geo Symptom Tracker

Geo Symptom Tracker is a crowdsourcing solution implemented by WebMD for identifying geographical distribution of disease activity. The data is presented in heat maps over the US to WebMD users, for various diseases.

Implemented over three years ago for tracking cold & flu symptoms across the nation, the solution has been tracking closely with CDC data for ILI (Influenza-Like Illnesses) diagnoses numbers collected over the US.

As the data in the tracker is available over a week sooner than real diagnoses data, WebMD made this data available to its partners/clients for identifying geographical areas of high disease activity, to help guide promotions as well as company operations. Over the 2012–2013 flu season, WebMD once again showed very high correlation with the CDC, while WebMD's key competitor had serious trouble with their projections.

The success of the WebMD solution comes from the fact that the data works against a defined vocabulary within Symptom Checker, from which users select and report symptoms. Other products that use search keywords as proxies for disease activity have been not close to reality, and solutions based on real diagnoses data have been slow in being able to provide timely guidance.

The solution involved consolidating the disease knowledge at various geographic location levels: state, county, and DMA. State and county data was analyzed to identify various levels of disease activities for creating heat maps.

Using data over the years across the US, DMA data was analyzed to define baseline activity levels. Clinical statisticians designated a level above baseline as "active" level—separately for each disease—based on data and their clinical knowledge of the disease. The algorithms then identified DMAs on a weekly basis where the disease was "active" with 95% statistical confidence. This solution has been implemented across multiple symptoms set at WebMD.

 
noon–12:30pm
"OpenSpending: Building a Global Database of Public Finances"
Anders Pedersen, Community Coordinator, Open Knowledge Foundation

OpenSpending: Building a Global Database of Public Finances

Across the world the OpenSpending community is playing a leading role enhancing access to spending data, discussing spending standards, and improving journalists' and researchers' access to government finances.

OpenSpending aims to track every government financial transaction across the world and present it in useful and engaging visualizations. The global database consists of public finances from more than 70 countries, including the United States, Brazil, Japan, and the United Kingdom, gathered by a global community.

The project enables users to add, share, and visualize both budget and transactional spending data. It has been instrumental in improving the quality and regularity of expenditure release, by working with data.gov.uk to build a tool to help those responsible for overseeing transparency commitments to see which departments were complying and which not. In Germany, the sister project OffenerHaushalt has had nearly 100 requests from local governments willing to release their financial data for a similar visualization of their finances. In Japan the OpenSpending community has already mapped budgets from 45 cities and initiated discussions within cities to increase financial transparency as an entirely civil-society-led initiative, with no cost to the government.

This talk will explore the lessons learned from one of the most ambitious government financial transparency projects ever and how OpenSpending plans to achieve a more complete view of public finances to tackle issues such as inefficiencies, tax evasion, and public finance deficits.

"Why Put a Library in Everybody's Pocket?"
Greg Newby, Director, Project Gutenberg and the Arctic Region Supercomputing Center

Why Put a Library in Everybody's Pocket?

What might it mean for everyone to have his or her own digital library, with full access and control? A library that could be shared, in parts or in whole. Where each item is fully unlocked: print, save, edit, mark up, and extract portions. We will discuss Project Gutenberg's approach to providing a personal library, and how Summit attendees could apply this principle to their own digital collections.

12:30–2pm
Salon: Lunch
 
2–2:30pm
"The Online Revolution: Education for Everyone"
Andrew Ng, Cofounder, Co-CEO, Coursera

The Online Revolution: Education for Everyone

In 2011, Stanford University offered three online courses, which anyone in the world could enroll in and take for free. Together, these three courses had enrollments of around 350,000 students, making this one of the largest experiments in online education ever performed. Since the beginning of 2012, we have transitioned this effort into a new venture, Coursera, a social entrepreneurship company whose mission is to make high-quality education accessible to everyone by allowing the best universities to offer courses to everyone around the world, for free. Coursera classes provide a real course experience to students, including video content, interactive exercises with meaningful feedback, using both auto-grading and peer grading, and a rich peer-to-peer interaction around the course materials. Currently, Coursera has 80 university and other partners and 3.6 million students enrolled in its nearly 400 courses. These courses span a range of topics, including computer science, business, medicine, science, humanities, social sciences, and more. In this talk, I'll report on this far-reaching experiment in education and why we believe this model can provide both an improved classroom experience for our on-campus students, via a flipped classroom model, and a meaningful learning experience for the millions of students around the world who would otherwise never have access to education of this quality.

 
2:30–3pm
"Data Mining Music"
Paul Lamere, Director of Developer Platform, The Echo Nest

Data Mining Music

Data mining is the process of extracting patterns and knowledge from large datasets. It has already helped revolutionize fields as diverse as advertising and medicine. In this talk we dive into mega-scale music data such as the Million Song Dataset (a recently released, freely available collection of detailed audio features and metadata for a million contemporary popular music tracks) to help us get a better understanding of the music and the artists who perform the music. We explore how we can use music data mining for tasks such as automatic genre detection, song similarity for music recommendation, and data visualization for music exploration and discovery. We use these techniques to try to answer questions about music such as, which drummers use click tracks to help set the tempo? or, is music really faster and louder than it used to be? Finally, we look at techniques and challenges in processing these extremely large datasets.

"The Future of Data Collection Using Smart Devices"
George Yu, CEO, Variable, Inc.

The Future of Data Collection Using Smart Devices

Collecting data is now easier than ever with the proliferation of affordable sensors and data-transmission infrastructure. The world is on the verge of a data collection revolution with the coming Internet of Everything era. However, challenges still remain, from sensing on devices to data fusion in the cloud. We will discuss our experience of data collection using smart devices such as the iPhone as a sensing hub.

3–3:30pm
"Metadata Matters"
Ian White, President, Urban Mapping

Metadata Matters

With Big Data come big responsibilities, or so the saying (sort of) goes. Federal transparency initiatives have spawned millions of rows of data; state and local programs engage developers and wonks with APIs, contests, and data galore. Sub-meter imagery ensures unparalleled accuracy, and collection efforts mean timely updates. Private industry offers attribute-laden device exhaust, forming a geo-footprint of who is going where, when, how, and (maybe) for what. The public disclosure of NSA-collected data has only served to elevate the significance of metadata. With opportunity comes challenge—the expertise in sourcing, identifying, collecting, normalizing, and maintaining geographic data is often overlooked in the mad rush to analyze. Curation, or the human side of extract, transform, and load (ETL) has increased in scope, scale, and importance as data proliferation translates to a deluge of nonstandardized data types lacking sufficient documentation or validation, bringing into question underlying value. Big Data calls for expertise in curating, acquiring, validating, and arranging data in collections that are relevant to the right audience at the right time.

 
3:30–4pm
Break
 
4–4:30pm
"Why the Census? Big Data from Enlightenment to Today"
Eric Newburger, Assistant to the Associate Director of Communications, US Census Bureau

Why the Census? Big Data from Enlightenment to Today

How big is it? It is the first of the three fundamental questions statisticians ask—necessarily the first, for the other two ("What difference does it make?" and, "Are you sure that's not just dumb luck?") both rely upon measurement. For more than two centuries, the Census Bureau has been the source for answering this question about the United States and its constituent parts. The founders wrote into the Constitution that this data should be used to govern. However, in their other writings, they made it clear that they also intended that this data should be an aid to commerce. In an age of internet speeds and ubiquitous electronic datasets, what role remains for official data collections that happen at most once a month?

"Data Synthesis—Addressing Small Data Problems Faced by Big Data"
Peter Sweeney, Founder & President, Primal

Data Synthesis—Addressing Small Data Problems Faced by Big Data

Analytics: Big data technologies are plagued with small data problems. Their performance suffers in markets that aggregate a large number of unique interests. Some of the largest markets share these small data characteristics, including local e-commerce, personalized media, and interest networking. New approaches are needed that are far less sensitive to the cost and complexity of the data. In this talk, Primal will demonstrate how its semantic synthesis technology can overcome these small data problems. We'll draw on real-world experience in application areas such as personalized information services, recommendation engines, and expertise search.

4:30–5:30pm
"Having It All Is Not Having It All at All! Problem Formulation in the Face of Overwhelming Quantities of Data"
Anthony Scriffignano, Senior, VP, Worldwide Data & Insight, Dun & Bradstreet

Having It All Is Not Having It All at All! Problem Formulation in the Face of Overwhelming Quantities of Data

As many companies struggle with the emerging technologies and nascent capabilities to discover and curate massive quantities of highly dynamic data, new problems are emerging in the form of how to ask meaningful questions that leverage the "v's" of large amounts of data (e.g. volume, variety, velocity, veracity). In the business-to-business space, these challenges are creating both significant opportunity and ominous new types of risk. While the vast availability and dynamic nature of data are allowing business counterparties to find and do business with each other in new and exciting ways, there are also new bad behaviors, false assumptions, and wholly inappropriate methods of problem formulation that are driving great risk at alarmingly increasing rates. This session will address the phenomena impacting business-to-business decisions in the era of massively available data.

 
6–7pm
Foyer: Drinks
 
7–9pm
Salon: Dinner
 

Friday, September 6

  Gallery Ballroom 1-2 Gallery Ballroom 3
8–9:30am
Foyer: Breakfast
 
9:30–10am
"How Will Usage-Based Auto Insurance Evolve?"
David Pratt, General Manager, Usage-Based Insurance, Progressive Insurance

How Will Usage-Based Auto Insurance Evolve?

Snapshot allows customers to earn discounts by sharing driving data with Progressive. Customers who drive relatively low mileage, rarely drive late at night, and avoid hard braking save money. Progressive has collected data from more than 1 million vehicles and more than 7 billion miles of driving activity. Dave Pratt will describe how the data is used today, how it might evolve in the future, and the analytic challenges associated with this unique dataset.

 
10–10:30am
"Health Data and the Recently Released Global Burden of Disease Study"
Peter Speyer, University of Washington

Health Data and the Recently Released Global Burden of Disease Study

Published in December 2010, the Global Burden of Disease 2010 is the most comprehensive assessment of human health ever conducted. Coordinated by the Institute for Health Metrics and Evaluation (IHME), 488 coauthors from over 50 countries compiled all available evidence on 291 causes of disease and injury, as well as 67 risk factors globally. The results are now available by country, age, and gender. GBD uses three key metrics: Years of Life Lost (YLL) sums up years of life that are lost due to premature mortality. Years Lived with Disability (YLD) provides an estimate for how many years of life are lost to sickness; this health loss is estimated based on prevalence and severity of diseases. Adding together YLLs and YLDs provides Disability-Adjusted Life Years (DALY), which provides a measure for overall health loss due to diseases, injuries, and risk factors. The analytic approach of GBD uses all available data on health outcomes, from surveys and censuses to vital registration, disease registries, hospital records, and published research. It starts with estimating the number of deaths for each country-age-gender group, then analyzes causes of death, making sure that every death is only counted once. It provides uncertainty bounds for every estimate, thereby providing an assessment on how reliable the input data is. And the metrics are fully comparable across ages, countries, and time, enabling comprehensive evaluation of levels, trends, and patterns in the data. To make best use of the available data, IHME has created a number of data visualizations that allow the exploration of the data via maps, treemaps, line and bar charts, and more. The visualizations provide access to over 1 billion results and provide useful functionality, from casual browsing of the data to deep dives into the full detail of the study. During a talk, I will provide a brief introduction to the set up and implementation of the study and use the visualization tools to illustrate results.

"Data Science with Wolfram Technologies: Context and Examples"
Dillon Tracy, Senior Kernel Developer, Wolfram Research

Data Science with Wolfram Technologies: Context and Examples

We present an overview of data science workflow using Wolfram technologies, covering topics in statistics, visualization, import/export, and deployment. We also consider practical examples in classification, Hadoop integration, web analytics, and data scrubbing. We will preview the emerging technologies DataArray, for out-of-core computation, and ReportGeneration, for deployment. This talk will be of interest to practicing data scientists and is suitable for those with or without previous knowledge of Mathematica or the Wolfram Language.

10:30–11am
Foyer: Break
 
11–11:30am
"The Human Brain Project—An Overview"
Sean Hill, Professor, Blue Brain Project/Human Brain Project

The Human Brain Project—An Overview

Understanding the human brain is one of the greatest scientific challenges of our time. Such an understanding will lead to fundamentally new computing technologies, transform the diagnosis and treatment of brain diseases, and provide profound insights into our humanity. The goal of the Human Brain Project (HBP) is to catalyze a global collaborative effort to unite all existing knowledge about the human brain and to reconstruct the brain, piece by piece, in supercomputer-based models and simulations. The Blue Brain Project developed the first proof-of-principle of this data-driven process and created a unifying model of neocortical microcircuitry. Such a unifying model of the brain can serve as a catalyst for neuroscience research and offers the prospect of accelerating our understanding of the human brain and its diseases and of new computing technologies.

"Infectious Texts: Uncovering Reprinting Networks in Nineteenth-Century Newspapers"
Ryan Cordell, Assistant Professor, Northeastern University
David Smith, Assistant Professor of Computer Science, Northeastern University

Infectious Texts: Uncovering Reprinting Networks in Nineteenth-Century Newspapers

Many studies of social networks and interactions use direct survey and observational data, but comparison of the language used by different members of a social network can provide additional evidence. Copying and quotation of text, in particular, can provide indirect evidence about social ties. These ties might be overt, as with scholarly citations, or covert, as with the interactions among interest groups and legislators. We will describe our work on a particularly dense network of text reuse: the news stories, short fiction, and poetry that "went viral" in nineteenth-century American newspapers and magazines. Prior to copyright legislation and enforcement, literary texts as well as other nonfiction prose texts circulated promiscuously among newspapers as editors freely reprinted materials borrowed from other venues. What texts were reprinted and why? How did ideas—literary, political, scientific, economic—circulate in the public sphere and achieve critical force among audiences? We will describe new approaches we are honing to identify clusters of reprinted passages. After employing space-efficient n-gram indexing techniques to identify candidate newspaper issues and then local models of alignment to identify reprinted passages, we group pairs of matching passages into larger clusters of text reuse. We will also describe new models we are developing to characterize reprinted texts, using both internal and external evidence. We augment models of the linguistic features of reprinted texts with features of the political, social, religious, and geographic affinities of the venues where they appeared and evaluate the effectiveness of both these components by manually constructing collections of reprinted texts. This granular data about the nature of frequently circulated texts and the paths of their circulation will enable us to understand the shape and constraints of the public sphere, the development of which was key to nineteenth-century US history, including democratic extension of the franchise, antebellum sectionalism, the abolitionist movement, and westward growth of the nation.

11:30–noon
"Bad Bugs: The 100K Food-Borne Pathogen Genome Project"
Sufian Al Khaldi, Scientist and Food Outbreak Investigator, Center for Food and Applied Nutrition, FDA

Bad Bugs: The 100K Food-Borne Pathogen Genome Project

The 100K Pathogen Genome Project includes sequencing pathogenic bacteria from all over the world. This will transform the ways and means of tracking bacterial strains causing food outbreaks. The project will report and help to solve the continuous challenge of food safety concerns by seeking worldwide partners to generate a publicly available genetic database of the most common food-borne microbes isolated from clinical and food samples. Currently, several thousands of bacterial genomes are collected from several bacteria representing different countries. Collecting bacterial strains for genome sequencing will help the public authorities to tackle food outbreak problems rapidly by pinpointing the sources of food contamination, saving human lives, and millions of dollars.

Out with the old—in with the new! The New Bad Bug Book CDF app is a new dynamic way of exploring data.

"15-Petabyte Public Interest Digital Library"
Roger Macdonald, Director, Television Archive, Internet Archive

15-Petabyte Public Interest Digital Library

I'll report on recent activities of the Internet Archive and invite engagement with our media. Our latest project, TV News Search & Borrow (archive.org/tv), offers opportunities to search more than 400,000 recent US television news programs, quote short segments, and borrow whole programs. We are experimenting with ways to facilitate deeper analysis of news media and their metadata while honoring varied stakeholder concerns.

noon–1:30pm
Salon: Lunch
 
1:30–2pm
"Sports Analytics v2.0: Assessing Team Strategy Using Spatiotemporal Data"
Patrick Lucey, Disney Research, Pittsburgh

Sports Analytics v2.0: Assessing Team Strategy Using Spatiotemporal Data

The "Moneyball revolution" coincided with a shift in the way professional sporting organizations handle and utilize data in terms of decision-making processes. Due to the demand for better sports analytics and the improvement in sensor technology, there has been a plethora of ball- and player-tracking information generated within professional sports for analytical purposes. However, due to the continuous nature of the data and the lack of associated high-level labels to describe it, this rich set of information has had very limited use, especially in the analysis of a team's tactics and strategy. In this talk, I will give an overview of the types of analysis currently performed mostly with event data and highlight the problems associated with the influx of spatiotemporal data. By way of example, I will present an approach to deal with spatiotemporal data that uses an entire season of ball-tracking data from the English Premier League (2010–2011 season). Using this analysis, I will show that home advantage in soccer is partly due to the conservative strategy of the away team. Additionally, I will discuss issues related to player-tracking data and permutations and how this can be avoided using a "role-representation."

 
2–2:30pm
"The Dynamics of Correlated Novelties"
Vittorio Loreto, Professor, Sapienza University of Rome

The Dynamics of Correlated Novelties

One new thing often leads to another. Such correlated novelties are a familiar part of daily life. They are also thought to be fundamental to the evolution of biological systems, human society, and technology. By opening new possibilities, one novelty can pave the way for others, in a process that Kauffman has called "expanding the adjacent possible." The dynamics of correlated novelties, however, have yet to be quantified empirically or modeled mathematically. Nowadays, thanks to the availability of extensive longitudinal records of human activity online, it has become possible to test whether everyday novelties crop up by chance alone, or whether one truly does pave the way for another. Here I'll propose a simple mathematical model that mimics the process of exploring a physical, biological, or conceptual space that enlarges whenever a novelty occurs. The model predicts statistical laws for the rate at which novelties happen (analogous to Heaps' law) and for the probability distribution on the space explored (analogous to Zipf's law), as well as signatures of the hypothesized process by which one novelty sets the stage for another. These predictions have been tested on four datasets of human activity: the edit events of Wikipedia pages, the emergence of tags in annotation systems, the sequence of words in texts, and listening to new songs in online music catalogs. By quantifying the dynamics of correlated novelties, these results provide a starting point for a deeper understanding of the ever-expanding adjacent possible and its role in biological, linguistic, cultural, and technological evolution.

"Cultural Heritage Institutions and Big Data Collections"
Leslie Johnston, Chief of Repository Development, Library of Congress

Cultural Heritage Institutions and Big Data Collections

Cultural heritage organizations have, until recently, spoken of "collections" and "content" and "records" and even "files." Datasets are not just scientific and business tables and spreadsheets. Data is not just generated by satellites, identified during experiments, or collected during surveys. We still have collections, but what we also have is Big Data in our libraries, archives, and museums. We must collect and preserve research data, in addition to recognizing that the collections we already have are also data resources to be mined. This requires us to rethink the infrastructure that is needed to make use of our collections. This talk will present a case study from the Library of Congress on acquiring, collecting, and preserving large-scale digital collections in many formats and making them usable as collections and as data.

2:30–3pm
"Using Data for Social Good: Unlocking the Potential of Big Data to Change the World"
Peter Panepento, Chronicle of Philanthropy

Using Data for Social Good: Unlocking the Potential of Big Data to Change the World

Big data has the power to change the world and help organizations solve important problems. But for every new piece of valuable data, a larger pile of useless data obscures it. It's tough work to sift through all of it to find the pieces that lead to greater insights. It's even more difficult to translate that data into something useful and transformative. Organizations and individuals need to understand what stories they want to tell with data. When the right data is gathered in the right way and presented intelligently, that is where the magic of data begins to fulfill its promise. Learn how The Chronicle of Philanthropy is transforming the way it thinks about, collects, and presents data to help nonprofits, foundations, and other organizations make better decisions and better understand their place in the world. And learn how journalism organizations can partner with businesses and charities to develop rich, powerful data presentations that illuminate the world and help inspire change.

"People Hear the Title First: A Mixed-Method Study of the Cultural Place of Science Fiction across Media, Genres, and Decades"
Eric Rabkin, Associate Provost for Online Education, Stony Brook University

People Hear the Title First: A Mixed-Method Study of the Cultural Place of Science Fiction across Media, Genres, and Decades

Titles of novels, short stories, films, and video games across all genres (science fiction, romance, Westerns, and so on) are generally short, succinct to the point of being cryptic, yet at the same time they are often both the first point of contact for the consumer—reader, viewer, listener, or player—and also the kernel that lodges itself in people's minds before, during, and after consuming the titled work. As such, titles play a vital role in media consumption experiences. Aside from simply recruiting people's attention, titles can convey and label the content of the media product they represent and, by their at least subliminal persistence during the extended consumption experience, titles influence that experience. In subtle ways, reading Gatsby would be different from reading The Great Gatsby or one of several alternative titles Fitzgerald seriously entertained, such as Trimalchio in West Egg. Assuming that texts, films, and games are not produced in a contextual vacuum, but rather mark products within complex cultural systems of production and consumption, we expect titles to follow (after the fact), track (coeval with the fact), or even lead (before the fact) cultural dynamics noticed through other lenses (such as news media reports of prominent activities, such as war). Because of its wide cultural diffusion (in most entertainment media, in industrial design, in city planning, and so on), science fiction provides a superb field for cultural analysis. To understand the differences among works of science fiction in different media and to study the status of science fiction titles as markers of cultural dynamics, we employ an innovative approach to performing data mining on titles from several different media. We conduct cross-sectional as well as longitudinal (across several decades, dating back to the 1930s where possible) frequency analysis of title words of science fiction novels, short stories, films, and video games to detect textual patterns that correlate with medium-specific consumption experiences and cultural dynamics. In addition to using publicly available databases (ISFDB for novels and short stories, IMDb for movies, and Giant Bomb for video games), we also use our custom-built GEP (Genre Evolution Project) database of short stories published in American science fiction magazines (1923-2000). For the GEP database, we have coded thousands of short stories along many dimensions of content and style, which allows us to identify much more detailed patterns of textual and cultural correlations than would title analysis alone. The GEP database provides a well-defined snapshot of the overall textual production in science fiction short stories. In the present study, we extend results obtained from analyzing GEP data and results obtained by title analysis across media by linking both sets of observations and demonstrating that the more and less detailed approaches reinforce each other. A main methodological conclusion is that our minimalist approach to textual data mining of titles is sufficient for the detection of patterns that correlate with (follow, track, or lead) actual cultural dynamics.

3–3:30pm
Break
 
3:30–4pm
"DataCite—Making Datasets Citable"
Jan Brase, Executive Officer, DataCite

DataCite—Making Datasets Citable

The scientific and information communities have largely mastered the presentation of and linkages between text-based electronic information by assigning persistent identifiers to give scientific literature unique identities and accessibility. Knowledge, as published through scientific literature, is however often the last step in a process originating from scientific research data. Today scientists are using simulation, observational, and experimentation techniques that yield massive quantities of research data.

This data is analyzed, synthesized, and interpreted, and the outcome of this process is generally published as a scientific article. Access to the original data as the foundation of knowledge has become an important issue throughout the world, and different projects have started to find solutions.

Global collaboration and scientific advances could be accelerated through broader access to scientific research data. In other words, data access could be revolutionized through the same technologies used to make textual literature accessible.

The most obvious opportunity to broaden visibility of and access to research data is to integrate its access into the medium where it is most often cited: electronic textual information. Besides this opportunity, it is important, irrespective of where it is cited, for research data to have an internet identity.

Since 2005, the German National Library of Science and Technology (TIB) has offered a successful Digital Object Identifier (DOI) registration service for persistent identification of research data. Since 2010 these services are offered by the global consortium DataCite, carried by 17 member organizations from 12 different countries, like the British Library, the Library of the ETH Zurich, the California Digital Library, or the Australian National Data Service (ANDS).

"Transforming Legacy Data into State-of-the-Art Interactive Visualizations"
Mark Elbert, US Energy Information Administration, Director, Office of Web Management

Transforming Legacy Data into State-of-the-Art Interactive Visualizations

Statistical agencies have troves of data, often poorly presented and hosted in legacy systems. That formerly described the Energy Information Administration's electricity data. By data mining and building interactive web pages using an array of open-source libraries, EIA has created interconnected APIs, interactive maps, and advanced query and visualization tools. This talk will focus on the dual themes of the web browser as a potent computing platform with a flexible developer kit of mature open-source libraries and how a legacy dataset properly transformed can power an ecosystem of online dissemination tools.

4–4:30pm
"Deep Data: Mapping the Legal Genome"
Adam Hahn, Cofounder & CTO, Judicata

Deep Data: Mapping the Legal Genome

At Judicata, our thesis is that legal research requires a deep understanding of a relatively small number of documents (with respect to traditional "big data" magnitudes). Instead of a simple analysis of a large volume of documents, we must comprehend both individual documents and the entire corpus with high accuracy. Learn how Judicata uses a combination of artificial intelligence and human-computer hybrid techniques to help litigators find needles, yet understand the haystack.

"Who Will Archive the Archives? Thoughts about the Future of Web Archiving"
Michael Nelson, Associate Professor, Old Dominion University

Who Will Archive the Archives? Thoughts about the Future of Web Archiving"

As more of our culture is encoded in the web, archiving the web becomes increasingly important. In our experience, when asked about web archiving, people often respond in one of two disappointingly uninformed ways: (1) "who would be interested in old web pages?"; and (2) "the Internet Archive already has every copy of every page ever created, so what's left to do?" Despite the yeoman's work of the Internet Archive, much remains to be done. Perhaps most importantly, much of the web is not archived. Because it is hard to generate a "representative sample" of the web, we sampled from four different sources and found that the percentage of the sample that was archived by at least one public web archive ranged from an encouraging 90% (sampled from dmoz.org) to a discouraging 16% (sampled from bit.ly). We have also discovered that the URLs we share via Twitter and Facebook are not as archivable as the URLs we use to build archive collections. In other words, what we save and what we share are fundamentally different. We are also interested in the integrity of the contents of the web archives. Because web pages are constructed with (sometimes hundreds of) embedded resources, many of the embedded resources are often missing or have been crawled sometimes years in the future or past relative to the root HTML page that embeds them. In summary, we cannot always trust that the page rendered from an archive represents what a user saw on the day the repository claims to have archived it.