Archive

Archive for the ‘Public Brainstorm’ Category

Fortifying Confidence by Stealing From Academics. And Scientists.

November 10th, 2009

Driven in large part by open government efforts initiated by the Obama Administration, and particularly Federal CIO Vivek Kundra, tremendous and rich data sets have become available from the federal government, as well as some state and local governments. This data is published digitally, in organized, well-known and documented formats.1

And because these government-amassed data sets have already been paid for by taxpayer dollars, they have been rightfully made accessible to the public domain, free of charge.

Tim O’Reilly – founder and CEO of O’Reilly Media, as well as organizer and host of various technology conferences including the Government 2.0 Summit – describes the thinking behind this policy decision in an article he wrote at Forbes, writing:

Rather than licensing government data to a few select “value added” providers, who then license the data downstream, the federal government (and many state and local governments) are beginning to provide an open platform that enables anyone with a good idea to build innovative services that connect government to citizens, give citizens visibility into the actions of government and even allow citizens to participate directly in policy-making.

The primary distribution point for the federal government’s data is the data.gov website (about which I’d earlier written). In another article he’d guest-authored for TechCrunch, Mr. O’Reilly talks about this website, writing:

Behind [the] site is the idea that government agencies shouldn’t just provide web sites, they should provide web services. These services, in effect, become the government’s SDK (software development kit). The government may build some applications using these APIs, but there’s an opportunity for private citizens and innovative companies to build new, unexpected applications. This is the phenomenon that Jonathan Zittrain refers to as “generativity“, the ability of open-ended platforms to create new possibilities not envisioned by their creators.

The range of potential applications for these data is difficult to exaggerate (or, frankly, to even imagine). A thorough exploration of these possibilities is beyond the scope of this post, but this showcase of apps built leveraging data made available by the city of San Francisco gives a small peek at the broad range of uses this government data.

In browsing that app showcase, I would note that none of the apps found there were written by the government. That’s zero. Rather, each was developed by a third-party.

I would also note that most of those apps combine multiple different data sets, many of which are also including non-governmental data sets.2

Clearly all this is just the beginning.

Opportunities and Challenges to Come

The datasets will grow broader, as the federal government continues to expand its data offerings, and more state and local governments begin to follow suit, as Utah, San Francisco, and even my home town of New York City have since done.

As the data sets become richer throughout this process, mining the information on offer will provide opportunities to develop insights about matters ranging from public health to environmental developments and energy consumption, and from regional commercial performance to educational development.

And once there’s some historical depth to these records – through a combination of digitally publishing data sets from earlier years, as well as continuing to release emergent data – we will eventually even start to see the emergence of various types of projection models developed for many of the issues mentioned above, from economic development forecasts to predictions for the spread of disease outbreak.

These data sets stand to revolutionize both entrepreneurial endeavors and academic research projects.

And with the grant allocations for research en route from provisions that are part of American Recovery and Reinvestment Act of 2009, we’re likely to see a staggering amount of new projects rise from both academia and the business world.

But with all these arguments and assertions about corollaries, trends, and predictions that this number crunching activity will generate, it will become increasingly crucial to have a mechanism by which the results claimed to have been derived from the number-crunching can be accounted for.

It’s not difficult to imagine, after all, the proliferation of claims that will begin to emerge, anchoring their proposed value on these mountains of data.3 Luckily, after decades of subjection to some of the most talented number-spinning tactics that statisticians teamed up with PR specialists have thrown about, many people have developed a thick skin (and perhaps even default suspicion) against allowing “the numbers” to speak to very much.

And rightfully so; “the numbers” can build nearly any narrative a story teller wishes to weave, depending on how they’re sliced, diced, and manicured.

Numbers may not be able to lie, but men sure can.

Luckily, we can find some time-tested solutions for mitigating against falsification and/or incompetence by looking to techniques applied in works of scholarship and the practices of scientific peer review: scholars must meticulously cite their sources in bibliographies attached to their work, and scientists must accompany any publication of the results of their work along with a detailed description of their methods.

It must similarly become incumbent upon anybody publishing findings derived from mining such data to share both the sources and processes used to derive their results or conclusions. In cases of claims rooted in the fruits of data mining endeavors, it is specifically important that results indicate:

  1. exactly which data sets it draws from, and

  2. precisely which algorithm(s) processed the data in question.

The trouble, however, is that there is neither a comprehensive repository nor a system for unique canonical identifiers to publicly and universally identify such data sets and algorithms. Their absence makes any attempts to reproduce such results very challenging, at best.

Fortifying Confidence in the Results

Books, by contrast, have an ISBN number. Books also have a governmental repository, called the Library of Congress.

So I propose that similar mechanisms must be worked out for data sets and algorithms. Perhaps serving as this repository becomes an evolutionary portion of the Library of Congress’ own charter. This repository would be a web service that exposes each individual data set and data mining algorithm source code package under permalinks which incorporate their respective canonical identifier.

Potential examples of such permalinks may look something like this:

  http://www.loc.gov/datasets/0123457/us-census-2010

http://www.loc.gov/algorithms/76543210/higgs-boson-modeler

Naturally, there are considerations that must be accounted for in some cases that it may wind up being imperative restrict access to any resources stored in this repository.

I’ve focused so far on publicly-available data sets, I would note that it is inevitable that a number of valuable projects will on occasion leverage data sets whose rights are privately owned, and to which access must be controlled by obtaining permission of some sort from its owner.

The same concern is naturally prone to surface with some regularity for algorithms, as well.

This a consideration that will require some real thought, but I’ll leave that to a future exploration. For the time being, I’ll simply note that the HTTP protocol does provide mechanisms for access restriction (particularly 401, 402, and 403); leaving only the policy around which those mechanisms will be applied to be worked out.

Although there’s loads to work out about how such a repository can be actualized, its availability will become crucial in the coming years.

Simply hope that both the practice of sharing data sources and methods – as well as a suitable canonical repository for them – materialize earlier than later, since a only a few silly and reckless abuses of this data can undermine public confidence in efforts to fully harness its potential value.

Footnotes

  1. Importantly, the manner in which all this data is distributed is an ideal packaging for use as input for processing by data-crunching algorithms developed by anyone interested in doing so.
  2. This practice of combining data sets from different sources to create a new, value-added data set is referred to as creating a mashup.
  3. It will also certainly be leveraged evaluate the government’s performance, both by the current administration and – perhaps more compellingly – by its political opponents.

Public Brainstorm , , , , ,

Bloomberg Anachronistically Proposes 311 “Mass Transit Hotline”

August 29th, 2009

The mayoral election season is drawing upon New York City, and it’s time for the candidates to start taking on the causes that will define their election platforms. One of the issues that incumbent mayor Michael Bloomberg is starting to get vocal about a plan to implement MTA reforms, which his campaign website describes as:

A thoughtful, comprehensive 33-part plan that lays out tangible, realistic ideas to help the MTA reduce costs, reduce congestion, speed commutes, improve efficiency, enhance accessibility, and ultimately produce a safer, faster, cleaner, better mass transit system.

As a man whose daily routine has depended heavily on the operations of the MTA (particularly the subway system) for over a decade, this is a concern in which I’ve become heavily invested. I’ve encountered my share of frustrations with the organization’s results, and I frankly have much to say about ways to improve the overall quality of the MTA’s service.

To be sure, I have a number of specific thoughts about various points in this plan, but I’d like to focus on one particular point for the moment: the idea of turning 311 into they city’s “Mass Transit Hotline.”

I’m sorry — a hotline?

The stated goal of this hotline is to provide quick and easy access to transit information, such as service schedules, travel maps, and up-to-date alerts regarding planned and circumstantial service alterations. Indeed this is an important goal, but a phone hotline is honestly probably the one of the least effective possible ways I can think of to accomplish this goal.

Simply put, nobody likes to call in for “phone support” for anything. This is because phone support systems universally suck, for everyone involved.

Now, I do feel like it would be useful to also offer 311 as a source of travel information, but only for people who cannot get it by other means. It could be a valuable new offering for, say, the visually impaired. Or, as a last resort for a person in some other extenuating circumstance. As such, a transit hotline would be more of an accessibility enhancement for transit information.

The fact is there are already a number of ways to access timely transit information that are better and more effective than a call-in hotline. Unfortunately, the average MTA customer has no idea any of them exist.

One example is www.MyMtaAlerts.com. The tool allows registered users to subscribe to service alerts for information about specific subway lines, bus service, and more, which all get delivered to their email inbox, mobile phone, or both. Although there’s plenty of room for improvement, this tool does allow MTA customers to subscribe to important information about the specific parts of the MTA’s vast transportation system that is directly relevant to them, and gets the information into customers’ hands without the customers having to even think about asking after it.

Other tools, including a trip planner, schedule listings, and more, are also available at www.mta.info… provided you actually manage to discover them in the train wreck of a website (yea, I’ll confess: pun fully intended).

The fact that these do exist, however, demonstrates that the MTA is tracking and managing all this information digitally.

So the bottom line here is that, if Bloomberg wishes to make a meaningful difference in getting transit information into the hands of New Yorkers, he’ll have to focus on making this data more accessible.

This broadly boils down to taking the following actions:

  1. Raise public awareness. Promote use of the existing tools in subway PA announcements. Rather than just reminding people that police may randomly search everyone’s bags, or discouraging people from giving money to panhandlers, or to step back from the yellow safety lines as trains enter and leave stations, these messages can encourage people to sign up for email and text message alerts online. Print subway ads. Run TV spots. Feature these tools prominently on MetroCards. This can begin immediately.

  2. Redesign the MTA website. I don’t simply mean tweaking the colors, adding some gradients, and moving to some three-column layout. This site is in dire need of a ground-up rethinking of how it’s organized. Although I have loads of specific criticisms about this site, I’ll save those for a later post. For now, I’ll simply say that the home page needs, at minimum, to directly expose their existing travel tools. This can be pulled off iteratively, over the course of several months.

  3. Expose the transit information via data feeds and Web Service APIs. The MTA is clearly tracking service information digitally, as it’s using it to power both the MyMtaAlerts website, as well as Google Maps’ capability to offer door-to-door travel directions via the MTA’s network. Connecting the infrastructure powering these services to data feeds and web services can allow both the MTA and third party developers to create new web and mobile device applications, designed to meet their customers’ evolving needs. This effort will take the longest of all, but will prove to be an investment that will have created a foundation for continued improvements for MTA customer service.

Having 311 take on the role of “Mass Transit Hotline” in an effort to get New Yorkers timely transit information is an idea would have, quite frankly, been deficient even in the 20th century.

But it’s 2009 now.

Bloomberg and NYC need to look to where government and society are moving. Mobile and web are the only information delivery solutions that can improve today’s commutes, while investing in improving tomorrow’s.

Government 2.0, Public Brainstorm , , ,

Sketching the Migration to Digital Education

August 18th, 2009

California governor Arnold Schwarzenegger’s recent proposal to adopt so-called e-textbooks for his state’s public school system has triggered a flurry of press coverage, as well as new products like the Kindle DX and CourseSmart’s iPhone app in the market.

The idea has critics. There are concerns regarding the economic feasibility of the idea, as well as the intellectual property management, and naturally the functional requirements for such devices.

An overview of these matters includes the following:

  • Economic Feasibility

    1. How will the costs behind distributing the readers (the actual hardware units) to every student be covered?

    2. What business model(s) will be available for textbook publishers?

  • Intellectual Property

    1. What safeguards do publishers have against unauthorized distribution of their materials (eg, piracy)?

    2. What safeguards does the educational system have against vendor lock-in?– schools should never become beholden to any one company.

    3. What about ownership of the software itself? The operating system, the format of the interactive materials, etc.

  • Functional Requirements

    1. What sort of hardware capabilities must these devices offer? Of course, they’ll have to display text in layouts with photos and diagrams, but what about video? What about 3D rendering for visualization purposes, or network connectivity?

    2. What sorts of interactions must these devices allow students to conduct with the educational material? Will it support touch-based hyperlinking, annotations, or some sort of data sharing? What about end-of-chapter quizzing?

Clearly there are several matters that need to be thought through, but here’s a “sketch” for a potential solution.

Read more…

Modernizing Education, Public Brainstorm ,

Designing the sfRESTClientPlugin: Sketching a Client API for RESTful Interactions

June 1st, 2009

I’ve lately been exploring the value proposition of RESTful APIs to organizations whose technological infrastructures are built upon a collection of legacy software components, customized to communicate with each other by highly tailored middleware software stacks.

That exploration will not unfold in this post, however. It could easily be an entire book unto itself.

Rather, I would like to focus specifically on ideas I’ve had about what a high level object oriented API for interacting with RESTful services might look like, and funnel those thoughts into the design and implementation of a plugin I’m developing for the Symfony framework, called sfRESTClientPlugin.

Audience and Scope

This post assumes at least casual familiarity with Web development. I will explore some general principles of the RESTful interaction paradigm, but only to the extent to which they inform the design direction of the plugin’s API.

Although all the code samples will be in PHP, it is my hope that the exercise will yield material valuable to people working with other software stacks.

Read more…

Public Brainstorm , , , , , ,