Recovering Uncarved

After many months of being utterly out of commission (and 8 years without any new posts… ehem), I’m pleased to finally have managed to restore my blog, so that visitors are no longer greeted with a directory listing featuring only a favicon.ico file. 🙄

First, a bit about why the blog vanished: my billing information had changed some time in 2017 and my hosting provider was unable to successfully bill me, so they deactivated my account. Of course, they did try to contact me, but those emails got lost in the sea of my Inbox, so my provider wound up deleting the files and databases associated with my account.

And so vanished Uncarved—for an ebarrassingly long time. Sigh.

The good news, however, is that I had already begun a port of the WordPress site (its content, anyway…) to Hexo.

For anyone unfamiliar, Hexo is basically a Node equivalent to Jekyll.1 Unlike WordPress, which is a PHP app that offers an Admin interface to author content that gets stored in a MySQL database and dynamically renders content to the user with each request2, Hexo is a cli tool that generates a static HTML site from a bunch of Markdown and template files.3

Thanks to Hexo’s WordPress migrator), I had already done the data migration. The “only” work that remained, preventing me from redering a site from the migrated content, was that which was necessary to make generated pages look at least something like they were part of prometheas.com.

So I managed to find time this weekend to attack the theme customizing business, et voila: Uncarved is restored! There remain some rough edges, like those silly blue links in the sidebar, and a couple of shortcodes (like [caption]) for whom a I have yet to implement a renderer, but the content is back.

Welcome back, Uncarved.


  1. 1.It’s difficult to pick whether Hexo’s being JavaScript or its not being Ruby is more attractive to me, but the truth is frankly that it’s all pure Win to me.
  2. 2.Ignoring stuff like page cache plugins and CDNs to keep things simple.
  3. 3.That’s obviously a super-reductive comparison of the two site management solutions, and I intend to write a more considered comparison in some future post, but it’s enough to say I am beyond pleased never to have to worry about upgrading WordPress to avoid getting my site data hacked.

A Secret Agent Trick

I recently discovered a neat little “trick” on my iPad (and iPhone): I’ve stumbled upon a way to listen to music streaming from Internet radio stations while I do “other things,” like check my email, take photos, or write text messages.

While iPhone OS 4.0 — due out this summer — will finally deliver the long-requested ability to allow users to listen to their Pandora or Last.fm radio streams in the “background” by virtue of its new “multi-tasking” capabilities, the solution I’ve stumbled upon works (in slight variations) today with any device running iPhone OS 3.x.

Although this little trick won’t work with Pandora, since you must be using a Pandora client to stream their music, you can use it with any radio station which exposes its MP3 or AAC music stream via a multimedia playlist file) URL (which will typically end in .pls); basically any radio station you’ll find on Live 365, Soma FM, and more.

I’m a fan of Soma FM’s Secret Agent radio station, so we’ll use that for our example; feel free to try this out for any station you like.

The process is super easy, but slightly different between the handheld iPhone OS devices (eg, iPhone and iPod Touch) and iPads (for which it’s actually a bit spiffier), so I’ll take you through the steps for doing it on each one.

iPhone / iPod Touch

Launch Mobile Safari, and head to the following URL:

http://somafm.com/secretagent48.pls

You’ll see the following:


Safari fetches the PLS file URL

Once the playlist file is loaded, Safari will find the URL of the music stream, and start playing the music, and you’ll see this:


Safari has started playing the audio stream

Now — click the Home button and, say, check in on your email. Note that the music continues to play.

Isn’t that fantastic?

Just one caveat, though: you won’t be able to browse other websites in Safari until you click the “Done” button (top left), which — as you might expect — causes the music to stop playing.

One workaround is to use an alternative browser, like iCab, Opera Mini, or any of a number of other web browsers (some paid, some free) available in the App Store.

iPad

Things get a little cooler on the iPad. The steps to get you listening to the music stream are the same, but we can do a few more things once the music starts playing on the iPad.

Once the music starts to play, you’ll see this:

Note one key difference to note, however: unlike the iPhone’s Mobile Safari app, the iPad’s Mobile Safari continues to show you the browser chrome up top.

For starters, this means that you may continue browsing other websites in Safari on the iPad by simply tapping the tabs icon at the top:

What’s more you can actually create a bookmark for the radio station, so you can quickly listen any time:

But — and this is where I started to get a little verklempt — it gets just slightly more fantastic: you can bookmark it to your Home Screen.

Looks like the folks at Soma FM went the extra mile to specify a Home Screen icon for their website. Your mileage will vary with the availability of your favorite station’s dedicated icon for your Home Screen, however, depending on the site publisher.

Meanwhile, go forth and enjoy streaming some music while you’re sending those texts or reading the Times.

Dell's Faulty Product Page

As a number of perturbed status updates I’d posted to my Facebook profile in the wee hours of Friday morning suggested to my friends this AM, the health of my Mac Mini, Cylon.local, took a bit of a nose dive last night. Now, it’s probably just a hard drive failure, which is actually not so bad 1, but I won’t know for sure until I take the little fella down to Tekserve‘s “ER” this weekend and get it properly diagnosed.

So one of the thoughts that naturally occurred to me is that there’s at least some small chance that Cylon.local won’t be coming back; perhaps the resurrection ship was simply too far away when the dreadful moment arrived.

I’d just bought a Mac Mini for my parents this past Christmas, so I already know the value proposition of replacing it with the latest model.

But, while I’m entertaining the notion of replacement hardware, it occurs to me that Dell rolled out a competitor a few months ago, called the Inspiron Zino HD. Now don’t get me wrong: I’m quite happy with the Mini’s performance over the last four years, and I’d be happy to keep it for as long as it’ll stick around with me, but any sensible man would think to check in on his options.

Read More

On the Forum on Modernizing Government

Here’s a YouTube playlist of videos published by The White House, which includes the complete forum sessions that followed the President’s opening remarks. 1

The forum is a series of round table panel discussions, featuring executives from the private sector (CEOs, CTOs, etc), as well as government officials brainstorming, sharing their experiences, and making recommendations.

One of my favorite parts comes at 0:56:25:

If the inefficiency isn’t captured and measured, and staring you in the face, it isn’t gonna be tackled as a project in the first place… If the government takes on a culture of streamlining, and attacking inefficiency, and looking for resource maximization, you’re going to start looking introspectively and measuring things that will — for the first time — put a line of sight on the inefficiency.

Watching all of these isn’t exactly an edge-of-your-seat thrill ride, to be sure, but think about how many times you or I have even had the opportunity to be a “fly on the wall” during official government discourse. The very idea that this forum was live-streamed and published for free public access is a fantastic example of the sorts of operational practices that I deeply hope to see continue growing in practice, particularly even after the present Administration has completed its term in Washington.


  1. 1.Start from video 2 in the playlist if you’ve already seen the President’s remarks, and just want to skip to the round table discussion.

The Twelve Year Road

In January of 1998, Netscape — in a last-ditch effort to retaliate against Microsoft’s domination of the browser market with its Internet Explorer browser — took to the strategy of open sourcing the source code for their flagship product, Netscape Navigator. And so the Mozilla Project was born, which has since brought the world the Firefox web browser, and the Thunderbird email client (as well a handful of other things).

And only now, at the end of December 2009, Firefox 3.5 — the latest release of the software open sourced twelve years ago — has at long last eked out ahead of any single version of rival Internet Explorer.

Source: StatCounter Global Stats - Browser Version Market Share

It’s been a long road, Mozilla; congratulations on this hard-earned milestone.

White House Announces Open Government Plan

A post from earlier today on the White House blog by Peter Orszag, director of the Office of Management and Budget, announced the release of two new documents related to the Administration’s “open government” initiatives:

  • The Open Government Directive (download as pdf, txt, doc or view on Slideshare)
  • The Open Government Progress Report to the American People (download as pdf or view on Slideshare)

The post also includes a video of the live online chat in which federal CIO Vivek Kundra and federal CTO Aneesh Chopra announce the Open Government Plan, during which they fielded some questions in realtime from Facebook and Twitter.

Speaking of the value proposition of the initiative, Chopra explains:

So it’s having the conversation with each of our leaders to find out what are the big objectives that they wish to tackle on behalf of the President’s agenda, and in support of the American people. And how can the principles of Open Government, and in particular the datasets, allow others in the ecosystem to support — and advance on — those activities. We just can’t afford to have a federal government solution for every issue. By relying on the ingenuity of the American people we can advance these policy priorities in new and more creative ways.

I also particularly appreciate their speaking to an attempt to raise the quality of published data, particularly after it was discovered that some folks had provided shoddy data to Recovery.gov earlier this year.

Using Inspiration to Aim Education Towards Innovation

On 23 November 2009, President Barack Obama announced the new Educate to Innovate program (full transcript). The program is an initiative to stimulate America’s students to develop skills and consider careers in science, engineering, technology, and innovation.

What’s exciting about this program is that it aims beyond merely demanding improvements in public test scores for math and science from school districts. Unlike the No Child Left Behind Act, which — in a nutshell — is legislation targeted at making schools show improved standardized testing scores, the Educate to Innovate program instead aims directly at inspiring students to learn.

The program also ties in participation and investment commitments from the nation’s businesses, in an attempt to provide initiatives beyond the boundaries of the class room:

Time Warner Cable is joining with the Coalition for Science After School and FIRST Robotics… to connect one million students with fun after-school activities, like robotics competitions. The MacArthur Foundation and industry leaders like Sony are launching a nationwide challenge to design compelling, freely available, science-related video games. And organizations representing teachers, scientists, mathematicians, and engineers – joined by volunteers in the community – are participating in a grassroots effort called “National Lab Day” to reach 10 million young people with hands-on learning.

Students will launch rockets, construct miniature windmills, and get their hands dirty. They’ll have the chance to build and create – and maybe destroy just a little bit … to see the promise of being the makers of things, and not just the consumers of things.

And the program doesn’t rely solely on the contributions of corporations; it also seeks to leverage the participation of teachers, science and technology professionals, and volunteers.

Of course, the players upon whose participation the program is counting are only part of the story. What’s additionally refreshing is the breadth of the approaches proposed to achieve the program’s goals. Academic competitions and after-school programs are fairly classic, but I’m rather pleased to see a proposal to create video games designed to catalyze the development of scientific skills — it speaks to an understanding of America’s youth communication culture. America’s young people aren’t engaged by slide shows and documentaries. They demand interactivity.

But interactivity isn’t all young people need. They also need role models. So the President also announced a new annual science fair at the White House, saying [emphasis mine]:

If you win the NCAA championship, you come to the White House. Well, if you’re a young person and you’ve produced the best experiment or design, the best hardware or software, you ought to be recognized for that achievement, too. Scientists and engineers ought to stand side by side with athletes and entertainers as role models, and here at the White House we’re going to lead by example. We’re going to show young people how cool science can be.

And finally, I was pleased to hear that part of the initiative’s core goals is to attempt to broaden the appeal of science, math, and technology to populations that aren’t traditionally the most likely to pursue such studies:

Through these efforts … we’re going to expand opportunities for all our young people – including women and minorities who too often have been underrepresented in scientific and technological fields, but who are no less capable of succeeding in math and science and pursuing careers that will help improve our lives and grow our economy.

Here’s a video of the President’s full speech (originally posted on the White House blog), which discusses additional pats of the initiative and offers several logistical details:

Additionally, here’s a video in which Education Secretary Arne Duncan and Office of Science and Technology Policy Director John P. Holdren answer questions about the “Educate to Innovate” initiative:

All in all, the initiative clearly has extremely ambitious goals.

And there are certainly a slew of improvements our educational system needs that this initiative simply doesn’t address, for while making a generation of critical-thinking, innovative, and technically-savvy Americans is a worthy goal for several reasons, the education system must also take care to prepare us for “everything else” in life, like health and nutrition, personal finance, and social and civic participation, just to name a few.

Even so, I’m terrifically heartened at the innovation and sensibility that’s demonstrably been applied towards defining the initiative’s fundamental methods. It speaks to an understanding and harnessing of lessons learned in recent years about the power of social participation to drive individual accomplishment.

Climategate: a Case Study in How Not to Conduct Research

Sometimes events arrive with a timing that is both serendipitous and uncanny. Only days after my last post, wherein I state a case for the growing importance of referencing the datasets and algorithms used in the distillation of research conclusions, comes a story about leaked correspondence records (email messages) amongst climate researchers working in affiliation with the East Anglia Climate Research Unit, or CRU.

From the NYT article:

The e-mail messages, attributed to prominent American and British climate researchers, include discussions of scientific data and whether it should be released, exchanges about how best to combat the arguments of skeptics…. Drafts of scientific papers … were also among the hacked data, some of which dates back 13 years.

To say the least, the leak contains some juicy fodder for skeptics of human-driven climate change amongst the leaked materials.

Amongst these leaked emails, for example, are conversations which document various difficulties some of the CRU’s climate researchers have encountered over the years in trying to work with the data collected and managed by the organization. The Times article focuses on a discussion thread in which researcher Phil Jones mentions using a “trick” — originally employed by another colleague, Michael Mann — to “hide [a] decline” in temperatures apparently shown in some set of data.

In an interview about the leaked emails, Dr. Mann attempts to defuse the statement as a poor choice of words. Unfortunately, whether he’s being sincere or not, his is frankly a response that’s to be expected.

The article continues:

Some skeptics asserted Friday that the correspondence revealed an effort to withhold scientific information. “This is not a smoking gun; this is a mushroom cloud,” said Patrick J. Michaels, a climatologist who has long faulted evidence pointing to human-driven warming and is criticized in the documents.

This is also a statement that you’d expect from a climatologist building a career on a body of work disagreeing with the idea of human-driven warming. These emails are naturally material that skeptics of the human-driven climate change argument will latch onto (and, frankly, they certainly should; it’s just how scientific work is tested — through dispute).

The next several days sees a flurry of activity throughout the media and the blogosphere.

Before long, the name “Climategate” (kitschy but concise) gets attached to the discussions about the leaked materials. And since there’s a bit of both data and program source code in the mix, techies from around the world immediately jump into the fray.

One of the most popular files from the leak discussed most heavily in techie circles is called HARRY_READ_ME.txt (copies available in both original format and more structured edition). The story that unfolds in this file reveals the plight of a programmer named Harry who had struggled for three years, attempting to reproduce some research results with a collection of data and the source code for an algorithm created to calculate research conclusions.

Sadly, this man’s three-year effort to reproduce the published results with the given material never succeeded. Here’s an excerpt from the file, for a glimpse at this poor fella’s mounting frustrations along the way:

getting seriously fed up with the state of the Australian data. so many new stations have been introduced, so many false references.. so many changes that aren’t documented. Every time a cloud forms I’m presented with a bewildering selection of similar-sounding sites, some with references, some with WMO codes, and some with both. And if I look up the station metadata with one of the local references, chances are the WMO code will be wrong (another station will have it) and the lat/lon will be wrong too. I’ve been at it for well over an hour, and I’ve reached the 294th station in the tmin database. Out of over 14,000. Now even accepting that it will get easier (as clouds can only be formed of what’s ahead of you), it is still very daunting. I go on leave for 10 days after tomorrow, and if I leave it running it isn’t likely to be there when I return! As to whether my ‘action dump’ will work (to save repetition).. who knows?

Yay! Two-and-a-half hours into the exercise and I’m in Argentina!

Pfft.. and back to Australia almost immediately :-( .. and then Chile. Getting there.

Unfortunately, after around 160 minutes of uninterrupted decision making, my screen has started to black out for half a second at a time. More video cable problems - but why now?!! The count is up to 1007 though.

I am very sorry to report that the rest of the databases seem to be in nearly as poor a state as Australia was. There are hundreds if not thousands of pairs of dummy stations, one with no WMO and one with, usually overlapping and with the same station name and very similar coordinates. I know it could be old and new stations, but why such large overlaps if that’s the case? Aarrggghhh!
There truly is no end in sight.

Assuming the original conclusions he was attempting to reproduce were all based on this data (and, there’s frankly no reason not to), it’s impossible to invest much confidence in their validity.

Martin points out that the data and algorithms with which Harry was working were “inherited” from a previous researcher (or researchers), and came in a poorly-organized bundle with poor documentation. And what’s worse, he didn’t have access to anyone who had originally derived the conclusions he was tasked to reproduce. 1

The real egg in the face of this anecdote is the fact that CRU has clearly done an atrocious job at properly archiving their data, and documenting the work their researchers produce. Naturally this level of disorganization is a serious problem anywhere it may occur, but it’s a particularly glaring issue in the field of scientific research, where the validity of research results lies squarely upon the ability of independent third parties to reliably reproduce those results on their own. Yet here we find that the CRU is demonstrated to have either managed their data so poorly as to prevent its own scientists from being able to reproduce the organization’s own published results (in which case “embarrassing” doesn’t even begin to describe the situation), or to have manipulated the data and produced false results. And the fact is that either story tells a horrible tale about the CRU.

Charlie Martin, in a post to the Pajamas Media blog, writes:

I think there’s a good reason the CRU didn’t want to give their data to people trying to replicate their work.

It’s in such a mess that they can’t replicate their own results.

This is not, sadly, all that unusual. Simply put, scientists aren’t software engineers. They don’t keep their code in nice packages and they tend to use whatever language they’re comfortable with. Even if they were taught to keep good research notes in the past, it’s not unusual for things to get sloppy later. But put this in the context of what else we know from the CRU data dump:

1. They didn’t want to release their data or code, and they particularly weren’t interested in releasing any intermediate steps that would help someone else

2. They clearly have some history of massaging the data… to get it to fit their other results….

3. They had successfully managed to restrict peer review to … the small group of true believers they knew could be trusted to say the right things.

As a result, it looks like they found themselves trapped. They had the big research organizations, the big grants — and when they found themselves challenged, they discovered they’d built their conclusions on fine beach sand.

I won’t belabor the discussion of the implications these leaked documents offer; there is no shortage of people writing about exactly that. In case you’re interested in some of the more detailed coverage of the tech community’s review of the leaked data and algorithms, I would point you to the following pieces:

There’s also some great ongoing coverage at Devil’s Kitchen.

Regardless whether or not there’s any merit to any of the CRU’s climate research, however, this little drama leaves me unable to resist repeating an argument from my last post:

But with all these arguments and assertions about corollaries, trends, and predictions that this number crunching activity will generate, it will become increasingly crucial to have a mechanism by which the results claimed to have been derived from the number-crunching can be accounted for.

It must … become incumbent upon anybody publishing findings derived from mining such data to share both the sources and processes used to derive their results or conclusions. In cases of claims rooted in the fruits of data mining endeavors, it is specifically important that results indicate:

1. exactly which data sets it draws from, and

2. precisely which algorithm(s) processed the data in question.

At this point, the specific implications this debacle has for the CRU’s research is irrelevant. For, whether by deceit or incompetence, this leaked data has left their published research about climate change completely unreliable.

Yet developing a confident clarity around the subject of their research remains of critical importance, for climate change is a real challenge that humankind must cope with. Regardless whether or not human industrial activity is a driving factor for climate change, the fact is that the ice at our poles _is_ melting at an accelerating rate. Decades worth of satellite photos and other survey data sufficiently demonstrate this fact. We similarly have data collected over the last several decades by the world’s meteorologists that global mean temperatures seem to be rising, as well as increasing levels of extreme weather (from droughts and famines to floods and more) around the world.

The climate debate isn’t over whether these events are occurring, but instead whether human industrial activity accounts for a relevant piece of it.

Governments around the planet will be forced to take some sort of action to deal with the prospective repercussions of these changes (e.g., rising sea levels, expansion of the Sahara, and the rest). The consideration at stake, therefore, is how each country will individually and collectively direct their efforts and invest their resources in dealing with it.

If human industrial activity has bearing on the matter, we’ll have to make some serious policy changes and invest heavily in developing alternative methods of production, lest we imperil our own (and other) species. But if, on the other hand, our industrial activity is not a determining factor in climate change, our efforts are best spent trying to figure out how we’re going to deal with the realities of a changing climate that we cannot mitigate simply by being more responsible with our emissions.

In any case, everyone needs to make informed decisions about where they’re investing their money and efforts.

And so a number of the world’s governmental and industrial leaders (including US President Barack Obama) are scheduled to meet — along with members of the climate research community — at the United Nations Climate Change Conference in Copenhagen this December in an attempt to work out policy directions to deal with climate change. I’m hoping the event will focus on methods to improve and reinforce confidence in the remainder of the climate research work being conducted around the world, and that it won’t turn into a political food fight.

Fingers crossed.

I am left hoping that some real good can rise from this mess. And so I call on climate change researchers and institutions around the world to take this opportunity develop the practice of providing full disclosure on the sources of their data sets and the functionality of their algorithms. There will likely be many political, legal, and logistical obstacles to address and overcome in this effort, but failure to do so carries stakes that are simply too high.


  1. 1.I personally have plenty of experience attempting to work with poorly-documented code and data inherited from some previous person’s work, and can directly attest to the maddening up-hill battle of that situation.

Fortifying Confidence by Stealing From Academics. And Scientists.

Driven in large part by open government efforts initiated by the Obama Administration, and particularly Federal CIO Vivek Kundra, tremendous and rich data sets have become available from the federal government, as well as some state and local governments. This data is published digitally, in organized, well-known and documented formats. 1

And because these government-amassed data sets have already been paid for by taxpayer dollars, they have been rightfully made accessible to the public domain, free of charge.

Tim O’Reilly — founder and CEO of O’Reilly Media, as well as organizer and host of various technology conferences including the Government 2.0 Summit — describes the thinking behind this policy decision in an article he wrote at Forbes, writing:

Rather than licensing government data to a few select “value added” providers, who then license the data downstream, the federal government (and many state and local governments) are beginning to provide an open platform that enables anyone with a good idea to build innovative services that connect government to citizens, give citizens visibility into the actions of government and even allow citizens to participate directly in policy-making.

The primary distribution point for the federal government’s data is the data.gov website (about which I’d earlier written). In another article he’d guest-authored for TechCrunch, Mr. O’Reilly talks about this website, writing:

Behind [the] site is the idea that government agencies shouldn’t just provide web sites, they should provide web services. These services, in effect, become the government’s SDK (software development kit). The government may build some applications using these APIs, but there’s an opportunity for private citizens and innovative companies to build new, unexpected applications. This is the phenomenon that Jonathan Zittrain refers to as “generativity“, the ability of open-ended platforms to create new possibilities not envisioned by their creators.

The range of potential applications for these data is difficult to exaggerate (or, frankly, to even imagine). A thorough exploration of these possibilities is beyond the scope of this post, but this showcase of apps built leveraging data made available by the city of San Francisco gives a small peek at the broad range of uses this government data.

In browsing that app showcase, I would note that none of the apps found there were written by the government. That’s zero. Rather, each was developed by a third-party.

I would also note that most of those apps combine multiple different data sets, many of which are also including non-governmental data sets. 2

Clearly all this is just the beginning.

Opportunities and Challenges to Come

The datasets will grow broader, as the federal government continues to expand its data offerings, and more state and local governments begin to follow suit, as Utah, San Francisco, and even my home town of New York City have since done.

As the data sets become richer throughout this process, mining the information on offer will provide opportunities to develop insights about matters ranging from public health to environmental developments and energy consumption, and from regional commercial performance to educational development.

And once there’s some historical depth to these records — through a combination of digitally publishing data sets from earlier years, as well as continuing to release emergent data — we will eventually even start to see the emergence of various types of projection models developed for many of the issues mentioned above, from economic development forecasts to predictions for the spread of disease outbreak.

These data sets stand to revolutionize both entrepreneurial endeavors and academic research projects.

And with the grant allocations for research en route from provisions that are part of American Recovery and Reinvestment Act of 2009, we’re likely to see a staggering amount of new projects rise from both academia and the business world.

But with all these arguments and assertions about corollaries, trends, and predictions that this number crunching activity will generate, it will become increasingly crucial to have a mechanism by which the results claimed to have been derived from the number-crunching can be accounted for.

It’s not difficult to imagine, after all, the proliferation of claims that will begin to emerge, anchoring their proposed value on these mountains of data. 3 Luckily, after decades of subjection to some of the most talented number-spinning tactics that statisticians teamed up with PR specialists have thrown about, many people have developed a thick skin (and perhaps even default suspicion) against allowing “the numbers” to speak to very much.

And rightfully so; “the numbers” can build nearly any narrative a story teller wishes to weave, depending on how they’re sliced, diced, and manicured.

Numbers may not be able to lie, but men sure can.

Luckily, we can find some time-tested solutions for mitigating against falsification and/or incompetence by looking to techniques applied in works of scholarship and the practices of scientific peer review: scholars must meticulously cite their sources in bibliographies attached to their work, and scientists must accompany any publication of the results of their work along with a detailed description of their methods.

It must similarly become incumbent upon anybody publishing findings derived from mining such data to share both the sources and processes used to derive their results or conclusions. In cases of claims rooted in the fruits of data mining endeavors, it is specifically important that results indicate:

1. exactly which data sets it draws from, and

2. precisely which algorithm(s) processed the data in question.

The trouble, however, is that there is neither a comprehensive repository nor a system for unique canonical identifiers to publicly and universally identify such data sets and algorithms. Their absence makes any attempts to reproduce such results very challenging, at best.

Fortifying Confidence in the Results

Books, by contrast, have an ISBN number. Books also have a governmental repository, called the Library of Congress.

So I propose that similar mechanisms must be worked out for data sets and algorithms. Perhaps serving as this repository becomes an evolutionary portion of the Library of Congress’ own charter. This repository would be a web service that exposes each individual data set and data mining algorithm source code package under permalinks which incorporate their respective canonical identifier.

Potential examples of such permalinks may look something like this:

http://www.loc.gov/datasets/0123457/us-census-2010
http://www.loc.gov/algorithms/76543210/higgs-boson-modeler

Naturally, there are considerations that must be accounted for in some cases that it may wind up being imperative restrict access to any resources stored in this repository.

I’ve focused so far on publicly-available data sets, I would note that it is inevitable that a number of valuable projects will on occasion leverage data sets whose rights are privately owned, and to which access must be controlled by obtaining permission of some sort from its owner.

The same concern is naturally prone to surface with some regularity for algorithms, as well.

This a consideration that will require some real thought, but I’ll leave that to a future exploration. For the time being, I’ll simply note that the HTTP protocol does provide mechanisms for access restriction (particularly 401, 402, and 403); leaving only the policy around which those mechanisms will be applied to be worked out.

Although there’s loads to work out about how such a repository can be actualized, its availability will become crucial in the coming years.

Simply hope that both the practice of sharing data sources and methods — as well as a suitable canonical repository for them — materialize earlier than later, since a only a few silly and reckless abuses of this data can undermine public confidence in efforts to fully harness its potential value.


  1. 1.Importantly, the manner in which all this data is distributed is an ideal packaging for use as input for processing by data-crunching algorithms developed by anyone interested in doing so.
  2. 2.This practice of combining data sets from different sources to create a new, value-added data set is referred to as creating a mashup.
  3. 3.It will also certainly be leveraged evaluate the government’s performance, both by the current administration and — perhaps more compellingly — by its political opponents.

Don't Ask Me for My Email Address

These days, anyone organizing competent promotional efforts (events, organizations, themselves, etc) invests various degrees of their attentions to online efforts. One reason for this is economics: efforts to “spread the word” online has the potential to reach more people at the expense of fewer resources and, therefore, less money.

One of the most commonly-leveraged contact points has become the email inbox.

Nearly everyone has an email address, and many of us have several — one for work, one personal. I presently have four, for example.

Generally speaking, people have largely become very comfortable communicating over email. It doesn’t carry the “burden” of requiring an immediate response, unlike a phone call, and can be whatever length the author thinks is appropriate for the correspondence.

It’s also easy to share information around the conversation in emails, by including a URL that points to further information on some website, or by attaching photos or other small files. This capability allows promoters to keep their message concise (if they’re clever), and yet provide leads to supplemental information for those with interest in pursuing the deeper details of the message.

Finally, it allows the author to write up a single message that can be delivered to a (theoretically) limitless number of people.

For all these reasons, one of the most common techniques that promoters adopt is the email campaign. They focus efforts on accumulating email addresses of people that could potentially be interested in their product, services, performances, or whatever it is they’re on a mission to promote.

Some years ago, I would share my email address with people and organizations whose news I’d have interest in following: bands, artists, pro-social organizations, and more.

But after a while, I noticed my inbox just blowing up.

The more I gave my email address out, the more emails I’d have to deal with every day.

I’m not really interested in anyone’s ideas on how I can be making millions from home, offers for debt reduction, or substances that promise me the ability to drive nails through wooden boards with my penis (promise me the same for granite, however, and maybe we’ll talk).

Read More