Facebook, Bye

Back on the evening of 30 September, not an hour after making a mundane Facebook post about the episode of Luke Cage I had been watching on Netflix, I announced that I’d decided to abandon my Facebook account. I wrote:

After one too many “WTF Facebook” moments, I’m shutting my account down soon.

I failed to specify when “soon” would be, but as I wrote the words, I had imagined giving it about a week, simply to allow folks to download photos, etc. In three hours from the moment I’m writing these words, the full week will have passed, but I’ve since realized a few things that I’ll need to stick around for–I’ve got some “Pages” to hand over, accounts to untie from Facebook (Spotify and a raft of others), and prefer to keep Messenger active through a bit of upcoming travel in November. So I’m keeping the account until December 2018.

I also failed to specify what I’d meant by “shutting down”. By this I mean to say that I’m full-on deleting the account, including all posts, photos, etc.

In the meantime, however, as I wrap up 10 years worth of Facebook footprints, I have decided to try to document what might be interesting material for other folks who may, one day, decide to do something similar themselves. The plan is to write posts on my blog (where you’re reading this), and to then share these posts to both my Facebook and Micro.blog accounts.

Those future posts will focus on the What and How. In the remainder of this post, I’ll speak to the Why, and the Why So Long.

The Why

The Why is easy: privacy and security. Facebook has a lot of data about each of its members–way more data than the far majority of people likely know. Moreover, Facebook also has way more data about most users’ friends than either the users or their friends know.

In fact, Facebook is so good at building a sense of who their users know, that there even people who open “fake” accounts, using no personal information, often find that Facebook begins to recommend people they actually know as likely friends.

Imagine how rich the information is about users who actively give Facebook information about themselves.

The Why So Long

Fair warning, this is the bit that gets particularly navel-gazing and retrospective. At the time of this writing, I’ve been on Facebook for 10 years and a few months. It’s fair to say that I’ve been quite an active poster for most of that time. And that many experiences and interactions that have enriched my life that have only been possible because Facebook existed. Not least of these things is the extent to which I’ve been able to keep in touch with my family in Greece, the family-like friends I made in Rome in my mid-20s, and an array of friends scattered across the US and the rest of the world.

And, beyond “staying in touch”, Facebook has made possible some truly great and serendipitous shit. Last December (2017), for example, I posted about my arrival to Fiumicino airport. I do this largely to let family know that I’ve arrived somewhere safely, without texting a bunch of people the same message. But it had just so happened that one of my cousins was secretly taking a vacation with her manfriend to Rome–and she had arrived the very same day. She was laying low on social media because her employer wouldn’t give her leave, so she had to phone in sick, but her father (my uncle) caught my post and relayed the information to her, so we were able to meet for a coffee in Campo de’ Fiori! That simply would never, nor–I can’t imagine–would ever have happened, without Facebook.

I’ve also evolved a lot of my thinking about what it means to be something that’s been completely invisible to me for most of my life: a white, middle-class American male. And, more importantly, have had access to glimpses of what it’s like not to be one, from a variety of smart and thoughtful friends. I am humbled to learn what they’ve been willing to share, and am very grateful for the access Facebook has given me to their writing and thinking.

Facebook has also taught me interesting new things I never knew about people I’d known for ages; introduced me to sometimes-great recommendations for books, articles, movies, and more; and even facilitated the organized response of a trans-generational group of people from my old high school rally to support one of the greatest teachers in all of our lives, when the current administration was–by all appearances–simply trying to flush him out… it was like Dead Poet’s Society, except justice prevailed!

So I’m not merely saying Facebook was responsible for some “OK” stuff; I’m very much saying Facebook is responsible for some downright awesome shit.

Which is why making the decision to leave took so damned long.

But leaving Facebook is an idea I’d started entertaining last summer (2017), when my mobile phone was stolen during an ohterwise lovely lunch date near Syntagma Square, in Athens. I was in the middle of my summer vacation, and I’ll tell you outright that the phone theft was a colossal pain in the ass–I almost certainly don’t need to tell any 21st century adult that I had been relying on my phone to get around town, and to track my travel and lodging plans. 1

Luckily, I was in a place I’ve visited almost every year since 2000, and I was very familiar with how things generally work. 2

But without a phone, I was not contactable (outside of my hotel room), had no access to GPS directions, and I didn’t have a camera… how would I ever remember what the food I ate looked like, without a flippin’ camera????

I was also unable to access the mobile internet, which meant no Facebook.

So for a couple of weeks, it was just me, my immediate environment, and my Kindle reader, passing each day on the Naxos beach. None of the bullshit Trump news reached me. I wasn’t stumbling upon random bigoted or racist comments from strangers or–depressingly–people I actually knew.

And you know what? It was nice… it was really nice.

Over time, though, I returned–to the US and then to Facebook. And it was largely uneventful for a while. Sure, it was marginally creepy how Facebook would keep showing me ads for shit I had looked at on other sites, but I turned a blind eye. Besides, I was tech lead for a team that looked after my employer’s “news” product at the time, so there were even a few minor work-related reasons to use the platform (not least of which was a private group the team used to share silly photos and plan karaoke outings).

But then Cambridge Analytica. And then the so-called “shadow contacts“ thing. And not even a week later, the “View As“ security exploit.

When a close friend had responded to the “View As” vulnerability news by announcing he was shutting down his Facebook account “in a week”, it reminded me of that period I was completely off Facebook.

Then I asked myself: what exactly would need to happen to really push me over the edge to just leave? What shittier thing could they do with their lack of respect for privacy? They’ve already got way more information about who I know, where I’ve been, and what I like than just about anybody, and they keep slurping up more data about me–both direclty from me, and from my friends. Or what sort of security breach needs to happen? And what of my information will have been stolen in that security breach to push me over the edge?

Finally, last Sunday, it occurred to me: if I wait until something significantly “bad enough” does happen, “after the fact” will simply be “too late” to do anything at all about it.

Over the course of the last 10 years, Facebook has shown–time and again–that it is hell-bent on gathering as much information about its users’ online and real-world interactions as it possibly can, whether directly reported by those users, or not. In fact, the recent “Shadow Contacts” story taught the world that Facebook doesn’t even stop at slurping up data about its users, but also seeks to retain as much data as it can about all their contacts–whether those contacts are on Facebook or not.

If I had even a small (though sensible) inkling that they were taking user privacy in any way seriously, I might have stuck it out, given all the value I do acknowledge the service has offered these years. Instead, they have repeatedly demonstrated the opposite.

Facebook, Bye.

  1. 1.I use TripIt to keep track of my travel itineraries, from flights to ferries and lodging, I had access to all my booking information.
  2. 2.And, full disclosure: I had luckily also brought my iPad, so I was able to access all my contacts and email, but that required finding working WiFi.

Recovering Uncarved

After many months of being utterly out of commission (and 8 years without any new posts… ehem), I’m pleased to finally have managed to restore my blog, so that visitors are no longer greeted with a directory listing featuring only a favicon.ico file. 🙄

First, a bit about why the blog vanished: my billing information had changed some time in 2017 and my hosting provider was unable to successfully bill me, so they deactivated my account. Of course, they did try to contact me, but those emails got lost in the sea of my Inbox, so my provider wound up deleting the files and databases associated with my account.

And so vanished Uncarved—for an ebarrassingly long time. Sigh.

The good news, however, is that I had already begun a port of the WordPress site (its content, anyway…) to Hexo.

For anyone unfamiliar, Hexo is basically a Node equivalent to Jekyll.1 Unlike WordPress, which is a PHP app that offers an Admin interface to author content that gets stored in a MySQL database and dynamically renders content to the user with each request2, Hexo is a cli tool that generates a static HTML site from a bunch of Markdown and template files.3

Thanks to Hexo’s WordPress migrator), I had already done the data migration. The “only” work that remained, preventing me from redering a site from the migrated content, was that which was necessary to make generated pages look at least something like they were part of prometheas.com.

So I managed to find time this weekend to attack the theme customizing business, et voila: Uncarved is restored! There remain some rough edges, like those silly blue links in the sidebar, and a couple of shortcodes (like [caption]) for whom a I have yet to implement a renderer, but the content is back.

Welcome back, Uncarved.

  1. 1.It’s difficult to pick whether Hexo’s being JavaScript or its not being Ruby is more attractive to me, but the truth is frankly that it’s all pure Win to me.
  2. 2.Ignoring stuff like page cache plugins and CDNs to keep things simple.
  3. 3.That’s obviously a super-reductive comparison of the two site management solutions, and I intend to write a more considered comparison in some future post, but it’s enough to say I am beyond pleased never to have to worry about upgrading WordPress to avoid getting my site data hacked.

A Secret Agent Trick

I recently discovered a neat little “trick” on my iPad (and iPhone): I’ve stumbled upon a way to listen to music streaming from Internet radio stations while I do “other things,” like check my email, take photos, or write text messages.

While iPhone OS 4.0 — due out this summer — will finally deliver the long-requested ability to allow users to listen to their Pandora or Last.fm radio streams in the “background” by virtue of its new “multi-tasking” capabilities, the solution I’ve stumbled upon works (in slight variations) today with any device running iPhone OS 3.x.

Although this little trick won’t work with Pandora, since you must be using a Pandora client to stream their music, you can use it with any radio station which exposes its MP3 or AAC music stream via a multimedia playlist file) URL (which will typically end in .pls); basically any radio station you’ll find on Live 365, Soma FM, and more.

I’m a fan of Soma FM’s Secret Agent radio station, so we’ll use that for our example; feel free to try this out for any station you like.

The process is super easy, but slightly different between the handheld iPhone OS devices (eg, iPhone and iPod Touch) and iPads (for which it’s actually a bit spiffier), so I’ll take you through the steps for doing it on each one.

iPhone / iPod Touch

Launch Mobile Safari, and head to the following URL:


You’ll see the following:

Safari fetches the PLS file URL

Once the playlist file is loaded, Safari will find the URL of the music stream, and start playing the music, and you’ll see this:

Safari has started playing the audio stream

Now — click the Home button and, say, check in on your email. Note that the music continues to play.

Isn’t that fantastic?

Just one caveat, though: you won’t be able to browse other websites in Safari until you click the “Done” button (top left), which — as you might expect — causes the music to stop playing.

One workaround is to use an alternative browser, like iCab, Opera Mini, or any of a number of other web browsers (some paid, some free) available in the App Store.


Things get a little cooler on the iPad. The steps to get you listening to the music stream are the same, but we can do a few more things once the music starts playing on the iPad.

Once the music starts to play, you’ll see this:

Note one key difference to note, however: unlike the iPhone’s Mobile Safari app, the iPad’s Mobile Safari continues to show you the browser chrome up top.

For starters, this means that you may continue browsing other websites in Safari on the iPad by simply tapping the tabs icon at the top:

What’s more you can actually create a bookmark for the radio station, so you can quickly listen any time:

But — and this is where I started to get a little verklempt — it gets just slightly more fantastic: you can bookmark it to your Home Screen.

Looks like the folks at Soma FM went the extra mile to specify a Home Screen icon for their website. Your mileage will vary with the availability of your favorite station’s dedicated icon for your Home Screen, however, depending on the site publisher.

Meanwhile, go forth and enjoy streaming some music while you’re sending those texts or reading the Times.

Dell's Faulty Product Page

As a number of perturbed status updates I’d posted to my Facebook profile in the wee hours of Friday morning suggested to my friends this AM, the health of my Mac Mini, Cylon.local, took a bit of a nose dive last night. Now, it’s probably just a hard drive failure, which is actually not so bad 1, but I won’t know for sure until I take the little fella down to Tekserve‘s “ER” this weekend and get it properly diagnosed.

So one of the thoughts that naturally occurred to me is that there’s at least some small chance that Cylon.local won’t be coming back; perhaps the resurrection ship was simply too far away when the dreadful moment arrived.

I’d just bought a Mac Mini for my parents this past Christmas, so I already know the value proposition of replacing it with the latest model.

But, while I’m entertaining the notion of replacement hardware, it occurs to me that Dell rolled out a competitor a few months ago, called the Inspiron Zino HD. Now don’t get me wrong: I’m quite happy with the Mini’s performance over the last four years, and I’d be happy to keep it for as long as it’ll stick around with me, but any sensible man would think to check in on his options.

Read More

On the Forum on Modernizing Government

Here’s a YouTube playlist of videos published by The White House, which includes the complete forum sessions that followed the President’s opening remarks. 1

The forum is a series of round table panel discussions, featuring executives from the private sector (CEOs, CTOs, etc), as well as government officials brainstorming, sharing their experiences, and making recommendations.

One of my favorite parts comes at 0:56:25:

If the inefficiency isn’t captured and measured, and staring you in the face, it isn’t gonna be tackled as a project in the first place… If the government takes on a culture of streamlining, and attacking inefficiency, and looking for resource maximization, you’re going to start looking introspectively and measuring things that will — for the first time — put a line of sight on the inefficiency.

Watching all of these isn’t exactly an edge-of-your-seat thrill ride, to be sure, but think about how many times you or I have even had the opportunity to be a “fly on the wall” during official government discourse. The very idea that this forum was live-streamed and published for free public access is a fantastic example of the sorts of operational practices that I deeply hope to see continue growing in practice, particularly even after the present Administration has completed its term in Washington.

  1. 1.Start from video 2 in the playlist if you’ve already seen the President’s remarks, and just want to skip to the round table discussion.

The Twelve Year Road

In January of 1998, Netscape — in a last-ditch effort to retaliate against Microsoft’s domination of the browser market with its Internet Explorer browser — took to the strategy of open sourcing the source code for their flagship product, Netscape Navigator. And so the Mozilla Project was born, which has since brought the world the Firefox web browser, and the Thunderbird email client (as well a handful of other things).

And only now, at the end of December 2009, Firefox 3.5 — the latest release of the software open sourced twelve years ago — has at long last eked out ahead of any single version of rival Internet Explorer.

Source: StatCounter Global Stats - Browser Version Market Share

It’s been a long road, Mozilla; congratulations on this hard-earned milestone.

White House Announces Open Government Plan

A post from earlier today on the White House blog by Peter Orszag, director of the Office of Management and Budget, announced the release of two new documents related to the Administration’s “open government” initiatives:

  • The Open Government Directive (download as pdf, txt, doc or view on Slideshare)
  • The Open Government Progress Report to the American People (download as pdf or view on Slideshare)

The post also includes a video of the live online chat in which federal CIO Vivek Kundra and federal CTO Aneesh Chopra announce the Open Government Plan, during which they fielded some questions in realtime from Facebook and Twitter.

Speaking of the value proposition of the initiative, Chopra explains:

So it’s having the conversation with each of our leaders to find out what are the big objectives that they wish to tackle on behalf of the President’s agenda, and in support of the American people. And how can the principles of Open Government, and in particular the datasets, allow others in the ecosystem to support — and advance on — those activities. We just can’t afford to have a federal government solution for every issue. By relying on the ingenuity of the American people we can advance these policy priorities in new and more creative ways.

I also particularly appreciate their speaking to an attempt to raise the quality of published data, particularly after it was discovered that some folks had provided shoddy data to Recovery.gov earlier this year.

Using Inspiration to Aim Education Towards Innovation

On 23 November 2009, President Barack Obama announced the new Educate to Innovate program (full transcript). The program is an initiative to stimulate America’s students to develop skills and consider careers in science, engineering, technology, and innovation.

What’s exciting about this program is that it aims beyond merely demanding improvements in public test scores for math and science from school districts. Unlike the No Child Left Behind Act, which — in a nutshell — is legislation targeted at making schools show improved standardized testing scores, the Educate to Innovate program instead aims directly at inspiring students to learn.

The program also ties in participation and investment commitments from the nation’s businesses, in an attempt to provide initiatives beyond the boundaries of the class room:

Time Warner Cable is joining with the Coalition for Science After School and FIRST Robotics… to connect one million students with fun after-school activities, like robotics competitions. The MacArthur Foundation and industry leaders like Sony are launching a nationwide challenge to design compelling, freely available, science-related video games. And organizations representing teachers, scientists, mathematicians, and engineers – joined by volunteers in the community – are participating in a grassroots effort called “National Lab Day” to reach 10 million young people with hands-on learning.

Students will launch rockets, construct miniature windmills, and get their hands dirty. They’ll have the chance to build and create – and maybe destroy just a little bit … to see the promise of being the makers of things, and not just the consumers of things.

And the program doesn’t rely solely on the contributions of corporations; it also seeks to leverage the participation of teachers, science and technology professionals, and volunteers.

Of course, the players upon whose participation the program is counting are only part of the story. What’s additionally refreshing is the breadth of the approaches proposed to achieve the program’s goals. Academic competitions and after-school programs are fairly classic, but I’m rather pleased to see a proposal to create video games designed to catalyze the development of scientific skills — it speaks to an understanding of America’s youth communication culture. America’s young people aren’t engaged by slide shows and documentaries. They demand interactivity.

But interactivity isn’t all young people need. They also need role models. So the President also announced a new annual science fair at the White House, saying [emphasis mine]:

If you win the NCAA championship, you come to the White House. Well, if you’re a young person and you’ve produced the best experiment or design, the best hardware or software, you ought to be recognized for that achievement, too. Scientists and engineers ought to stand side by side with athletes and entertainers as role models, and here at the White House we’re going to lead by example. We’re going to show young people how cool science can be.

And finally, I was pleased to hear that part of the initiative’s core goals is to attempt to broaden the appeal of science, math, and technology to populations that aren’t traditionally the most likely to pursue such studies:

Through these efforts … we’re going to expand opportunities for all our young people – including women and minorities who too often have been underrepresented in scientific and technological fields, but who are no less capable of succeeding in math and science and pursuing careers that will help improve our lives and grow our economy.

Here’s a video of the President’s full speech (originally posted on the White House blog), which discusses additional pats of the initiative and offers several logistical details:

Additionally, here’s a video in which Education Secretary Arne Duncan and Office of Science and Technology Policy Director John P. Holdren answer questions about the “Educate to Innovate” initiative:

All in all, the initiative clearly has extremely ambitious goals.

And there are certainly a slew of improvements our educational system needs that this initiative simply doesn’t address, for while making a generation of critical-thinking, innovative, and technically-savvy Americans is a worthy goal for several reasons, the education system must also take care to prepare us for “everything else” in life, like health and nutrition, personal finance, and social and civic participation, just to name a few.

Even so, I’m terrifically heartened at the innovation and sensibility that’s demonstrably been applied towards defining the initiative’s fundamental methods. It speaks to an understanding and harnessing of lessons learned in recent years about the power of social participation to drive individual accomplishment.

Climategate: a Case Study in How Not to Conduct Research

Sometimes events arrive with a timing that is both serendipitous and uncanny. Only days after my last post, wherein I state a case for the growing importance of referencing the datasets and algorithms used in the distillation of research conclusions, comes a story about leaked correspondence records (email messages) amongst climate researchers working in affiliation with the East Anglia Climate Research Unit, or CRU.

From the NYT article:

The e-mail messages, attributed to prominent American and British climate researchers, include discussions of scientific data and whether it should be released, exchanges about how best to combat the arguments of skeptics…. Drafts of scientific papers … were also among the hacked data, some of which dates back 13 years.

To say the least, the leak contains some juicy fodder for skeptics of human-driven climate change amongst the leaked materials.

Amongst these leaked emails, for example, are conversations which document various difficulties some of the CRU’s climate researchers have encountered over the years in trying to work with the data collected and managed by the organization. The Times article focuses on a discussion thread in which researcher Phil Jones mentions using a “trick” — originally employed by another colleague, Michael Mann — to “hide [a] decline” in temperatures apparently shown in some set of data.

In an interview about the leaked emails, Dr. Mann attempts to defuse the statement as a poor choice of words. Unfortunately, whether he’s being sincere or not, his is frankly a response that’s to be expected.

The article continues:

Some skeptics asserted Friday that the correspondence revealed an effort to withhold scientific information. “This is not a smoking gun; this is a mushroom cloud,” said Patrick J. Michaels, a climatologist who has long faulted evidence pointing to human-driven warming and is criticized in the documents.

This is also a statement that you’d expect from a climatologist building a career on a body of work disagreeing with the idea of human-driven warming. These emails are naturally material that skeptics of the human-driven climate change argument will latch onto (and, frankly, they certainly should; it’s just how scientific work is tested — through dispute).

The next several days sees a flurry of activity throughout the media and the blogosphere.

Before long, the name “Climategate” (kitschy but concise) gets attached to the discussions about the leaked materials. And since there’s a bit of both data and program source code in the mix, techies from around the world immediately jump into the fray.

One of the most popular files from the leak discussed most heavily in techie circles is called HARRY_READ_ME.txt (copies available in both original format and more structured edition). The story that unfolds in this file reveals the plight of a programmer named Harry who had struggled for three years, attempting to reproduce some research results with a collection of data and the source code for an algorithm created to calculate research conclusions.

Sadly, this man’s three-year effort to reproduce the published results with the given material never succeeded. Here’s an excerpt from the file, for a glimpse at this poor fella’s mounting frustrations along the way:

getting seriously fed up with the state of the Australian data. so many new stations have been introduced, so many false references.. so many changes that aren’t documented. Every time a cloud forms I’m presented with a bewildering selection of similar-sounding sites, some with references, some with WMO codes, and some with both. And if I look up the station metadata with one of the local references, chances are the WMO code will be wrong (another station will have it) and the lat/lon will be wrong too. I’ve been at it for well over an hour, and I’ve reached the 294th station in the tmin database. Out of over 14,000. Now even accepting that it will get easier (as clouds can only be formed of what’s ahead of you), it is still very daunting. I go on leave for 10 days after tomorrow, and if I leave it running it isn’t likely to be there when I return! As to whether my ‘action dump’ will work (to save repetition).. who knows?

Yay! Two-and-a-half hours into the exercise and I’m in Argentina!

Pfft.. and back to Australia almost immediately :-( .. and then Chile. Getting there.

Unfortunately, after around 160 minutes of uninterrupted decision making, my screen has started to black out for half a second at a time. More video cable problems - but why now?!! The count is up to 1007 though.

I am very sorry to report that the rest of the databases seem to be in nearly as poor a state as Australia was. There are hundreds if not thousands of pairs of dummy stations, one with no WMO and one with, usually overlapping and with the same station name and very similar coordinates. I know it could be old and new stations, but why such large overlaps if that’s the case? Aarrggghhh!
There truly is no end in sight.

Assuming the original conclusions he was attempting to reproduce were all based on this data (and, there’s frankly no reason not to), it’s impossible to invest much confidence in their validity.

Martin points out that the data and algorithms with which Harry was working were “inherited” from a previous researcher (or researchers), and came in a poorly-organized bundle with poor documentation. And what’s worse, he didn’t have access to anyone who had originally derived the conclusions he was tasked to reproduce. 1

The real egg in the face of this anecdote is the fact that CRU has clearly done an atrocious job at properly archiving their data, and documenting the work their researchers produce. Naturally this level of disorganization is a serious problem anywhere it may occur, but it’s a particularly glaring issue in the field of scientific research, where the validity of research results lies squarely upon the ability of independent third parties to reliably reproduce those results on their own. Yet here we find that the CRU is demonstrated to have either managed their data so poorly as to prevent its own scientists from being able to reproduce the organization’s own published results (in which case “embarrassing” doesn’t even begin to describe the situation), or to have manipulated the data and produced false results. And the fact is that either story tells a horrible tale about the CRU.

Charlie Martin, in a post to the Pajamas Media blog, writes:

I think there’s a good reason the CRU didn’t want to give their data to people trying to replicate their work.

It’s in such a mess that they can’t replicate their own results.

This is not, sadly, all that unusual. Simply put, scientists aren’t software engineers. They don’t keep their code in nice packages and they tend to use whatever language they’re comfortable with. Even if they were taught to keep good research notes in the past, it’s not unusual for things to get sloppy later. But put this in the context of what else we know from the CRU data dump:

1. They didn’t want to release their data or code, and they particularly weren’t interested in releasing any intermediate steps that would help someone else

2. They clearly have some history of massaging the data… to get it to fit their other results….

3. They had successfully managed to restrict peer review to … the small group of true believers they knew could be trusted to say the right things.

As a result, it looks like they found themselves trapped. They had the big research organizations, the big grants — and when they found themselves challenged, they discovered they’d built their conclusions on fine beach sand.

I won’t belabor the discussion of the implications these leaked documents offer; there is no shortage of people writing about exactly that. In case you’re interested in some of the more detailed coverage of the tech community’s review of the leaked data and algorithms, I would point you to the following pieces:

There’s also some great ongoing coverage at Devil’s Kitchen.

Regardless whether or not there’s any merit to any of the CRU’s climate research, however, this little drama leaves me unable to resist repeating an argument from my last post:

But with all these arguments and assertions about corollaries, trends, and predictions that this number crunching activity will generate, it will become increasingly crucial to have a mechanism by which the results claimed to have been derived from the number-crunching can be accounted for.

It must … become incumbent upon anybody publishing findings derived from mining such data to share both the sources and processes used to derive their results or conclusions. In cases of claims rooted in the fruits of data mining endeavors, it is specifically important that results indicate:

1. exactly which data sets it draws from, and

2. precisely which algorithm(s) processed the data in question.

At this point, the specific implications this debacle has for the CRU’s research is irrelevant. For, whether by deceit or incompetence, this leaked data has left their published research about climate change completely unreliable.

Yet developing a confident clarity around the subject of their research remains of critical importance, for climate change is a real challenge that humankind must cope with. Regardless whether or not human industrial activity is a driving factor for climate change, the fact is that the ice at our poles _is_ melting at an accelerating rate. Decades worth of satellite photos and other survey data sufficiently demonstrate this fact. We similarly have data collected over the last several decades by the world’s meteorologists that global mean temperatures seem to be rising, as well as increasing levels of extreme weather (from droughts and famines to floods and more) around the world.

The climate debate isn’t over whether these events are occurring, but instead whether human industrial activity accounts for a relevant piece of it.

Governments around the planet will be forced to take some sort of action to deal with the prospective repercussions of these changes (e.g., rising sea levels, expansion of the Sahara, and the rest). The consideration at stake, therefore, is how each country will individually and collectively direct their efforts and invest their resources in dealing with it.

If human industrial activity has bearing on the matter, we’ll have to make some serious policy changes and invest heavily in developing alternative methods of production, lest we imperil our own (and other) species. But if, on the other hand, our industrial activity is not a determining factor in climate change, our efforts are best spent trying to figure out how we’re going to deal with the realities of a changing climate that we cannot mitigate simply by being more responsible with our emissions.

In any case, everyone needs to make informed decisions about where they’re investing their money and efforts.

And so a number of the world’s governmental and industrial leaders (including US President Barack Obama) are scheduled to meet — along with members of the climate research community — at the United Nations Climate Change Conference in Copenhagen this December in an attempt to work out policy directions to deal with climate change. I’m hoping the event will focus on methods to improve and reinforce confidence in the remainder of the climate research work being conducted around the world, and that it won’t turn into a political food fight.

Fingers crossed.

I am left hoping that some real good can rise from this mess. And so I call on climate change researchers and institutions around the world to take this opportunity develop the practice of providing full disclosure on the sources of their data sets and the functionality of their algorithms. There will likely be many political, legal, and logistical obstacles to address and overcome in this effort, but failure to do so carries stakes that are simply too high.

  1. 1.I personally have plenty of experience attempting to work with poorly-documented code and data inherited from some previous person’s work, and can directly attest to the maddening up-hill battle of that situation.

Fortifying Confidence by Stealing From Academics. And Scientists.

Driven in large part by open government efforts initiated by the Obama Administration, and particularly Federal CIO Vivek Kundra, tremendous and rich data sets have become available from the federal government, as well as some state and local governments. This data is published digitally, in organized, well-known and documented formats. 1

And because these government-amassed data sets have already been paid for by taxpayer dollars, they have been rightfully made accessible to the public domain, free of charge.

Tim O’Reilly — founder and CEO of O’Reilly Media, as well as organizer and host of various technology conferences including the Government 2.0 Summit — describes the thinking behind this policy decision in an article he wrote at Forbes, writing:

Rather than licensing government data to a few select “value added” providers, who then license the data downstream, the federal government (and many state and local governments) are beginning to provide an open platform that enables anyone with a good idea to build innovative services that connect government to citizens, give citizens visibility into the actions of government and even allow citizens to participate directly in policy-making.

The primary distribution point for the federal government’s data is the data.gov website (about which I’d earlier written). In another article he’d guest-authored for TechCrunch, Mr. O’Reilly talks about this website, writing:

Behind [the] site is the idea that government agencies shouldn’t just provide web sites, they should provide web services. These services, in effect, become the government’s SDK (software development kit). The government may build some applications using these APIs, but there’s an opportunity for private citizens and innovative companies to build new, unexpected applications. This is the phenomenon that Jonathan Zittrain refers to as “generativity“, the ability of open-ended platforms to create new possibilities not envisioned by their creators.

The range of potential applications for these data is difficult to exaggerate (or, frankly, to even imagine). A thorough exploration of these possibilities is beyond the scope of this post, but this showcase of apps built leveraging data made available by the city of San Francisco gives a small peek at the broad range of uses this government data.

In browsing that app showcase, I would note that none of the apps found there were written by the government. That’s zero. Rather, each was developed by a third-party.

I would also note that most of those apps combine multiple different data sets, many of which are also including non-governmental data sets. 2

Clearly all this is just the beginning.

Opportunities and Challenges to Come

The datasets will grow broader, as the federal government continues to expand its data offerings, and more state and local governments begin to follow suit, as Utah, San Francisco, and even my home town of New York City have since done.

As the data sets become richer throughout this process, mining the information on offer will provide opportunities to develop insights about matters ranging from public health to environmental developments and energy consumption, and from regional commercial performance to educational development.

And once there’s some historical depth to these records — through a combination of digitally publishing data sets from earlier years, as well as continuing to release emergent data — we will eventually even start to see the emergence of various types of projection models developed for many of the issues mentioned above, from economic development forecasts to predictions for the spread of disease outbreak.

These data sets stand to revolutionize both entrepreneurial endeavors and academic research projects.

And with the grant allocations for research en route from provisions that are part of American Recovery and Reinvestment Act of 2009, we’re likely to see a staggering amount of new projects rise from both academia and the business world.

But with all these arguments and assertions about corollaries, trends, and predictions that this number crunching activity will generate, it will become increasingly crucial to have a mechanism by which the results claimed to have been derived from the number-crunching can be accounted for.

It’s not difficult to imagine, after all, the proliferation of claims that will begin to emerge, anchoring their proposed value on these mountains of data. 3 Luckily, after decades of subjection to some of the most talented number-spinning tactics that statisticians teamed up with PR specialists have thrown about, many people have developed a thick skin (and perhaps even default suspicion) against allowing “the numbers” to speak to very much.

And rightfully so; “the numbers” can build nearly any narrative a story teller wishes to weave, depending on how they’re sliced, diced, and manicured.

Numbers may not be able to lie, but men sure can.

Luckily, we can find some time-tested solutions for mitigating against falsification and/or incompetence by looking to techniques applied in works of scholarship and the practices of scientific peer review: scholars must meticulously cite their sources in bibliographies attached to their work, and scientists must accompany any publication of the results of their work along with a detailed description of their methods.

It must similarly become incumbent upon anybody publishing findings derived from mining such data to share both the sources and processes used to derive their results or conclusions. In cases of claims rooted in the fruits of data mining endeavors, it is specifically important that results indicate:

1. exactly which data sets it draws from, and

2. precisely which algorithm(s) processed the data in question.

The trouble, however, is that there is neither a comprehensive repository nor a system for unique canonical identifiers to publicly and universally identify such data sets and algorithms. Their absence makes any attempts to reproduce such results very challenging, at best.

Fortifying Confidence in the Results

Books, by contrast, have an ISBN number. Books also have a governmental repository, called the Library of Congress.

So I propose that similar mechanisms must be worked out for data sets and algorithms. Perhaps serving as this repository becomes an evolutionary portion of the Library of Congress’ own charter. This repository would be a web service that exposes each individual data set and data mining algorithm source code package under permalinks which incorporate their respective canonical identifier.

Potential examples of such permalinks may look something like this:


Naturally, there are considerations that must be accounted for in some cases that it may wind up being imperative restrict access to any resources stored in this repository.

I’ve focused so far on publicly-available data sets, I would note that it is inevitable that a number of valuable projects will on occasion leverage data sets whose rights are privately owned, and to which access must be controlled by obtaining permission of some sort from its owner.

The same concern is naturally prone to surface with some regularity for algorithms, as well.

This a consideration that will require some real thought, but I’ll leave that to a future exploration. For the time being, I’ll simply note that the HTTP protocol does provide mechanisms for access restriction (particularly 401, 402, and 403); leaving only the policy around which those mechanisms will be applied to be worked out.

Although there’s loads to work out about how such a repository can be actualized, its availability will become crucial in the coming years.

Simply hope that both the practice of sharing data sources and methods — as well as a suitable canonical repository for them — materialize earlier than later, since a only a few silly and reckless abuses of this data can undermine public confidence in efforts to fully harness its potential value.

  1. 1.Importantly, the manner in which all this data is distributed is an ideal packaging for use as input for processing by data-crunching algorithms developed by anyone interested in doing so.
  2. 2.This practice of combining data sets from different sources to create a new, value-added data set is referred to as creating a mashup.
  3. 3.It will also certainly be leveraged evaluate the government’s performance, both by the current administration and — perhaps more compellingly — by its political opponents.