22 Jul

Still Quite An Analog World We’re Living In

Sometimes it feels like we’re swimming in an endless sea of digital content, and to the extent I for instance have more than a terabyte of media on my own local network (currently 16 IP devices!), we are. But once you dive in, you realize that a lot of it is digital only on the surface, but analog underneath. From mp3 files that are just a compressed sound wave rather than the original digital tracks to scanned PDF files that you can’t search or index. Consider that Wal-mart PDF annual reports from as early as the mid nineties are images, not text. Similarly, many government documents released after FOIA requests are just scans. What you see happening is digital sources dumbed down into digitized analog content. A lot of structure, meta data and meaning are lost in the process, and “print to PDF” as opposed to “save as PDF” makes a world of difference. I’m impressed by a number of ongoing efforts, from the Show Us a Better Way initiative in the UK to what sites such as Freebase are trying to accomplish, but we’re still a long way away from the universal data cube!
As a publisher we’re starting to work on adding our very modest contribution towards that elusive vision. An enormous amount of time goes wasted within companies to just gather and aggregate market or industry data. You haven’t even started analyzing the data that you’re already exhausted by all the scrape-copy-paste-clean-massage-normalize work involved. This makes it hard to reach conclusive insights because the waters often remain muddled in apples mixed with oranges, and it’s not any better when companies are dealing with their own internal data.
We’re not going to go after this pain by trying to boil the semantic ocean (good luck with that to the start-ups in that field). Rather, we intend to put together tight data packages in our selected verticals. Trade publishing is stuck in the 80’s for the most part, which helps explain the turmoil currently seen in companies such as Reed Elsevier, Penton or Cygnus. The print and events legacy is really hard to shake off for these guys. There’s a lot of value locked there that’s just not delivered to business audiences in convenient ways. Data products tend to be published behind the firewall through expensive and complicated offerings. I’m not saying we have the answer, but I do think we “have the question” better than most.
Hopefully we’ll start fleshing out these ideas into actual products within the next 12 months. It’s been baking for a while, from our Focus Article format at Defense Industry Daily to pretty much what MarketingCharts.com is all about. Now we intend to turn our sites into application/news hybrids (let’s face it, publishing charts in gif format is just a stopgap), and that’s going to be a tough but fun ride. Now let me go back to shutting up and working on execution!

23 Nov

Google Base v. microformats

I think Jeff Jarvis frames the issue properly (open format vs. walled garden) but it’s very early to make a call about Google’s intent. I’d say they want to give themselves a headstart in terms of surfacing Google Base content across their services (e.g. Local) but they’ll probably expose it to the outside world sooner or later. Not doing it seems not only at odds with their roots but more importantly it would leave them vulnerable to a more open joint effort by Microsoft and Yahoo, not to speak of countless smaller competitors.

09 Sep

Movie Tags at Imdb

Movie Keywords Analyzer (MoKA), like Flickr, lets users apply tags to non-text content, where they make sense to support free-wheeling descriptions and in my opinion complement, rather than compete with more structured taxonomies. Not quite faceted navigation, but it’s still nice to find, say, movies that feature drugs, dream, death. Obviously there’s a lot of tagging left to be done, I’m curious about how fast and accurate it’s going to be.

09 Sep

Structured Data Blogging

For some reason I hadn’t bumped into Reger yet, but it’s definitely intriguing:

"[W]hat this tool excels at is allowing you to capture extended data fields with each entry. As you blog and collect data, you can then mine that data with custom graphs, advanced saved data searches and data-enabled RSS feeds. All right out of the box with no complex user manuals or custom code. You can create a custom log type to log any sort of activity you can imagine."

Watched a couple videos, they’re ok though the sound production values are so-so and it still seems rough around the edges, especially overall visual design and more specifically form rendering. But then some of it should be considered “alpha software” says the voice over, and who likes HTML forms anyway? From a marketing perspective I get the sense Reger tries to cover too much ground too quickly, at the risk of not being a clear killer app for anyone. That probably comes with the swiss knife nature of such a tool. Since normal people don’t go creating XML schemas, the whole thing needs to be streamlined and refined, but the basic premises are intriguing. Many people track the craziest things with Excel, and a lot of it would make sense online (as much as these things make sense in the first place, that is).

19 Jul

Who’s Going to Take Online Real Estate to the Next Level?

Combine Realtor‘s MLS-based national (i.e. US) search with HousingMaps UI (because who cares for Craigslist housing listings?), including smart ideas such as a history of past sales as featured by Redfin (who is not alone in overlaying local MLS data on maps). Mix with itinerary (by foot or car) and transportation applications

23 Aug

Distributed Classification through Self-Interest

Victor Lombardi about a retailer he worked with that implemented "distributed classification":

"[T]hey have many thousands of products that need classifying on a regular basis. The products are relatively inexpensive commodities that change often and are sold through stores, a print catalog, and online. All the information about the products, including classification, is managed in one big content management system. […]
Instead of looking around their own organization for someone to classify (someone who has no interest in getting it done right, other than being paid to do so), they moved classification outside the organization to those who already have a self-interest in getting it done, the manufacturers whose main focus is selling more products. This is a more scalable solution than hiring a team of librarians. The rules of the system keep the manufacturers from abusing it."

You can’t just throw your taxonomy to outsiders and expect it to work as is or else you’ll likely have a GIGO (garbage in garbage out) problem. But if you provide them some kind of training or guidelines, and make sure their input is clean and relevant, then yes this is a good way to go. At our site about fantasy and science fiction books we collect a lot of information provided by readers and writers, and this lets us improve our database faster than doing it all by ourselves.
One of my goals is to use the site itself (rather than email) to collect that kind of metadata more systematically and to cut the number of steps between third-party input and exposure on the site (I don’t plan to go 100% towards a wiki model, so there’s always going to be a validation step in the middle of the workflow rather than after the fact). But most of the data we get from the outside world is of high quality once we’ve established with these contributors what it is we’re looking for (which the public site goes a long way to make explicit already). And the self-interest logic is fully at work for us too, which goes without saying for writers (the additional data will contribute to promoting their works better) but also for readers who simply want to enjoy a better resource about their favorite genre.
01/07/05 update: Louis Rosenberg: Folksonomies? How about Metadata Ecologies?, Clay Shirky’s counter-point.

23 Jun

Interview with Ross Blanchard, Gracenote

Sandy and Dave’s Broadband report:

"[T]he CDDB database now includes nearly 3 million CDs and more than 36 million songs. Each day, users from ninety countries submit about ten thousand new albums to CDDB. In the US, Korea and Japan, Gracenote has editorial staff to “vet” the entries and “lock them down” to prevent modification by users. Because of the volume of entries, the editors are focused on the most popular content; of the most popular CDs, “the high 90% are locked down.”
For the smaller labels, Gracenote has a “content partner program” so that they can directly submit album and track metadata for their own CDs and lock them down.
Gracenote does not lock down the genre metadata, which is highly subjective. They allow for two different genres for each album, artist and song.
Gracenote licenses a CDDB Software Development Kit (SDK) to software developers. Most PC-based CD players and recorders use this kit, which is also available for Apple and Linux-based systems. Gracenote also provides an “Embedded CDDB” solution; for portable devices and others without an active Internet connection, this provides a copy of CDDB for storage on a hard drive."

Here’s another example where bottom-up contributions are mixed with a stricter controlled vocabulary.
07/07/04 update: Wired: The House That Music Fans Built.

24 May

How to Share a Taxonomy with Other Photoshop Album Users?

Photoshop Album is one of those rare applications that make meta tagging so easy that even "normal people" can use it. Now, let’s say I want to create a taxonomy and share it with other users, is there a way to do it so that a) they don’t need to re-create those tags, and b) they’re going to use my controlled vocabulary in order to avoid alternate spellings and other discrepancies? I could even throw in some fun such as nice, well chosen thumbnails for all those tags. (Don’t you love an app than lets you choose your own thumbnails to visualize metatags? And you should see the feature in action, it’s very graceful.)
The goal is to keep everyone on the same page to feed a common database. Is there an explicit intersection between desktop photo management tools and the semantic web (say, an ontology created with Prot

19 May

Freakingly Detailed Survey of a Single NYC Block

One Block Radius is fascinating:

"One Block Radius, a project of Brooklyn artists Christina Ray and Dave Mandl [known collaboratively as Glowlab], is an extensive psychogeographic survey of the block where New York’s New Museum of Contemporary Art will build a new 60,000 square foot facility beginning in late 2004. […]
While the block is bit-size in relation to the surrounding metropolis, the changes it is about to undergo are massive. One Block Radius plays with this idea of scale, aiming to zoom in and physically data-mine the tiny area for the amount of information one would normally find in a guide book for an entire city. This feature-rich urban record will include personal perspectives from diverse sources such as city workers, children, street performers, artists and architectural historians. Engaging a variety of tools and media such as blogs, video documentation, maps, field recordings and interviews, Glowlab will create a multi-layered portrait of the block as it has never been seen before (and will never be seen again)."

01 Mar

Let’s Start Memes: Best of Blog, Blog Spring Cleaning

It seems that at about the same time than me, Tom Coates is recategorizing his blog to extract his best entries into a best of category. Since my migration from Blogger last month, I’ve started doing likewise, though I’m not finished deleting the crap, fixing broken links and titles, wrapping quotes into blockquote tags, and generally speaking overhauling my archives so that they have better long-lasting value. Start identifying your best-of entries, and let’s aggregate them through a category trackback on some site. (Of course this is ripe for spamming, but self-declared quality might be worth tracking in Technorati.)
Here are a few thoughts so far related to my blog spring cleaning:

  • If you don’t already do it, start using semantics now. Go for the low-hanging fruit: blockquote quotes, use at least a few categories (try to define them properly beforehand though), because some day you’ll want to do it and it will take you days to go back and redo it for all your old content.
  • I’d like some posts not to be archived. They’re interesting as a passing notice, but two years later, it’s obvious they had no value past the transient homepage. Is there an MT plugin to do that on a per-post basis? I could use a specific category to do that, but I don’t care to change all my archive templates to filter it out.
  • The “best of” category is only a start in categorization guided by hindsight. Other fun perspectives I could add might rate mood, tone, timeliness or clarity of vision, such as "passive-aggressive", "upbeat!", "self-righteous rants", "when I’m right, I’m right", "sign of the times", "meme laggard", or "I was wrong before anyone else."
  • Going back in time, there’s an obvious emergence of patterns based on what what on my mind at the time (for instance, 2002 presidential elections in France) or in the general zeitgeist (say, corporate scandals). A way to make them explicit would be to map the most used category by month, quarter or year. I’ll probably do that when I have the time to.