boringtechstuff

Friday, October 12, 2007

Watch the Alpha Geeks

Note that this article was written way back (in internet time) in 2002. It has striking similarity to what's happening today.

--------------------


Inventing the Future
by Tim O'Reilly
04/09/2002

"The future is here. It's just not evenly distributed yet." I recently came across that quote from science-fiction writer William Gibson, and I've been repeating it ever since.

So often, signs of the future are all around us, but it isn't until much later that most of the world realizes their significance. Meanwhile, the innovators who are busy inventing that future live in a world of their own. They see and act on premises not yet apparent to others. In the computer industry, these are the folks I affectionately call "the alpha geeks," the hackers who have such mastery of their tools that they "roll their own" when existing products don't give them what they need.

The alpha geeks are often a few years ahead of their time. They see the potential in existing technology, and push the envelope to get a little (or a lot) more out of it than its original creators intended. They are comfortable with new tools, and good at combining them to get unexpected results.

What we do at O'Reilly is watch these folks, learn from them, and try to spread the word by writing down (or helping them write down) what they've learned and then publishing it in books or online. We also organize conferences and hackathons at which they can meet face to face, and do advocacy to get wider notice for the most important and most overlooked ideas.
O'Reilly Emerging Technologies Conference

The 2002 O'Reilly Emerging Technologies Conference explored how P2P and Web services are coming together in a new Internet operating system.

So, what am I seeing today that I think the world will be writing about (and the venture capitalists and entrepreneurs chasing) in the years to come?

*

Wireless. Community 802.11b networks are springing up everywhere as hackers realize they can share their high-speed, high-cost Internet connections, turning them into high-speed, low- or no-cost connections for a larger group of people. Companies like Apple are building 802.11b into their new hardware, but that's just a convenient springboard.

The hackers are extending the range of their networks with homemade antennas--the antenna shootout between Pringles cans, coffee cans, and tomato juice cans, and the discussion of how the ridges in the bottom of a can happen to match up to wireless wavelengths, represent hacker ingenuity at its best. But wireless community networks are only the tip of the iceberg.

If you watch the alpha geeks, you notice that they are already living in a future made up of ubiquitous wireless connectivity, not just for their PCs, but for a variety of computing devices. The furthest-out of them are into wearable computing, with access to the Net as much a part of what they put on each morning as a clean pair of socks.
*

Next generation search engines. Early search engines used brute force. Google uses link information to make searches smarter. New search engines are taking this even farther, basing searches on the implicit webs of trust and interest reflected not only by link counts (a la Google) but by who specifically links to whom.

It's easy to take search engines for granted. But they are prototypes for functionality that we will all need when our personal data storage exceeds that which the entire Web required only a few years ago.
*

Weblogs. These daily diaries of links and reflections on links are the new medium of communication for the technical elite. Replacing the high-cost, high-octane, venture-funded Web site with one that is intensely personal and built around the connectivity between people and ideas, they are creating a new set of synapses for the global brain. It's no accident that weblogs are increasingly turning up as the top hits on search engines, since they trade in the same currency as the best search engines--human intelligence, as reflected in who's already paying attention to what.

Weblogs aren't just the next generation of personal home pages, representing a return to text over design and, lightweight content management systems. They are also a platform for experimentation with the way the Web works: collective bookmarking, virtual communities, tools for syndication, referral, and Web services.
*

Instant messaging, not just between people but between programs. A generation of people who grew up on IM ask themselves why it needs to be just a toy. They are making collaboration, "presence management," and instant communication into a business application, but more than that, they are making messaging the paradigm for a new class of applications. One developer we know used the Jabber instant messaging framework to let him control his SAP database--about as corporate as you get--from his cell phone. Microsoft is busy making instant messaging functionality a standard part of the developer toolkit in .Net MyServices.

Related Reading
Planning for Web Services: Obstacles and Opportunities

Planning for Web Services: Obstacles and Opportunities
An O'Reilly Research Report
By Clay Shirky
*

File sharing. Napster may have been shut down by the legal system, but the ideas behind it are blindingly obvious in retrospect. While entrepreneurs mired in the previous generation of computing built massive server farms to host downloadable music archives, Shawn Fanning, a young student who'd grown up in the age of the Internet, asked himself, "Why do I need to have all the songs in one place? My friends already have them. All I need is a way for them to point to each other." When everyone is connected, all that needs to be centralized is the knowledge of who has what.

Perhaps even more excitingly, projects like BitTorrent provide raw Internet performance increases, as downloads are streamed not from single sites but from a mesh of cooperating PCs, a global grid of high-performance anonymous storage. We're also seeing desktop Web sites exposing the local file-system via distributed-content management systems. This is fundamental infrastructure for a next generation global operating system.
*

Grid computing. The success of SETI@home and other similar projects demonstrates that we can use the idle computing power of millions of interconnected PCs to work on problems that were previously intractable because of the cost of dedicated supercomputers. We're just scratching the surface here. Large-scale clustering, and the availability of large amounts of computer power on demand--a computing utility much like the power grid--will have an enormous impact on both science and business in the years to come.
*

Web spidering. Once primarily the province of search engines, Web spidering is becoming ubiquitous, as hackers realize they can build "unauthorized interfaces" to the huge Web-facing databases behind large sites, and give themselves and their friends a new and useful set of tools. More on this in a moment.

All of these things come together into what I'm calling "the emergent Internet operating system." The facilities being pioneered by thousands of individual hackers and entrepreneurs will, without question, be integrated into a standardized platform that enables a next generation of applications. (This is the theme of our Emerging Technologies conference in Santa Clara May 13-16, "Building the Internet Operating System.") The question is, who will own that platform?

Both Microsoft and Sun (not to mention other companies like IBM and BEA) have made it clear that they consider network computing the next great competitive battleground. Microsoft's .Net and Sun's Java (from J2ME to J2EE) represent ambitious, massively engineered frameworks for network computing. These network operating systems--and yes, at bottom, that's what they are--are designed to make it easier for mainstream developers to use functions that the pioneers had to build from scratch.

But the most interesting part of the story is still untold, in the work of hundreds or thousands of independent projects that, like a progressively rendered image, will suddenly snap into focus. That's why I like to use the word "emergent." There's a story here that is emerging with increasing clarity.

What's more, I don't believe that the story will emerge whole-cloth from any large vendor. The large vendors are struggling with how to make money from this next generation of computing, and so they are moving forward slowly. But network computing is a classic case of what Clayton Christensen, author of The Innovator's Dilemma, calls a disruptive technology. It doesn't fit easily into existing business models or containers. It will belong to the upstarts, who don't have anything to lose, and the risk-takers among the big companies, who are willing to bet more heavily on the future than they do on the past.

Let's take Web services as an example. Microsoft recently announced they hadn't figured out a business model for Web services, and were slowing down their ambitious plans for building for-pay services. Meanwhile, the hackers, who don't worry too much about business models, but just try to find the shortest path to where they are going, are building "unauthorized" Web services by the thousands.

Spiders (programs which download pages automatically for purposes ranging from general search engines to specialized shopping comparison services to market research) are really a first-generation Web service, built from the outside in.

Spiders have been around since the early days of the Web, but what's getting interesting is that as the data resources out on the Net get richer, programmers are building more specialized spiders--and here's the really cool bit--sites built with spiders themselves are getting spidered, and spiders are increasingly combining data from one site with data from another.

One developer I know built a carpool planning tool that recommended ridesharing companions by taking data from the company's employee database, then spidering MapQuest to find people who live on the same route.

There are now dozens of Amazon rank spiders that will help authors keep track of their book's Amazon rank. We have a very powerful one at O'Reilly that provides many insights valuable to our business that are not available in the standard Amazon interface. It allows us to summarize and study things like pricing by publisher and topic, rank trends by publisher and topic over a two-year period, correlation between pricing and popularity, relative market share of publishers in each technology area, and so on. We combine this data with other data gleaned from Google link counts on technology sites, traffic trends on newsgroups, and other Internet data, to provide insights into tech trends that far outstrip what's available from traditional market research firms.

There are numerous services that keep track of eBay auctions for interested bidders. Hackers interested in the stock market have built their own tools for tracking pricing trends and program trading. The list goes on and on, an underground data economy in which Web sites are extended by outsiders to provide services that their owners didn't conceive.

Right now, these services are mostly built with brute force, using a technique referred to as "screen scraping." A program masquerading as a Web browser downloads a Web page, uses pattern matching to find the data it wants, keeps that, and throws the rest away. And because it's a program, not a person, the operation is repeated, perhaps thousands of times a minute, until all the desired data is in hand.

For example, every three hours, amaBooks, our publishing market research spider, downloads information about thousands of computer books from Amazon. The Amazon Web page for a book like Programming Perl is about 68,000 bytes by the time you include description, reader comments, etc. The first time we discover a new book, we want under a thousand bytes--its title, author, publisher, page count, publication date, price, rank, number of reader reviews, and average value of reader reviews. For later visits we need even less information: the latest rank, the latest number of reviews, and any change to pricing. For a typical run of our spider, we're downloading 24,000,000 bytes of data when we need under 10,000.

Eventually, these inefficient, brute-force spiders, built that way because that's the only way possible, will give way to true Web services. The difference is that a site like Amazon or Google or MapQuest or E*Trade or eBay will not be the unwitting recipient of programmed data extraction, but a willing partner. These sites will offer XML-based APIs that allow remote programmers to request only the data they need, and to re-use it in creative new ways.

Why would a company that has a large and valuable data store open it up in this way?

My answer is a simple one: because if they don't ride the horse in the direction it's going, it will run away from them. The companies that "grasp the nettle firmly" (as my English mother likes to say) will reap the benefits of greater control over their future than those who simply wait for events to overtake them.

There are a number of ways for a company to get benefits out of providing data to remote programmers:

*

Revenue. The brute force approach imposes costs both on the company whose data is being spidered and on the company doing the spidering. A simple API that makes the operation faster and more efficient is worth money. What's more, it opens up whole new markets. Amazon-powered library catalogs anyone?
*

Branding. A company that provides data to remote programmers can request branding as a condition of the service.
*

Platform lock in. As Microsoft has demonstrated time and time again, a platform strategy beats an application strategy every time. Once you become part of the platform that other applications rely on, you are a key part of the computing infrastructure, and very difficult to dislodge. The companies that knowingly take their data assets and make them indispensable to developers will cement their role as a key part of the computing infrastructure.
*

Goodwill. Especially in the fast-moving high-tech industry, the "coolness" factor can make a huge difference both in attracting customers and in attracting the best staff.

Even though I believe that revenue is possible from turning Web spiders into Web services, I also believe that it's essential that we don't make this purely a business transaction. One of the beauties of the Internet is that it has an architecture that promotes unintended consequences. You don't have to get someone else's permission to build a new service. No business negotiation. Just do it. And if people like what you've done, they can find it and build on it.

As a result, I believe strongly that Web services APIs need to have, at minimum, a low-volume option that remains free of charge. It could be done in the same way that a company like Amazon now builds its affiliates network. A developer signs up online using a self-service Web interface for a unique ID that it must present for XML-based data access. At low volumes (say 1,000 requests a day), the service is free. This promotes experimentation and innovation. But at higher volumes, which would suggest a service with commercial possibility, pricing needs to be negotiated.

Bit by bit, we'll watch the transformation of the Web services wilderness. The first stage, the pioneer stage, is marked by screen scraping and "unauthorized" special purpose interfaces to database-backed Web sites. In the second stage, the Web sites themselves will offer more efficient, XML-based APIs. (This is starting to happen now.) In the third stage, the hodgepodge of individual services will be integrated into a true operating system layer, in which a single vendor (or a few competing vendors) will provide a comprehensive set of APIs that turns the Internet into a huge collection of program-callable components, and integrates those components into applications that are used every day by non-technical people.

Tim O'Reilly is the founder and CEO of O'Reilly Media, Inc., thought by many to be the best computer book publisher in the world, and an activist for open standards. O'Reilly Media also publishes online through the O'Reilly Network and hosts conferences on technology topics, including the O'Reilly Open Source Convention, the O'Reilly Emerging Technology Conference, and the Web 2.0 Conference. Tim's blog, the O'Reilly Radar "watches the alpha geeks" to determine emerging technology trends, and serves as a platform for advocacy about issues of importance to the technical community. For everything Tim, see tim.oreilly.com.