Like a lot of people in the computer business, I am intrigued by the impact of RFID tags and other sensor technology on IT. But my interest is fairly narrow: I'm curious about what kind of workloads these technologies will impose on corporate data centres. To understand this, I want to get a handle on the numbers: the sensor event rates (both the flow rate and the burstiness), what kind of intermediate aggregation and filtering will be performed, and what the resulting datacenter workload will look like.
Sounds straightforward, doesn't it? Not that it's a simple problem, but we can construct some scenarios, assign some numbers, plug them into a queueing model, see how it looks. (Capacity Planning for Web Services: Metrics, Models and Methods includes some simple models.) The problem that I've found (repeatedly) is keeping people on topic whenever I try to discuss it.
"But we can't ignore privacy issues." "Centralized data centers are passe - let's project the data centre to the edge." (I love that - what on earth does it mean?) "Data centres - not data centre! Federation!!" Or we replace inventory control tags with hospital patient tags, in which case discussions of domain-specific issues rapidly crowd out everything else.
The general problem, which I've observed in various contexts, is that it's increasingly difficult to keep people focussed on simple problems. Of course all of the issues that people raise are real, but in most cases they are either irrelevant or simply complicate the problem in incalculable ways. We need to focus on the simplified versions of the problems in order to use them as tools to analyze alternative architectures.
My dream is that one day someone will listen to my scenario and immediately propose a simplification, in order to make it more computationally tractable. Most of the systems that we've dreamed up over the last twenty years are far too complicated, and the analysis of the whole becomes even more problematic if we load even more complex application patterns on top of them.
Posted by geoff2 at May 22, 2004 01:29 AM+++My dream is that one day someone will listen to my scenario and immediately propose a simplification, in order to make it more computationally tractable.+++
A potential social solution:
"Avoid and ignore windbags who have a history of not delivering goods; fluffing people with verbiage about 'Going For The 80% Solution First' and talking for about 8..9 seconds (overwriting the audience's short-term memory) helps guide them/the meeting back on course."
8-)
Of course this technique has limitations; insular "surgical-team" development projects - those which have deflected inane "contribution" - are typically successful *until* they hit mandatory "process" people (legal clearance, etc) whose behaviour is predicated on the audit trail of urinary odour that is associated with a given project.
Any project that comes out of the wilds can end up shunned, stunted or undermined, unless enough people have pissed on it - unless you are a big name, of course. Sometimes you have to resort to outright subterfuge to prevent a project being killed by people who failed to attain kudos from it.
Tragic, eh?
ObBlastFromPast: http://www.wired.com/wired/archive/5.07/rants.html - item 6; not all those who interfere are droids; some also are dinosaurs, and others just plain clueless. Some, however are insightful; were that we could tell such before their opening their mouths...
Posted by: alecm at May 22, 2004 02:47 PMAs you say, how can we tell the difference in advance? In part, I think the problem is a reluctance to say "I don't know," a reluctance that seems to diminish with age. (At least this seems to be true in the scientific and technical domains; I don't know about politics...)
Metaphor is useful, even essential. The Network Is The Computer * still works for me, powerfully so, but anyone who uses it should be prepared to follow up immediately with a concrete, non-metaphorical explanation.
-----------------------
* Rob Gingell's lovely question, "If the network is the computer, what is the computer that is the network?" is a great way of starting a discussion about varieties of computer systems architectures.
+++Rob Gingell's lovely question, "If the network is the computer, what is the computer that is the network?" is a great way of starting a discussion about varieties of computer systems architectures.+++
Mmmm... I remember that great quote about USENET from years ago, describing various USENET newsgroups thusly:
comp.compilers: computer scientists discuss compiler theory
sci.genetics: computer scientists discuss genetics
sci.astro: computer scientists discuss astronomy
alt.sex: computer scientists theorise about sex
soc.religion.misc: computer scientists discuss religion
...or somesuch, the point (if it needs spelling out) being that USENET was a world full of opinionated people, chiefly exchanging heated views on topics in which they are not expert.
What I find bizzare is that in the engineering business this syndrome persists in microcosm and is not widely recognised, leading to (for instance) reference architectures which aren't, high availability which isn't (or: is not as much as it could be) - and (worse) scalable solutions which don't.
Eg: There are a handful of people I know - across my experience of several several companies - whom I would implicitly trust to lay out an appropriate system-complementary Layer2 fabric for (say) a stock trader company; none of them are particularly senior, and those who become senior seem rapidly to lose their ability/edge...
...which (i suppose) brings us back to the "dinosaur versus insightful" theme, and the heretical suggestion that "might" does not imply "right"...
Posted by: alecm at May 22, 2004 06:01 PMThe role of the senior people who have lost their edge should be to
(1) humbly acknowledge that fact, and
(2) exploit the status which seniority confers to making life easier for those who have not yet lost their edge
There just isn't a real *problem* to be solved here. Let me explain:
Assume a 96-bit EPC tag and that each item is scanned as it makes a transition between the warehouse and the store-front. Each scan we'll call an 'event'. Well assume for a worst case scenario that well record each and every 'event' for all time. For each 'event' you want to store 12 bytes for the EPC data and another 4 bytes for the time at which it was read (unix time in seconds) and just to be generous another 4 bytes for the location at which it was read. That's 20 bytes per 'event'. Now say you are running a large operation with warehouses, distribution centers and store fronts and that for the sake of argument an item generates 6 'events' as it travels (arrive warehouse, palletization, leave warehouse, arrive distribution center, leave distribution center, arrive store). Let's further assume that you push 100,000,000 items though your supply chain yearly. That's 100,000,000 * 20 or 2 GB.
2GB.
I have that much free space on my laptop.
I can buy a server with enough memory to hold all of that in RAM for less than $4,000.
This is what I mean when I say there isn't a *problem* here to be solved.
Ok, so I didn't stay on topic, which is of course, the topic of staying on topic. Sorry.
You stayed nicely on topic. But sheer data volume isn't the only issue. What do we want to do with those events? Assume for the moment that each one is individually delivered to some kind of inventory control application. How bursty is the traffic; how many events per second at peak? And what's the response required from the application? What's the maximum allowable lag between an RFID tag being scanned and the corresponding data base record being updated? What's the processing time for each event, which will let me calculate the maximum number of concurrent events that I'll need to handle.
See where I'm going?
Posted by: Geoff Arnold at May 24, 2004 12:29 AMAn excellent topic, making a nice wrapper for another interesting topic.
I think answers to your RFID questions will be application-specific. If you're gathering RFID events from shoppers exiting a store, you may be able to take some samples and work with a nice Poisson distribution. But, if you're capturing RFID events as a forklift drives by with a pallet of boxes, then your burstiness is determined by the number of RFIDed items on the forklift, right? It seems like the same applies to the question of latency. How long can it take to update the database? A long time (relatively speaking), if you're just tracking stuff, but a short time if the forklifts or customers have to wait until you do something with the data before they may proceed. In many cases, you should be able to use statistics from the current POS systems or inventory management systems that theoretically handle the same volume albeit without the convenience of RFID systems.
As to keeping people on topic, I'm afraid there's no hope. You'll just have to manage as best you can by being very exclusive with whom you spend time face-to-face. (And pay attention to alecm's precaution about including the politically essential, yet non-contributing people at the appropriate time.)
Posted by: Dave Lemen at May 24, 2004 10:13 AMYes, I do see where you are going.
>> How bursty is the traffic; how many events per second at peak?
A lot of this depends on your deployment. For example how many items per pallet are tagged and if you are reading the whole pallet as it goes through a portal or if you are just reading a single pallet tag. Also, you may have handheld readers that will send a burst of updates when they upload their data.
The rest of your questions are very specific to what you are doing with the tag data. Good luck getting that nailed down :)
Edge brokering is something that requires some interesting flexibility in what you do. You need to be able to decouple inbound and outbound events, and you need to be able to throttle or defer forwarding until the system can accept the data.
The issue of data timeliness might be interesting too. So, you might have some places were you just note the presence of the tag, but not time, and not force timely delivery. In other places you might demand timely delivery. The speed of a production line (conveyors and such) might dictate a range of time intervals that you could use to show the performance of the line.
Some locations might forward information to realtime monitoring systems while others don't so that the inbound event rate for the monitoring system is managed explicitly.
I've been building systems with these kinds of attributes, but for gather different kinds of data for many years. A 2-way machine running with 1GB of memory has no problem handling 70 events per second that turn into some 350,000 database transactions per hour. 90% of those transactions are 'updates' the other 10% are accumulating inserts. The broker talks to 3 different Oracle databases and a Sybase database simulataneously, as well as a status monitoring system.
My system design utilizes a queue and associated pool for each type of outbound transaction. The Oracle systems (and status monitoring system) have 90% of the load going to them. There might be between 25 and 100 simultaneous TNS sessions to each database. The database system needs to be configured to handle this kind of load. We've had issues with systems that were not.
If inbound data is decoupled from outbound data, you can imagine that when the outbound data flow is delayed or interrupted, that data will accumulate quickly. You need a strategy for timeliness related data so that it is expired quickly so that it doesn't accumulate if you are not interested in anything more than the current value.
For RFID, if you had zone numbers, the status monitoring system might only care about the current zone. The historical log might need to contain all zone-time combinations.
There are some numbers for you. I don't know how signigicant they might be for you, but there they are.
Posted by: Gregg Wonderly at May 25, 2004 09:14 AMI my field, domain-based OOA, we have a term for things that don't belong in a domain model. We call them 'domain pollution'. You could try adopting the idea and pointing it out when people drag in concepts from other domains. If the domain pollution meme catches on we could all benefit.
Posted by: Neil at May 27, 2004 07:41 AM