Cosmos is Microsoft's internal data storage/query system for analyzing enormous amounts (as in petabytes) of data. As the lead Program Manager for Cosmos I can't say too much about it but what I can do is take a tour of the information that Microsoft has published about Cosmos. So read on if you are interested in the architecture Microsoft uses to store and query petabytes of data and what technical issues Microsoft's approach brings up.
While the Cosmos service is restricted for internal Microsoft use only and therefore isn't typically discussed outside the company, over time Microsoft has authorized quite a bit of data to be published about Cosmos. For example, you could go to Microsoft's job site and do a search for " Cosmos". This might lead you to a job posting like this one. Or you might notice the reference in the job posting to something called Dryad (the job posting even includes a URL that amongst other things contains a link to a Google tech talk on Dryad). You might also do an Internet search and find out about a paper on a language called Scope that was accepted by VLDB '08 (which I can give you a sneak peak of here).
The story the previous data tells is that Cosmos is a platform for storing massive amounts of data across large numbers of machines and then distributing jobs onto the machines that store the data so that the data can be processed in a fully distributed and efficient fashion. The Scope article in particular provides an architectural overview of the system and breaks Cosmos into three parts: Storage, Execution Management and Query Language.
Storage
The storage layer is append only so it's ideal for data such as logs (hum… I hear those Cosmos guys are part of Search…). Data is stored as files. Files can be very large so file contents are broken into multi-megabyte blocks called extents. Each extent is then copied onto a machine in a Cosmos cluster. So if someone wanted to read an entire 'file' they would have to touch a large number of machines in order to collect together all the extents that make up that file. The Scope paper also points out that extents are replicated so more than one machine will have a copy of a particular extent (no bonus for figuring out how many copies are typically kept around for each extent). The Scope paper also points out that the metadata about the files (e.g. which extents make up a file and where those extents are stored) is kept in a stand alone cluster that uses a quorum protocol so members of the metadata cluster can fail but the system will still function properly.
For me the interesting question about the storage layer is the metadata. Should there really be a central metadata repository? The answer depends on what one needs from the system. In the case of storing logs losing data is typically just fine. So long as you only lose say 0.1% of the data (so long as the loses are reasonably randomly distributed) nobody is going to care. Now if you really and truly need to keep every single bit of data (I can't say how many 9's of reliability the current system delivers but let's just say it's a lot) then the current system makes a ton of sense. But if you can lose data do you need a centralized metadata facility? Is the centralized guarantee of data consistency necessary? Seeing systems like Amazon's Dynamo which use a combination of consistent hashing and gossip protocols to keep their data "reasonably" consistent I can't help but think Cosmos could be made simpler if that central metadata system were done away with and replaced instead with a self organizing distributed system. Of course I could be completely wrong. The right way to figure out the answer is to carefully define exactly what Cosmos' business requirements are and then evaluate the cost of centralized versus de-centralized solutions in terms of their ability to meet those business requirements at the lowest cost. That is a process that the Cosmos team will be going through in the coming months as we deal with the explosive growth of the system.
Execution Management
The Dryad website provides so much information on Dryad that I'm surprised Microsoft doesn't just release the source code. The idea behind Dryad is that most interesting distributed queries require multiple iterations of map/reduce that typically involve large numbers of machines running for long periods of time. Dryad's job is to handle efficiently distributing a job onto the cluster and then managing the job to handle failures.
Jobs are submitted to Dryad as Directed Acyclic Graphs (DAGs). Each vertex in the DAG represents a single physical machine and a program to run on that machine. The links between vertices tell Dryad what other vertices need to complete before a particular vertex can start. Dryad acts as the job manager. Dryad figures out which machines contain the data the job wants to run on and schedules vertices on those machines. Dryad tracks dependencies. So if Vertex D can only run once Vertices A, B and C are done then Dryad will track A, B and C and as soon as they finish will kick off D. Dryad figures out the best places to physically locate vertices. So, for example, vertex D is going to need to pull in data from A, B and C so it's advantageous if D is physically close to A, B and C (whenever possible) to minimize latency and network load. Dryad also handles machine failures. When running huge queries for extended periods of time (notice the vague adjectives I used there) machines can and will fail. Dryad detects these failures and recreates the failed vertices so the job can complete.
Dryad faces a similar design decision to storage – to centralize or decentralize? In Dryad each job has its own job manager instance that runs the job. This creates a number of interesting problems. For example, how should job instances coordinate in order to make best use of cluster resources? Should there be a central job manager who knows 'everything' about a cluster and can 'optimally' allocate vertices across multiple jobs simultaneously running in the cluster? Should there be multiple job managers but those managers should communicate with each other about what is going on in the system? What happens when a job manager fails? Should there really be a central point of failure for jobs? Maybe the job manager needs to be distributed? Perhaps every vertex could simultaneously be a member of the DAG and part of the management of the job? As with storage, it all comes down to business requirements. What qualities do we need from the system and what are the best architectural approaches to get those qualities at the lowest cost? As the system scales we are forced to revisit our assumptions. We don't know the answers but we are going to have a lot of fun in the coming months trying to figure the answers out.
Query Language
Scope is a SQL 'style' scripting language that supports inline C#. In Scope all data is made up of rows and rows are made up of Columns and Columns are typed. However the storage layer doesn't care what data it stores and that data may or may not be in row format. To solve this problem the first step in a Scope script is to provide an extractor (Scope comes with default extractors and users can write their own in C#) that reads data in from the storage layer and turns that data into rows made up of columns with the appropriate type. To give the reader some sense of what a Scope script looks like here is an example from the Scope paper:
SELECT query, COUNT(*) AS count
FROM "search.log" USING LogExtractor
GROUP BY query
HAVING count > 1000
ORDER BY count DESC;
OUTPUT TO "qcount.result";
Scope's power comes from its ability to express complex queries in a familiar language. This enables programmers to describe the complete query they want and by putting all the steps of the query in one place Scope can analyze the entire query as a unit and make global optimizations to make the query run as quickly as possible.
The Scope paper didn't go into too much detail on how exactly it does query optimization so I can't say much but what I can say is that we have barely scratched the surface of optimizing Scope scripts. There are numerous areas ripe for investigation ranging from simple query optimization within the language to optimization in the face of outside knowledge (e.g. how do optimizations change when taking account of things like machine architecture or current system load or data location?). This is an area of active research that we expect to be able to mine for quite some time. There are also two other basic questions we expect to tackle – is Scope done? And is Scope enough? The first question asks what changes or enhancements do we need to Scope to meet our customer's needs. The second question asks if we should support languages other than Scope. We really don't know the answer to either question at the moment but as with the previous questions we intend to find out.
Conclusion
The previous architecture isn't actually complete. There is at least one more major component that isn't mentioned as well as major architectural issues I haven't yet seen discussed in the public material. I'm not going to expand on what has been published so I can't go into more details. But what I can say is that Cosmos is a very successful system that is growing at a breakneck pace both in terms of the number of customers we support and the size of the clusters we run. Everyday brings up numerous challenges ranging from system design to service management. Working for Cosmos is very much about living in interesting times. So if you would like to live an interesting life please check out the jobs available on the Cosmos team.
Sounds like COSMOS is very interesting. We’ve been pushing CCR, DSS and MPI pretty hard in scientific computing. On multi-core you really need to be careful about mapping your data so you get good cache hit rates. Combining multi-core and multi-machine is interesting challenge.
How do you select a random set of rows from a particular table with scope script?
In sql we have rand() function, there is no such thing in scope. So, how to select random rows?
In scope there were a couple of ways of doing that if I remember correctly, it’s been many years though. One was that you could choose to run the script on less than the complete set of data and be randomly assigned to different machines. Another way to stream the data from all machines and randomly throw away rows. Given how common the random scenario is I wouldn’t be surprised if they now have a dedicated operator. Heck, they probably always did but it’s just been too long so I don’t remember.
Is Cosmos DB offered by Azure the same as Cosmos internal storage / query execution service you are talking about in the article?