**updated section ‘Will it work the other way around’ on 25.07.2017 **
Well, not how the data is stored internally, but rather how CosmosDB seems to handle data that is stored and accessed via Graph, Table or MongoDB API. Each of these collections/graphs that have been created with the new available APIs can be still accessed with the “native” DocumentDB SQL API. To date it remains a mystery for me if the CosmosDB Team just uses the “classic” api to provide all these alternate APIs on top or if uses some magic behind.
Please note that while accessing CosmosDB Graph/Table/MongoDB data with DocumentDB SQL is quite interesting it is not something to use in production and probably not supported by Microsoft. Microsoft might at any time change their way of storing this data in CosmosDB and your code might break.
The first three sections will describe “Graph-API”, “Table-API” and “MongoDB-API”. The 4th section explains how you can change the visual portal experience in Azure and in the 5th section I try to do create documents with DocumentDB API and try to query them with the graph API.
To illustrate this on the Graph API, I created a simple graph by executing a series of commands inside the Graph Explorer. For this article I only use two vertices (one with an additional property) and one edge which connects both vertices (see the upper two vertices in the image):
- g.addV(‚place‘).property(’name‘,’Elronds Haus‘).property(‚FoodRating‘,’Delicious’);
- g.V().has(’name‘,’Rivendell‘).addE(‚path‘).to(V().has(’name‘,’Elronds Haus‘)).property(‚weight‘,1.0).property(‚level‘,’difficult‘);
In parallel I use Azure DocumentDB Studio to connect to my graph with the regularly DocumentDB SQL API. If we select “Elronds Haus” in Graph Explorer we can see the Vertex-ID and the properties to this vertex.
In Azure DocDB Studio we can now issue a query on the collection to reveal the vertex for “Elronds Haus”. To reduce complexity I removed the internal fields like _ts, _etag,_self,… in the images.
- select * from c where c.id=“ae5668e1-0e29-4412-b3ea-a84b2eb68104″
Id and Label of the vertex is just stored as normal JSON fields, but the properties are stored as a combination of _value and some unique property id field. The edge interestingly stores it’s properties different and more easy to query with DocDB SQL.
Where we can find all paths with a weight of 1 easy with
- select * from c where c.label=“path“ and c.weight=1
we need to issue a more complex query to find a specific vertex by a property value. I’m not sure why they decided to store these properties as array, but maybe this is required for some graph functionality I am not aware of yet.
- SELECT * FROM c WHERE c.label = „place“ AND c.name._value=“Elronds Haus“
- SELECT v FROM v JOIN c IN v.name WHERE v.label = „place“ AND c._value = „Elronds Haus“
The edges themselves can be easily discovered by querying the “_isEdge” field. The linked vertices for the edge are stored in the following fields:
- _sink, _sinkLabel… Id and Lable of the IN-Vertex
- _vertexId, _vertexLabel… Id and lable of the OUT-Vertex
In this video Rimma Nehme (@rimmanehme) mentioned at 37:20 that the SQL-Extensions will be available at a later point in time enabling you to query the graph with SQL instead of gremlin.
In this case I use the new Table API of CosmosDB. While you have the same namespace and same API in .NET you need to replace the old Storage nugget package with a new one to have it work.
I created three instances of PersonEntity that derive from TableEntity as you would expect with the Storage Table API. The store the race of the person as partitionkey and a combination of first and last name as row key. As soon as you create a new table with the CreateIfNotExistsAsync() method a new database “TablesDB” will be created in CosmosDB with a collection named after your table.
Keep in mind that the API will create a new collection for every table! It might be better from a cost perspective to store various kinds of documents into one collection!
As soon as we add the entities to the table we can see them in DocumentDB Studio. Because we used a combination of first and last name as rowkey we can see that the rowkey repesents the “id” field of the CosmosDB entry.
While you can now query more properties than just RowKey and ParitionKey always use the PartitionKey to avoid costly partition scans! You could do a LINQ query like this:
Now lets load one document in DocumentDB Studio and examine how the data is stored for TableAPI. Again I removed all the internal properties like _ts,_rid,…
What instantly pops into our eye is the use of the $ sign which will cause some trouble constructing DocDB SQL statements as we will see. Like in the graph API we have multiple fields defining a property. I find this more approachable as this naming reduces the size of the document. (“_value” vs “$v”). The partitionkey is stored as $pk.
CosmosDB stores the type of the properties within it’s $t field. Where 2=string, 16=integer, 1=double.
To query for Bilbo we need to escape the $ character in our query:
- SELECT * from p where p[‚FirstName‘][‚$v‘] = ‚Bilbo‘ and p[‚$pk‘]=’Hobbit‘
To query the document with LINQ you need to build the entity like in the image. Then you create a typed LINQ query:
- var queryable2 = client.CreateDocumentQuery<PersonT>(collection.SelfLink, feedOptions).Where( doc => (doc.FirstName.v==“Bilbo“) );
First we will create two simple entries with the native MongoDB Client where I add two documents. The second document also uses an ISOData type for date/time. You can see that MongoDB also stores the ID as ObjectId type.
There seem to be some issues with other BSON Types though. For example there is an article mentioning some compatibility issues with SiteCore and CosmosDB MongoDB API and I believe it is related to the yet unsupported BSON Type BsonInt32. As far as I have seen (I lost the article in the web ) currently only ObjectId and ISODate are supported types in CosmosDB MongoDB API.
Again if we now examine those two documents in Azure DocumentDB Studio we can see that id is stored twice. First as “id” and second as [“_id”][“$oid”]. Another way to declare the data type of fields. The field of type ISODate is stored as EPOCH value.
Switching the portal experience
This will with all APIs except with MongoDB. The reason for this might be legacy . If you take a look at the ARM-Template to create GraphAPI, TableAPI, MongoAPI and DocumentDB API you will notice that while as CosmosDB with MongoAPI has set “kind” property to MongoDB, all others have set it to GlobalDocumentDB.
All other APIS rely on the tags collection within the resource definition. So to change the Table Experience to Graph Experience is to remote the “defaultExperience:Table” tag and add a new “defaultExperience:Graph” in the portal and reload the page.
Will it work the other way around? ** updated 25.07.2017 **
Now we have figured that out, I wonder if I can take a “normal” DocumentDB API collection, fill it with data that looks like what we have created with GraphAPI. Then change the experience to GraphAPI and see if we can access the data via Gremlin.
For that purpose I have set up a brand new CosmosDB with DocumentDB API “demobuilddocs” in the portal. I am again using the Document DB Studio to create a new collection and three documents to it (You can download the documents here!).
Expressed in gremlin this would be (25.07.2017: replaced ö,ü with oe and ue):
- g.addV(‚place‘).property(’name‘,’Gasthaus Zum gruenen Drachen‘);
- g.V().has(’name‘,’Hobbithöhle‘).addE(‚path‘).to(V().has(’name‘,’Gasthaus Zum gruenen Drachen‘)).property(‚weight‘,2.0);
In DocumentDB Studio I create a new single partitioned collection “democol” with 400 RU/s for “demobuilddocs”. Then I create the three documents with CTRL+N (Create document in context menu of collection). So that’s it.
Finally we will change the defaultExperience tag for “demobuilddocs” in the portal to “Graph”:
Refresh the portal page and navigate to the Data Explorer (Preview). Et voila :
Next try that with GraphExplorer and it works all fine as well.
Now lets try that with the cloud bash and gremlin client. (
Spoiler: Will break! – No it won’t. It would break if we used ö,ü,… in the JSON). First I copy my hobbit.yaml connection configuration to doccb.yaml and edit it with nano to point to the graph url “demobuilddocs.graphs.azure.com”. Please note that GraphExplorer which uses the .NET CosmosDB SDK will connect to “demobuilddocs.documents.azure.com”. Then I add the path to my collection and the primary key as password (I have truncated that for security reasons).
Now I run my gremlin client (I have installed that with a tar ball in my cloud bash) and connect to my graph database:
And lets query for edges and see how that works.
Now when we read the vertices we will get the correct result.
Now lets try that with vertices and see it break with an Decoder Exception. It is missing some strange close marker for OBJECT. If however our JSON contains mutated vowels (Umlaute) the the decode will fail with an exception:
A .NET implementation btw like GraphExplore can handle the mutated vowels without any problem. But you might want to look out for this error in a Java Client.
If you examine the exception you can see that the resulting json misses some closing brackets. This will need some further inspection, but I am closing out my article for today. Stay tuned,… P.S: If you create a new vertex in the gremlin console, you can query that with no problems. But if you replace that document by just changing the value of a property with DocumentDB Studio you have the same error if you query the vertex with the gremlin console. Obviously Microsoft is storing more data than meets the eye . On the other hand it is interesting to see the .NET Client SDK to work.
Keep digging and have a great weekend
3 Gedanken zu “Digging into CosmosDB storage”
Azure Cosmos DB team member here:
Thank you for the article, AndiP.
Currently there are some type system differences (JSON has only one number type), and semantics differences (in Gremlin, all entities are multi-valued) which means that we have to decorate the actual content with additional metadata to preserve meaning. We are working on changes in the Azure Cosmos DB service that will improve the experience of cross-API access. With the changes, we will eliminate/change the format so it looks more natural when accessed to and from any API. Therefore (like you correctly guessed in the blog), the formats here are internal and subject to change with upcoming service updates.
We’re happy to provide more details and help anyone interested in performing cross-API data access right away (until these changes are live). Please contact us at AskCosmosDB@microsoft.com for details.
Pingback: Couchbase Server vs CosmosDB | The Couchbase Blog
[…] can still try to access some models through DocumentDB API. that was what gave me some hints of how CosmosDB uses internally a decorated JSON format to store its […]
Microsoft hinted that they are doing more in the background than just decorate the JSON format (indexing optimizations). Therefore, and of course because the „internal representation“ might change one should avoid accessing MongoDBAPI/Gremlin based instances with DocumentDB API.