This Article

Efficiently Reading Multiple Document Types from Cosmos DB (.NET SDK) - From Review

It's astonishing the amount of strange advice given on the internet for integrating and using Cosmos DB.
Note: This article's content was expanded out a little to help with code readability.

Containers, Partition Keys, Point Reads, Queries, Request Units. That’s the life of Cosmos DB and it’s pretty great until you need to know something.

If you do end up wanting to know something specific about Cosmos DB then like every other developer you’re probably going to Google it. Then you’ll find yourself on Stack Overflow reading misguided questions and receiving questionable answers. Then you’ll regret not being able to easily navigate Microsoft’s mostly-excellent, but sometimes-lacking, documentation.

Here I’ll take you through a few key things to know about Cosmos DB and depending on the Google Overlords doing their job, hopefully answer a few questions that people are searching for.

It’s pretty long so here’s a TLDR: When you’re using GetItemQueryIterator to read and parse documents in a single Container with multiple types, don’t use dynamic or JsonConvert.DeserialiseObject(). It’s a JObject and you can simply use ToObject<T>()!

This was cross-posted on dev.to if you’re in their community and / or want to leave a comment / question.


A little backstory first as to what prompted this article. At [insert our next venture name] we use Cosmos for some parts of our architecture that heavily benefits from NoSQL and efficient operations including but not limited to auditing, configurations, highly available internal services, event streams and basic key-value type storage requirements.

Late last night I was reviewing a critical path in one of our services that uses Cosmos as a backing store and noticed that it had actually undergone several fundamental changes to how it worked in the last couple of weeks. This code was storing and receiving multiple different types of documents in a single Container (which is a perfectly valid, if not encouraged practice).

The code was responsible for reading different document types from a Container in Cosmos and correctly getting them to their actual types.

Here’s what the different versions were:

  1. Only get a single item at a time with the defined up front. Simple, effective, fast, strongly typed. This used ReadItemAsync<T> from the Cosmos SDK.
  2. Get all of the documents and return them to the API as object to be serialized back to JSON. This uses a Query Iterator with a query like SELECT * FROM c, with a partition key. So do all of the next ones.
  3. Use dynamic and Newtonsoft’s JsonConvert to Deserialize the result of .ToString()
  4. Use dynamic and System.Text.Json’s JsonSerializer to Deserialize the result of .ToString()

They’re all acceptable ways of doing this yet I wasn’t happy with any of them.

I want to start by saying that I’m a huge fan of using dynamic in only 2 scenarios. The first is to wind up colleagues and exclaim “DYNAMIC ALL THE WAY” just for fun. The second is when you truly don’t have a clue what’s going on and something could literally be anything and you need to be flexible, generally when communicating with third parties or other languages that aren’t typed and we don’t really care too much.

When I saw dynamic in actual production code that we’re running and in a path that will be called many times per second I was a little confused and dug deeper.

Let’s understand where this came from. Here’s 2 notable Stack Overflow questions that are doing what we’re doing:

There are more of these that all basically say the same thing, these are just the first 2 I found whilst writing.

These answers both do a curious thing, take the dynamic object, check the type property on the document, and then JsonConvert it to the type we want. Fair enough.

But is it? Let’s take a look.

Here I’ve created a benchmark to test out different methods, there’s 2 documents to test with, 1 is 1.8KB and the other is 2.1KB. Negligible. This is also tested on both .NET 6 and 7 as we use both in production depending on the environment. I ran these on an M1 Pro with 16GB RAM and a tonne of other stuff open, so YMMV.

Note also that Newtonsoft.Json is the default (de)serializer for the Cosmos SDK v3. v4 which seems to never be coming (it hasn’t been touched for years) does use System.Text.Json but for now we’re stuck.

Here’s the benchmark code.

Type1 t1Out;
Type2 t2Out;
foreach (var obj in DynamicObjects)
{
    if (obj.type == "type_1")
    {
        t1Out = JsonConvert.DeserializeObject<Type1>(obj.ToString());
    }
    else if (obj.type == "type_2")
    {
        t2Out = JsonConvert.DeserializeObject<Type2>(obj.ToString());
    }
}
MethodJobRuntimeMeanErrorStdDevGen0Gen1Allocated
Dynamic_ToString_Deserialize.NET 6.0.NET 6.033.63 us0.087 us0.077 us21.3623-43.7 KB
Dynamic_ToString_Deserialize.NET 7.0.NET 7.027.36 us0.381 us0.356 us7.11060.244143.7 KB

Alright. 43.7KB memory, around 30 microseconds.

Let’s see how we improved it a little (this is essentially the latest version I had reviewed) by using System.Text.Json to deserialize it.

New code:

Type1 t1Out;
Type2 t2Out;
foreach (var obj in DynamicObjects)
{
    if (obj.type == "type_1")
    {
        t1Out = JsonSerializer.Deserialize<Type1>(obj.ToString());
    }
    else if (obj.type == "type_2")
    {
        t2Out = JsonSerializer.Deserialize<Type2>(obj.ToString());
    }
}

and the results:

MethodJobRuntimeMeanErrorStdDevGen0Gen1Allocated
Dynamic_ToString_SystemTextJson_Deserialize.NET 6.0.NET 6.014.96 us0.038 us0.033 us13.4583-27.48 KB
Dynamic_ToString_SystemTextJson_Deserialize.NET 7.0.NET 7.013.53 us0.130 us0.109 us4.47080.137327.48 KB

Better in every way! It’s faster and allocates 42% less memory. Winning! We’ve beat out the Stack Overflow answers already.

So that is the code that was left and committed and would have made it out in to production if I wasn’t a pedant. But I am.

We know that Cosmos DB itself as well as the SDK don’t actually care what types we’re storing up there as long as they meet some criteria (actually having a partition key is actually the only criteria, an ID will be generated, and the partition key could be that ID!), so the object itself is not actually dynamic, it’s something.

> obj.GetType()
JObject

Yup. Something. In V2 of the SDK, I believe this was Document as it’s referenced in some other Stack Overflow answers though I haven’t personally seen it used so can’t comment.

Let’s work with that, and instead of using dynamic which is the root of all evil in the performance world, supposedly. Instead of DynamicObjects (which is what it says on the tin), Objects is the same thing but as a JObject. They were both created using JObject.Parse (which is what Cosmos does internally), we’re just typed now.

This gives us a little more type safety. I say a little more because we’re still going to just assume obj["type"] is there and it’s a string. Which is pretty safe.

Type1 t1Out;
Type2 t2Out;
foreach (var obj in Objects)
{
    var type = (string)obj["type"];
    if (type == "type_1")
    {
        t1Out = JsonSerializer.Deserialize<Type1>(obj.ToString());
        // And the equivalent JsonConvert in another benchmark
    }
    else if (type == "type_2")
    {
        t2Out = JsonSerializer.Deserialize<Type2>(obj.ToString());
    }
}

(and the JsonConvert equivalent for good comparative measure).

MethodJobRuntimeMeanErrorStdDevGen0Gen1Allocated
JObject_Cast_ToString_Deserialize.NET 6.0.NET 6.034.14 us0.288 us0.240 us21.3623-43.63 KB
JObject_Cast_ToString_SystemTextJson_Deserialize.NET 6.0.NET 6.015.33 us0.304 us0.338 us13.4277-27.41 KB
JObject_Cast_ToString_Deserialize.NET 7.0.NET 7.027.39 us0.424 us0.397 us7.11060.213643.63 KB
JObject_Cast_ToString_SystemTextJson_Deserialize.NET 7.0.NET 7.013.02 us0.254 us0.250 us4.47080.122127.41 KB

Excellent, we’ve shaved off a few bytes on memory allocation. Done! We have type-safety(ish, kinda, sorta) from using JObject and we shaved off some time.

I’m joking, obviously. Let’s look at what a JObject actually is.

Represents a JSON object.

Helpful.

Anyway to save a rant, it’s basically a parsed object representing the document that was in it.

Does it actually hold the original document JSON though which ToString() is just returning? Of course not, why would it? Let’s see where our allocations are actually coming from.

obj.ToString();
MethodJobRuntimeMeanErrorStdDevGen0Gen1Allocated
JObject_ToString.NET 6.0.NET 6.09.795 us0.0190 us0.0159 us13.3209-27.23 KB
JObject_ToString.NET 7.0.NET 7.08.681 us0.1725 us0.2634 us4.44030.091627.23 KB

When we’re calling JObject.ToString() we are reserializing the object back to JSON. In effect, this is what the answers to the Stack Overflow questions are doing:

  1. Let Cosmos load the document blob from the database itself
  2. Let Newtonsoft.Json Deserialize it from JSON to a JObject
  3. Using .ToString(), serialize it back from a JObject to JSON
  4. Deserialize it (using Newtonsoft or in our case, System.Text.Json) back to our type

We got the same JSON twice in this case, and parsed it twice too resulting in all the allocations.

We also learned that System.Text.Json is actually super awesome, resulting in virtually no extra allocations above the .ToString(). Of course Newtonsoft.Json basically doubles it, but we knew that it was nowhere near in competition with System.Text.Json for simple document (de)serialization anyway.

With that said, to use System.Text.Json we still have to call this .ToString() method. There has to be a better way!

Well there is, since a JObject is also a JToken, it means we can access ToObject<T>. We’ll live in the Newtonsoft.Json world still, but we’re already there anyway.

So here we go:

Type1 t1Out;
Type2 t2Out;
foreach (var obj in Objects)
{
    var type = obj["type"].ToString();
    // or type = (string)obj["type"]
    // or type = obj["type"].Value<string>()
    // it doesn't matter
    if (type == "type_1")
    {
        t1Out = obj.ToObject<Type1>();
    }
    else if (type == "type_2")
    {
        t2Out = obj.ToObject<Type2>();
    }
}
MethodJobRuntimeMeanErrorStdDevGen0Allocated
JObject_Cast_ToObject.NET 6.0.NET 6.016.36 us0.046 us0.043 us1.61743.32 KB
JObject_ToString_ToObject.NET 6.0.NET 6.016.22 us0.082 us0.072 us1.61743.32 KB
JObject_Value_ToObject.NET 6.0.NET 6.016.42 us0.051 us0.048 us1.61743.32 KB
Dynamic_ToObject.NET 6.0.NET 6.016.80 us0.019 us0.017 us1.64793.39 KB
JObject_Cast_ToObject.NET 7.0.NET 7.012.52 us0.030 us0.027 us0.53413.32 KB
JObject_ToString_ToObject.NET 7.0.NET 7.012.03 us0.018 us0.015 us0.53413.32 KB
JObject_Value_ToObject.NET 7.0.NET 7.011.86 us0.016 us0.015 us0.53413.32 KB
Dynamic_ToObject.NET 7.0.NET 7.012.33 us0.034 us0.031 us0.54933.39 KB

Consistent results, no unnecessary (de)serializing and it’s just as fast as (well, slightly faster than, of course) using System.Text.Json to parse it back. The times are all so close to one another that it doesn’t matter much, but the memory allocations are significant enough to make a difference when we’re calling this code over and over again in quick succession.

For good measure I also through the dynamic benchmark back in just calling ToObject<T>() with also negligible differences, but why use it when we know what it is?

So there we have it. Rather than calling .ToString() (serialize) and then deserializing the result from, we simply let much less of the work be done and simply map it to our target type.

Also dynamic is not actually that evil in this case (though it’s still more evil than being explicit).

There’s actually an even better way to do this (that we ended up implementing), but I’ll save that for a quick follow up in a couple of days once I’ve had chance to do real benchmarks.


A small bonus for any of you that are still here and are eagle eyed enough to realise that we did a Point Read in Cosmos on the type rather than using the query to get back all of the documents. By using Point Reads we would have the strong type from the start and arguably slightly simpler code.

Cosmos DB bills on Request Units (or RUs). Roughly, point reading 1 x 1KB document is 1 RU. 1 x 100KB document is 100 RU. Queries activate the Query Engine at the Cosmos Side and have a base cost of around 2.8 RU.

Let’s say we had 4 document types that we wanted to read, each at roughly 2KB, we can assume that a Point Read would cost us 2RU x 4 = 8 RU. Point Reads are super quick, and we see them returning in around 8ms generally which is astonishing. With that we’re using 8 RU and spending around 32ms loading in the documents from Cosmos inclusive of the parsing to load in these 4 documents.

However, we’re only loading in 4 documents and an important thing to know is that strangely query-based RUs don’t scale with document size. This seems bad for Azure’s bottom line, but great for us!

When running SELECT * FROM c and returning those 4 documents, the query costs 2.93 RU and takes 22ms. We’re slashing costs in half, and time by a third.

If you have a similar set up to this give it a try if you’re not already querying. Just make sure that you specify the Partition Key for the query as you don’t want any cross partition queries going on.

For your reference it’ll look a little something like this:

var query = new QueryDefinition("SELECT * FROM c");
var iterator = container.GetItemQueryIterator<JObject>(query, requestOptions: new QueryRequestOptions
{
    PartitionKey = partitionKey
});

Happy Cosmosing and stay tuned for the even better way! 🪐

Written by Rudi Visser

Fancy reading more?

We'll be writing more in the coming weeks. Check back later!