Tag Archives: Best practice or not?

MailAddress class is driving me crazy

I’m not working a lot with emails and email addresses. Loading, parsing and so on. Because of that I’m also using a MailAddress class a lot. And I hate it. The class, even simple, makes ma wanna scream.

Consider simple code like this.

var mail1 = new MailAddress("mail@example.com");
var mail2 = new MailAddress("MAIL@example.com");
Console.WriteLine("Equals?:\t{0}", mail1.Equals(mail2));
Console.WriteLine("Equals?:\t{0}", mail2.Equals(mail1));
Console.WriteLine("GetHashCode1:\t{0}", mail1.GetHashCode());
Console.WriteLine("GetHashCode2:\t{0}", mail2.GetHashCode());

What do you think the output be? Drum roll please…

Equals?:        True
Equals?:        True
GetHashCode1:   582340455
GetHashCode2:   1045083261

Why the hell somebody did the work overriding the Equals and not providing the implementation of GetHashCode that aligns with that?

I think it’s time to implement my own. 8-)

Timestamp property on Windows Azure Table entity

Every entity in Azure Table Storage has PartitionKey, RowKey and Timestamp property. PartitionKey gives the “bucket” where the entity will be stored (entities in one partition can be on more machines, though). And the RowKey is simply the key of that row. Queries where you’re matching both keys are fast. Matches inside partition (you know PartitionKey but only part (or none) of RowKey) are slower. Cross partition matches even slower. You get the idea. But what about the Timestamp?

The Timestamp property is maintained by server and it’s the time the entity was last modified. Simple as it is. Because the property is always there, I thought it would be good to actually use it. But it’s not. This property is not being “indexed” (not even slightly :) ) or anything like that. That means any query including condition on it will be as slow as table scan (or any other property condition). And that’s something very very slow, especially if you have dozens of partitions with millions of records and you’re doing range condition (not mentioning the limit on one batch for result).

And I learned that by my experience. :) Because the secondary indices are not here yet, the only option, if you’re doing really a lot of queries based on date&time/last modified to (ab)use RowKey (or build and maintain secondary index yourself).

You might wonder how to create such secondary index. Easy, like in 1980s. You already have table (BTW did you noticed, how easy is to store something into table, but it’s order of magnitude harder (read slower) to get something from it, except if you’re doing PartitionKey&RowKey exact match?), if the inserts/updates/deletes are not super fast and you’re not fighting for throughput you can compute the key for secondary index directly while doing the operation and store it in another table with both keys from original table. But if you need the operation to be super fast, you can just put the data into queue message (if it’s too big for message (64kB in Jan 2013) you can put the data into blog and store just reference to blob) and use worker role(s) to process the data, compute secondary keys and insert/update/delete into table(s). Boring? Yes. Old school? Yes. But Azure Table Storage isn’t exactly smart storage, but it scales. :)

Let’s hope secondary indices (even stale would be good) will be added in the future.

Do not overdo parallelism and asynchronous

Sometimes it’s easy to little bit overdo the need for having everything asynchronous and parallel. Quite often in last few weeks I’ve seen methods similar to this one.

Console.WriteLine("Starting");
Parallel.For(1, 10, async i =>
{
	await Task.Delay(200);
	Console.WriteLine(i);
});
Console.WriteLine("Finished");

What’s the result? For somebody maybe surprisingly:

Starting
Finished

Why? What we see here is a two pieces “process”. First the Parallel.For. This methods runs the provided method in parallel (for our discussion it doesn’t matter how and what exactly that means) and waits for all methods to complete. The lambda expression we’re providing is asynchronous. And though the async/await simplified the programming a lot, it’s still standing on basic principles around Tasks. And that’s the key for understanding what’s wrong. The async lambda is basically (I’m simplifying here) starting a task to do the work and returning that task so you can eventually (a)wait it to complete. But the Parallel.For care about the method (all of them) returning, not (a)waiting tasks (it’s actually an Action<int> also known as “async void” hence it has no idea about the task inside). And here you have it.

The question that’s left is, how to fix it? :) Probably easiest way is to extract the lambda to method and wait for that task to complete. That will make the method blocking so the Parallel.For is not going to end prematurely.

static void Main(string[] args)
{
	Console.WriteLine("Starting");
	Parallel.For(1, 10, i => Action(i).Wait());
	Console.WriteLine("Finished");
}

static async Task Action(int i)
{
	await Task.Delay(200);
	Console.WriteLine(i);
}

But wait. That’s a little bit crazy, isn’t it? We have tasks running asynchronously and we’re spinning Parallel.For and waiting??? You can actually run the loop starting the asynchronous methods capturing the returned tasks and then use Task.WaitAll or if you want to go really deep async ;) you can use Task.WhenAll and await it.

If you’d like to see something like ForEachAsync (or maybe ForAsync) you can get and inspiration and other interesting notes from this Stephen Toub’s blog post.

Awaiting something in TransactionScope

I like asynchronous programming. The Asynchronous Programming Model and later Tasks with ContinueWith offer great performance especially if no waiting and similar is used.

Although the async/await makes this very simple for 99% of cases, there’s always 1% where you might hit the wall. With callbacks it was little bit obvious, because you wrote the code. Now the compiler is doing the hard work.

If you do something like:

lock (SyncRoot)
{
	await FooBar.DoAsync();
}

You’ll get nice error from compiler saying The 'await' operator cannot be used in the body of a lock statement. And it really makes sense. The lock will very likely provide wrong results under default rewriting work the compiler is doing (and doing it properly for Monitor class needs a lot of knowledge of what you’re trying to achieve. What’s not so clear is that with TransactionScope block (or similar construct) you’re basically doing same stuff, just probably somewhere else in database.

So the compiler is completely OK with:

using (TransactionScope ts = new TransactionScope())
{
	await FooBar.DoAsync();
}

But that’s not what you might had in mind. Consider code like:

static void Main(string[] args)
{
	Test();
}

static async void Test()
{
	Task t = FooAsync();
	Console.WriteLine("Other stuff");
	await t;
}

static async Task FooAsync()
{
	using (Test t = new Test())
	{
		Console.WriteLine("Before");
		await Task.Yield();
		Console.WriteLine("After");
	}
}
class Test : IDisposable
{
	public void Dispose()
	{
		Console.WriteLine("Dispose");
	}
}

The result is correct (and expected, if you look at it closely):

Before
Other stuff
After
Dispose

But considering, that the Other stuff might have some dependency in data being done in that transaction you might get wrong result.

So it’s not always wrong, nor some gotcha in compiler. But think it through before using this construct.

Count for System.Collections.Concurrent.ConcurrentStack<T>

Collections in System.Collections.Concurrent namespace are optimized for access from more (basically) threads. That means no stupid “one-lock-for-everything” approach. Actually these are lock free.

It’s good for performance, but also, if used foolhardily, the performance penalty can be too big. One example can be the ConcurrentStack<T> class. As with a lot of collections, this stack also has a Count property. But because it’s a lock free implementation using linked list the Count isn’t that easy. Actually it’s O(n) and this can be especially bad for big stacks. So use with care. As the remarks for this property recommends, if you need to know just whether it’s empty of not, the IsEmpty is better (though realize, that in time you check it it might be empty, but not on next line where you’re about to perform some action and vice versa).

Use the tools, but also know your tools.

Executing method in intervals – good and bad approaches

In last few days I’ve seen couple of pieces of code with “executing method every x seconds”. And a lot of bad. Not buggy, but superfluously expensive. I’m also here adding a simple loop to run just 10 times.

The first bad one is simply using Thread and Sleep method.

void BadOne1()
{
	Thread t = new Thread(o =>
		{
			for (int i = 0; i < 10; i++)
			{
				Console.WriteLine("BadOne1");
				Thread.Sleep(1000);
			}
		});
	t.Start(null);
}

Problem is the Thread object is very expensive and it’s used only fraction of time. The rest is blocked. Simply wasted resources.

I’ve also seen slightly better version.

void BadOne2()
{
	ThreadPool.QueueUserWorkItem(o =>
		{
			for (int i = 0; i < 10; i++)
			{
				Console.WriteLine("BadOne2");
				Thread.Sleep(1000);
			}
		});
}

This is really just a little bit better. ThreadPool thread is used, but again, the thread is doing nothing a lot of time. Again wasted resources.

Better approach is to use Timer. This way you’re not wasting resources by abusing threads. The callback from Timer is executed when the interval is elapsed (on ThreadPool thread). Rest of the time it’s just sits there waiting for next tick.

void GoodOne1()
{
	int i = 0;
	Timer t = null;
	t = new Timer(o =>
		{
			Console.WriteLine("GoodOne1");
			if (Interlocked.Increment(ref i) == 10)
			{
				// could tick, still
				t.Change(Timeout.InfiniteTimeSpan, Timeout.InfiniteTimeSpan);
				t.Dispose(); // or somewhere else
			}
		}, null, TimeSpan.Zero, TimeSpan.FromSeconds(1));
}

Here I’d like to point to my older post about keeping the reference to Timer.

And also one question that might come. The Timer ticks every interval no matter how long executing the callback took. That might not be what you want. You may want to execute the method with xx seconds delay between (previous examples), not execute the method every xx second (this example). The solution is pretty easy. Initially just schedule one tick and then reschedule it for next interval.

void GoodOne2()
{
	int i = 0;
	Timer t = null;
	t = new Timer(o =>
		{
			Console.WriteLine("GoodOne2");
			if (Interlocked.Increment(ref i) == 10)
			{
				// could tick, still
				t.Change(Timeout.InfiniteTimeSpan, Timeout.InfiniteTimeSpan);
				t.Dispose(); // or somewhere else
			}
			else
			{
				t.Change(TimeSpan.FromSeconds(1), Timeout.InfiniteTimeSpan);
			}
		}, null, TimeSpan.Zero, Timeout.InfiniteTimeSpan);
}

All these are kinda low level. But you can use Reactive Extensions (Rx) to write it in more succinct way, but internally the core idea is same as with Timer.

void GoodOne3()
{
	IDisposable obs = null;
	obs = Observable
		.Timer(TimeSpan.Zero, TimeSpan.FromSeconds(1))
		.Take(10)
		.Subscribe(_ =>
			{
				Console.WriteLine("GoodOne3");
			}, () => obs.Dispose() /* or somewhere else */);
}

No matter what approach you’ll use, please don’t (ab)use threads (manually created or ThreadPool ones). Threads are small little nice sweet creatures that don’t deserve to be treated like this. And. Don’t forget to cleanup resources (i.e. Timer is IDisposable).