Had reason recently to select a random sample of data from a stream of elements. The amount of samples I needed to take was finite but what was unknown was the number of elements in the input stream. I managed to find various implementations of an IEnumerable extension method on the net, but all the ones I found would cache the entire stream before selecting the sample. This clearly isn't scalable as the input stream increases in size.
After a look around I found this blog article that describes how to implement Reservoir sampling without the need to keep all the items. This then allowed me to write a Random Sample extension method to IEnumerable<T> to do what I needed.
public static IEnumerable<T> RandomSample<T>(this IEnumerable<T> rows, int nSamples)
{
Random rnd = new Random();
T[] sample = new T[nSamples];
int nSamplesTaken = 0;
foreach (T item in rows)
{
if (nSamplesTaken < sample.Length)
{
sample[nSamplesTaken] = item;
}
else
{
// As the amount of samples increases the probability
// of including a value gets less..due to the fact
// that it has a greater chance of surviving if it gets
// placed into the sample over earlier selections
if (rnd.Next(nSamplesTaken) < nSamples)
{
sample[rnd.Next(nSamples)] = item;
}
}
nSamplesTaken++;
}
if (nSamplesTaken >= nSamples)
{
return sample;
}
else
{
return sample.Take(nSamplesTaken);
}
}
1 comment:
Hey, I enjoyed your blog post. I've been writing a lot of extension methods lately myself. Sometimes you really have to be thankful to be a C# developer, you know?
Also, I was wondering if you've tried any syntax highlighters? I use one from google code and I have a pretty easy one liner you can add to your template. If you'd like to use it, feel free.
Post a Comment