I am building a website which depends on serving lots of little mp3 files (approx 10-15KB each) quite quickly. Each file contains a word pronunciation, and 20-30 per user will be downloaded every minute they are using the site. Each user might download 200 a day, and I anticipate 50 simultaneous users. There will be approx. 15,000 separate file eventually.

What would be the best way to store, manage, call and play these files as required? Will I need specialist hosting to deal with all the little files, or will they behave happily in one big folder (using a standard host)? Any delays will ruin the feel.


Update

Having done a bit more searching, I think the problem could be solved with either:

  1. A service like Photobucket but for audio instead, with its own API
  2. Some other sort of 'bucket hosting' solution where you can upload thousands of files at a reasonable cost, and call for them easily

Does anyone know of such a product?

Comments

I totally misread this question as "Serving lots of small fries", like it was McDonald's spam. No I don't eat fast food and have never worked at such a place.

Written by Tesserex

At least to me, this sounds like finding and configuring a server product rather than writing any code, so it seems like it would fit better on ServerFault.

Written by Jerry Coffin

Accepted Answer

If you want (or need) to store the files on disk instead of as BLOBs in a database, there are a couple of things you need to keep in mind.

Many (but not necessarily all) file systems don't work too well with folders containing many files, so you probably don't want to store everything in one big folder - but that doesn't mean you need specialist hosting.

The key is to distribute the files into a folder hierarchy, based on some hash function. As an example, we'll use the MD5 of the filename here, but it's not particularly important which algorithm you use or what data you are hashing, as long as you're consistent and have the data available when you need to locate a file.

In general, the output of a hash function is formatted as a hexadecimal string: for example, the MD5 of "foo.mp3" is 10ebb1120767e9de166e0f5905077cb1.

You can create 16 folders, one for each of possible hexadecimal characters - so you have a directory 0, one named 1, and so on up to f.

In each of those 16 folders, repeat this structure, so you have two levels. (0/0/, 0/1/,... , f/f/)

What you then do is simply to place the file in the folder dictated by its hash. You can use the first character to determine the first folder, and the second character to determine the subfolder. Using that scheme, foo.mp3 would go in 1/0/, bar.mp3 goes in b/6/, and baz.mp3 goes in 1/b/.

Since these hash functions are intended to distribute their values evenly, your files will be distributed fairly evenly across these 256 folders, which reduces the number of files in any single folder; statistically, 15000 files would result in an average of nearly 60 per folder which should be no problem.

If you're unlucky and the hash function you chose ends up clumping too many of your files in one folder anyway, you can extend the hierarchy to more than 2 levels, or you can simply use a different hash function. In both cases, you need to redistribute the files, but you only need to do that once, and it shouldn't be too much trouble to write a script to do it for you.

For managing your files, you will likely want a small database indexing what files you currently have, but this does not necessarily need to be used for anything other than managing them - if you know the name of the file, and you use the filename as input to your hash function, you can just calculate the hash again and find its location that way.

Written by Michael Madsen
This page was build to provide you fast access to the question and the direct accepted answer.
The content is written by members of the stackoverflow.com community.
It is licensed under cc-wiki