[PATCH] Better Filename Parsing

Locked
shartte
User
Posts: 7
Joined: Sun Aug 27, 2006 8:05 pm

[PATCH] Better Filename Parsing

Post by shartte » Sun Aug 27, 2006 8:09 pm

Hi. Since my MP3 collection had several files where the title contained a dash ('-'), i had to manually improve the filename parsing since the artist always turned up to be '**** UNKNOWN FILENAME STRUCTURE***' (or sth like that).

The new code would be like this:

(update.php, around line 746)

Code: Select all

	$cd = 1;
	$temp = DecodeEscapeCharacters($filename[$i]);
	
	// START OF PATCH
	
	// Funny processing stuff
	// Check for the number of separators
	$separators = substr_count($temp, '-');

	$featuring = '';
	
	// At least two parts are required.
	if ($separators < 2) {
		$track_artist 	= '*** UNKNOWN FILENAME FORMAT ***';
	    $title 			= '(' . $filename[$i] . ')';
	} else {
	    // If three parts are present, we check if the first one
	    // is a track number. Basically we only allow 1 to 3 digits here.
	    // Anything above that is probably not a track number.
	    if (preg_match('/^(\d{1,3})\s*-\s*(.*)$/', $temp, $matches)) {
	        $track_number = $matches[1];
	        // Most likely a CD number in front (101 = CD 1 Track 1)
	        if ($track_number > 99) {
	            $cd = floor($track_numer / 100);
	        }
	        $temp = $matches[2]; // Strip the track number
	    }
	    
	    preg_match('/^(.+)\s+-\s+(.*)(Feat\.\s+.*|)$/i', $temp, $matches);
	    
	    $track_artist = $matches[1];
	    $title = $matches[2];
	    $featuring = $matches[3];
	}
	
	// END OF PATCH
	
	$relative_file = substr($file[$i], strlen($cfg['media_dir']));
	
	$query = mysql_query('SELECT album_id FROM track WHERE album_id = "' . mysql_real_escape_string($album_id) . '" AND BINARY relative_file = "' . mysql_real_escape_string($relative_file) . '"');
The basics are this:
It looks for a track number in the beginning of the file that has to be numeric and between 1 and 3 digits long (largest track number is 999 as before). It then does the CD check and removes the track number from the string (to unify the processing afterwards).

Then it basically does everything in one regular expression.

The first part is the artist name which can be _anything_ but it has to be separated by ' - ' from the rest of the string. The rest of the string is then treated as the title until (optionally) 'Feat.' followed by one or more spaces is encountered. The rest is then treated as the "featuring" part.

If you want to include this go ahead (GPL'd anyway). If you have certain requests regarding this code before you want to include it (i.e. change the regular expression pattern a bit), go ahead and ask.

cu,
Sebastian

ps: Filenames it does parse correctly include:

15 - Static-X - Skinnyman.mp3 (Previously parsed correctly i think)
15 - Kärtsy Hatakka & Kimmo Kajasto - Variations - Max Payne (Cello Base).mp3

Especially the last one gave me a lot of trouble since the entire album looks like that.

shartte
User
Posts: 7
Joined: Sun Aug 27, 2006 8:05 pm

Post by shartte » Sun Aug 27, 2006 8:31 pm

This is basically the same patch but for album directory names:

Code: Select all

$year	= 'NULL';
$month	= 'NULL';
preg_match('/^(?:(\d{4,6})\s*-\s*)?(.+)$/', $album, $matches);
print_r($matches);
if (strlen($matches[1]) == 6) {
	$year  = substr($matches[1], 0, 4);
	$month = substr($matches[1], -2);
} else if (strlen($matches[1]) == 4) {
    $year = $matches[1];
}
$album = $matches[2];

$temp = explode(', ', $artist_alphabetic);
Test to work with both years and without years.

The path that wasn't parsed correctly without this patch was:

2004 - Gundam SEED DESTINY ED1 Single - Reason [Nami Tamaki]

That now works flawlessly.

User avatar
wbartels
netjukebox developer
Posts: 797
Joined: Thu Nov 04, 2004 3:12 pm
Location: Netherlands
Contact:

Post by wbartels » Mon Aug 28, 2006 4:23 pm

Thanks for the help Shartte,

I will definitely take a look at your code.
If everything is working correctly I will implement it in the next version of netjukebox.
I will post my finding in this tread later.

Willem

shartte
User
Posts: 7
Joined: Sun Aug 27, 2006 8:05 pm

Post by shartte » Mon Aug 28, 2006 4:25 pm

wbartels wrote:Thanks for the help Shartte,

I will definitely take a look at your code.
If everything is working correctly I will implement it in the next version of netjukebox.
I will post my finding in this tread later.

Willem
Thanks and go ahead. My next project would be automatic genre assignment (without overwriting existing genres) based on ID3 tags (since thats more or less the only field that has a little standard to it). Is there a specific reason you're not reading the id3 tags at all right now?

cu,
Sebastian

User avatar
wbartels
netjukebox developer
Posts: 797
Joined: Thu Nov 04, 2004 3:12 pm
Location: Netherlands
Contact:

Post by wbartels » Sat Sep 02, 2006 2:13 pm

shartte wrote:Is there a specific reason you're not reading the id3 tags at all right now?
There are two reasons for not using ID tags:
There is no uniform ID tag for an "album artist".
This is handy for mix/compilation albums, like this:

Album artist: DJ Tiesto
Artist: York
Track: The reachers of civilisation

When using ID-tags there is not really a reason to split albums in different maps.
netjukebox creates an id in every album-directory for identification.
Even after renaming an album the album related data in the database (internet cover art, album play counter) are kept in tact.

shartte
User
Posts: 7
Joined: Sun Aug 27, 2006 8:05 pm

Post by shartte » Sat Sep 02, 2006 2:34 pm

wbartels wrote:
shartte wrote:Is there a specific reason you're not reading the id3 tags at all right now?
There are two reasons for not using ID tags:
There is no uniform ID tag for an "album artist".
This is handy for mix/compilation albums, like this:

Album artist: DJ Tiesto
Artist: York
Track: The reachers of civilisation

When using ID-tags there is not really a reason to split albums in different maps.
netjukebox creates an id in every album-directory for identification.
Even after renaming an album the album related data in the database (internet cover art, album play counter) are kept in tact.
Then what about the genre information?

I personally would find it very useful if the software could read the id3 tags for the genre, then see if the entire album has the same genre and assign it automatically (at least try it) if it isnt assigned yet.

User avatar
wbartels
netjukebox developer
Posts: 797
Joined: Thu Nov 04, 2004 3:12 pm
Location: Netherlands
Contact:

Post by wbartels » Mon Sep 04, 2006 4:14 pm

shartte wrote:This is basically the same patch but for album directory names:

Code: Select all

$year	= 'NULL';
$month	= 'NULL';
preg_match('/^(?:(\d{4,6})\s*-\s*)?(.+)$/', $album, $matches);
print_r($matches);
if (strlen($matches[1]) == 6) {
	$year  = substr($matches[1], 0, 4);
	$month = substr($matches[1], -2);
} else if (strlen($matches[1]) == 4) {
    $year = $matches[1];
}
$album = $matches[2];

$temp = explode(', ', $artist_alphabetic);
Test to work with both years and without years.

The path that wasn't parsed correctly without this patch was:

2004 - Gundam SEED DESTINY ED1 Single - Reason [Nami Tamaki]

That now works flawlessly.
Thanks this is working very well.
I have changed the preg_match() function to this:

Code: Select all

preg_match('/^(?:(\d{4,6})\s+-\s+)?(.+)$/', $album, $matches);
So it could handle album names like: 1234-hits

shartte
User
Posts: 7
Joined: Sun Aug 27, 2006 8:05 pm

Post by shartte » Mon Sep 04, 2006 4:15 pm

Jep that sounds good.

I don't even remember why i used \s* instead of \s+ ;-)

cu,
Sebastian

User avatar
wbartels
netjukebox developer
Posts: 797
Joined: Thu Nov 04, 2004 3:12 pm
Location: Netherlands
Contact:

Re: [PATCH] Better Filename Parsing

Post by wbartels » Mon Sep 04, 2006 4:22 pm

shartte wrote:Hi. Since my MP3 collection had several files where the title contained a dash ('-'), i had to manually improve the filename parsing since the artist always turned up to be '**** UNKNOWN FILENAME STRUCTURE***' (or sth like that).

The new code would be like this:

(update.php, around line 746)

Code: Select all

	$cd = 1;
	$temp = DecodeEscapeCharacters($filename[$i]);
	
	// START OF PATCH
	
	// Funny processing stuff
	// Check for the number of separators
	$separators = substr_count($temp, '-');

	$featuring = '';
	
	// At least two parts are required.
	if ($separators < 2) {
		$track_artist 	= '*** UNKNOWN FILENAME FORMAT ***';
	    $title 			= '(' . $filename[$i] . ')';
	} else {
	    // If three parts are present, we check if the first one
	    // is a track number. Basically we only allow 1 to 3 digits here.
	    // Anything above that is probably not a track number.
	    if (preg_match('/^(\d{1,3})\s*-\s*(.*)$/', $temp, $matches)) {
	        $track_number = $matches[1];
	        // Most likely a CD number in front (101 = CD 1 Track 1)
	        if ($track_number > 99) {
	            $cd = floor($track_numer / 100);
	        }
	        $temp = $matches[2]; // Strip the track number
	    }
	    
	    preg_match('/^(.+)\s+-\s+(.*)(Feat\.\s+.*|)$/i', $temp, $matches);
	    
	    $track_artist = $matches[1];
	    $title = $matches[2];
	    $featuring = $matches[3];
	}
	
	// END OF PATCH
	
	$relative_file = substr($file[$i], strlen($cfg['media_dir']));
	
	$query = mysql_query('SELECT album_id FROM track WHERE album_id = "' . mysql_real_escape_string($album_id) . '" AND BINARY relative_file = "' . mysql_real_escape_string($relative_file) . '"');
The basics are this:
It looks for a track number in the beginning of the file that has to be numeric and between 1 and 3 digits long (largest track number is 999 as before). It then does the CD check and removes the track number from the string (to unify the processing afterwards).

Then it basically does everything in one regular expression.

The first part is the artist name which can be _anything_ but it has to be separated by ' - ' from the rest of the string. The rest of the string is then treated as the title until (optionally) 'Feat.' followed by one or more spaces is encountered. The rest is then treated as the "featuring" part.

If you want to include this go ahead (GPL'd anyway). If you have certain requests regarding this code before you want to include it (i.e. change the regular expression pattern a bit), go ahead and ask.

cu,
Sebastian

ps: Filenames it does parse correctly include:

15 - Static-X - Skinnyman.mp3 (Previously parsed correctly i think)
15 - Kärtsy Hatakka & Kimmo Kajasto - Variations - Max Payne (Cello Base).mp3

Especially the last one gave me a lot of trouble since the entire album looks like that.
Now netjukebox can handle filenames like these:

Code: Select all

01 – moby – go
01 – go
moby – go

track – track_artist – title  (compilation album)
track – title                 (album from one artist)
track_artist – title          (singles)

In your code with regular expression the second and third filename is not working.
Do you have a fix for that?

Thanks :D

shartte
User
Posts: 7
Joined: Sun Aug 27, 2006 8:05 pm

Post by shartte » Mon Sep 04, 2006 4:25 pm

I'll fix it once I'm home from work!

cu,
Sebastian

User avatar
wbartels
netjukebox developer
Posts: 797
Joined: Thu Nov 04, 2004 3:12 pm
Location: Netherlands
Contact:

Post by wbartels » Mon Sep 04, 2006 4:49 pm

shartte wrote:Then what about the genre information?

I personally would find it very useful if the software could read the id3 tags for the genre, then see if the entire album has the same genre and assign it automatically (at least try it) if it isnt assigned yet.
The genre structure in netjukebox has an hierarchical structure.
See: http://forum.lan/viewtopic.php?p=616#616
This is not compatible wit the flat structure in most ID-tags.

Another reason I have not read the genre ID-tag is that most people have another idea what genre a specific album belongs to.

shartte
User
Posts: 7
Joined: Sun Aug 27, 2006 8:05 pm

Post by shartte » Mon Sep 04, 2006 9:30 pm

I have slightly modified the title parsing and added a fallback to parse titles which dont contain artist information:

Code: Select all

   // START OF PATCH
   
   // Funny processing stuff
   // Check for the number of separators
   $separators = substr_count($temp, '-');

   $featuring = '';
   
   // At least two parts are required.
   if ($separators < 2) {
      $track_artist    = '*** UNKNOWN FILENAME FORMAT ***';
       $title          = '(' . $filename[$i] . ')';
   } else {
       // If three parts are present, we check if the first one
       // is a track number. Basically we only allow 1 to 3 digits here.
       // Anything above that is probably not a track number.
       if (preg_match('/^(\d{1,3})\s+-\s+(.*)$/', $temp, $matches)) {
           $track_number = $matches[1];
           // Most likely a CD number in front (101 = CD 1 Track 1)
           if ($track_number > 99) {
               $cd = floor($track_numer / 100);
           }
           $temp = $matches[2]; // Strip the track number
       }

       if (preg_match('/^(.+)\s+-\s+(.*)(Feat\.\s+.*|)$/i', $temp, $matches)) {
        $track_artist = $matches[1];
        $title = $matches[2];
        $featuring = $matches[3];
       } else if (preg_match('/^(.*)(Feat\.\s+.*|)$/i', $temp, $matches)) {
        $title = $matches[1];
        $featuring = $matches[2];
       }
   }
   
   // END OF PATCH



It does not automatically use the album artist yet, do you do that anywhere below that code?

Well anyway...

Let me know if it works. I didn't test it yet (too tired :( ).

cu,
Sebastian

ps: I wanted to use the id3 album tags as a hinting system, not as the definitive answer. And even if you have a hierarchical system, it still does have a lot of overlapping genres with the id3 tag system.

User avatar
wbartels
netjukebox developer
Posts: 797
Joined: Thu Nov 04, 2004 3:12 pm
Location: Netherlands
Contact:

Post by wbartels » Tue Sep 05, 2006 6:49 pm

The above code didn’t work because.
At least two parts are required means one separator needed.

So:

Code: Select all

if ($separators < 2)
Must be changed in:

Code: Select all

if ($separators < 1)
And Featuring didn’t work at all.
But your code helped me a lot to come up with this result :D :

Code: Select all

$temp = DecodeEscapeCharacters($filename[$i]);
$cd = 1;
$featuring = '';

if (preg_match('/^(?:\d{2}\s+-\s+|(\d{1})\d{2}\s+-\s+)?(.*)$/', $temp, $matches))
    {
    if ($matches[1] > 1)
        $cd = $matches[1];
    $temp = $matches[2]; // Strip the track number
    }

if (preg_match('/^(.*?)\s+-\s+(.*?)(?:\s+Ft\.\s+(.*)|)$/i', $temp, $matches))
    {
    $track_artist    = $matches[1];
    $title            = $matches[2];
    if (isset($matches[3]))
        $featuring    = $matches[3];
    }
else if (preg_match('/^(.*?)(?:\s+Ft\.\s+(.*)|)$/i', $temp, $matches))
    {
    $track_artist    = $artist;
    $title            = $matches[1];
    if (isset($matches[2]))
        $featuring    = $matches[2];
    }
else
    {
    $track_artist    = '*** UNKNOWN FILENAME FORMAT ***';
    $title            = '(' . $filename[$i] . ')';
    }
If you have any comment.
Please let me know.

Locked