Well, it kind of sucks when you run into a limit you can't get around when doing a project.
And this is exactly what I've run into.
I need a Project Voldemort installation to be able to continue working on my current project, and I don't have it. I won't be able to get it for at least a few months.
So this sucks.
I guess I'll stop with this project and move onto one where I don't foresee myself being stopped due to software and hardware requirements.
This is just an update of what I've been doing.
Wait a second, no it's not. I just told you what problem I ran into.
What I've been working on is a bot that will attempt to download all .torrent files, and just that. I have that part done, however I need a database to store all of this information. A database that won't fail if one of the nodes decides to go down (for maintenance or any other reason).
Now, as you can imagine, there are a lot of .torrent files on the internet, meaning that I'll need a lot of database space.
I split the bot up into two parts. There are probably more to come, but I thought that this was logical, seeing as how it's how I would go about downloading all the torrents.
Bot 1- URL
I need URL's. And lots of them. This bot goes out to torrent sites and gets the information URL for each torrent in a category.
It can currently gather about 2000 torrents in a minute, and this is with it being limited.
I ran into one major problem with this bot, but luckily, while being bored at a religious event, I thought up of the solution.
The problem was that the bot was using only one MySQL connection, and because of this it was trying to execute too many things at once.
So, naturally, I gave each thread it's own connection. That solves that. And it does that quite nicely.
Bot 2- Download
Well, now that I have the URL's, now I need to download them. This bot parses the information page and downloads the .torrent file.
It handles gzip compression. I added a site that gzips the content, and wondered why I when I parsed the torrent file using bencode, I got gibberish. So, hence I added gzip compatibility.
This bot gets some false positives on 404's. I'll need to do a bit of research into why this happens, but I'm not too concerned about it right now because it gets more than 80% of the .torrent files downloaded.
Well. It downloads the .torrent files into memory and attempts to parse it. With this parsed information, I get the name, the time it was created, the files in the torrent, and all the other things your torrent client would get.
I just need a place to store this all now.
I've been thinking of just outputting .xml files and having another bot insert all of the .xmls into a database.
So, as previously stated. I'll wait for a few months then start this project back up again.
Until then, I'll start working on something else.
A link checker in Python.
I've always been fascinated by Link Checkers. Don't even ask me why, because I wouldn't know how to answer you.
Anyways, that is all.
0 comments:
Post a Comment