jnosal

On programming and stuff

Don't miss Python articles on reddit

| Comments

In case you've ever wanted to create your personal "must reads" on Python related stuff, search for something in articles published a while ago - here is a simple approach how one may try to tackle the problem:

Crawl reddit's r/Python:

# -*- coding: utf-8 -*-

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector


class RedditItem(Item):
    title = Field()
    url = Field()


class RpythonSpider(CrawlSpider):
    name = "rpython"
    allowed_domains = [
        "reddit.com"
    ]
    start_urls = (
        'http://www.reddit.com/r/Python/new/',
    )
    crawled = set()
    crawled_prev_next = set()

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        prev_next = hxs.select('//span[@class="nextprev"]//a').\
            select('@href').extract()
        
        if len(prev_next) > 0:
            url = prev_next[-1]
            if url not in self.crawled_prev_next:
                self.crawled_prev_next.add(url)
                yield self.make_requests_from_url(prev_next[-1])

        for link in hxs.select('//*[@id="siteTable"]//div//p[1]/a'):
            
            url = link.select('@href').extract()[0]
            title = link.select('text()').extract()[0]

            if len(url) and len(title) and url not in self.crawled and '/r/Python/' not in url:
                item = RedditItem()
                item['title'] = title 
                item['url'] = url
                self.crawled.add(url)
                yield item

That's a very basic scrapy spider, which You can invoke for instance like that:

scrapy crawl rpython -o file.json -t json

and store all links with titles that lead to external blogs/services. That way You won't miss a single article :-). Hook this script to some database, add published_at field and get notified only when there is new stuff.

Enjoy!

Comments

comments powered by Disqus