description
item objects are the regular dicts of python. we can use the following syntax to access the attributes of the class −
>>> item = dmozitem() >>> item['title'] = 'sample title' >>> item['title'] 'sample title'
add the above code to the following example −
import scrapy
from tutorial.items import dmozitem
class myprojectspider(scrapy.spider):
name = "project"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/computers/programming/languages/python/books/",
"http://www.dmoz.org/computers/programming/languages/python/resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = dmozitem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
the output of the above spider will be −
[scrapy] debug: scraped from <200
http://www.dmoz.org/computers/programming/languages/python/books/>
{'desc': [u' - by david mertz; addison wesley. book in progress, full text,
ascii format. asks for feedback. [author website, gnosis software, inc.\n],
'link': [u'http://gnosis.cx/tpip/'],
'title': [u'text processing in python']}
[scrapy] debug: scraped from <200
http://www.dmoz.org/computers/programming/languages/python/books/>
{'desc': [u' - by sean mcgrath; prentice hall ptr, 2000, isbn 0130211192,
has cd-rom. methods to build xml applications fast, python tutorial, dom and
sax, new pyxie open source xml processing library. [prentice hall ptr]\n'],
'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
'title': [u'xml processing with python']}