Module pyweb :: Class htmlRipper
[show private | hide private]
[frames | no frames]

Class htmlRipper

ParserBase --+    
             |    
    HTMLParser --+
                 |
                htmlRipper


This is a handy class which supports the use of html files as templates, and is intended to be used in conjunction with the 'template=' constructor keyword for webwidget objects.

Think of it as a fast 'poor-man's DOM'.

With this class, you can prepare an html file with your favourite editor (Mozilla composer, OpenOffice.org, Emacs/vim, cat etc), then pass it when you construct this class.

Methods of this class let you search for content in the file, by tag, id or any attributes, and return that content in its original rendered form.

See the method docstrings for more info
Method Summary
  __init__(self, fileOrStr, **kw)
  __getitem__(self, idx)
This allows a dirty shorthand for extracting content by tag, instancenum, id and/or attributes.
  getEntity(self, entity, contentsOnly)
Lower-level method which returns the nth instance of an entity.
  getEntityTag(self, tag, idx, **kw)
Searches for the nth instance of an entity with matching tag name and/or attributes
  getId(self, id)
Renders the entity named 'id' (ie, the tag with an attribute 'id' set to id), or returns empty string if no entity found
  getItem(self, item)
Renders an item back to its raw html item can be an index or an entity dict
  getRange(self, fromidx, toidx)
renders a range of raw items as with getItem, this is probably too low level to be of much use
  handle_comment(self, data)
  handle_data(self, data)
  handle_decl(self, data)
  handle_endtag(self, tag)
  handle_startendtag(self, tag, attr)
  handle_starttag(self, tag, attrs)
    Inherited from HTMLParser
  check_for_whole_start_tag(self, i)
  clear_cdata_mode(self)
  close(self)
Handle any buffered data.
  error(self, message)
  feed(self, data)
Feed data to the parser.
  get_starttag_text(self)
Return full source of start tag: '<...>'.
  goahead(self, end)
  handle_charref(self, name)
  handle_entityref(self, name)
  handle_pi(self, data)
  parse_comment(self, i, report)
  parse_endtag(self, i)
  parse_pi(self, i)
  parse_starttag(self, i)
  reset(self)
Reset this instance.
  set_cdata_mode(self)
  unescape(self, s)
  unknown_decl(self, data)
    Inherited from ParserBase
  getpos(self)
Return current line number and offset.
  parse_declaration(self, i)
  parse_marked_section(self, i, report)
  updatepos(self, i, j)

Class Variable Summary
    Inherited from HTMLParser
tuple CDATA_CONTENT_ELEMENTS = ('script', 'style')

Method Details

__getitem__(self, idx)
(Indexing operator)

This allows a dirty shorthand for extracting content by tag, instancenum, id and/or attributes.

Arguments:
  • idx - can be a string or tuple of args - see examples below
Examples:
 p = htmlRipper("somefile.html")

 s = p['fred']
   gets the entity with an attribute 'id', with value 'fred'

 s = p['-fred']
   gets the entity with an attribute 'id', with value 'fred', but
   without its start/end tags

 s = p['td',]
   gets the first '<td...> entity

 s = p['td', 2]
   gets the third '<td...> entity

 s = p['td', 4, {'colspan':3}]
   gets the fifth '<td..> entity with attr 'colspan' set to 3
Returns:
  • fully rendered text of the retrieved entity if one exists, empty string otherwise
Note:
  • by default, returns the extracted tag and its contents. If the 'tag' is prefixed with a hyphen, then only the contents (not the start/end tags) are returned.

getEntity(self, entity, contentsOnly=0)

Lower-level method which returns the nth instance of an entity.

Note that the order of entities is the order in which their tags are closed in the original file.

Returns the entity's text fully rendered.

The optional 'contentsOnly' argument, if true, causes only the *contents* of the entity, and *not* its opening/closing tags, to be returned.

getEntityTag(self, tag='', idx=0, **kw)

Searches for the nth instance of an entity with matching tag name and/or attributes

Arguments:
  • tag - text name of tag to look for, or '' or None to find any tag - default is match any tag
  • idx - which instance of matching item to return, default 0
Keywords:
  • attributes to match
Returns:
  • the fully rendered tag entity, or '' if not found

Note - if the 'tag' argument is prefixed with a hyphen '-', then only the tag's contents are returned - the opening and closing tag are dropped.

Note - access to this method is short-handed in the __getitem__ method, which perversely allows the htmlRipper object to be subscripted. See __getitem__ for more info.

getId(self, id)

Renders the entity named 'id' (ie, the tag with an attribute 'id' set to id), or returns empty string if no entity found

If found, returns the full text of the original entity, including its opening/closing tags and all content

If the id is prefixed with a hyphen '-', the entity's start/end tags are not included - only the contents.

getItem(self, item)

Renders an item back to its raw html item can be an index or an entity dict

Note that this only returns the opening tag, so probably won't be much use for clients

getRange(self, fromidx, toidx)

renders a range of raw items as with getItem, this is probably too low level to be of much use

Generated by Epydoc 2.0 on Sat Feb 7 20:08:05 2004 http://epydoc.sf.net