Skip to content

Instantly share code, notes, and snippets.

@hudsonsferreira
Created September 12, 2012 16:22
Show Gist options
  • Select an option

  • Save hudsonsferreira/3707842 to your computer and use it in GitHub Desktop.

Select an option

Save hudsonsferreira/3707842 to your computer and use it in GitHub Desktop.
Script para extrair o body do content.xml
import re
from elementtree import ElementTree as etree
def get_text_element(tree):
for i in tree.getiterator():
if 'text' in i.tag and 'text-' not in i.tag:
return i
def return_the_text(element):
string = etree.tostring(element)
text = re.search('</ns0:annotation>.*</ns1:p>', string).group(0)
text = re.sub('</ns0:annotation>', '', text)
text = re.sub('</ns1:p>', '', text)
return text
tree = etree.parse('content.xml')
print return_the_text(get_text_element(tree))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment