Python lxml教程

Pythonlxml教程这两天因为要处理XML，研究了一下lxml库，做个总结。我在处理XML时，最想了解的三个问题是：问题1：有一个XML文件，如何解析问题2：解析后，如果查找、定位某个标签问题3：定位后如何操作标签，比如访问属性、文本内容等本文就是按这三个问题组织的，文本中代码都在Python3.5中运行通过。开始之前，首先是导入模块，该库常用的XML处理功能都在lxml.etree中，可用下面的语句导入：fromlxmlimportetreeElement类这一节回答问题3。Element是XML处理的核心类，Element对象可以直观的理解为XML的节点，大部分XML节点的处理都是围绕该类进行的。这部分包括三个内容：节点的操作、节点属性的操作、节点内文本的操作。节点操作1、创建Element对象直接使用Element方法，参数即节点名称。root=etree.Element('root')print(root)Elementrootat0x2da07082、获取节点名称使用tag属性，获取节点的名称。print(root.tag)root3、输出XML内容使用tostring方法输出XML内容(后文还会有补充介绍)，参数为Element对象。print(etree.tostring(root))b'rootchild1/child2/child3//root'3、输出XML内容使用tostring方法输出XML内容(后文还会有补充介绍)，参数为Element对象。child1=etree.SubElement(root,'child1')child2=etree.SubElement(root,'child2')child3=etree.SubElement(root,'child3')5、删除子节点使用remove方法删除指定节点，参数为Element对象。clear方法清空所有节点。root.remove(child1)#删除指定子节点print(etree.tostring(root))b'rootchild2/child3//root'root.clear()#清除所有子节点print(etree.tostring(root))b'root/'6、以列表的方式操作子节点可以将Element对象的子节点视为列表进行各种操作：child=root[0]#下标访问print(child.tag)child1print(len(root))#子节点数量3root.index(child2)#获取索引号1forchildinroot:#遍历...print(child.tag)child1child2child3root.insert(0,etree.Element('child0'))#插入start=root[:1]#切片end=root[-1:]print(start[0].tag)child0print(end[0].tag)child3root.append(etree.Element('child4'))#尾部添加print(etree.tostring(root))b'rootchild0/child1/child2/child3/child4//root'其实前面讲到的删除子节点的两个方法remove和clear也和列表相似。7、获取父节点使用getparent方法可以获取父节点。print(child1.getparent().tag)Root属性操作属性是以key-value的方式存储的，就像字典一样。1、创建属性可以在创建Element对象时同步创建属性，第二个参数即为属性名和属性值：root=etree.Element('root',interesting='totally')print(etree.tostring(root))b'rootinteresting=totally/'也可以使用set方法给已有的Element对象添加属性，两个参数分别为属性名和属性值：root.set('hello','Huhu')print(etree.tostring(root))b'rootinteresting=totallyhello=Huhu/'2、获取属性属性是以key-value的方式存储的，就像字典一样。直接看例子#get方法获得某一个属性值print(root.get('interesting'))totally#keys方法获取所有的属性名sorted(root.keys())['hello','interesting']#items方法获取所有的键值对forname,valueinsorted(root.items()):...print('%s=%r'%(name,value))hello='Huhu'interesting='totally'也可以用attrib属性一次拿到所有的属性及属性值存于字典中：attributes=root.attribprint(attributes){'interesting':'totally','hello':'Huhu'}attributes['good']='Bye'#字典的修改影响节点print(root.get('good'))Bye文本操作标签及标签的属性操作介绍完了，最后就剩下标签内的文本了。可以使用text和tail属性、或XPath的方式来访问文本内容。1、text和tail属性一般情况，可以用Element的text属性访问标签的文本。root=etree.Element('root')root.text='Hello,World!'print(root.text)Hello,World!print(etree.tostring(root))b'rootHello,World!/root'XML的标签一般是成对出现的，有开有关，但像HTML则可能出现单一的标签，比如下面这段代码中的。htmlbodyTextbr/Tail/body/htmlElement类提供了tail属性支持单一标签的文本获取。html=etree.Element('html')body=etree.SubElement(html,'body')body.text='Text'print(etree.tostring(html))b'htmlbodyText/body/html'br=etree.SubElement(body,'br')print(etree.tostring(html))b'htmlbodyTextbr//body/html'#tail仅在该标签后面追加文本br.tail='Tail'print(etree.tostring(br))b'br/Tail'print(etree.tostring(html))b'htmlbodyTextbr/Tail/body/html'#tostring方法增加method参数，过滤单一标签，输出全部文本print(etree.tostring(html,method='text'))b'TextTail'2、XPath方式#方式一：过滤单一标签，返回文本print(html.xpath('string()'))TextTail#方式二：返回列表，以单一标签为分隔print(html.xpath('//text()'))['Text','Tail']方法二获得的列表，每个元素都会带上它所属节点及文本类型信息，如下：texts=html.xpath('//text()'))print(texts[0])Text#所属节点parent=texts[0].getparent()print(parent.tag)bodyprint(texts[1],texts[1].getparent().tag)Tailbr#文本类型：是普通文本还是tail文本print(texts[0].is_text)Trueprint(texts[1].is_text)Falseprint(texts[1].is_tail)True文件解析与输出这一节回答问题1。这部分讲述如何将XML文件解析为Element对象，以及如何将Element对象输出为XML文件。1、文件解析文件解析常用的有fromstring、XML和HTML三个方法。接受的参数都是字符串。xml_data='rootdata/root'#fromstring方法root1=etree.fromstring(xml_data)print(root1.tag)rootprint(etree.tostring(root1))b'rootdata/root'#XML方法，与fromstring方法基本一样root2=etree.XML(xml_data)print(root2.tag)rootprint(etree.tostring(root2))b'rootdata/root'#HTML方法，如果没有html和body标签，会自动补上root3=etree.HTML(xml_data)print(root3.tag)htmlprint(etree.tostring(root3))b'htmlbodyrootdata/root/body/html'2、输出输出其实就是前面一直在用的tostring方法了，这里补充xml_declaration和encoding两个参数，前者是XML声明，后者是指定编码。root=etree.XML('rootab//a/root')print(etree.tostring(root))b'rootab//a/root'#XML声明print(etree.tostring(root,xml_declaration=True))b?xmlversion='1.0'encoding='ASCII'?\nrootab//a/root#指定编码print(etree.tostring(root,encoding='iso-8859-1'))b?xmlversion='1.0'encoding='iso-8859-1'?\nrootab//a/rootElementPath这一节回答问题2。讲ElementPath前，需要引入ElementTree类，一个ElementTree对象可理解为一个完整的XML树，每个节点都是一个Element对象。而ElementPath则相当于XML中的XPath。用于搜索和定位Element元素。这里介绍两个常用方法，可以满足大部分搜索、查询需求，它们的参数都是XPath语句：findall()：返回所有匹配的元素，返回列表find()：返回匹配到的第一个元素root=etree.XML(rootax='123'aTextb/c/b//a/root)#查找第一个b标签print(root.find('b'))Noneprint(root.find('a').tag)a#查找所有b标签，返回Element对象组成的列表[b.

Python lxml教程

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

广州电子信息学校运动场-施工组织设计[1]

房地产酒店项目策划全案

我国新建住宅小区物业管理研究

C语言工程设计3-2_线性表

汽机专业工程监理实施细

保险训练与辅导

浙江省烟草专卖行政复议规定(正文)

万色尼古盾产品介绍(XXXX0209美化更新)

XXXX年风险管理真题及答案

01-情景领导

相关文档

相关搜索