我在爬imdb网站,然后存到mysql数据库里面。如这个 http://www.imdb.com/title/tt2474024/reviews?start=0,其中有一段是
Surprisingly, it's not the talented leads that provide the most interest here … it's the story structure. As per the title, the story follows the couple's relationship over a five year period. The opening scene features Cathy reading and reacting to the break-up note left by Jamie. The second scene features Jamie describing his joy when he first falls for Cathy, as they romp in bed. See, Cathy's story goes from the end to the beginning, while Jamie's story goes from the beginning to the end … intersecting only at the marriage proposal in the park. It's a fascinating way to tell a story – not just two perspectives, but also in reverse order of each other!
提示存这一段的时候 Warning: Incorrect String value...,是那个省略号的问题。用chrome查看网页编码是 西欧语言windows-1252,用requests请求,可以看到r.encoding=iso-8859-1。我的mysql存储编码是默认的(应该是utf-8?),在代码里 我 是这样写的:
`content = p.text_content().strip('\n').decode('iso-8859-1').encode('utf8')`
decode那里换成windows-1252也不对。
也试过
`content = unicode(p.text_content().strip('\n'))`,都不行。需要怎么解决?
Surprisingly, it's not the talented leads that provide the most interest here … it's the story structure. As per the title, the story follows the couple's relationship over a five year period. The opening scene features Cathy reading and reacting to the break-up note left by Jamie. The second scene features Jamie describing his joy when he first falls for Cathy, as they romp in bed. See, Cathy's story goes from the end to the beginning, while Jamie's story goes from the beginning to the end … intersecting only at the marriage proposal in the park. It's a fascinating way to tell a story – not just two perspectives, but also in reverse order of each other!
提示存这一段的时候 Warning: Incorrect String value...,是那个省略号的问题。用chrome查看网页编码是 西欧语言windows-1252,用requests请求,可以看到r.encoding=iso-8859-1。我的mysql存储编码是默认的(应该是utf-8?),在代码里 我 是这样写的:
`content = p.text_content().strip('\n').decode('iso-8859-1').encode('utf8')`
decode那里换成windows-1252也不对。
也试过
`content = unicode(p.text_content().strip('\n'))`,都不行。需要怎么解决?