可以搜寻活动主帖,也能搜寻username所发主帖的程序

来源: nearby 2022-04-25 14:26:58 [] [博客] [旧帖] [给我悄悄话] 本文已被阅读: 次 (23939 bytes)
本文内容已被 [ nearby ] 在 2022-06-07 04:30:52 编辑过。如有问题,请报告版主或论坛管理删除.

可以搜寻活动主帖,也能搜寻username所发主帖的程序

邻兄今天抽空把上次的程序更新了一下,这样可以用这python程序来搜寻一个用户在一个论坛的所有主帖。比如用尘凡无忧再书香搜寻,可以找到169个主帖。邻兄只有108个主帖。

为什么要加这个功能呢?因为今天我想查看一个自己以前的帖子,可是我已经忘了帖子的名称!好难找的,干脆给程序加这个功能!

 

 

 

# Author: 书香之家版主 nearby, March 2022
#
# This program allows you to
# 1. Collect your activity posts in a 论坛
# 2. Search for the 主帖 posted based on username
#
# Usage of this Python program:
# 0. Make sure that you have Internet access and Python 3 installed on your computer (or use Cloud)!
# 1. Place this file in a folder. Say, in a folder named "wxc"
# 2. Follow the instructions prompted by the program, everything should work fine.
# 3. The result is stored inside 'wxc/sxzj-out.html'. You can then copy/paste the source code of 'sxzj-out.html' into your WXC new page. Done!
#
# Note
# 1. You will also be prompted for your 论坛's name in alphabets/English. You can look up this in your 论坛.
#    For example, 书香之家 has the URL https://bbs.wenxuecity.com/sxsj/, so its English name is sxsj.
#    Other examples include: 美语世界 is mysj, 文化走廊 is culture, 诗词欣赏 is poetry, etc.
# 2. By default the entries are organized in reverse chronological order.
# Should you need them to be placed in chronological order, please do:
# Comment out the statement: mylist.reverse() by placing # in front of it, like: #mylist.reverse()
#
#

import requests


notargets = ['跟帖', '输入关键词', '内容查询', 'input name', '当前', '首页', '上一页', '尾页', '下一页']
notargets.append('archive')
# This is how SXZJ (书香之家) works. When 尘凡无忧 starts an activity, she always marks her activity like this.
notargets.append('##活动##')
# notargets.append('汇总')


def isInside(line, notargets_array):
    for t in notargets_array:
        if t in line:
            return True
    return False
# END

# the line looks like <a href="/sxsj/76799.html" target="_blank"><em>春天的畅想</em>】春天属于女人</a>
# I need it to be <a href="https://bbs.wenxuecity.com/sxsj/76799.html" target="_blank"><em>春天的畅想</em>】春天属于女人</a>
def addHttp(line):
    at = line.split('href="')
    line2 = '<a href="https://bbs.wenxuecity.com' + at[1]
    return line2
# END

def processOneFile(target, html, mylist, searchedURL=None):
    # split the text by newline character to get an array of string
    all = html.text.split('\n')
    length = len(all)
    i = 0
    if searchedURL == None:
        while i < length:
            line = all[i]
            if (target in line) and (not isInside(line, notargets)):
                line = addHttp(line)
                print(line)
                i = i + 1
                line2 = all[i]
                # look like: [书香之家] - <strong>WXCTEATIME</strong>(6987 bytes ), need to be WXCTEATIME only
                line2 = line2.replace('</strong>', '<strong>').split('<strong>')[1]
                i = i + 1
                line3 = all[i]
                line += "  " + line2 + "  " + line3
                mylist.append(line)
            i = i + 1
    else:
        searchedUsername = '<strong><em>' + target + '</em></strong>'
        while i < length:
            line = all[i]
            if (searchedURL in line) and ('target="_blank"' in line) and (not isInside(line, notargets)):
                i = i + 1
                line2 = all[i]
                # look like: <strong><em>nearby</em></strong>,
                if searchedUsername in line2:
                    line = addHttp(line)
                    # add date information on 6-7-2022, which looks like:    <i>2022-04-24</i>
                    i = i + 1
                    line = line + ' ' + all[i]
                    mylist.append(line)
                    print(line)
            i = i + 1
# END of FUNCTIONS


# ---- main starts here ----

print()
print('# Author: 书香之家版主 nearby, March 2022')
print()

use = 0 # 0 = activity, 1 = username
try:
    use = int(input("Is your search based on activity name or username? If username, enter 1, otherwise enter 0: "))
except:
    print('Wrong input. Assume you are searching for activity posts. use=0')
    use = 0
#
if use == 1:
    target = input('What is the username (for example: nearby or 尘凡无忧)?:  ')
    if target == '':
        target = 'nearby'
    else:
        target = target.lstrip().rstrip()
    pages = 100  # default
    temp = input('How many pages there are when you search for the username in WXC? (If you do not know, just Hit ENTER, default 100 is assumed): ')
    if temp != '':
        pages = int(temp)
else:
    target = input('What is the title of your activity (活动)?:  ')
    target = target.lstrip().rstrip()
    pages = 3 # default, means there are maximum 150 entries
    temp = input('How many pages there are when you search for the activity in WXC? (If you do not know, just Hit ENTER): ')
    if temp != '':
        pages = int(temp)

subid = 'sxsj'
temp = input('What is the name of your 论坛 in English? For example, 书香之家 is sxsj, 美语世界 is mysj, 文化走廊 is culture, 诗词欣赏 is poetry: ')
if len(temp) >= 2:
    subid = temp

mylist = []
# this is the output file.
html2 = open('sxzj-out.html', 'w', encoding='utf-8')

useron = ''
if use == 1:
    useron = 'on'

url = 'https://bbs.wenxuecity.com/bbs/archive.php?SubID='+subid+'&pos=bbs&keyword=' + target + '&username=' + useron

f = requests.get(url)
if use == 1:
    searchedURL = '<a href="/' + subid + '/'
    processOneFile(target, f, mylist, searchedURL)
else:
    processOneFile(target, f, mylist)
for i in range(1, pages):
    url = 'https://bbs.wenxuecity.com/bbs/archive.php?page=' + str(i) + '&SubID=' + subid +'&pos=bbs&keyword=' + target + '&username=' + useron
    f = requests.get(url)
    if use == 1:
        searchedURL = '<a href="/' + subid + '/'
        processOneFile(target, f, mylist, searchedURL)
    else:
        processOneFile(target, f, mylist)

if use != 1:
    mylist.reverse()

for li in mylist:
    html2.write('<p>' + li+'\n')
html2.close()

print("\n")
print(str(len(mylist)) + " entries")
print("\n")
print("Please check the file sxzj-out.html. The result is in it! Thanks for using this program. ---- 虎哥 / Nearby / 邻兄")

 

 

所有跟帖: 

小白路过 -望沙- 给 望沙 发送悄悄话 望沙 的博客首页 (0 bytes) () 04/25/2022 postreply 14:47:48

需要时与师兄说一声,师兄为你搜寻就是 -nearby- 给 nearby 发送悄悄话 nearby 的博客首页 (0 bytes) () 04/25/2022 postreply 15:00:44

有点难学。。。。。 -lovecat08- 给 lovecat08 发送悄悄话 lovecat08 的博客首页 (0 bytes) () 04/25/2022 postreply 16:23:18

Python比较容易 -nearby- 给 nearby 发送悄悄话 nearby 的博客首页 (0 bytes) () 04/25/2022 postreply 16:53:13

赞邻兄的程序,写得跟诗一样。。 -东风再起- 给 东风再起 发送悄悄话 东风再起 的博客首页 (0 bytes) () 04/25/2022 postreply 16:27:39

看上去像 :-) -nearby- 给 nearby 发送悄悄话 nearby 的博客首页 (0 bytes) () 04/25/2022 postreply 16:50:00

赞! -WXCTEATIME- 给 WXCTEATIME 发送悄悄话 WXCTEATIME 的博客首页 (0 bytes) () 04/25/2022 postreply 17:23:25

赞有心! -applebee3- 给 applebee3 发送悄悄话 applebee3 的博客首页 (0 bytes) () 04/25/2022 postreply 17:40:17

佩服!然而只能不明觉厉:) -浮云驰- 给 浮云驰 发送悄悄话 浮云驰 的博客首页 (0 bytes) () 04/26/2022 postreply 02:07:28

高手! -laopika- 给 laopika 发送悄悄话 laopika 的博客首页 (0 bytes) () 04/26/2022 postreply 10:17:44

请您先登陆,再发跟帖!

发现Adblock插件

如要继续浏览
请支持本站 请务必在本站关闭Adblock

关闭Adblock后 请点击

请参考如何关闭Adblock

安装Adblock plus用户请点击浏览器图标
选择“Disable on www.wenxuecity.com”

安装Adblock用户请点击图标
选择“don't run on pages on this domain”