可以搜寻活动主帖,也能搜寻username所发主帖的程序
邻兄今天抽空把上次的程序更新了一下,这样可以用这python程序来搜寻一个用户在一个论坛的所有主帖。比如用尘凡无忧再书香搜寻,可以找到169个主帖。邻兄只有108个主帖。
为什么要加这个功能呢?因为今天我想查看一个自己以前的帖子,可是我已经忘了帖子的名称!好难找的,干脆给程序加这个功能!
# Author: 书香之家版主 nearby, March 2022 # # This program allows you to # 1. Collect your activity posts in a 论坛 # 2. Search for the 主帖 posted based on username # # Usage of this Python program: # 0. Make sure that you have Internet access and Python 3 installed on your computer (or use Cloud)! # 1. Place this file in a folder. Say, in a folder named "wxc" # 2. Follow the instructions prompted by the program, everything should work fine. # 3. The result is stored inside 'wxc/sxzj-out.html'. You can then copy/paste the source code of 'sxzj-out.html' into your WXC new page. Done! # # Note # 1. You will also be prompted for your 论坛's name in alphabets/English. You can look up this in your 论坛. # For example, 书香之家 has the URL https://bbs.wenxuecity.com/sxsj/, so its English name is sxsj. # Other examples include: 美语世界 is mysj, 文化走廊 is culture, 诗词欣赏 is poetry, etc. # 2. By default the entries are organized in reverse chronological order. # Should you need them to be placed in chronological order, please do: # Comment out the statement: mylist.reverse() by placing # in front of it, like: #mylist.reverse() # # import requests notargets = ['跟帖', '输入关键词', '内容查询', 'input name', '当前', '首页', '上一页', '尾页', '下一页'] notargets.append('archive') # This is how SXZJ (书香之家) works. When 尘凡无忧 starts an activity, she always marks her activity like this. notargets.append('##活动##') # notargets.append('汇总') def isInside(line, notargets_array): for t in notargets_array: if t in line: return True return False # END # the line looks like <a href="/sxsj/76799.html" target="_blank">【<em>春天的畅想</em>】春天属于女人</a> # I need it to be <a href="https://bbs.wenxuecity.com/sxsj/76799.html" target="_blank">【<em>春天的畅想</em>】春天属于女人</a> def addHttp(line): at = line.split('href="') line2 = '<a href="https://bbs.wenxuecity.com' + at[1] return line2 # END def processOneFile(target, html, mylist, searchedURL=None): # split the text by newline character to get an array of string all = html.text.split('\n') length = len(all) i = 0 if searchedURL == None: while i < length: line = all[i] if (target in line) and (not isInside(line, notargets)): line = addHttp(line) print(line) i = i + 1 line2 = all[i] # look like: [书香之家] - <strong>WXCTEATIME</strong>(6987 bytes ), need to be WXCTEATIME only line2 = line2.replace('</strong>', '<strong>').split('<strong>')[1] i = i + 1 line3 = all[i] line += " " + line2 + " " + line3 mylist.append(line) i = i + 1 else: searchedUsername = '<strong><em>' + target + '</em></strong>' while i < length: line = all[i] if (searchedURL in line) and ('target="_blank"' in line) and (not isInside(line, notargets)): i = i + 1 line2 = all[i] # look like: <strong><em>nearby</em></strong>, if searchedUsername in line2: line = addHttp(line) # add date information on 6-7-2022, which looks like: <i>2022-04-24</i> i = i + 1 line = line + ' ' + all[i] mylist.append(line) print(line) i = i + 1 # END of FUNCTIONS # ---- main starts here ---- print() print('# Author: 书香之家版主 nearby, March 2022') print() use = 0 # 0 = activity, 1 = username try: use = int(input("Is your search based on activity name or username? If username, enter 1, otherwise enter 0: ")) except: print('Wrong input. Assume you are searching for activity posts. use=0') use = 0 # if use == 1: target = input('What is the username (for example: nearby or 尘凡无忧)?: ') if target == '': target = 'nearby' else: target = target.lstrip().rstrip() pages = 100 # default temp = input('How many pages there are when you search for the username in WXC? (If you do not know, just Hit ENTER, default 100 is assumed): ') if temp != '': pages = int(temp) else: target = input('What is the title of your activity (活动)?: ') target = target.lstrip().rstrip() pages = 3 # default, means there are maximum 150 entries temp = input('How many pages there are when you search for the activity in WXC? (If you do not know, just Hit ENTER): ') if temp != '': pages = int(temp) subid = 'sxsj' temp = input('What is the name of your 论坛 in English? For example, 书香之家 is sxsj, 美语世界 is mysj, 文化走廊 is culture, 诗词欣赏 is poetry: ') if len(temp) >= 2: subid = temp mylist = [] # this is the output file. html2 = open('sxzj-out.html', 'w', encoding='utf-8') useron = '' if use == 1: useron = 'on' url = 'https://bbs.wenxuecity.com/bbs/archive.php?SubID='+subid+'&pos=bbs&keyword=' + target + '&username=' + useron f = requests.get(url) if use == 1: searchedURL = '<a href="/' + subid + '/' processOneFile(target, f, mylist, searchedURL) else: processOneFile(target, f, mylist) for i in range(1, pages): url = 'https://bbs.wenxuecity.com/bbs/archive.php?page=' + str(i) + '&SubID=' + subid +'&pos=bbs&keyword=' + target + '&username=' + useron f = requests.get(url) if use == 1: searchedURL = '<a href="/' + subid + '/' processOneFile(target, f, mylist, searchedURL) else: processOneFile(target, f, mylist) if use != 1: mylist.reverse() for li in mylist: html2.write('<p>' + li+'\n') html2.close() print("\n") print(str(len(mylist)) + " entries") print("\n") print("Please check the file sxzj-out.html. The result is in it! Thanks for using this program. ---- 虎哥 / Nearby / 邻兄")