How to get a specific chunk of text from a URL of a text file?

for i in range(len(file)) :
a = file.loc[i, "SECFNAME"]
url = ('https://www.sec.gov/Archives/' + a)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
txt = str(soup)
text = txt.lower()
doc_lenght = len(text)

for line in urllib.request.urlopen(url):
    print(line.decode('utf-8'))
    def mdaa(text, doc_lenght):
        if elem in text.find("ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS "):
            print (elem)
        else :
            pass

The link has a section called as MANAGEMENT’S DISCUSSION AND ANALYSIS below that it has its description or a chunk of text that needs to be scraped. From the code above I am only able to print the whole document and not that specific part.

In dataset process needs to be done for each row value in the dataset(file) in which the URLs are given. So In Python, when given the URL for a text file, what is the simplest way to access the contents off the text file and print the contents of the file out locally line-by-line without saving a local copy of the text file?

Answer

You can use Pandas to read your .xlsx or .csv file and use the apply function over the SECFNAME column. Use the request library to get the text and avoid saving a local copy of the text to a file. Apply a regex similar to the text you already use in the find function, the caveat here is that has to exist a ITEM 8. From here you can print to screen or save to a file. From what I’ve examined, not all text links have an ITEM 7 that’s why some items in the list return None.

import pandas as pd
import requests
import re

URL_PREFIX = "https://www.sec.gov/Archives/"
REGEX = r"nITEM 7.s*MANAGEMENT'S DISCUSSION AND ANALYSIS.*?(?=nITEM 8.s)"

def get_section(url):
    source = requests.get(f'{URL_PREFIX}/{url}').text

    r = re.findall(REGEX, source, re.M | re.DOTALL)
    if r:
        return ''.join(r)

df['has_ITEM7'] = df.SECFNAME.apply(get_section)

hasITEM7_list = df['has_ITEM7'].to_list()

Output from hasITEM7_list

['nITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTSn        OF OPERATIONnnnYEAR ENDED DECEMBER 28, 1997 COMPARED TO THE YEAR ENDED DECEMBER 29, 1996nnn     In November 1996, the Company initiated a major restructuring and growthnplan designed to substantially reduce its cost structure and grow the businessnin order to restore higher levels of profitability for the Company. By Julyn1997, the Company completed the major phases of the restructuring plan. Then$225.0 million of annualized cost savings anticipated from the restructuringnresults primarily from the consolidation of administrative functions within thenCompany, the rationalization
...
...