Create a csv thanks to web scraping multiple page with python or pyspark

I’m trying for the first time to webscrape a website and i would like to create a csv file from webscraping of a japanese animation website with title, gender, studio and duration of anime.

I only managed to collect the data of the titles of the first page with that code :

import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.animeka.com/animes/series/~_1.html")
soup = BeautifulSoup(res.content, "html.parser")
anime_containers = soup.find_all('table', class_ = 'animesindex')
names = []

for container in anime_containers:

    if container.find_all('td', class_ = 'animestxt') is not None:
        name = container.a.text
        names.append(name)
        
import pandas as pd

test_df = pd.DataFrame({'anime': names})
print(test_df)

and getting something like that :

anime
0   "Eikou Naki Tensai-tachi" kara no Monogatari
1                                 "Eiyuu" Kaitai
2                              "Parade" de Satie
3                                       ?l DLIVE
4                   'n Gewone blou Maandagoggend
5                                    +Tic Neesan
6                          .hack// Terminal Disc
7                           .hack//G.U. Returner
8                            .hack//G.U. Trilogy

I don’t know how to gather gender, studio and duration and how to scrape all other page without reiterate same code

this is the source code of the page view-source:http://www.animeka.com/animes/series/~_1.html

Answer

To iterate over all the pages, you simply need to use a for loop and change the page number in the url like this:

for page_no in range(1, 467):
    url = 'http://www.animeka.com/animes/~_{}.html'.format(page_no)

Now, to get the information you want, you can use this:

titles, studios, genres, durations = [], [], [], []

for page_no in range(1, 467):
    url = 'http://www.animeka.com/animes/~_{}.html'.format(page_no)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')

    for table in soup.find_all('table', class_='animesindex'):
        td = table.find_all('td', class_='animestxt')
        titles.append(td[1].text.split(':')[1])
        studios.append(td[3].text.split(':')[1])
        genres.append(td[4].text.split(':')[1])
        durations.append(td[6].text.split(':')[1])

headers = ['Title', 'Studio', 'Genres', 'Duration']
df = pd.DataFrame(dict(zip(headers, [titles, studios, genres, durations])))
print(df)

Partial Output:

                                            Title                   Duration                                             Genres                                             Studio
0    "Eikou Naki Tensai-tachi" kara no Monogatari    TV-S 25 mins (en cours)                          [SPORT] [TRANCHE DE VIE]                      [NHK ENTERPRISE] [J.C. STAFF] 
1                                  "Eiyuu" Kaitai                      1 OAV                                         [COMéDIE]                                            [ZEXCS] 
2                               "Parade" de Satie             1 FILM 14 mins         [FANTASTIQUE & MYTHE] [COMéDIE] [MUSICAL]                               [YAMAMURA ANIMATION] 
3                                         elDLIVE               Non spécifié         [ACTION] [COMéDIE] [ESPACE & SCI-FICTION]                                 [PIERROT CO.,LTD.] 
4                    'n Gewone blou Maandagoggend              1 FILM 3 mins                                           [DRAME]                                            [AUCUN]