I have 2.5 GB of JSON file with 25 columns and about 4 million rows. I try to filter the JSON with the following script it takes at least 10 minutes.
import json product_list = ['Horse','Rabit','Cow'] year_list = ['2008','2009','2010'] country_list = ['USA','GERMANY','ITALY'] with open('./products/animal_production.json', 'r', encoding='utf8') as r: result = r.read() result = json.loads(result) for item in result[:]: if (not str(item["Year"]) in year_list) or (not item["Name"] in product_list) or (not item["Country"] in country_list): result.remove(item) print(result)
I need to prepare the result in a max 1 minute so what your suggestion or the fastest way to filter json?
Removing from a list in a loop is slower, each remove is
O(n) and that is done
n times so
O(n^2), appending to a new list is
O(1) and doing this
n times is
O(n) in a loop. So you can try this
[item for item in result if str(item["Year"] in year_list) or (item["Name"] in product_list) or (item["Country"] in country_list)]
Filter based on the condition you need and add only those that match.