Web scraping is very powerful methodology to extract and track “things of interest” from websites.
Let us see how it works with an example. Here I am performing search for items on an online retailer and fetch price of searched items.
To do this I wrote a module having one class having two abstract functions. “Shopping.Flipkart” module has “flipkart” class and “flipkart” class has “finditem” and “getDetails” functions. Here is the file structure.
.
├── flptest.py
└── Shopping
├── FlipKart.py
Lets look into module code first.
cat Shopping/FlipKart.py
#!/usr/bin/python3
import requests
import bs4 as bs
import sys
import json
class flipkart:
"Flipkart class"
Agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
def finditem(searchfor):
searchbase = "https://www.flipkart.com/search?q="
sort="&sort=popularity"
searchobj = searchbase+searchfor+sort
try:
res = requests.get(searchobj, headers={"User-Agent": flipkart.Agent})
except:
return(-1)
soup = bs.BeautifulSoup(res.text,'html.parser')
for tag in soup.find_all('script'):
if "jsonLD" and "ItemList" in str(tag):
items = str(tag).split('\n')
itemlist = eval(items[1].lstrip())
items = itemlist["itemListElement"]
return(list(items))
break
def getDetails(itemurl):
try:
res = requests.get(itemurl, headers={"User-Agent": flipkart.Agent})
except:
return(-1)
soup = bs.BeautifulSoup(res.text,'html.parser')
for tag in soup.find_all('script'):
if "__INITIAL_STATE__" in str(tag):
itemDetailraw = str(tag).split('\n')
itemDetails=itemDetailraw[1].lstrip().split("__ = ")
return(itemDetails[1].rstrip(';'))
break
Here is the main code which calls functions to perform search and get price of searched items.
cat flptest.py
#!/usr/bin/python3
from Shopping.FlipKart import flipkart
import sys
import json
if len(sys.argv) > 1:
searchfor = sys.argv[1]
else:
print("No search object provided")
quit()
myitems = flipkart.finditem(searchfor)
if myitems != -1:
for myitem in myitems:
itemurl = myitem["url"]
# print(itemurl)
try:
itemDetails = json.loads(flipkart.getDetails(itemurl))
except:
print("Flipkart cannot get item details of", itemurl)
try:
itemPrice = itemDetails["pageDataV4"]["page"]["data"]["10002"][2]["widget"]["data"]["emiDetails"]["metadata"]["price"]
except:
try:
itemPrice = itemDetails["pageDataV4"]["page"]["data"]["10002"][2]["widget"]["data"]["pricing"]["value"]["finalPrice"]["value"]
except:
itemPrice = "Cannot get it !!"
print(itemurl, itemPrice)
# break
else:
print("Flipkart search down, cannot proceed finding", searchfor)
quit()
Execute and verify results
./flptest.py "Beer mugs"
https://www.flipkart.com/ftafat-inductive-rainbow-color-cup-led-flashing-7-changing-light-pour-water-tea-lighting-cup-easy-battery-replace-glass-mug/p/itmf9dqs93sptkbv?pid=MUGF9DN2HRQ8F9PZ&lid=LSTMUGF9DN2HRQ8F9PZ3LFRUF&marketplace=FLIPKART 199
https://www.flipkart.com/ftafat-pair-inductive-rainbow-color-cup-led-flashing-7-changing-light-lighting-cup-easy-battery-replace-2-cups-glass-mug/p/itmf9dr5pcbfjzpb?pid=MUGF9DN4V9ZHKJQG&lid=LSTMUGF9DN4V9ZHKJQGZUFBKA&marketplace=FLIPKART 254
https://www.flipkart.com/alemkip-inductive-rainbow-color-cup-led-flashing-7-changing-light-pour-water-tea-lighting-cup-easy-battery-replace-glass-250-ml-mug/p/itmff7kfevfd8ntz?pid=MUGFF76WRGTGVX3D&lid=LSTMUGFF76WRGTGVX3DWU3PVI&marketplace=FLIPKART 196
https://www.flipkart.com/red-renge-enterprises-rrbm1-brass-mug/p/itmfhtnn8x8bhw93?pid=MUGFHSQZCW2ZYBKA&lid=LSTMUGFHSQZCW2ZYBKARTPWYG&marketplace=FLIPKART 1850
https://www.flipkart.com/sachdeva-s-traders-color-changing-led-light-glass-mug/p/itmf6z4zpgsgg28z?pid=MUGF6YY6UBP9JZGS&lid=LSTMUGF6YY6UBP9JZGS86FOTS&marketplace=FLIPKART 185
https://www.flipkart.com/lucky-thailand-lg115-glass-set/p/itme9dx8bxtq2eqp?pid=GLSE9DX8ZKGCFAWP&lid=LSTGLSE9DX8ZKGCFAWPRURGMS&marketplace=FLIPKART 246
https://www.flipkart.com/somil-trandy-new-design-stylish-glass-beer-mug-handle-set-2/p/itmeqg8abrwhhgne?pid=GLSEQG8ASNC8P7MM&lid=LSTGLSEQG8ASNC8P7MMH1KNBE&marketplace=FLIPKART 375
https://www.flipkart.com/mega-shine-4-pcs-colour-changing-liquid-activated-lights-multi-purpose-use-cup-300-ml-plastic-mug/p/itmf4s5symnhg499?pid=MUGF4RZ44G3G3ZHK&lid=LSTMUGF4RZ44G3G3ZHKLX4EA5&marketplace=FLIPKART 446
https://www.flipkart.com/caryn-beer-party-led-light-plastic-mug/p/itmee79a6jnhggmk?pid=MUGEE79AQ3CNVECW&lid=LSTMUGEE79AQ3CNVECWYYX5P7&marketplace=FLIPKART&spotlightTagId=BestvalueId_upp 326
https://www.flipkart.com/nvcollections-3d-led-magic-cup-cool-stylish-colorful-design-300ml-plastic-mug/p/itmfa39vh3nhyzex?pid=MUGFA2V2B7E3Y7ZD&lid=LSTMUGFA2V2B7E3Y7ZDKMGWHI&marketplace=FLIPKART 160
Verify first url result in web browser as well.
Looks good, price matches with what we see from python code.
Remember, scraping has to be crafted after carefully looking into the content and source of web pages. This article is just to give an idea of web scraping.