Commit 4cacf669 authored by A. Unique TensorFlower's avatar A. Unique TensorFlower
Browse files

Internal change

PiperOrigin-RevId: 468782444
parent 7fb2bb02
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "0JmF5ohLbPlF"
},
"source": [
"# Pre processing steps of a COCO JSON annotated file "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uXwwz3PlbUX2"
},
"source": [
"Given a single COCO annotated JSON file, your goal is to pre-process in order to remove noise and manipulate it into a form which is suitable for training a ML model. This script will also check if the annotated images are broken or missing."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "E1SxGZD2bv8E"
},
"source": [
"The COCO annotation file includes the following -\n",
"\n",
"1. Name of the images.\n",
"\n",
"2. Dimensions of the images.\n",
"\n",
"3. Classes in the image category.\n",
"\n",
"4. Name of the super categories of the classes.\n",
"\n",
"5. Area acquired by the segmented pixels in an image.\n",
"\n",
"6. Bounding box co-ordinates.\n",
"\n",
"7. Annotated segmentation coordinates."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "j0v31gxTbweO"
},
"source": [
"There is a lot of noise in the real world annotation file. The images name could be wrong. The images mentioned in an annotation file may not be present in the image folder, which will disrupt the model training procedure. The contents within an annotation file may not match with each other. Even the files present in an image folder may be broken or truncated, which will cause errors while reading image files. Our goal is to eradicate all these problems."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PyFn96EKb7A-"
},
"source": [
"Our goal is to make sure that all information in the key values corresponds to each other correctly. This notebook will help you achieve this task."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "W6aXxxox0DDa"
},
"source": [
"## Import labels and sample JSON file \n",
"To import total classes for the material, material_form and plastic_type we will import the label files from the waste_identification_ml project from Tensorflow Model Garden.\n",
"We will also import a noisy sample JSON file to illustrate an example."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "WluEHMZYm0zM",
"outputId": "b8c4738c-4636-4c56-c6ea-b4e4da92474c",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 3536 100 3536 0 0 29714 0 --:--:-- --:--:-- --:--:-- 29714\n",
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 2427 100 2427 0 0 18248 0 --:--:-- --:--:-- --:--:-- 18248\n",
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 3303k 100 3303k 0 0 14.1M 0 --:--:-- --:--:-- --:--:-- 14.2M\n"
]
}
],
"source": [
"%%bash\n",
"curl -O https://raw.githubusercontent.com/tensorflow/models/master/official/projects/waste_identification_ml/pre_processing/config/categories_list_of_dictionaries.py\n",
"\n",
"curl -O https://raw.githubusercontent.com/tensorflow/models/master/official/projects/waste_identification_ml/pre_processing/config/sample_json/dataset.json\n",
"\n",
"mkdir image_folder\n",
"\n",
"curl -o image_folder/image_2.png https://raw.githubusercontent.com/tensorflow/models/master/official/projects/waste_identification_ml/pre_processing/config/sample_images/image_2.png"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MRhCAFlVcRm0"
},
"source": [
"## Import the required libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Mnxbo8GBcN2O"
},
"outputs": [],
"source": [
"import glob\n",
"import tqdm\n",
"import json\n",
"from PIL import Image\n",
"import subprocess\n",
"import copy\n",
"import os\n",
"from google.colab import files\n",
"from categories_list_of_dictionaries import *"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "f-05VwsL0mCi"
},
"outputs": [],
"source": [
"# reading labels \n",
"\n",
"images_folder_path = 'image_folder/' #@param {type:\"string\"}\n",
"list_of_material = build_material(MATERIAL_LIST,'material-types')\n",
"list_of_material_form = build_material(MATERIAL_FORM_LIST,'material-form-types')\n",
"list_of_plastic_type = build_material(PLASTICS_SUBCATEGORY_LIST,'plastic-types')"
]
},
{
"cell_type": "code",
"source": [
"# common labeling typo errors\n",
"_KNOWN_TYPOS = {\n",
" 'and': '&',\n",
" 'Cassete': 'Cassette',\n",
" 'Toy':'Toys',\n",
" 'Mug-&-Tub':'tub',\n",
" 'Toyss':'toys'\n",
"}\n",
"_KNOWN_TYPOS"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "p7xwZDoc5rZU",
"outputId": "33d060a4-f4aa-475a-e140-30bc222553a3"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{'and': '&',\n",
" 'Cassete': 'Cassette',\n",
" 'Toy': 'Toys',\n",
" 'Mug-&-Tub': 'tub',\n",
" 'Toyss': 'toys'}"
]
},
"metadata": {},
"execution_count": 4
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "958ZSjT_eG_b"
},
"source": [
"## Utility functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tGOCdeiucUgq"
},
"outputs": [],
"source": [
"def read_json(file):\n",
" \"\"\"Read any JSON file.\n",
"\n",
" Args:\n",
" file: path to the file\n",
" \"\"\"\n",
" with open(file) as json_file:\n",
" data = json.load(json_file)\n",
" return data\n",
"\n",
"\n",
"def search_dict_value(dic, id):\n",
" \"\"\"Returns the key of the dictionary from its value'\n",
"\n",
" Args:\n",
" dic = Mapping to search by value.\n",
" id = Value to search.\n",
" \"\"\" \n",
" key_list = list(dic.keys())\n",
" val_list = list(dic.values())\n",
" position = val_list.index(id)\n",
" return key_list[position]\n",
"\n",
"\n",
"def delete_truncated_images(folder_path: str) -> None:\n",
" \"\"\"Find and delete truncated images.\n",
"\n",
" Args:\n",
" folder_path: path to the folder where images are saved.\n",
" \"\"\"\n",
" # path to the images folder to read its content\n",
" files = glob.glob(folder_path + '/*')\n",
" print('Total number of files in the folder:', len(files))\n",
"\n",
" num = 0\n",
"\n",
" # read all image files and remove them from the directory in case they are broken\n",
" for file in tqdm.tqdm(files):\n",
" if file.endswith(('.png','.jpg')):\n",
" try:\n",
" img = Image.open(file)\n",
" img.verify()\n",
" except:\n",
" num = num + 1\n",
" subprocess.run(['rm', file])\n",
" print('Broken file name: ' + file)\n",
" if num == 0:\n",
" print('\\nNo broken images found')\n",
" else:\n",
" print('Total number of broken images found:', num)\n",
"\n",
"\n",
"def spelling_correction(dic):\n",
" \"\"\"Correcting some common spelling mistakes.\"\"\"\n",
" for i in dic['categories']:\n",
" for old, new in _KNOWN_TYPOS.items():\n",
" i['name'].replace(old, new)\n",
"\n",
"\n",
"def labeling_correction(dic, num, labels_dict):\n",
" \"\"\"Matching annotated labels with the correct labels and correcting the mistakes.\n",
"\n",
" Mapping the modified labeling ID with the corresponding original ID for alignment\n",
" of categories.\n",
"\n",
" Args:\n",
" dic: JSON file read as a dictionary\n",
" num: keyword position inside the label\n",
" labels_dict: dictionary showing the labels ID of the original categories \n",
" \"\"\"\n",
" incorrect_labels = []\n",
" mapping_list = []\n",
" for i in dic['categories']:\n",
" if i['name'].split('_')[num].lower() in labels_dict.values():\n",
" id = i['id']\n",
" name = i['name'].split('_')[num]\n",
" id_match = search_dict_value(labels_dict, i['name'].split('_')[num].lower())\n",
" mapping_list.append((id, name, id_match))\n",
" else:\n",
" id = i['id']\n",
" incorrect_labels.append(id)\n",
" return mapping_list, incorrect_labels\n",
"\n",
"\n",
"def images_key(dic):\n",
" \"\"\"Align the data within the dictionary in the 'images' key.\n",
" \n",
" The 'image_id' parameter in the 'annotation' key is the same as 'id' in the 'images' key of the dictionary. This function \n",
" will also remove all image data from the 'images' key whose 'id' does not \n",
" match with 'image_id' in the 'annotation' key in the dictionary.\n",
"\n",
" Args:\n",
" dic: where the JSON file is read into\n",
" \"\"\"\n",
" image_ids = set(i['image_id'] for i in dic['annotations'])\n",
" new_images = [i for i in dic['images'] if i['id'] in image_ids]\n",
" return new_images\n",
"\n",
"\n",
"def annotations_key(dic, incorrect_labels, mapping_dict):\n",
" \"\"\"Align the data within the dictionary in the 'annotation' key.\n",
" \n",
" Notice that the 'category_id' in the 'annotation' key is same as 'id' \n",
" in the 'categories' key of the dictionary.\n",
"\n",
" Args:\n",
" dic: where the JSON file is read into\n",
" \"\"\"\n",
" new_annotation = []\n",
"\n",
" for i in dic['annotations']:\n",
" id = i['category_id']\n",
" if id not in incorrect_labels:\n",
" new_id = [i[2] for i in mapping_dict if i[0] == id][0]\n",
" i['category_id'] = new_id\n",
" new_annotation.append(i)\n",
" return new_annotation\n",
"\n",
"\n",
"def annotated_images(folder_path, dic):\n",
" \"\"\"Get images infromation that are mentioned in an annotation file but are not present in an image folder.\n",
"\n",
" Args:\n",
" folder_path: path of an image folder.\n",
" \"\"\"\n",
" # read the file names from the directory \n",
" files = glob.glob(folder_path + '/*')\n",
" files = set(map(os.path.basename, files))\n",
"\n",
" # list of images in an annotation file\n",
" dic['images'] = [i for i in dic['images'] if i['file_name'] in files]\n",
" return dic\n",
"\n",
"\n",
"def image_annotation_key(dic):\n",
" \"\"\"Check if same images are present in both \"images\" key and \"annotations\" key. \n",
"\n",
" List of the image IDs which are in the \"images\" key but NOT in \"annotation\" key.\n",
" Remove information if they are not present in both keys.\n",
"\n",
" Args:\n",
" dic: annotation file read as a dictionary\n",
" \"\"\"\n",
" images_id = [i['id'] for i in dic['images']]\n",
" annotation_id = [i['image_id'] for i in dic['annotations']]\n",
" common_list = set(images_id).intersection(annotation_id)\n",
" dic['images'] = [i for i in dic['images'] if i['id'] in common_list]\n",
" dic['annotations'] = [i for i in dic['annotations'] if i['image_id'] in common_list]\n",
" return dic"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0OoDmNC22ycz"
},
"source": [
"## Find and delete truncated images from the image folder."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bUUu3F6I20w3",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "21ad1270-0486-4eb8-b4ed-07d7dd99b936"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Total number of files in the folder: 1\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"100%|██████████| 1/1 [00:00<00:00, 30.21it/s]"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"\n",
"No broken images found\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"\n"
]
}
],
"source": [
"delete_truncated_images(images_folder_path)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "65XuyPBSea7-"
},
"source": [
"## Perform operations on the file\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "l-uMtZK2edPY",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "428298ff-af02-44c2-c06b-7223646a70db"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"dict_keys(['images', 'annotations', 'categories'])\n"
]
}
],
"source": [
"# read json file and it should contain at least the three keys as shown below\n",
"path_to_json = 'dataset.json' #@param {type:\"string\"}\n",
"data = read_json(path_to_json)\n",
"print(data.keys())\n",
"\n",
"# create a copy to compare the results in the end\n",
"data_preprocessing = copy.deepcopy(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "G8w7MfDtvDIq",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "4c33d866-e788-4b59-cdb2-b3b929c25f7b"
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"100%|██████████| 6/6 [00:00<00:00, 51463.85it/s]"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"\n",
"Total number of wrong annotated labels are 5\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"\n"
]
}
],
"source": [
"# checking labeling mistakes as all annotated labels should have 6 keywords connected by '_' \n",
"num = 0\n",
"for i in tqdm.tqdm(data['categories']):\n",
" if len(i['name'].split('_')) != 6:\n",
" num += 1\n",
"print('\\nTotal number of wrong annotated labels are', num)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "q2jOWegZxPEp",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "0d44c754-aead-422a-c6bc-3fa304acd2b5"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\n",
"Total number of labels which has less than 6 keywords are 0\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[{'id': 0,\n",
" 'name': 'plastics_HDPE_flexible_color_SAchets-&-pouch_pouch',\n",
" 'supercategory': 'plastics_HDPE_flexible_color_SAchets-&-pouch_pouch'},\n",
" {'id': 1,\n",
" 'name': 'Plastics_HDPE_Rigid_Blue_Lid_Bottle-Cap_Na_Na',\n",
" 'supercategory': 'Plastics_HDPE_Rigid_Blue_Lid_Bottle-Cap_Na_Na'},\n",
" {'id': 2,\n",
" 'name': 'Plastics_peTE_Na_Clear_Bottle_Shampoo-Bottle_250Ml_Vlcc',\n",
" 'supercategory': 'Plastics_peTE_Na_Clear_Bottle_Shampoo-Bottle_250Ml_Vlcc'},\n",
" {'id': 3,\n",
" 'name': 'Plastics_na_Rigid_Blue_Bottle_Hair-Oil-Bottle-500Ml_Parachute',\n",
" 'supercategory': 'Plastics_na_Rigid_Blue_Bottle_Hair-Oil-Bottle-500Ml_Parachute'},\n",
" {'id': 4,\n",
" 'name': 'Plastics_HDPE_Rigid_Na_Cosmetic_Comb_Na_Na',\n",
" 'supercategory': 'Plastics_HDPE_Rigid_Na_Cosmetic_Comb_Na_Na'},\n",
" {'id': 5,\n",
" 'name': 'Plastics_PETE_Na_Clear_Bottle_Energy-Drink-Bottle_250Ml_Sting-Energy',\n",
" 'supercategory': 'Plastics_PETE_Na_Clear_Bottle_Energy-Drink-Bottle_250Ml_Sting-Energy'}]"
]
},
"metadata": {},
"execution_count": 9
}
],
"source": [
"# remove category labels which has less than 6 keywords\n",
"categories = []\n",
"num = 0\n",
"for i in data['categories']:\n",
" if len(i['name'].split('_')) >= 6:\n",
" categories.append(i)\n",
" else:\n",
" num += 1\n",
"print('\\nTotal number of labels which has less than 6 keywords are', num)\n",
"data['categories'] = categories\n",
"\n",
"# display categories after removing the labels\n",
"data['categories']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qup_-ReIz-iv",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "55963fc5-89f9-4767-b10e-b8ca8ea7a65d"
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"100%|██████████| 6/6 [00:00<00:00, 48026.38it/s]\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[{'id': 0,\n",
" 'name': 'plastics_HDPE_flexible_color_SAchets-&-pouch_pouch',\n",
" 'supercategory': 'plastics_HDPE_flexible_color_SAchets-&-pouch_pouch'},\n",
" {'id': 1,\n",
" 'name': 'Plastics_HDPE_Rigid_Blue_Lid_Bottle-Cap-Na-Na',\n",
" 'supercategory': 'Plastics_HDPE_Rigid_Blue_Lid_Bottle-Cap_Na_Na'},\n",
" {'id': 2,\n",
" 'name': 'Plastics_peTE_Na_Clear_Bottle_Shampoo-Bottle-250Ml-Vlcc',\n",
" 'supercategory': 'Plastics_peTE_Na_Clear_Bottle_Shampoo-Bottle_250Ml_Vlcc'},\n",
" {'id': 3,\n",
" 'name': 'Plastics_na_Rigid_Blue_Bottle_Hair-Oil-Bottle-500Ml-Parachute',\n",
" 'supercategory': 'Plastics_na_Rigid_Blue_Bottle_Hair-Oil-Bottle-500Ml_Parachute'},\n",
" {'id': 4,\n",
" 'name': 'Plastics_HDPE_Rigid_Na_Cosmetic_Comb-Na-Na',\n",
" 'supercategory': 'Plastics_HDPE_Rigid_Na_Cosmetic_Comb_Na_Na'},\n",
" {'id': 5,\n",
" 'name': 'Plastics_PETE_Na_Clear_Bottle_Energy-Drink-Bottle-250Ml-Sting-Energy',\n",
" 'supercategory': 'Plastics_PETE_Na_Clear_Bottle_Energy-Drink-Bottle_250Ml_Sting-Energy'}]"
]
},
"metadata": {},
"execution_count": 10
}
],
"source": [
"# According to the collected data it was found that most issues occurs from the\n",
"# 6th keyword which are the sub category of the material form.\n",
"\n",
"for i in tqdm.tqdm(data['categories']):\n",
" l1 = i['name'].split('_')[:5]\n",
" l2 = i['name'].split('_')[5:]\n",
" l1.append('-'.join(l2))\n",
" i['name'] = '_'.join(l1)\n",
"\n",
"# display categories after making corrections\n",
"data['categories']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Ro8KNGaGFv7k",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2dcd1d5b-2387-49be-e1ab-45b796169e4b"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Dictionary characteristics before processing :\n",
"images: 2 categories: 6 annotations: 6\n",
"\n",
"Dictionary characteristics after processing of material_type_annotation :\n",
"images: 1 categories: 10 annotations: 5\n",
"\n",
"Dictionary characteristics after processing of material_form_type_annotation :\n",
"images: 1 categories: 34 annotations: 5\n",
"\n",
"Dictionary characteristics after processing of plastic_type_annotation :\n",
"images: 1 categories: 9 annotations: 4\n"
]
}
],
"source": [
"print('Dictionary characteristics before processing :')\n",
"print('images:',len(data_preprocessing['images']),'categories:', len(data_preprocessing['categories']),'annotations:',len(data_preprocessing['annotations']))\n",
"\n",
"list_of_categories = [(list_of_material,0,'material_type_annotation.json'),\\\n",
" (list_of_material_form,4,'material_form_type_annotation.json'),\\\n",
" (list_of_plastic_type,1,'plastic_type_annotation.json')]\n",
"\n",
"for m in list_of_categories:\n",
"\n",
" data_processing = copy.deepcopy(data)\n",
"\n",
" # create a dict showing TDs corresponding to the labels & convert all words\n",
" # to lower case in order to eliminate case sensitive issues\n",
" labels_dict = dict([(i['id'], i['name'].lower()) for i in m[0]])\n",
"\n",
" # correcting grammatical errors\n",
" spelling_correction(data_processing)\n",
"\n",
" # create a mapping table to map each label to the right label structure.\n",
" # find the incorrect labels.\n",
" mapping_dict, incorrect_labels = labeling_correction(data_processing, m[1], labels_dict) \n",
"\n",
" # change the 'categories' key\n",
" data_processing['categories'] = m[0]\n",
"\n",
" # change the 'annotation' key\n",
" data_processing['annotations'] = annotations_key(data_processing, incorrect_labels, mapping_dict)\n",
"\n",
" # change the 'images' key\n",
" data_processing['images'] = images_key(data_processing)\n",
"\n",
" # remove data from the 'images' key not present in the image folder\n",
" data_processing = annotated_images(images_folder_path, data_processing)\n",
"\n",
" # align 'images' and 'annotations' key\n",
" data_processing = image_annotation_key(data_processing)\n",
"\n",
" # write to a new JSON file\n",
" with open(m[2], 'w') as opened_file:\n",
" opened_file.write(json.dumps(data_processing, indent=4))\n",
"\n",
" print('\\nDictionary characteristics after processing of', m[2].replace('.json','') ,':')\n",
" print('images:',len(data_processing['images']),'categories:', len(data_processing['categories']),'annotations:',len(data_processing['annotations'])) "
]
},
{
"cell_type": "code",
"source": [
"# View the final JSON file\n",
"try:\n",
" files.view(m[2]) # use files.download to download the file\n",
"except ImportError:\n",
" pass"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"id": "nC6XzQYL15Ki",
"outputId": "ea8e0a7d-79bc-4321-8a63-4855afc190e1"
},
"execution_count": 13,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Javascript object>"
],
"application/javascript": [
"\n",
" ((filepath) => {{\n",
" if (!google.colab.kernel.accessAllowed) {{\n",
" return;\n",
" }}\n",
" google.colab.files.view(filepath);\n",
" }})(\"/content/plastic_type_annotation.json\")"
]
},
"metadata": {}
}
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "json_preparation.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment