0.2.6版本新增文件补充

fe851fbc · zhouxiang · e2d98ddc · fe851fbc · fe851fbc · fe851fbc
Commit fe851fbc authored Mar 24, 2024 by zhouxiang
20 changed files
--- a/docs/zh_cn/_static/image/lmdeploy-logo.svg
+++ b/docs/zh_cn/_static/image/lmdeploy-logo.svg
+<svg width="724" height="169" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" overflow="hidden"><defs><clipPath id="clip0"><rect x="290" y="255" width="724" height="169"/></clipPath><linearGradient x1="515.209" y1="187.434" x2="675.945" y2="480.272" gradientUnits="userSpaceOnUse" spreadMethod="pad" id="fill1"><stop offset="0" stop-color="#9C8BFE"/><stop offset="1" stop-color="#2B50FF"/></linearGradient><linearGradient x1="366.983" y1="280.208" x2="358.966" y2="161.282" gradientUnits="userSpaceOnUse" spreadMethod="pad" id="fill2"><stop offset="0" stop-color="#E3AFFE"/><stop offset="1" stop-color="#2B50FF"/></linearGradient><linearGradient x1="339.833" y1="251.78" x2="336.655" y2="198.744" gradientUnits="userSpaceOnUse" spreadMethod="pad" id="fill3"><stop offset="0" stop-color="#748DFA"/><stop offset="1" stop-color="#C1B8FF"/></linearGradient><linearGradient x1="366.61" y1="199.406" x2="331.082" y2="291.3" gradientUnits="userSpaceOnUse" spreadMethod="pad" id="fill4"><stop offset="0" stop-color="#DBABFE"/><stop offset="1" stop-color="#C8F2FF"/></linearGradient><linearGradient x1="369.17" y1="198.557" x2="335.983" y2="245.993" gradientUnits="userSpaceOnUse" spreadMethod="pad" id="stroke5"><stop offset="0" stop-color="#FFFFFF"/><stop offset="0.46875" stop-color="#FFFFFF" stop-opacity="0"/><stop offset="1" stop-color="#FFFFFF" stop-opacity="0"/></linearGradient><linearGradient x1="378.752" y1="221.569" x2="411.083" y2="175.73" gradientUnits="userSpaceOnUse" spreadMethod="pad" id="stroke6"><stop offset="0" stop-color="#FFFFFF"/><stop offset="0.46875" stop-color="#FFFFFF" stop-opacity="0"/><stop offset="1" stop-color="#FFFFFF" stop-opacity="0"/></linearGradient><linearGradient x1="405.519" y1="173.592" x2="409.26" y2="222.227" gradientUnits="userSpaceOnUse" spreadMethod="pad" id="fill7"><stop offset="0" stop-color="#DBABFE"/><stop offset="1" stop-color="#B1E8FA"/></linearGradient><linearGradient x1="356.715" y1="253.912" x2="350.448" y2="271.193" gradientUnits="userSpaceOnUse" spreadMethod="pad" id="stroke8"><stop offset="0" stop-color="#AA5FE6" stop-opacity="0"/><stop offset="1" stop-color="#2E75FE"/></linearGradient><linearGradient x1="350.864" y1="235.329" x2="339.765" y2="259.744" gradientUnits="userSpaceOnUse" spreadMethod="pad" id="stroke9"><stop offset="0" stop-color="#AA5FE6" stop-opacity="0"/><stop offset="1" stop-color="#2E75FE"/></linearGradient><linearGradient x1="353.774" y1="211.139" x2="340.952" y2="235.597" gradientUnits="userSpaceOnUse" spreadMethod="pad" id="stroke10"><stop offset="0" stop-color="#AA5FE6" stop-opacity="0"/><stop offset="1" stop-color="#2E75FE"/></linearGradient></defs><g clip-path="url(#clip0)" transform="translate(-290 -255)"><path d="M0 0 1280.24 0 1280.24 463.908 0 463.908Z" fill="none" transform="matrix(1 0 0 1.00081 -0.255482 128.069)"/><path d="M589.722 261.071 569.151 213.627C567.428 209.675 565.705 205.513 563.982 201.142 563.034 198.739 562.087 196.272 561.14 193.743L560.908 193.122 560.765 192.739 560.606 192.309 560.117 192.309 560.127 192.486 560.156 193.058 560.166 193.275C560.704 203.942 560.972 213.736 560.972 222.652L560.972 261.071 551.023 261.071 551.023 181.62 565.367 181.62 584.594 226.572C586.654 231.396 588.911 237.144 591.365 243.812L591.858 245.158 592.163 245.158 592.21 245.03 592.302 244.777 592.408 244.486C595.227 236.778 597.568 230.803 599.427 226.572L618.654 181.62 632.998 181.62 632.998 261.071 623.049 261.071 623.049 222.652C623.049 214.228 623.299 204.812 623.8 194.403L623.855 193.272 623.866 193.058 623.894 192.486 623.904 192.309 623.415 192.309 623.146 193.032 623.114 193.119C620.214 200.895 617.465 207.736 614.869 213.627L594.3 261.071 589.722 261.071ZM718.209 234.053C719.389 229.975 719.979 225.622 719.979 220.99 719.979 216.398 719.389 212.121 718.209 208.164 717.07 204.165 715.32 200.582 712.96 197.415 710.64 194.209 707.69 191.439 704.109 189.102 700.569 186.766 696.418 184.945 691.658 183.639 688.93 182.886 685.981 182.371 682.807 182.095 681.362 181.951 679.864 181.839 678.31 181.761 676.45 181.667 674.509 181.62 672.491 181.62L653.813 181.62 653.813 261.071 672.308 261.071C674.622 261.071 676.833 261.017 678.941 260.909L679.099 260.902 679.955 260.853C680.864 260.798 681.754 260.732 682.623 260.656 685.797 260.339 688.748 259.806 691.474 259.053 696.235 257.747 700.386 255.905 703.926 253.53 707.507 251.156 710.477 248.344 712.838 245.098 715.239 241.813 717.029 238.132 718.209 234.053ZM704.354 202.403C707.772 207.154 709.481 213.349 709.481 220.99 709.481 228.708 707.772 235.044 704.354 239.992 700.935 244.94 696.032 248.404 689.643 250.383 688.628 250.695 687.564 250.967 686.451 251.198 684.886 251.521 683.223 251.765 681.464 251.927 678.452 252.205 675.054 252.342 671.27 252.342L663.762 252.342 663.762 190.349 671.27 190.349C675.054 190.349 678.452 190.487 681.464 190.764 684.474 191.042 687.202 191.557 689.643 192.309 696.032 194.288 700.935 197.652 704.354 202.403ZM842.824 232.331C842.824 229.046 842.458 226.097 841.724 223.484 841.036 220.871 839.974 218.654 838.552 216.833 837.13 214.973 835.379 213.548 833.299 212.557 832.611 212.23 831.893 211.955 831.129 211.736 830.875 211.663 830.621 211.595 830.359 211.534 829.042 211.227 827.62 211.074 826.101 211.074 824.148 211.074 822.353 211.331 820.729 211.845 819.143 212.359 817.698 213.013 816.396 213.805 815.095 214.596 813.972 215.447 813.037 216.358 812.102 217.269 811.346 218.08 810.777 218.793L810.777 248.305C812.775 250.285 815.027 251.848 817.556 252.995 818.813 253.56 820.115 253.986 821.47 254.274 822.001 254.388 822.547 254.48 823.101 254.55 823.976 254.662 824.867 254.719 825.779 254.719L825.794 254.719C826.752 254.719 827.74 254.628 828.757 254.447 829.64 254.289 830.546 254.062 831.474 253.768 833.464 253.095 835.297 251.927 836.966 250.264 838.672 248.561 840.078 246.266 841.178 243.377 842.271 240.486 842.824 236.804 842.824 232.331ZM852.417 237.663C852.215 239.443 851.908 241.11 851.489 242.663 850.718 245.672 849.678 248.305 848.376 250.561 847.074 252.817 845.548 254.719 843.797 256.262 842.794 257.129 841.769 257.902 840.722 258.584 839.944 259.087 839.165 259.54 838.365 259.943 836.494 260.853 834.586 261.507 832.633 261.902 830.718 262.338 828.907 262.556 827.201 262.556 823.654 262.556 820.587 262.002 817.983 260.893 815.701 259.889 813.628 258.476 811.75 256.654 811.608 256.511 811.466 256.369 811.323 256.223 811.196 256.098 811.077 255.974 810.964 255.847L810.777 255.847 810.777 288.684 801.319 288.684 801.319 204.659 810.471 204.659 810.471 211.074 810.658 211.074C810.83 210.813 811.017 210.556 811.219 210.302 811.638 209.779 812.124 209.263 812.67 208.757 813.523 208.005 814.646 207.213 816.03 206.382 816.703 205.97 817.437 205.586 818.222 205.226 819.105 204.826 820.063 204.459 821.095 204.125 821.904 203.869 822.734 203.664 823.595 203.512 824.859 203.288 826.184 203.175 827.56 203.175 830.823 203.175 833.95 203.71 836.966 204.778 840.018 205.847 842.705 207.55 845.024 209.886 847.344 212.221 849.192 215.23 850.576 218.911 851.998 222.553 852.709 226.987 852.709 232.213 852.709 234.135 852.611 235.951 852.417 237.663ZM935.995 232.925C935.995 229.442 935.546 226.354 934.656 223.662 933.803 220.931 932.583 218.633 930.989 216.773 929.448 214.873 927.592 213.449 925.437 212.498 923.282 211.508 920.918 211.014 918.359 211.014 915.793 211.014 913.436 211.508 911.273 212.498 909.118 213.449 907.248 214.873 905.661 216.773 904.794 217.818 904.03 219.001 903.364 220.32 902.848 221.35 902.392 222.464 902.003 223.662 901.142 226.354 900.716 229.442 900.716 232.925 900.716 236.369 901.142 239.457 902.003 242.189 902.893 244.88 904.135 247.157 905.721 249.018 907.308 250.878 909.178 252.303 911.341 253.293 913.496 254.244 915.853 254.719 918.419 254.719 920.985 254.719 923.32 254.244 925.437 253.293 926.979 252.588 928.356 251.661 929.583 250.514 930.077 250.051 930.548 249.553 930.989 249.018 932.583 247.157 933.803 244.88 934.656 242.189 935.546 239.457 935.995 236.369 935.995 232.925ZM945.887 232.925C945.887 237.359 945.236 241.396 943.934 245.039 942.632 248.681 940.776 251.809 938.382 254.422 936.018 256.994 933.152 258.993 929.77 260.418 926.395 261.844 922.609 262.556 918.419 262.556 914.102 262.556 910.241 261.844 906.821 260.418 904.921 259.617 903.177 258.633 901.584 257.469 901.411 257.339 901.232 257.209 901.06 257.074 900.02 256.27 899.055 255.386 898.157 254.422 895.792 251.809 893.982 248.681 892.717 245.039 892.014 242.991 891.505 240.816 891.191 238.517L891.153 238.216C890.936 236.52 890.831 234.756 890.831 232.925 890.831 228.451 891.482 224.394 892.784 220.752 893.518 218.704 894.415 216.818 895.478 215.096L895.538 215.007C896.353 213.7 897.266 212.488 898.276 211.37 900.678 208.757 903.566 206.738 906.941 205.313 910.36 203.888 914.169 203.175 918.359 203.175 922.632 203.175 926.477 203.888 929.897 205.313 933.309 206.738 936.205 208.757 938.562 211.37 940.919 213.983 942.729 217.111 943.994 220.752 944.458 222.092 944.839 223.489 945.131 224.939L945.191 225.231C945.655 227.641 945.887 230.206 945.887 232.925ZM976.587 259.943 964.196 288.684 973.602 288.684 1009.79 204.659 999.842 204.659 981.593 248.899 981.226 248.899 963.224 204.659 953.519 204.659 976.587 259.943ZM787.896 235.787C785.352 249.711 773.073 260.247 758.79 260.247 742.478 260.247 729.206 246.852 729.206 230.387 729.206 220.419 734.071 211.575 741.531 206.149 746.423 202.538 752.452 200.404 758.962 200.404 771.248 200.404 782.067 207.869 786.527 219.398L788.24 223.81 788.046 223.884 788.068 223.934 742.402 241.925C742.482 242.04 742.561 242.152 742.643 242.264L742.821 242.506C743.147 242.943 743.49 243.366 743.849 243.774 747.55 247.979 752.961 250.636 758.962 250.636 763.361 250.636 767.477 249.173 770.836 246.695 775.326 243.309 778.416 238.091 778.91 232.132L779.067 232.146C779.067 232.138 779.067 232.132 779.067 232.126L779.074 232.008 788.442 232.811C788.367 233.768 788.24 234.712 788.068 235.642L787.896 235.787ZM747.047 213.806C742.027 217.517 738.759 223.504 738.759 230.246 738.759 230.51 738.765 230.775 738.776 231.039L738.782 231.207 738.787 231.304C738.801 231.567 738.82 231.83 738.843 232.09L738.86 232.267 738.901 232.641 738.949 233.037 775.356 218.692C771.637 213.294 765.531 209.998 758.775 209.998 754.405 210.001 750.357 211.413 747.047 213.806ZM535.763 252.342 535.763 261.071 485.955 261.071 485.955 181.62 495.904 181.62 495.904 252.342 535.763 252.342ZM865.743 175.088 875.201 175.088 875.201 261.071 865.743 261.071 865.743 175.088Z" fill="none" fill-rule="evenodd" transform="matrix(1 0 0 1.00081 -0.255482 128.069)"/><path d="M589.722 261.071 569.151 213.627C567.428 209.675 565.705 205.513 563.982 201.142 563.034 198.739 562.087 196.272 561.14 193.743L560.908 193.122 560.765 192.739 560.606 192.309 560.117 192.309 560.127 192.486 560.156 193.058 560.166 193.275C560.704 203.942 560.972 213.736 560.972 222.652L560.972 261.071 551.023 261.071 551.023 181.62 565.367 181.62 584.594 226.572C586.654 231.396 588.911 237.144 591.365 243.812L591.858 245.158 592.163 245.158 592.21 245.03 592.302 244.777 592.408 244.486C595.227 236.778 597.568 230.803 599.427 226.572L618.654 181.62 632.998 181.62 632.998 261.071 623.049 261.071 623.049 222.652C623.049 214.228 623.299 204.812 623.8 194.403L623.855 193.272 623.866 193.058 623.894 192.486 623.904 192.309 623.415 192.309 623.146 193.032 623.114 193.119C620.214 200.895 617.465 207.736 614.869 213.627L594.3 261.071 589.722 261.071ZM718.209 234.053C719.389 229.975 719.979 225.622 719.979 220.99 719.979 216.398 719.389 212.121 718.209 208.164 717.07 204.165 715.32 200.582 712.96 197.415 710.64 194.209 707.69 191.439 704.109 189.102 700.569 186.766 696.418 184.945 691.658 183.639 688.93 182.886 685.981 182.371 682.807 182.095 681.362 181.951 679.864 181.839 678.31 181.761 676.45 181.667 674.509 181.62 672.491 181.62L653.813 181.62 653.813 261.071 672.308 261.071C674.622 261.071 676.833 261.017 678.941 260.909L679.099 260.902 679.955 260.853C680.864 260.798 681.754 260.732 682.623 260.656 685.797 260.339 688.748 259.806 691.474 259.053 696.235 257.747 700.386 255.905 703.926 253.53 707.507 251.156 710.477 248.344 712.838 245.098 715.239 241.813 717.029 238.132 718.209 234.053ZM704.354 202.403C707.772 207.154 709.481 213.349 709.481 220.99 709.481 228.708 707.772 235.044 704.354 239.992 700.935 244.94 696.032 248.404 689.643 250.383 688.628 250.695 687.564 250.967 686.451 251.198 684.886 251.521 683.223 251.765 681.464 251.927 678.452 252.205 675.054 252.342 671.27 252.342L663.762 252.342 663.762 190.349 671.27 190.349C675.054 190.349 678.452 190.487 681.464 190.764 684.474 191.042 687.202 191.557 689.643 192.309 696.032 194.288 700.935 197.652 704.354 202.403ZM842.824 232.331C842.824 229.046 842.458 226.097 841.724 223.484 841.036 220.871 839.974 218.654 838.552 216.833 837.13 214.973 835.379 213.548 833.299 212.557 832.611 212.23 831.893 211.955 831.129 211.736 830.875 211.663 830.621 211.595 830.359 211.534 829.042 211.227 827.62 211.074 826.101 211.074 824.148 211.074 822.353 211.331 820.729 211.845 819.143 212.359 817.698 213.013 816.396 213.805 815.095 214.596 813.972 215.447 813.037 216.358 812.102 217.269 811.346 218.08 810.777 218.793L810.777 248.305C812.775 250.285 815.027 251.848 817.556 252.995 818.813 253.56 820.115 253.986 821.47 254.274 822.001 254.388 822.547 254.48 823.101 254.55 823.976 254.662 824.867 254.719 825.779 254.719L825.794 254.719C826.752 254.719 827.74 254.628 828.757 254.447 829.64 254.289 830.546 254.062 831.474 253.768 833.464 253.095 835.297 251.927 836.966 250.264 838.672 248.561 840.078 246.266 841.178 243.377 842.271 240.486 842.824 236.804 842.824 232.331ZM852.417 237.663C852.215 239.443 851.908 241.11 851.489 242.663 850.718 245.672 849.678 248.305 848.376 250.561 847.074 252.817 845.548 254.719 843.797 256.262 842.794 257.129 841.769 257.902 840.722 258.584 839.944 259.087 839.165 259.54 838.365 259.943 836.494 260.853 834.586 261.507 832.633 261.902 830.718 262.338 828.907 262.556 827.201 262.556 823.654 262.556 820.587 262.002 817.983 260.893 815.701 259.889 813.628 258.476 811.75 256.654 811.608 256.511 811.466 256.369 811.323 256.223 811.196 256.098 811.077 255.974 810.964 255.847L810.777 255.847 810.777 288.684 801.319 288.684 801.319 204.659 810.471 204.659 810.471 211.074 810.658 211.074C810.83 210.813 811.017 210.556 811.219 210.302 811.638 209.779 812.124 209.263 812.67 208.757 813.523 208.005 814.646 207.213 816.03 206.382 816.703 205.97 817.437 205.586 818.222 205.226 819.105 204.826 820.063 204.459 821.095 204.125 821.904 203.869 822.734 203.664 823.595 203.512 824.859 203.288 826.184 203.175 827.56 203.175 830.823 203.175 833.95 203.71 836.966 204.778 840.018 205.847 842.705 207.55 845.024 209.886 847.344 212.221 849.192 215.23 850.576 218.911 851.998 222.553 852.709 226.987 852.709 232.213 852.709 234.135 852.611 235.951 852.417 237.663ZM935.995 232.925C935.995 229.442 935.546 226.354 934.656 223.662 933.803 220.931 932.583 218.633 930.989 216.773 929.448 214.873 927.592 213.449 925.437 212.498 923.282 211.508 920.918 211.014 918.359 211.014 915.793 211.014 913.436 211.508 911.273 212.498 909.118 213.449 907.248 214.873 905.661 216.773 904.794 217.818 904.03 219.001 903.364 220.32 902.848 221.35 902.392 222.464 902.003 223.662 901.142 226.354 900.716 229.442 900.716 232.925 900.716 236.369 901.142 239.457 902.003 242.189 902.893 244.88 904.135 247.157 905.721 249.018 907.308 250.878 909.178 252.303 911.341 253.293 913.496 254.244 915.853 254.719 918.419 254.719 920.985 254.719 923.32 254.244 925.437 253.293 926.979 252.588 928.356 251.661 929.583 250.514 930.077 250.051 930.548 249.553 930.989 249.018 932.583 247.157 933.803 244.88 934.656 242.189 935.546 239.457 935.995 236.369 935.995 232.925ZM945.887 232.925C945.887 237.359 945.236 241.396 943.934 245.039 942.632 248.681 940.776 251.809 938.382 254.422 936.018 256.994 933.152 258.993 929.77 260.418 926.395 261.844 922.609 262.556 918.419 262.556 914.102 262.556 910.241 261.844 906.821 260.418 904.921 259.617 903.177 258.633 901.584 257.469 901.411 257.339 901.232 257.209 901.06 257.074 900.02 256.27 899.055 255.386 898.157 254.422 895.792 251.809 893.982 248.681 892.717 245.039 892.014 242.991 891.505 240.816 891.191 238.517L891.153 238.216C890.936 236.52 890.831 234.756 890.831 232.925 890.831 228.451 891.482 224.394 892.784 220.752 893.518 218.704 894.415 216.818 895.478 215.096L895.538 215.007C896.353 213.7 897.266 212.488 898.276 211.37 900.678 208.757 903.566 206.738 906.941 205.313 910.36 203.888 914.169 203.175 918.359 203.175 922.632 203.175 926.477 203.888 929.897 205.313 933.309 206.738 936.205 208.757 938.562 211.37 940.919 213.983 942.729 217.111 943.994 220.752 944.458 222.092 944.839 223.489 945.131 224.939L945.191 225.231C945.655 227.641 945.887 230.206 945.887 232.925ZM976.587 259.943 964.196 288.684 973.602 288.684 1009.79 204.659 999.842 204.659 981.593 248.899 981.226 248.899 963.224 204.659 953.519 204.659 976.587 259.943ZM787.896 235.787C785.352 249.711 773.073 260.247 758.79 260.247 742.478 260.247 729.206 246.852 729.206 230.387 729.206 220.419 734.071 211.575 741.531 206.149 746.423 202.538 752.452 200.404 758.962 200.404 771.248 200.404 782.067 207.869 786.527 219.398L788.24 223.81 788.046 223.884 788.068 223.934 742.402 241.925C742.482 242.04 742.561 242.152 742.643 242.264L742.821 242.506C743.147 242.943 743.49 243.366 743.849 243.774 747.55 247.979 752.961 250.636 758.962 250.636 763.361 250.636 767.477 249.173 770.836 246.695 775.326 243.309 778.416 238.091 778.91 232.132L779.067 232.146C779.067 232.138 779.067 232.132 779.067 232.126L779.074 232.008 788.442 232.811C788.367 233.768 788.24 234.712 788.068 235.642L787.896 235.787ZM747.047 213.806C742.027 217.517 738.759 223.504 738.759 230.246 738.759 230.51 738.765 230.775 738.776 231.039L738.782 231.207 738.787 231.304C738.801 231.567 738.82 231.83 738.843 232.09L738.86 232.267 738.901 232.641 738.949 233.037 775.356 218.692C771.637 213.294 765.531 209.998 758.775 209.998 754.405 210.001 750.357 211.413 747.047 213.806ZM535.763 252.342 535.763 261.071 485.955 261.071 485.955 181.62 495.904 181.62 495.904 252.342 535.763 252.342ZM865.743 175.088 875.201 175.088 875.201 261.071 865.743 261.071 865.743 175.088Z" fill="url(#fill1)" fill-rule="evenodd" transform="matrix(1 0 0 1.00081 -0.255482 128.069)"/><path d="M417.928 210.759 332.03 292.638 356.588 212.584 329.253 211.565 415.752 129.412 390.657 209.626 417.928 210.759Z" fill="url(#fill2)" fill-rule="evenodd" transform="matrix(1 0 0 1.00081 -0.255482 128.069)"/><path d="M352.974 215.897 331.46 292.898 370.665 200.078C370.913 199.492 370.362 198.884 369.754 199.072L328.536 211.86 352.35 214.954C352.802 215.013 353.097 215.459 352.974 215.897Z" fill="url(#fill3)" transform="matrix(1 0 0 1.00081 -0.255482 128.069)"/><path d="M352.974 215.897 331.46 292.898 370.665 200.078C370.913 199.492 370.362 198.884 369.754 199.072L328.536 211.86 352.35 214.954C352.802 215.013 353.097 215.459 352.974 215.897Z" fill="url(#fill4)" transform="matrix(1 0 0 1.00081 -0.255482 128.069)"/><path d="M352.974 215.897 331.46 292.898 370.665 200.078C370.913 199.492 370.362 198.884 369.754 199.072L328.536 211.86 352.35 214.954C352.802 215.013 353.097 215.459 352.974 215.897Z" stroke="url(#stroke5)" stroke-width="0.748239" fill="none" transform="matrix(1 0 0 1.00081 -0.255482 128.069)"/><path d="M394.247 202.173 415.328 129.974 377.297 220.145C377.057 220.715 377.573 221.314 378.172 221.161L417.509 211.1 394.716 203.089C394.342 202.957 394.135 202.554 394.247 202.173Z" stroke="url(#stroke6)" stroke-width="0.748239" fill="url(#fill7)" transform="matrix(1 0 0 1.00081 -0.255482 128.069)"/><path d="M400.69 240.126C415.788 244.356 425.536 251.018 425.453 258.426 425.315 270.82 397.71 280.608 363.797 280.288 329.883 279.969 302.503 269.662 302.641 257.268 302.735 248.864 315.458 241.657 334.215 237.989" stroke="url(#stroke8)" stroke-width="5.23768" fill="none" transform="matrix(1 0 0 1.00081 -0.255482 128.069)"/><path d="M403.693 233.437C417.578 241.42 425.394 250.68 423.145 258.396 419.383 271.306 388.87 275.007 354.995 266.662 321.119 258.317 296.707 241.086 300.47 228.176 303.021 219.421 317.873 214.902 337.734 215.501" stroke="url(#stroke9)" stroke-width="5.23768" fill="none" transform="matrix(1 0 0 1.00081 -0.255482 128.069)"/><path d="M403.498 232.586C414.855 243.273 420.115 253.555 416.138 259.89 409.483 270.487 379.485 266.019 349.137 249.91 318.787 233.801 299.58 212.151 306.236 201.553 310.748 194.367 325.995 194.108 344.807 199.71" stroke="url(#stroke10)" stroke-width="5.23768" fill="none" transform="matrix(1 0 0 1.00081 -0.255482 128.069)"/></g></svg>
--- a/docs/zh_cn/advance/chat_template.md
+++ b/docs/zh_cn/advance/chat_template.md
+# 自定义对话模板
+被应用的对话模板效果，可以通过设置日志等级为`INFO`进行观测。
+LMDeploy 支持两种添加对话模板的形式：
+- 一种是利用现有对话模板，直接配置一个如下的 json 文件使用。
+  ```json
+  {
+      "model_name": "your awesome chat template name",
+      "system": "<|im_start|>system\n",
+      "meta_instruction": "You are a robot developed by LMDeploy.",
+      "eosys": "<|im_end|>\n",
+      "user": "<|im_start|>user\n",
+      "eoh": "<|im_end|>\n",
+      "assistant": "<|im_start|>assistant\n",
+      "eoa": "<|im_end|>",
+      "separator": "\n",
+      "capability": "chat",
+      "stop_words": ["<|im_end|>"]
+  }
+  ```
+  model_name 为必填项，可以是 LMDeploy 内置对话模板名（通过 `lmdeploy list` 可查阅），也可以是新名字。其他字段可选填。
+  当 model_name 是内置对话模板名时，json文件中各非 null 字段会覆盖原有对话模板的对应属性。
+  而当 model_name 是新名字时，它会把将`BaseChatTemplate`直接注册成新的对话模板。其具体定义可以参考[BaseChatTemplate](https://github.com/InternLM/lmdeploy/blob/24bd4b9ab6a15b3952e62bcfc72eaba03bce9dcb/lmdeploy/model.py#L113-L188)。
+  这样一个模板将会以下面的形式进行拼接。
+  ```
+  {system}{meta_instruction}{eosys}{user}{user_content}{eoh}{assistant}{assistant_content}{eoa}{separator}{user}...
+  ```
+  在使用 CLI 工具时，可以通过 `--chat-template` 传入自定义对话模板，比如：
+  ```shell
+  lmdeploy serve api_server internlm/internlm2-chat-7b --chat-template ${JSON_FILE}
+  ```
+  也可以在通过接口函数传入，比如：
+  ```python
+  from lmdeploy import ChatTemplateConfig, serve
+  serve('internlm/internlm2-chat-7b',
+        chat_template_config=ChatTemplateConfig.from_json('${JSON_FILE}'))
+  ```
+- 一种是以 LMDeploy 现有对话模板，自定义一个python对话模板类，注册成功后直接用即可。优点是自定义程度高，可控性强。
+  下面是一个注册 LMDeploy 对话模板的例子：
+  ```python
+  from lmdeploy.model import MODELS, BaseChatTemplate
+  @MODELS.register_module(name='customized_model')
+  class CustomizedModel(BaseChatTemplate):
+      """A customized chat template."""
+      def __init__(self,
+                   system='<|im_start|>system\n',
+                   meta_instruction='You are a robot developed by LMDeploy.',
+                   user='<|im_start|>user\n',
+                   assistant='<|im_start|>assistant\n',
+                   eosys='<|im_end|>\n',
+                   eoh='<|im_end|>\n',
+                   eoa='<|im_end|>',
+                   separator='\n',
+                   stop_words=['<|im_end|>', '<|action_end|>']):
+          super().__init__(system=system,
+                           meta_instruction=meta_instruction,
+                           eosys=eosys,
+                           user=user,
+                           eoh=eoh,
+                           assistant=assistant,
+                           eoa=eoa,
+                           separator=separator,
+                           stop_words=stop_words)
+  from lmdeploy import ChatTemplateConfig, pipeline
+  messages = [{'role': 'user', 'content': 'who are you?'}]
+  pipe = pipeline('internlm/internlm2-chat-7b',
+                  chat_template_config=ChatTemplateConfig('customized_model'))
+  for response in pipe.stream_infer(messages):
+      print(response.text, end='')
+  ```
+  在这个例子中，我们注册了一个 LMDeploy 的对话模板，该模板将模型设置为由 LMDeploy 创造，所以当用户提问模型是谁的时候，模型就会回答由 LMDeploy 所创。
--- a/docs/zh_cn/advance/debug_turbomind.md
+++ b/docs/zh_cn/advance/debug_turbomind.md
+# 如何调试 Turbomind
+Turbomind 使用 C++ 实现，不像 Python 一样易于调试。该文档提供了调试 Turbomind 的基本方法。
+## 前置工作
+首先，根据构建[命令](../build.md)完成本地编译。
+## 配置 Python 调试环境
+由于目前许多大公司在线上生产环境中使用 Centos 7，我们将以 Centos 7 为例来说明配置过程。
+### 获取 `glibc` 和 `python3` 的版本
+```bash
+rpm -qa | grep glibc
+rpm -qa | grep python3
+```
+结果类似于这样：
+```
+[username@hostname workdir]# rpm -qa | grep glibc
+glibc-2.17-325.el7_9.x86_64
+glibc-common-2.17-325.el7_9.x86_64
+glibc-headers-2.17-325.el7_9.x86_64
+glibc-devel-2.17-325.el7_9.x86_64
+[username@hostname workdir]# rpm -qa | grep python3
+python3-pip-9.0.3-8.el7.noarch
+python3-rpm-macros-3-34.el7.noarch
+python3-rpm-generators-6-2.el7.noarch
+python3-setuptools-39.2.0-10.el7.noarch
+python3-3.6.8-21.el7_9.x86_64
+python3-devel-3.6.8-21.el7_9.x86_64
+python3.6.4-sre-1.el6.x86_64
+```
+根据上述信息，我们可以看到 `glibc` 的版本是 `2.17-325.el7_9.x86_64`，`python3` 的版本是 `3.6.8-21.el7_9.x86_64`。
+### 下载并安装 `debuginfo` 库
+从 http://debuginfo.centos.org/7/x86_64 下载 `glibc-debuginfo-common-2.17-325.el7.x86_64.rpm`、`glibc-debuginfo-2.17-325.el7.x86_64.rpm` 和 `python3-debuginfo-3.6.8-21.el7.x86_64.rpm`。
+```bash
+rpm -ivh glibc-debuginfo-common-2.17-325.el7.x86_64.rpm
+rpm -ivh glibc-debuginfo-2.17-325.el7.x86_64.rpm
+rpm -ivh python3-debuginfo-3.6.8-21.el7.x86_64.rpm
+```
+### 升级 GDB
+```bash
+sudo yum install devtoolset-10 -y
+echo "source scl_source enable devtoolset-10" >> ~/.bashrc
+source ~/.bashrc
+```
+### 验证
+```bash
+gdb python3
+```
+输出类似于这样：
+```
+[username@hostname workdir]# gdb python3
+GNU gdb (GDB) Red Hat Enterprise Linux 9.2-10.el7
+Copyright (C) 2020 Free Software Foundation, Inc.
+License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
+This is free software: you are free to change and redistribute it.
+There is NO WARRANTY, to the extent permitted by law.
+Type "show copying" and "show warranty" for details.
+This GDB was configured as "x86_64-redhat-linux-gnu".
+Type "show configuration" for configuration details.
+For bug reporting instructions, please see:
+<http://www.gnu.org/software/gdb/bugs/>.
+Find the GDB manual and other documentation resources online at:
+   <http://www.gnu.org/software/gdb/documentation/>.
+For help, type "help".
+Type "apropos word" to search for commands related to "word"...
+Reading symbols from python3...
+(gdb)
+```
+如果显示 `Reading symbols from python3`，说明配置成功。
+对于其他操作系统，请参考 [DebuggingWithGdb](https://wiki.python.org/moin/DebuggingWithGdb)。
+## 设置符号链接
+设置符号链接后，不需要每次都通过 `pip` 进行本地安装。
+```bash
+# 更改目录到 lmdeploy，例如
+cd /workdir/lmdeploy
+# 因为编译文件在 build 文件夹中
+# 设置 lib 的软链接
+cd lmdeploy && ln -s ../build/lib . && cd ..
+# （可选）创建 compile_commands.json 软链接，用于 clangd 构建 index
+ln -s build/compile_commands.json .
+```
+## 开始调试
+````bash
+# 使用 gdb 启动 API Server，例如
+gdb --args python3 -m lmdeploy serve api_server /workdir/Llama-2-13b-chat-hf
+# 在 gdb 中设置 lmdeploy 文件夹路径
+Reading symbols from python3...
+(gdb) set directories /workdir/lmdeploy
+# 使用相对路径设置断点，例如
+(gdb) b src/turbomind/models/llama/BlockManager.cc:104
+# 当出现
+# ```
+# No source file named src/turbomind/models/llama/BlockManager.cc.
+# Make breakpoint pending on future shared library load? (y or [n])
+# ```
+# 输入 y 并回车
+# 运行
+(gdb) r
+# (可选) 使用 https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_restful_api.py 发送请求
+python3 profile_restful_api.py --server_addr 127.0.0.1:23333 --tokenizer_path /workdir/Llama-2-13b-chat-hf --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --concurrency 1 --num_prompts 1
+````
+## 使用 GDB
+参考 [GDB Execution Commands](https://lldb.llvm.org/use/map.html) 进行调试。
--- a/docs/zh_cn/advance/long_context.md
+++ b/docs/zh_cn/advance/long_context.md
+# 长文本外推
+长文本外推指 LLM 推理时处理比训练文本更长数据的能力。TurboMind 引擎目前支持 [LlamaDynamicNTKScalingRotaryEmbedding](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L178), 并与 HuggingFace 的实现对齐。
+## 如何使用
+如果要直接加载 HuggingFace 格式的模型，可以通过修改 TurbomindEngineConfig 参数的方式赋予模型外推能力。将 `session_len` 修改为外推的长度，并将 `rope_scaling_factor` 修改为不小于 1.0 的值。
+以 InternLM2 为例，可以使用如下方式，激活长文本推理能力：
+```python
+from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0, session_len=160000)
+pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config)
+prompt = 'Use a long prompt to replace this sentence'
+gen_config = GenerationConfig(top_p=0.8,
+                              top_k=40,
+                              temperature=0.8,
+                              max_new_tokens=1024)
+response = pipe(prompt, gen_config=gen_config)
+print(response)
+```
+## 评测
+我们使用多种方式评测 LMDeploy 长文本推理能力，分别是 [passkey retrieval 实验](#passkey-retrieval)、[大海捞针实验](#大海捞针) 和[计算困惑度](#困惑度)
+### Passkey Retrieval
+执行如下代码，可以测试在长文本中找到特殊 key 成功和失败的次数
+```python
+import numpy as np
+from lmdeploy import pipeline
+from lmdeploy import TurbomindEngineConfig
+session_len = 160000
+backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0, session_len=session_len)
+pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config)
+def passkey_retrival(session_len, n_round=5):
+    # create long context input
+    tok = pipe.tokenizer
+    task_description = 'There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.'
+    garbage = 'The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.'
+    for _ in range(n_round):
+        n_times = (session_len - 1000) // len(tok.encode(garbage))
+        n_garbage_prefix = np.random.randint(0, n_times)
+        n_garbage_suffix = n_times - n_garbage_prefix
+        garbage_prefix = ' '.join([garbage] * n_garbage_prefix)
+        garbage_suffix = ' '.join([garbage] * n_garbage_suffix)
+        pass_key = np.random.randint(1, 50000)
+        information_line = f'The pass key is {pass_key}. Remember it. {pass_key} is the pass key.'  # noqa: E501
+        final_question = 'What is the pass key? The pass key is'
+        lines = [
+            task_description,
+            garbage_prefix,
+            information_line,
+            garbage_suffix,
+            final_question,
+        ]
+        # inference
+        prompt = ' '.join(lines)
+        response = pipe([prompt])
+        print(pass_key, response)
+passkey_retrival(session_len, 5)
+```
+### 大海捞针
+可使用 OpenCompass 进行测评，具体使用方法，请参考[文档](https://github.com/open-compass/opencompass/blob/main/docs/zh_cn/advanced_guides/needleinahaystack_eval.md)
+### 困惑度
+下面展示使用 LMDeploy 计算困惑度的用法
+```python
+from datasets import load_dataset
+from lmdeploy import TurbomindEngineConfig
+from lmdeploy.turbomind import TurboMind
+import numpy as np
+# load model and tokenizer
+engine_config = TurbomindEngineConfig(rope_scaling_factor=2.0, session_len=160000)
+engine = TurboMind.from_pretrained('internlm/internlm2-chat-7b', engine_config)
+tokenizer = engine.tokenizer
+generator = engine.create_instance()
+# get perplexity
+text = 'The grass is green. The sky is blue. The sun is yellow'
+input_ids = tokenizer.encode(text)
+loss = generator.get_ppl(input_ids)[0]
+ppl = np.exp(loss)
+```
--- a/docs/zh_cn/advance/pytorch_new_model.md
+++ b/docs/zh_cn/advance/pytorch_new_model.md
+# lmdeploy.pytorch 新模型支持
+lmdeploy.pytorch 被设计用来简化新模型的支持以及原型的开发，新模型的支持依赖于 patch 机制，对原模型做修改以及功能添加，以期可以最大程度上复用模型的原始实现，减少工作量。
+## 模型支持
+我们以 transformers 中的 llama 实现来介绍模型支持的流程
+在开始之前，我们首先要了解一下模型的输入。lmdeploy.pytorch 的输入与标准 transformers 模型的输入略有不同，差异主要体现在如下方面：
+1. 由于支持了 continuous batching，一个 batch 的输入 `input_ids` 会被拼接成一维的长序列，然后 `unsqueeze(0)` 来保证输入维度与 transformers 中相同。这样的输入不会影响 MLP 以及 RMSNorm 等模块的计算。
+2. 由于添加了对 paged attention 的支持，`past_key_value` 不再是原来的大小，而是一组形状为 `[num_blocks, block_size, num_heads, head_dim]` 的 cache 块，num_blocks 为总 block 数量，由可用显存大小决定，block_size 为预设的块大小。这样的输入改变会影响到 LlamaModel 和 LlamaAttention 的计算，因此要对这两个模块的实现进行修改。
+3. 由于上述输入的改变，模型中需要一些额外的输入来支持推理，比如 batch 中的序列起始位置和长度，kv cache 的 block table 等。这些输入并不在模块的 forward 参数列表中，我们需要维护一个上下文以获得这些输入。
+上面的输入改动会影响 LlamaModel 和 LlamaAttention，首先我们来实现新的 LlamaModel，这是对原始实现的简化，我们删除了很多检查代码，以避免由于输入改变造成的断言失败，仅保留了最小程度的代码：
+```python
+# lmdeploy/pytorch/models/llama.py
+class LlamaModel(nn.Module):
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        """Rewrite implementation of LlamaModel.forward."""
+        inputs_embeds = self.embed_tokens(input_ids)
+        hidden_states = inputs_embeds
+        # decoder layers
+        for idx, decoder_layer in enumerate(self.layers):
+            past_key_value = past_key_values[idx]
+            layer_outputs = decoder_layer(
+                hidden_states,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_value=past_key_value,
+                output_attentions=output_attentions,
+                use_cache=use_cache,
+            )
+            hidden_states = layer_outputs[0]
+        hidden_states = self.norm(hidden_states)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values,
+            hidden_states=None,
+            attentions=None,
+        )
+```
+然后是对 LlamaAttention 模块的改写。按顺序实现如下操作：
+1. kqv proj
+2. rotary embedding
+3. 填充 kv cache
+4. MHA 计算
+5. o proj
+continuous batching 和 kv cache 的改动对该模块的影响比较大
+```python
+# lmdeploy/pytorch/models/llama.py
+from lmdeploy.pytorch.kernels import apply_rotary_pos_emb, fill_kv_cache, paged_attention_fwd
+class LlamaAttention(nn.Module):
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor],
+               Optional[Tuple[torch.Tensor]]]:
+        """Rewrite of LlamaAttention.forward."""
+        context = self.context.context
+        history_lengths = context.history_lengths
+        position_ids_1d = context.position_ids_1d
+        block_offsets = context.block_offsets
+        # qkv proj
+        query_states = q_proj(hidden_states)
+        key_states = k_proj(hidden_states)
+        value_states = v_proj(hidden_states)
+        query_states = query_states.view(-1, num_heads, head_dim)
+        key_states = key_states.view(-1, num_kv_heads, head_dim)
+        value_states = value_states.view(-1, num_kv_heads, head_dim)
+        # rotary embedding
+        max_seq_len = position_ids.size(-1)
+        kv_seq_len = max_seq_len + max(history_lengths)
+        if kv_seq_len >= self.rotary_emb.max_seq_len_cached:
+            cos, sin = self.rotary_emb(value_states,
+                                        seq_len=kv_seq_len + 128)
+        query_states, key_states = apply_rotary_pos_emb(
+            query_states,
+            key_states,
+            self.rotary_emb.cos_cached,
+            self.rotary_emb.sin_cached,
+            position_ids,
+            position_ids_1d,
+            q_embed=query_states,
+            k_embed=key_states)
+        # fill kv cache
+        kv_seq_length = context.kv_seq_length
+        q_seq_length = context.q_seq_length
+        q_start_loc = context.q_start_loc
+        fill_kv_cache(key_states,
+                      value_states,
+                      past_key_value[0],
+                      past_key_value[1],
+                      q_start_loc,
+                      q_seq_length,
+                      block_offsets=block_offsets,
+                      history_lengths=history_lengths,
+                      context=context)
+        # attention
+        attn_output = query_states
+        block_size = past_key_value[0].size(1)
+        paged_attention_fwd(
+            query_states,
+            past_key_value[0],
+            past_key_value[1],
+            attn_output,
+            block_offsets,
+            q_start_loc=q_start_loc,
+            q_seqlens=q_seq_length,
+            kv_seqlens=kv_seq_length,
+            max_seqlen=max_seq_len,
+        )
+        hidden_size = num_heads * head_dim
+        attn_output = attn_output.reshape(*hidden_states.shape[:-1], hidden_size)
+        # o proj
+        attn_output = o_proj(attn_output)
+        return attn_output, None, past_key_value
+```
+上面的代码有几处值得注意的地方，首先是 context 对象。我们需要 history_lengths、block_offsets 等参数辅助运算，这些参数无法通过模型的 forward 函数传递进来。因此我们维护了一个 context 对象，把几乎所有可能用到的输入参数都保存在其中，方便在各个模块间共享。context 对象可以通过 `self.context.context` 来访问，结构可以参考 [context-结构](#context-结构)。
+另一个值得注意的地方就是自定义 kernel，由于输入形式的改变，原来的 LlamaAttention 实现变得不再适用，为了保证推理的速度和正确性，我们在 lmdeploy.pytorch.kernels 中实现了许多自定义的 triton kernel，上面的模块中就用到了 `apply_rotary_pos_emb`，`fill_kv_cache` 和 `paged_attention_fwd` ，分别负责实现 rotary embedding，填充 kv cache 还有 attention 的计算。
+有了上述的两个模块后，还需要将他们注册到 `lmdeploy/pytorch/models/module_map.py` 中，进行原模块与 patch 模块的映射
+```python
+# lmdeploy/pytorch/models/module_map.py
+MODEL_MAP.update({
+    'transformers.models.llama.modeling_llama.LlamaAttention':
+    'lmdeploy.pytorch.models.llama.LlamaAttention',
+    'transformers.models.llama.modeling_llama.LlamaModel':
+    'lmdeploy.pytorch.models.llama.LlamaModel'
+})
+```
+完成注册后，Engine 在启动时就会将这两个模块 patch 成新的实现，完成后续的部署任务。
+## Tensor 并发支持
+为了支持 Tensor 并发，需要对模型的权重做切分。让我们试着为上面接入的 Llama 模型添加 TP 的支持。
+Llama 中涉及到 Tensor 并发的模块是 LlamaAttention 中的 qkvo proj 和 LlamaMLP 中的 gate,up 和 down proj。其中 o_proj 和 down_proj 需要按行切分，剩下的按列切分。我们可以在对应的模块中实现 `_distribution_partition_fn` 函数：
+```python
+# lmdeploy/pytorch/models/llama.py
+from ..dist_utils import (colwise_parallelize_linear_fn,
+                          rowwise_parallelize_linear_fn)
+class LlamaAttention(nn.Module):
+    @classmethod
+    def _distribute_partition_fn(cls, mod_name: str, mod: nn.Module,
+                                 device_mesh: DeviceMesh):
+        """Distribution partition callback."""
+        if mod_name in ['q_proj', 'k_proj', 'v_proj']:
+            colwise_parallelize_linear_fn(mod,
+                                          device_mesh=device_mesh,
+                                          to_local=True)
+        elif mod_name in ['o_proj']:
+            rowwise_parallelize_linear_fn(mod,
+                                          device_mesh=device_mesh,
+                                          to_local=True)
+class LlamaMLP(nn.Module):
+    @classmethod
+    def _distribute_partition_fn(cls, mod_name: str, mod: nn.Module,
+                                 device_mesh: DeviceMesh):
+        """Distribution partition callback."""
+        if mod_name in ['gate_proj', 'up_proj']:
+            colwise_parallelize_linear_fn(mod,
+                                          device_mesh=device_mesh,
+                                          to_local=True)
+        elif mod_name in ['down_proj']:
+            rowwise_parallelize_linear_fn(mod,
+                                          device_mesh=device_mesh,
+                                          to_local=True)
+```
+`_distribute_partition_fn` 会在加载模型权重时被调用，对应的权重会被按照特定的形式分配到对应的设备中。
+按照目前的方案切分后的权重，需要对 o_proj 和 down_proj 的结果进行 all_reduce 操作才能得到正确的结果。可以选择将 all_reduce 放在模型的 forward 函数中，也可以选择另一种方案，添加 `_distribute_output_fn` 函数：
+```python
+# lmdeploy/pytorch/models/llama.py
+import torch.distributed as dist
+class LlamaAttention(nn.Module):
+    @classmethod
+    def _distribute_output_fn(cls, outputs, device_mesh: DeviceMesh):
+        """Distribution output hook."""
+        dist.all_reduce(outputs[0])
+        return outputs
+class LlamaMLP(nn.Module):
+    @classmethod
+    def _distribute_output_fn(cls, outputs, device_mesh: DeviceMesh):
+        """Distribution output hook."""
+        dist.all_reduce(outputs)
+        return outputs
+```
+最后别忘了将 LlamaMLP 也注册进 module_map 中
+```python
+# lmdeploy/pytorch/models/module_map.py
+MODEL_MAP.update({
+    'transformers.models.llama.modeling_llama.LlamaMLP':
+    'lmdeploy.pytorch.models.llama.LlamaMLP'
+})
+```
+这样就可以利用多卡的优势，让更大的模型部署成为可能
+## 模块调试
+当模型的输出不符合预期时，我们会希望调试某个特定模块以确定添加的重写是否正确。`lmdeploy.pytorch` 提供了一些工具以帮助进行精度对齐。还是以上面提到的 `LlamaAttention` 模块为例。
+首先，我们通过 transformers 的 API 得到想要调试的子模块的一个实例：
+```python
+import torch
+from transformers import AutoModelForCausalLM
+# get module
+model_path = 'meta-llama/Llama-2-7b-chat-hf'
+dtype = torch.float16
+model = AutoModelForCausalLM.from_pretrained(model_path).to(torch.float16).cuda()
+self_attn = model.model.layers[0].self_attn
+```
+然后，使用 `ModuleIOExtractor` 工具可以生成该模块的一组输入输出
+```python
+from lmdeploy.pytorch.tools.make_inputs import ModuleIOExtractor
+# extract module input/output
+input_ids = torch.tensor([[1, 2, 3, 4, 5]]).cuda()
+extractor = ModuleIOExtractor(model, self_attn)
+attn_args, attn_kwargs, attn_output = extractor.extract(input_ids)
+```
+重写模块的输入与原模块略有不同，主要体现在三方面：
+1. 模型需要一些特殊输入输出，他们以 `StepContext` 的形式传入，可以使用 `make_step_context` 生成。
+2. `input_ids`，`hidden_states` 等数据都被 continuous 化，可以使用 `continuous_tensor` 进行处理。
+3. 由于 paged caching 的需要， `past_key_value` 需要被 page 化处理。
+基于以上原因，我们要对提取的输入进行加工：
+```python
+from lmdeploy.pytorch.tools.make_inputs import make_step_context
+from lmdeploy.pytorch.tools.layout_convert import continuous_tensor
+# create patched input/output
+context = make_step_context(input_ids,
+                            kv_cache_dtype=dtype,
+                            num_key_value_heads=32)
+seq_length = context.q_seq_length
+attn_kwargs['hidden_states'] = continuous_tensor(
+    attn_kwargs['hidden_states'],
+    seq_length)
+attn_kwargs['past_key_value'] = context.kv_caches[0]
+```
+然后就可以启动重写，并比较结果正确性了。（注意输出也要 continuous 化后进行比较）
+```python
+from lmdeploy.pytorch.models import patch
+# patch and test
+patched_self_attn = patch(self_attn, extra_args=['context'])
+with torch.inference_mode():
+    patched_output = patched_self_attn.patched_forward(*attn_args,
+                                                       **attn_kwargs,
+                                                       context=context)
+torch.testing.assert_close(patched_output[0],
+                            continuous_tensor(attn_output[0], seq_length))
+```
+可以通过上述方法调试重写模块，直到精度满足预期。
+## 附录
+### context 结构
+```python
+@dataclass
+class StepContext:
+    """context of Model.
+    """
+    inputs: ModelInputs
+    block_offsets: torch.LongTensor
+    position_ids: torch.LongTensor
+    position_ids_1d: torch.LongTensor
+    q_start_loc: torch.LongTensor
+    history_lengths: torch.LongTensor
+    seq_length: torch.LongTensor
+    max_seq_length: int
+    kv_seq_length: torch.LongTensor
+    kv_caches: List
+    is_decoding: bool
+    world_size: int = 1
+    json_config: Dict = None
+    local_adapter_ids: torch.LongTensor = None
+    global_adapter_ids: torch.LongTensor = None
+    adapter_offsets: torch.LongTensor = None
+    max_rank: int = 0
+```
+### FAQ
+- **如何访问 patch 前的模块？**
+有时我们只希望在函数前后加一个 hook 代码，不希望大段的拷贝函数，可以通过 `self.origin_mod` 访问 patch 前的模块。
+- **非 transformers 官方的模型该如何注册？**
+一些模型的实现代码可能是以 remote code 的形式添加的，这样的模块无法通过完整的 qualname 来定位。lmdeploy.pytorch 支持使用缩写的模块名进行注册：
+```python
+MODULE_MAP.update({
+    'modeling_internlm.InternLMAttention':
+    'lmdeploy.pytorch.models.internlm.PatchedInternLMAttention',
+})
+```
+> \[!NOTE\]
+>
+> 缩写的优先级会更低，有条件的话还是鼓励使用完整的 qualname 进行注册。
+- **模块出现同名但不同实现怎么处理？**
+目前推荐的做法是同名就映射到同一个实现中，然后在实现内部根据模块的固有参数来判断模型该使用的类型，以 baichuan2 7b/13b 为例：
+```python
+class BaichuanModel(nn.Module):
+    def forward(self, ...):
+        if self.config.num_hidden_layers == 32:
+            return forward_7b(...)
+        else:
+            return forward_default(...)
+```
+- **如果希望在推理前对模块进行初始化?**
+可以实现模块的 `_update_model_fn` 函数，它会在模块的权重都加载完，完成 TP 权重切分后被调用
+```python
+class LlamaAttention:
+    def _update_model_fn(self):
+        # ADD YOUR CODE HERE
+```
--- a/docs/zh_cn/api/pipeline.rst
+++ b/docs/zh_cn/api/pipeline.rst
+推理 pipeline
+==================
+.. currentmodule:: lmdeploy
+pipeline
+--------
+.. autofunction:: pipeline
+serving
+--------
+.. autofunction:: serve
+.. autofunction:: client
+PytorchEngineConfig
+-------------------
+.. autoclass:: PytorchEngineConfig
+TurbomindEngineConfig
+---------------------
+.. autoclass:: TurbomindEngineConfig
+GenerationConfig
+----------------
+.. autoclass:: GenerationConfig
+ChatTemplateConfig
+------------------
+.. autoclass:: ChatTemplateConfig
--- a/docs/zh_cn/benchmark/evaluate_with_opencompass.md
+++ b/docs/zh_cn/benchmark/evaluate_with_opencompass.md
+# 如何使用OpenCompass测评LLMs
+LMDeploy设计了TurboMind推理引擎用来加速大模型推理，其推理精度也支持使用OpenCompass测评。
+## 准备
+我们将配置用于测评的环境
+### 安装 lmdeploy
+使用 pip (python 3.8+) 安装 LMDeploy，或者[源码安装](../build.md)
+```shell
+pip install lmdeploy
+```
+### 安装 OpenCompass
+执行如下脚本，从源码安装OpenCompass。更多安装方式请参考[installation](https://opencompass.readthedocs.io/en/latest/get_started/installation.html)。
+```shell
+git clone https://github.com/open-compass/opencompass.git
+cd opencompass
+pip install -e .
+```
+如果想快速了解OpenCompass基本操作，可翻阅[Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html#)
+### 下载数据集
+OpenCompass提供了多个版本的数据集，在这里我们下载如下版本数据集
+```shell
+# 切换到OpenCompass根目录
+cd opencompass
+wget https://github.com/open-compass/opencompass/releases/download/0.1.8.rc1/OpenCompassData-core-20231110.zip
+unzip OpenCompassData-core-20231110.zip
+```
+## 准备测评配置文件
+OpenCompass采用OpenMMLab风格的配置文件来管理模型和数据集，用户只需添加简单的配置就可以快速开始测评。OpenCompass已支持通过python API来
+测评TurboMind推理引擎加速的大模型。
+### 数据集配置
+在OpenCompass根目录，准备测评配置文件`$OPENCOMPASS_DIR/configs/eval_lmdeploy.py`。
+在配置文件开始，导入如下OpenCompass支持的数据集`datasets`和格式化输出测评结果的`summarizer`。
+```python
+from mmengine.config import read_base
+with read_base():
+    # choose a list of datasets
+    from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
+    from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
+    from .datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets
+    from .datasets.SuperGLUE_WSC.SuperGLUE_WSC_gen_7902a7 import WSC_datasets
+    from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
+    from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
+    from .datasets.race.race_gen_69ee4f import race_datasets
+    from .datasets.crowspairs.crowspairs_gen_381af0 import crowspairs_datasets
+    # and output the results in a chosen format
+    from .summarizers.medium import summarizer
+datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
+```
+### 模型配置
+这个部分展示如何在测评配置文件中添加模型配置。让我们来看几个示例：
+`````{tabs}
+````{tab} internlm-20b
+```python
+from opencompass.models.turbomind import TurboMindModel
+internlm_20b = dict(
+        type=TurboMindModel,
+        abbr='internlm-20b-turbomind',
+        path="internlm/internlm-20b",  # this path should be same as in huggingface
+        engine_config=dict(session_len=2048,
+                           max_batch_size=8,
+                           rope_scaling_factor=1.0),
+        gen_config=dict(top_k=1, top_p=0.8,
+                        temperature=1.0,
+                        max_new_tokens=100),
+        max_out_len=100,
+        max_seq_len=2048,
+        batch_size=8,
+        concurrency=8,
+        run_cfg=dict(num_gpus=1, num_procs=1),
+    )
+models = [internlm_20b]
+```
+````
+````{tab} internlm-chat-20b
+对于Chat类大模型，用户需要在配置文件中指定`meta_template`，该设置需要与训练设置对齐，可翻阅[meta_template](https://opencompass.readthedocs.io/en/latest/prompt/meta_template.html) 查看其介绍。
+```python
+from opencompass.models.turbomind import TurboMindModel
+internlm_meta_template = dict(round=[
+    dict(role='HUMAN', begin='<|User|>:', end='\n'),
+    dict(role='BOT', begin='<|Bot|>:', end='<eoa>\n', generate=True),
+],
+                              eos_token_id=103028)
+internlm_chat_20b = dict(
+    type=TurboMindModel,
+    abbr='internlm-chat-20b-turbomind',
+    path='internlm/internlm-chat-20b',
+    engine_config=dict(session_len=2048,
+                       max_batch_size=8,
+                       rope_scaling_factor=1.0),
+    gen_config=dict(top_k=1,
+                    top_p=0.8,
+                    temperature=1.0,
+                    max_new_tokens=100),
+    max_out_len=100,
+    max_seq_len=2048,
+    batch_size=8,
+    concurrency=8,
+    meta_template=internlm_meta_template,
+    run_cfg=dict(num_gpus=1, num_procs=1),
+    end_str='<eoa>'
+)
+models = [internlm_chat_20b]
+```
+````
+`````
+**注**
+- 如果想在测评配置文件中`engine_config`和`gen_config`字段传递更多参数，请参考[TurbomindEngineConfig](https://lmdeploy.readthedocs.io/zh-cn/latest/inference/pipeline.html#turbomindengineconfig) 和 [EngineGenerationConfig](https://lmdeploy.readthedocs.io/zh-cn/latest/inference/pipeline.html#generationconfig)
+## 执行测评任务
+完成测评配置文件编写后，在OpenCompass根目录执行`run.py`脚本，指定工作目录即可开启测评任务。
+测评脚本更多参数可参考[执行测评](https://opencompass.readthedocs.io/zh-cn/latest/user_guides/experimentation.html#id1)
+```shell
+# in the root directory of opencompass
+python3 run.py configs/eval_lmdeploy.py --work-dir ./workdir
+```
--- a/docs/zh_cn/get_started.md
+++ b/docs/zh_cn/get_started.md
+# 快速上手
+LMDeploy提供了快速安装、模型量化、离线批处理、在线推理服务等功能。每个功能只需简单的几行代码或者命令就可以完成。
+## 安装
+使用 pip (python 3.8+) 安装 LMDeploy，或者[源码安装](./build.md)
+```shell
+pip install lmdeploy
+```
+LMDeploy的预编译包默认是基于 CUDA 11.8 编译的。如果需要在 CUDA 12+ 下安装 LMDeploy，请执行以下命令：
+```shell
+export LMDEPLOY_VERSION=0.2.0
+export PYTHON_VERSION=38
+pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl
+```
+## 离线批处理
+```python
+import lmdeploy
+pipe = lmdeploy.pipeline("internlm/internlm-chat-7b")
+response = pipe(["Hi, pls intro yourself", "Shanghai is"])
+print(response)
+```
+有关 pipeline 的详细使用说明，请参考[这里](./inference/pipeline.md)
+## 推理服务
+LMDeploy 提供了多种部署模型推理服务的方式，总有一款适合你。
+- [部署类 openai 的服务](https://lmdeploy.readthedocs.io/zh-cn/latest//serving/api_server.html)
+- [通过 docker 部署服务](https://lmdeploy.readthedocs.io/zh-cn/latest/serving/api_server.html#docker)
+- [部署 gradio 服务](https://lmdeploy.readthedocs.io/zh-cn/latest/serving/gradio.html)
+## 模型量化
+- [INT4 权重量化](quantization/w4a16.md)
+- [K/V 量化](quantization/kv_int8.md)
+- [W8A8 量化](quantization/w8a8.md)
+## 好用的工具
+LMDeploy CLI 提供了如下便捷的工具，方便用户快速体验模型对话效果
+### 控制台交互式对话
+```shell
+lmdeploy chat turbomind internlm/internlm-chat-7b
+```
+### WebUI 交互式对话
+LMDeploy 使用 gradio 开发了在线对话 demo。
+```shell
+# 安装依赖
+pip install lmdeploy[serve]
+# 启动
+lmdeploy serve gradio internlm/internlm-chat-7b
+```
+![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
--- a/docs/zh_cn/inference/load_hf.md
+++ b/docs/zh_cn/inference/load_hf.md
+# 直接读取 huggingface 模型
+从 v0.1.0 开始，Turbomid 添加了直接读取 Huggingface 格式权重的能力。
+## 支持的类型
+目前，TurboMind 支持加载三种类型的模型：
+1. 在 huggingface.co 上面通过 lmdeploy 量化的模型，如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit)
+2. huggingface.co 上面其他 LM 模型，如Qwen/Qwen-7B-Chat
+3. 通过 `lmdeploy convert` 命令转换好的模型，兼容旧格式
+## 使用方式
+### 1) 通过 lmdeploy 量化的模型
+对于通过 `lmdeploy.lite` 量化的模型，TurboMind 可以直接加载，比如 [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit).
+```
+repo_id=internlm/internlm-chat-20b-4bit
+model_name=internlm-chat-20b
+# or
+# repo_id=/path/to/downloaded_model
+# Inference by TurboMind
+lmdeploy chat turbomind $repo_id --model-name $model_name
+# Serving with gradio
+lmdeploy serve gradio $repo_id --model-name $model_name
+# Serving with Restful API
+lmdeploy serve api_server $repo_id --model-name $model_name --tp 1
+```
+### 2) 其他的 LM 模型
+其他 LM 模型比如 Qwen/Qwen-7B-Chat, baichuan-inc/Baichuan2-7B-Chat。LMDeploy 模型支持情况可通过 `lmdeploy list` 查看。
+```
+repo_id=Qwen/Qwen-7B-Chat
+model_name=qwen-7b
+# or
+# repo_id=/path/to/Qwen-7B-Chat/local_path
+# Inference by TurboMind
+lmdeploy chat turbomind $repo_id --model-name $model_name
+# Serving with gradio
+lmdeploy serve gradio $repo_id --model-name $model_name
+# Serving with Restful API
+lmdeploy serve api_server $repo_id --model-name $model_name --tp 1
+```
+### 3) 通过 `lmdeploy convert` 命令转换好的模型
+使用方式与之前相同
+```
+# Convert a model
+lmdeploy convert $MODEL_NAME /path/to/model --dst-path ./workspace
+# Inference by TurboMind
+lmdeploy chat turbomind ./workspace
+# Serving with gradio
+lmdeploy serve gradio ./workspace
+# Serving with Restful API
+lmdeploy serve api_server ./workspace --tp 1
+```
--- a/docs/zh_cn/inference/pipeline.md
+++ b/docs/zh_cn/inference/pipeline.md
+# LLM 离线推理 pipeline
+本文通过一些例子展示 pipeline 的基本用法。
+pipeline API 详细的接口说明，请阅读[此处](https://lmdeploy.readthedocs.io/zh-cn/latest/api/pipeline.html)
+## 使用方法
+- **使用默认参数的例子:**
+```python
+from lmdeploy import pipeline
+pipe = pipeline('internlm/internlm2-chat-7b')
+response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
+print(response)
+```
+在这个例子中，pipeline 默认申请一定比例显存，用来存储推理过程中产生的 k/v。比例由参数 `TurbomindEngineConfig.cache_max_entry_count` 控制。
+LMDeploy 在研发过程中，k/v cache 比例的设定策略有变更，以下为变更记录：
+1. `v0.2.0 <= lmdeploy <= v0.2.1`
+   默认比例为 0.5，表示 **GPU总显存**的 50% 被分配给 k/v cache。 对于 7B 模型来说，如果显存小于 40G，会出现 OOM。当遇到 OOM 时，请按照下面的方法，酌情降低 k/v cache 占比：
+   ```python
+   from lmdeploy import pipeline, TurbomindEngineConfig
+   # 调低 k/v cache内存占比调整为总显存的 20%
+   backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
+   pipe = pipeline('internlm/internlm2-chat-7b',
+                   backend_config=backend_config)
+   response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
+   print(response)
+   ```
+2. `lmdeploy > v0.2.1`
+   分配策略改为从**空闲显存**中按比例为 k/v cache 开辟空间。默认比例值调整为 0.8。如果遇到 OOM，类似上面的方法，请酌情减少比例值，降低 k/v cache 的内存占用量
+- **如何设置 tp:**
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(tp=2)
+pipe = pipeline('internlm/internlm2-chat-7b',
+                backend_config=backend_config)
+response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
+print(response)
+```
+- **如何设置 sampling 参数:**
+```python
+from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(tp=2)
+gen_config = GenerationConfig(top_p=0.8,
+                              top_k=40,
+                              temperature=0.8,
+                              max_new_tokens=1024)
+pipe = pipeline('internlm/internlm2-chat-7b',
+                backend_config=backend_config)
+response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
+                gen_config=gen_config)
+print(response)
+```
+- **如何设置 OpenAI 格式输入:**
+```python
+from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(tp=2)
+gen_config = GenerationConfig(top_p=0.8,
+                              top_k=40,
+                              temperature=0.8,
+                              max_new_tokens=1024)
+pipe = pipeline('internlm/internlm2-chat-7b',
+                backend_config=backend_config)
+prompts = [[{
+    'role': 'user',
+    'content': 'Hi, pls intro yourself'
+}], [{
+    'role': 'user',
+    'content': 'Shanghai is'
+}]]
+response = pipe(prompts,
+                gen_config=gen_config)
+print(response)
+```
+- **流式返回处理结果：**
+```python
+from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(tp=2)
+gen_config = GenerationConfig(top_p=0.8,
+                              top_k=40,
+                              temperature=0.8,
+                              max_new_tokens=1024)
+pipe = pipeline('internlm/internlm2-chat-7b',
+                backend_config=backend_config)
+prompts = [[{
+    'role': 'user',
+    'content': 'Hi, pls intro yourself'
+}], [{
+    'role': 'user',
+    'content': 'Shanghai is'
+}]]
+for item in pipe.stream_infer(prompts, gen_config=gen_config):
+    print(item)
+```
+- **使用 pytorch 后端**
+需要先安装 triton
+```shell
+pip install triton>=2.1.0
+```
+```python
+from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig
+backend_config = PytorchEngineConfig(session_len=2048)
+gen_config = GenerationConfig(top_p=0.8,
+                              top_k=40,
+                              temperature=0.8,
+                              max_new_tokens=1024)
+pipe = pipeline('internlm/internlm2-chat-7b',
+                backend_config=backend_config)
+prompts = [[{
+    'role': 'user',
+    'content': 'Hi, pls intro yourself'
+}], [{
+    'role': 'user',
+    'content': 'Shanghai is'
+}]]
+response = pipe(prompts, gen_config=gen_config)
+print(response)
+```
+## FAQs
+- **RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase**.
+  如果你在使用 tp>1 和 pytorch 后端的时候，遇到了这个错误。请确保 python 脚本中有下面内容作为入口
+  ```python
+  if __name__ == '__main__':
+  ```
+  一般来说，在多线程或多进程上下文中，可能需要确保初始化代码只执行一次。这时候，`if __name__ == '__main__':` 可以帮助确保这些初始化代码只在主程序执行，而不会在每个新创建的进程或线程中重复执行。
+- 自定义对话模板，请参考[chat_template.md](../advance/chat_template.md)
--- a/docs/zh_cn/inference/pytorch.md
+++ b/docs/zh_cn/inference/pytorch.md
+# lmdeploy.pytorch 架构
+`lmdeploy.pytorch` 是 LMDeploy 提供的推理后端之一。与着重于性能的 turbomind 相比，lmdeploy.pytorch 以较小的性能开销为代价，提供了一套更容易开发与扩展的大模型推理实现。
+## 设计
+![pytorch arch](https://github.com/grimoire/lmdeploy/blob/media/lmdeploy_pytorch_arch.png?raw=true)
+## API
+lmdeploy.pytorch 可以与 turbomind 共享同样的服务接口，这些服务接口通过 Engine 与 EngineInstance 与 lmdeploy.pytorch 进行交互。
+EngineInstance 是推理请求的发起者，它会将推理请求组织成特定格式发送给 Engine，以此实现流式推理。EngineInstance 的推理接口是线程安全的，服务发起者可以在不同线程中启动各自的 EngineInstance，Engine 回根据当前资源与推理请求自动进行 batch 化处理。
+Engine 是推理请求的接收与执行者。它包含如下的组件来完成这项任务：
+- ModelAgent 对象负责模型的加载、缓存管理以及 tensor parallelism 的管理。
+- Scheduler 对象负责 session 的管理，sequence 与 lora adapter 所需要的资源的分配。
+- RequestManager 负责请求的发送与接收，可以通过它与 EngineInstance 交互。
+## Engine
+为了应对异步推理请求，Engine 在启动后会维护一个线程，循环如下操作：
+1. 通过 RequestManager 读取请求，对各种请求进行分类处理。
+2. Scheduler 规划哪些请求可以被处理，以及它们所需的缓存和 adapters。
+3. ModelAgent 根据步骤 2. 得到的信息为输入分配资源，然后使用 patch 后的模型进行推理
+4. Scheduler 根据推理结果更新请求状态
+5. RequestManager 将输出返回给发送者（EngineInstance），回到步骤 1.
+下面我们将介绍上述步骤中用到的几个重要组件
+### Scheduler
+在进行大模型的推理时，通常会把 attention 的历史输入 key 和 value 缓存起来，以避免在未来的推理中进行重复计算。这种情况下如果要进行多 batch 的推理，由于不同数据的序列长度可能不同，kv 会进行大量的填充，浪费很多显存资源，也限制了模型的并发推理能力上限。
+[vLLM](https://docs.vllm.ai) 提了一种 paging 策略，以 page block 为单位为 key value 分配缓存，这样就可以避免由于 padding 导致的显存浪费。 lmdeploy.pytorch 中的 Scheduler 也遵循同样的设计，根据请求的长度合理分配所需的资源，并撤出暂时不使用的资源以保证存储资源的高效利用。
+lmdeploy.pytorch 还对 [S-LoRA](https://github.com/S-LoRA/S-LoRA) 的支持，S-LoRA 是一种对单模型多 adapter 的支持方案。LoRA 在推理时通常会把 adapter 融合进模型权重当中，同时使用复数个 adapter 会导致显存使用量的激增；S-LoRA 不对 adapter 进行融合，通过使用 unified paging，在推理时动态换入需要使用的 adapter，大幅降低了使用 adapter 的显存开销。Scheduler 中也实现了相关的功能，让用户可以更方便的使用自己的 adapter.
+### ModelAgent
+lmdeploy.pytorch 中对 Tensor Parallelism（TP）进行了支持，不同的 TP 参数对模型的构造、权重处理、分配 cache 都存在影响。ModelAgent 对这些内容进行了封装，让 Engine 不用再关心这部分细节。
+ModelAgent 有两个重要组件：
+1. patched_model 是更新后的 transformer 模型，更新后的模型添加了各种功能的支持，包括更高性能的子模块实现、TP、量化等等
+2. cache_engine 是缓存的分配与交换模块。它接收来自 scheduler 的交换请求，执行 host-device 间显存交换，adapter 加载等工作
+## Patching
+为了降低接入模型的门槛，我们实现了一套简单的 patch 机制来简化实现的替换。
+以 Llama 模型的 LlamaAttention.forward 为例，我们可以重新写一个 forward 的实现：
+```python
+class CustomLlamaAttention(nn.Module):
+    def forward(self, ...):
+        # custom forward
+```
+然后在 `lmdeploy.pytorch.models.module_map` 中注册模块的映射
+```python
+MODULE_MAP.update({
+'transformers.models.llama.modeling_llama.LlamaAttention':
+'qualname.to.CustomLlamaAttention'})
+```
+经过 patch 后的模型就会使用新的 forward 实现。TP、量化等功能也依赖 patch 机制，请阅读 [lmdeploy.pytorch 新模型支持](../advance/pytorch_new_model.md) 了解更多细节。
+## 特性
+- **Continuous Batching**: 由于输入序列的长度不一样，batching 通常需要对输入进行 padding，这种 padding 会导致后续运算的计算量增加、影响速度，也会使得显存的占用大幅增加。遵循许多其他成熟框架的方案，lmdeploy.pytorch 采用了 continuous batching 的方式对输入做了连续化处理，避免了多余的资源占用。
+- **Tensor Parallelism**: 大模型可能会占用远超一张显卡的显存量，为了支持这样的大模型的推理，我们实现了 Tensor 并发，模型的权重会被分布在不同的设备中，每张 GPU 设备负责一部分计算，减少了单卡显存占用，也充分利用了多显卡的计算优势。
+- **S-LoRA**: LoRA adapter 可以帮助我们使用有限的显存来调优大模型，S-LoRA 可以帮助我们在有限的显存中同时使用复数个 LoRA 权重，扩展模型的能力。
+- **Quantization**: 量化可以帮助我们进一步减少显存占用，提高推理性能。lmdeploy.pytorch 分支中添加了 w8a8 模型量化的支持，可以阅读 [w8a8](../quantization/w8a8.md) 了解更多细节。
--- a/docs/zh_cn/inference/turbomind.md
+++ b/docs/zh_cn/inference/turbomind.md
+# TurboMind 框架
+TurboMind 是一款关于 LLM 推理的高效推理引擎，基于英伟达的 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) 研发而成。它的主要功能包括：LLaMa 结构模型的支持，persistent batch 推理模式和可扩展的 KV 缓存管理器。
+## TurboMind 结构
+```
+  +--------------------+
+  |        API         |
+  +--------------------+
+          |    ^
+    请 求  |    | 流式回调
+          v    |
+  +--------------------+    获取   +-------------------+
+  |  Persistent Batch  | <-------> |  KV Cache 管理器 |
+  +--------------------+    更新   +-------------------+
+             ^
+             |
+             v
+------------------------+
+|      LLaMa推理实现      |
+------------------------+
+| FT kernels & utilities |
+------------------------+
+```
+## Persistent Batch
+你也许在别的项目中看到这项机制的另一个名字： `continuous batching` 。在开发这个功能时，我们将对话式 LLM 的推理建模为一个持续运行的 batch ，其生命周期跨越整个服务过程，故将其命名为 `persistent batch` 。简单来说是这样实现的：
+- 该功能会预先准备好 N 个 batch slots。
+- 当有空闲 slots 时， 请求就会加入到 batch 中。当请求对应的 tokens 都生成完毕后，对应的 batch slot 会立刻被释放，接收新的请求。
+- **当一个 sequence 命中缓存时（见下文），它的历史 token 不必在每轮中都进行解码，所以它的 token 生成过程会即刻开始**。
+- 整个 batch 会自动扩缩容来避免不必要的计算。
+## KV 缓存管理器
+TurboMind 的 [KV 缓存管理器](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/SequenceManager.h) 是一个内存池类型的对象，并且在其中加入了 LRU 的实现，这样整个管理器可以被看作是一个 **KV 缓存的缓存**。大致工作方式如下：
+- KV 缓存由管理器分配。管理器会根据预先配置好的 slot 数量开辟空间。每个 slot 对应于一个 sequence 所需的 KV 缓存。分配的内存块大小可通过配置来实现预分配或者按需分配（或介于两者之间）。
+- 当有新的请求，但是缓存池中没有空闲 slot时，根据 LRU 机制，管理器会踢除最近使用最少的 sequence，把它占据的 slot 分给新的请求。不仅仅如此，
+- sequence获取到了slot，类似缓存命中。它在缓存中的历史KV会被直接返回，而不用再进行context decoding 。
+- 被踢除的 sequences 不会被完全的删除，而是会被转换成最简洁的形式，例如 token IDs 。当之后获取到相同的 sequence id 时 (即 _cache-miss_ 状态)，这些 token IDs 将被 FMHA 的 context decoder 解码并被转回 KV 缓存。
+- 踢除和转换均由 TurboMind 内部自动管理所以对用户来说是透明的。__从用户的使用角度来看，使用了 TurboMind 的系统就像是可以访问无限的设备内存__。
+## TurboMind 的 LLaMa 实现
+我们对 LLaMa 系列模型的实现是从 FasterTransformer 中的 Gpt-NeX 模型修改而来的。除了对 LLaMa 系列进行基本重构和修改外，我们还做了一些改进以实现会话模型的高性能推理，其中最重要的是：
+- 支持多轮对话中的快速文本解码。我们用基于 [cutlass](https://github.com/NVIDIA/cutlass) 的 FMHA 实现替代了 context decoder 中的注意力机制实现，从而支持了 Q/K 长度不匹配的情况。
+- 我们在 context FMHA 和 generation FMHA 中都加入了间接缓冲指针，支持 batch 中不连续的 KV 缓存。
+- 为了支持 persistent batch 的并发推理，我们设计了新的同步机制来协调在张量并型模式下的工作线程。
+- 我们实现了 INT8 KV cache，降低了内存开销，提高了批处理大小和系统吞吐量。这在实际场景中非常有用，因为相比权重和其他激活，KV cache 会消耗更多的内存和内存带宽。
+- 我们解决了单个进程内多个模型实例在 TP 模式下运行时 NCCL 卡住的问题。NCCL APIs 现由 host 端的同步 barriers 保护。
+## API
+TurboMind 的 Python API 支持流式结果返回和张量并行模式。
+同时 TurboMind 也继承了 FasterTransformer 能够注册为 [Triton Inference Server](https://github.com/triton-inference-server/server) 推理后端的能力。但是为了支持 persistent batch 中的并发请求，我们不再像 FasterTransformer 那样使用 sequence batching 或者 dynamic batching 。相反，TurboMind 负责记录和管理请求序列的状态。
+## TurboMind 和 FasterTransformer 的区别
+除了上文中提到的功能外，TurboMind 相较于 FasterTransformer 还有不少差别。譬如不少 FasterTransformer 的功能在 TurboMind 中都被去掉了，这其中包括前缀提示词、 beam search 、上下文 embedding、稀疏化 GEMM 操作和对应 GPT 或 T5 等结构的模型的支持等等。
+## FAQ
+### 对 Huggingface 模型的支持
+因为历史因素， TurboMind 的权重设计是基于 [LLaMa 的官方实现](https://github.com/facebookresearch/llama) 完成的，两者只相差一个转置操作。但是 Huggingface 版本的实现却是[另一种形式](https://github.com/huggingface/transformers/blob/45025d92f815675e483f32812caa28cce3a960e7/src/transformers/models/llama/convert_llama_weights_to_hf.py#L123C76-L123C76)，两种权重实现方式在 `W_q` 和 `W_k` 上的区别我们在 [deploy.py](https://github.com/InternLM/lmdeploy/blob/ff4648a1d09e5aec74cf70efef35bfaeeac552e0/lmdeploy/serve/turbomind/deploy.py#L398) 进行了适配处理，用户可前往查看。
--- a/docs/zh_cn/inference/turbomind_config.md
+++ b/docs/zh_cn/inference/turbomind_config.md
+# TurboMind 配置
+TurboMind 是 LMDeploy 的推理引擎，在用它推理 LLM 模型时，需要把输入模型转成 TurboMind 模型。在 TurboMind 的模型文件夹中，除模型权重外，TurboMind 模型还包括其他一些文件，其中最重要的是和推理性能息息相关的配置文件`triton_models/weights/config.ini`。
+如果你使用的是 LMDeploy 0.0.x 版本，请参考[turbomind 1.0 配置](#turbomind-10-配置)章节，了解配置中的相关内容。如果使用的是 LMDeploy 0.1.x 版本，请阅读[turbomind 2.0 配置](#turbomind-20-配置)了解配置细节。
+## TurboMind 2.0 配置
+以 `llama-2-7b-chat` 模型为例，在 TurboMind 2.0 中，它的`config.ini`内容如下：
+```toml
+[llama]
+model_name = llama2
+tensor_para_size = 1
+head_num = 32
+kv_head_num = 32
+vocab_size = 32000
+num_layer = 32
+inter_size = 11008
+norm_eps = 1e-06
+attn_bias = 0
+start_id = 1
+end_id = 2
+session_len = 4104
+weight_type = fp16
+rotary_embedding = 128
+rope_theta = 10000.0
+size_per_head = 128
+group_size = 0
+max_batch_size = 64
+max_context_token_num = 1
+step_length = 1
+cache_max_entry_count = 0.5
+cache_block_seq_len = 128
+cache_chunk_size = 1
+use_context_fmha = 1
+quant_policy = 0
+max_position_embeddings = 2048
+rope_scaling_factor = 0.0
+use_logn_attn = 0
+```
+这些参数由模型属性和推理参数组成。模型属性包括层数、head个数、维度等等，它们**不可修改**
+```toml
+model_name = llama2
+head_num = 32
+kv_head_num = 32
+vocab_size = 32000
+num_layer = 32
+inter_size = 11008
+norm_eps = 1e-06
+attn_bias = 0
+start_id = 1
+end_id = 2
+rotary_embedding = 128
+rope_theta = 10000.0
+size_per_head = 128
+```
+和 TurboMind 1.0 config 相比，TurboMind 2.0 config 中的模型属性部分和 1.0 一致，但推理参数发生了变化。
+在接下来的章节中，我们重点介绍推理参数。
+### 数据类型
+和数据类型相关的参数是 `weight_type` 和 `group_size`。它们**不可被修改**。
+`weight_type` 表示权重的数据类型。目前支持 fp16 和 int4。int4 表示 4bit 权重。当 `weight_type`为 4bit 权重时，`group_size` 表示 `awq` 量化权重时使用的 group 大小。目前，在 LMDeploy 的预编译包中，使用的是 `group_size = 128`。
+### 批处理大小
+仍通过 `max_batch_size` 设置最大批处理量。默认值由原来的 32 改成 64。
+在 TurboMind 2.0 中，`max_batch_size` 和 `cache_max_entry_count`无关。
+### k/v 缓存大小
+`cache_block_seq_len` 和 `cache_max_entry_count` 用来调节 k/v cache 的内存大小。
+TurboMind 2.0 实现了 Paged Attention，按块管理 k/v cache。
+`cache_block_seq_len` 表示一块 k/v block 可以存放的 token 序列长度，默认 128。TurboMind 按照以下公式计算 k/v block 的内存大小：
+```
+cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_data_type)
+```
+对于 llama2-7b 模型来说，以 half 类型存放 k/v 时，一块 k/v block 的内存为：`128 * 32 * 32 * 128 * 2 * sizeof(half) = 64MB`
+`cache_max_entry_count` 根据取值不同，表示不同的含义：
+- 当值为 (0, 1) 之间的小数时，`cache_max_entry_count` 表示 k/v block 使用的内存百分比。比如 A100-80G 显卡内存是80G，当`cache_max_entry_count`为0.5时，表示 k/v block 使用的内存总量为 80 * 0.5 = 40G
+- 当 lmdeploy 版本大于 0.2.1 时，`cache_max_entry_count` 将**空闲**内存的百分比用于 k/v blocks，默认值为 `0.8`。例如，在 A100-80G GPU 上运行 Turbomind 加载 13b 模型时，k/v blocks 使用的内存为 `(80 - 26) * 0.8 = 43.2G`，即利用剩余 54G 中的 80%
+- 当值为 > 1的整数时，表示 k/v block 数量
+`cache_chunk_size` 表示在每次需要新的 k/v cache 块时，开辟 k/v cache 块的大小。不同的取值，表示不同的含义：
+- 当为 > 0 的整数时，开辟 `cache_chunk_size` 个 k/v cache 块
+- 当值为 -1 时，开辟 `cache_max_entry_count` 个 k/v cache 块
+- 当值为 0 时，时，开辟 `sqrt(cache_max_entry_count)` 个 k/v cache 块
+### kv int8 开关
+`quant_policy`是 KV-int8 推理开关。具体使用方法，请参考 [kv int8](../quantization/kv_int8.md) 部署文档
+### 外推能力开关
+默认 `rope_scaling_factor = 0` 不具备外推能力。设置为 1.0，可以开启 RoPE 的 Dynamic NTK 功能，支持长文本推理。
+关于 Dynamic NTK 的原理，详细请参考：
+1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
+2. https://kexue.fm/archives/9675
+设置 `use_logn_attn = 1`，可以开启 [LogN attention scaling](https://kexue.fm/archives/8823)。
+## TurboMind 1.0 配置
+以 `llama-2-7b-chat` 模型为例，在 TurboMind 1.0 中，它的`config.ini`内容如下：
+```toml
+[llama]
+model_name = llama2
+tensor_para_size = 1
+head_num = 32
+kv_head_num = 32
+vocab_size = 32000
+num_layer = 32
+inter_size = 11008
+norm_eps = 1e-06
+attn_bias = 0
+start_id = 1
+end_id = 2
+session_len = 4104
+weight_type = fp16
+rotary_embedding = 128
+rope_theta = 10000.0
+size_per_head = 128
+group_size = 0
+max_batch_size = 32
+max_context_token_num = 4
+step_length = 1
+cache_max_entry_count = 48
+cache_chunk_size = 1
+use_context_fmha = 1
+quant_policy = 0
+max_position_embeddings = 2048
+use_dynamic_ntk = 0
+use_logn_attn = 0
+```
+这些参数由模型属性和推理参数组成。模型属性包括层数、head个数、维度等等，它们**不可修改**
+```toml
+model_name = llama2
+head_num = 32
+kv_head_num = 32
+vocab_size = 32000
+num_layer = 32
+inter_size = 11008
+norm_eps = 1e-06
+attn_bias = 0
+start_id = 1
+end_id = 2
+rotary_embedding = 128
+rope_theta = 10000.0
+size_per_head = 128
+```
+在接下来的章节中，我们重点介绍推理参数。
+### 数据类型
+和数据类型相关的参数是 `weight_type` 和 `group_size`。它们**不可被修改**。
+`weight_type` 表示权重的数据类型。目前支持 fp16 和 int4。int4 表示 4bit 权重。当 `weight_type`为 4bit 权重时，`group_size` 表示 `awq` 量化权重时使用的 group 大小。目前，在 LMDeploy 的预编译包中，使用的是 `group_size = 128`。
+### 批处理大小
+可通过`max_batch_size`调节推理时最大的 batch 数。一般，batch 越大吞吐量越高。但务必保证 `max_batch_size <= cache_max_entry_count`
+### k/v cache 大小
+TurboMind 根据 `session_len`、 `cache_chunk_size` 和 `cache_max_entry_count` 开辟 k/v cache 内存。
+- `session_len` 表示一个序列的最大长度，即 context window 的大小。
+- `cache_chunk_size` 表示当新增对话序列时，每次要开辟多少个序列的 k/v cache
+- `cache_max_entry_count` 表示最多缓存多少个对话序列
+### kv int8 开关
+当启动 8bit k/v 推理时，需要修改参数 `quant_policy` 和 `use_context_fmha`。详细内容请查阅 [kv int8](../quantization/kv_int8.md) 部署文档。
+### 外推能力开关
+设置 `use_dynamic_ntk = 1`，可以开启 RoPE 的 Dynamic NTK 选项，支持长文本推理。
+关于 Dynamic NTK 的原理，详细请参考：
+1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
+2. https://kexue.fm/archives/9675
+设置 `use_logn_attn = 1`，可以开启 [LogN attention scaling](https://kexue.fm/archives/8823)。
--- a/docs/zh_cn/inference/vl_pipeline.md
+++ b/docs/zh_cn/inference/vl_pipeline.md
+# VLM 离线推理 pipeline
+LMDeploy 把视觉-语言模型（VLM）复杂的推理过程，抽象为简单好用的 pipeline。它的用法与大语言模型（LLM）推理 [pipeline](./pipeline.md) 类似。
+目前，VLM pipeline 支持以下模型：
+- [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat)
+- LLaVA series: [v1.5](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
+- [Yi-VL](https://huggingface.co/01-ai/Yi-VL-6B)
+我们诚挚邀请社区在 LMDeploy 中添加更多 VLM 模型的支持。
+本文将以 [liuhaotian/llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) 模型为例，展示 VLM pipeline 的用法。你将了解它的最基础用法，以及如何通过调整引擎参数和生成条件来逐步解锁更多高级特性，如张量并行，上下文窗口大小调整，随机采样，以及对话模板的定制。
+此外，我们还提供针对多图、批量提示词等场景的实际推理示例。
+## "Hello, world" 示例
+```python
+from lmdeploy import pipeline
+from lmdeploy.vl import load_image
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+如果在执行这个用例时，出现 `ImportError` 的错误，请按照提示安装相关的依赖包。
+上面的例子中，推理时的提示词是 (prompt, image) 的 tuple 结构。除了这种结构外，pipeline 支持 openai 格式的提示词：
+```python
+from lmdeploy import pipeline
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
+prompts = [
+    {
+        'role': 'user',
+        'content': [
+            {'type': 'text', 'text': 'describe this image'},
+            {'type': 'image_url', 'image_url': {'url': 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg'}}
+        ]
+    }
+]
+response = pipe(prompts)
+print(response)
+```
+### 设置多卡并行
+设置引擎参数 `tp`，可激活多卡并行能力
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(tp=2))
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+### 设置上下文长度
+创建 pipeline 时，通过设置引擎参数 `session_len`，可以定制上下文窗口的最大长度
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+### 设置随机采样参数
+可通过传入 `GenerationConfig` 修改 pipeline 的生成接口中的默认采样参数。
+```python
+from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(tp=2, session_len=8192))
+gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.6)
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image), gen_config=gen_config)
+print(response)
+```
+### 设置对话模板
+推理时，LMDeploy 会根据模型路径匹配内置的对话模板，并把对话模板应用到输入的提示词上。但是，对于类似 [llava-v1.5-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b) 视觉-语言模型，它使用的对话模板是 vicuna，但是这个模板名无法从模型路径中获取，所以需要用户指定。具体方式如下：
+```python
+from lmdeploy import pipeline, ChatTemplateConfig
+from lmdeploy.vl import load_image
+pipe = pipeline('liuhaotian/llava-v1.5-7b',
+                chat_template_config=ChatTemplateConfig(model_name='vicuna'))
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe(('describe this image', image))
+print(response)
+```
+关于如何自定义对话模版，请参考[这里](../advance/chat_template.md)
+## 多图推理
+对于多图的场景，在推理时，只要把它们放在一个列表中即可。不过，多图意味着输入 token 数更多，所以通常需要[增大推理的上下文长度](#设置上下文长度)
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+image_urls=[
+    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
+    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
+]
+images = [load_image(img_url) for img_url in image_urls]
+response = pipe(('describe these images', images))
+print(response)
+```
+## 提示词批处理
+做批量提示词推理非常简单，只要把它们放在一个 list 结构中：
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+from lmdeploy.vl import load_image
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+image_urls=[
+    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
+    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
+]
+prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
+response = pipe(prompts)
+print(response)
+```
+## 多轮对话
+pipeline 进行多轮对话有两种方式，一种是按照 openai 的格式来构造 messages，另外一种是使用 `pipeline.chat` 接口。
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
+from lmdeploy.vl import load_image
+pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
+                backend_config=TurbomindEngineConfig(session_len=8192))
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
+gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.6)
+sess = pipe.chat(('describe this image', image), gen_config=gen_config)
+print(sess.response.text)
+sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
+print(sess.response.text)
+```
--- a/docs/zh_cn/quantization/kv_int8.md
+++ b/docs/zh_cn/quantization/kv_int8.md
+# KV Cache 量化和测试结果
+对于最大长度是 2048 的 LLaMa-7B fp16 模型，服务端每创建 1 个并发，都需要大约 1030MB 显存保存 kv_cache，即便是 A100 80G，能服务的用户也非常有限。
+为了降低运行时显存，我们实现了 kv cache PTQ 量化，使用的公式如下：
+```bash
+zp = (min+max) / 2
+scale = (max-min) / 255
+quant: q = round( (f-zp) / scale)
+dequant: f = q * scale + zp
+```
+## 如何开启 KV Cache INT8
+### **第一步**
+通过以下命令，获取量化参数，并保存至原HF模型目录
+```bash
+# get minmax
+export HF_MODEL=internlm/internlm-chat-7b
+lmdeploy lite calibrate \
+  $HF_MODEL \
+  --calib-dataset 'ptb' \
+  --calib-samples 128 \
+  --calib-seqlen 2048 \
+  --work-dir $HF_MODEL
+```
+### **第二步**
+测试聊天效果。注意需要添加参数`--quant-policy 4`以开启KV Cache int8模式。
+```bash
+lmdeploy chat turbomind $HF_MODEL --model-format hf --quant-policy 4
+```
+## 显存测试
+测试对象为 [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) 模型。
+测试方法：
+1. 使用 `deploy.py` 转换模型，修改 `workspace` 配置中的最大并发数；调整 `llama_config.ini` 中的请求数
+2. 编译执行 `bin/llama_triton_example`，获取 fp16 版本在不同 batch_size 的显存情况
+3. 开启量化，重新执行 `bin/llama_triton_example`，获取 int8 版本在不同 batch_size 显存情况
+以下是两个版本的显存对比：
+| batch_size | fp16 memory(MiB) | int8 memory(MiB) | diff(MiB) |
+| :--------: | :--------------: | :--------------: | :-------: |
+|     8      |      22337       |      18241       |   -4096   |
+|     16     |      30593       |      22369       |   -8224   |
+|     32     |      47073       |      30625       |  -16448   |
+|     48     |      63553       |      38881       |  -24672   |
+相对于直接量化 Weight（如 [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/)），我们做了两种方案在 7B 模型中的内存增长对比预估，部分数据来自 [llama.cpp](https://github.com/ggerganov/llama.cpp)。
+![](../../../resources/batch_memory.png)
+可以看到，fp16 版本每个并发需要 1030MB 显存，因此量化 kv_cache 能显著降低运行时的显存增长速度。
+## 精度测试
+测试对象为 [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) 指令模型。
+以下是 `kCacheKVInt8` 方法仅从 c4 数据集，随机选择 128 条数据 PTQ 量化。量化前后均使用 [opencompass](https://github.com/InternLM/opencompass) 测试精度。
+|     task      |     dataset     |    metric     | int8  | fp16  | diff  |
+| :-----------: | :-------------: | :-----------: | :---: | :---: | :---: |
+|   Language    |   winogrande    |   accuracy    | 60.77 | 61.48 | -0.71 |
+|   Knowledge   |       nq        |     score     | 2.69  | 2.60  | +0.09 |
+|   Reasoning   |      gsm8k      |   accuracy    | 33.28 | 34.72 | -1.44 |
+|   Reasoning   |       bbh       | naive_average | 20.12 | 20.51 | -0.39 |
+| Understanding | openbookqa_fact |   accuracy    | 82.40 | 82.20 | +0.20 |
+| Understanding |   eprstmt-dev   |   accuracy    | 90.62 | 88.75 | +1.87 |
+|    Safety     |   crows_pairs   |   accuracy    | 32.56 | 31.43 | +1.13 |
+需要注意的是，`kCacheKVInt8` 和 `WeightInt4` 两种方案可以同时开启。请参阅 [w4a16](./w4a16.md) 开启 `WeightInt4` ，然后测试聊天效果：
+```shell
+lmdeploy chat turbomind ./internlm-chat-7b-4bit --model-format awq --quant-policy 4
+```
--- a/docs/zh_cn/quantization/w4a16.md
+++ b/docs/zh_cn/quantization/w4a16.md
+# INT4 模型量化和部署
+LMDeploy 使用 AWQ 算法，实现模型 4bit 权重量化。推理引擎 TurboMind 提供了非常高效的 4bit 推理 cuda kernel，性能是 FP16 的 2.4 倍以上。它支持以下 NVIDIA 显卡：
+- 图灵架构（sm75）：20系列、T4
+- 安培架构（sm80,sm86）：30系列、A10、A16、A30、A100
+- Ada Lovelace架构（sm90）：40 系列
+在量化和部署之前，请确保安装了 lmdeploy.
+```shell
+pip install lmdeploy[all]
+```
+本文由以下章节组成：
+<!-- toc -->
+- [模型量化](#模型量化)
+- [模型评测](#模型评测)
+- [模型推理](#模型推理)
+- [推理性能](#推理性能)
+- [推理服务](#推理服务)
+<!-- tocstop -->
+## 模型量化
+仅需执行一条命令，就可以完成模型量化工作。量化结束后，权重文件存放在 `$WORK_DIR` 下。
+```shell
+export HF_MODEL=internlm/internlm-chat-7b
+export WORK_DIR=internlm/internlm-chat-7b-4bit
+lmdeploy lite auto_awq \
+   $HF_MODEL \
+  --calib-dataset 'ptb' \
+  --calib-samples 128 \
+  --calib-seqlen 2048 \
+  --w-bits 4 \
+  --w-group-size 128 \
+  --work-dir $WORK_DIR
+```
+绝大多数情况下，在执行上述命令时，可选参数可不用填写，使用默认的即可。比如量化 [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) 模型，命令可以简化为：
+```shell
+lmdeploy lite auto_awq internlm/ianternlm-chat-7b --work-dir internlm-chat-7b-4bit
+```
+```{note}
+我们建议 --work-dir 参数带有模型名字，就像上面的例子展示的那样。这样在推理时，就不用指定对话模板了。因为推理接口会以模糊搜索方式，选出和 --work-dir 近似的对话模板
+```
+量化后的模型，可以用一些工具快速验证对话效果。
+比如，直接在控制台和模型对话，
+```shell
+lmdeploy chat turbomind ./internlm-chat-7b-4bit --model-format awq
+```
+或者，启动gradio服务，
+```shell
+lmdeploy serve gradio ./internlm-chat-7b-4bit --server-name {ip_addr} --server-port {port} --model-format awq
+```
+然后，在浏览器中打开 http://{ip_addr}:{port}，即可在线对话
+## 模型评测
+我们使用 [OpenCompass](https://opencompass.readthedocs.io/zh-cn/latest/index.html) 评测量化模型在各个维度上的能力
+## 模型推理
+量化后的模型，通过以下几行简单的代码，可以实现离线推理：
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+engine_config = TurbomindEngineConfig(model_format='awq')
+pipe = pipeline("./internlm-chat-7b-4bit", backend_config=engine_config)
+response = pipe(["Hi, pls intro yourself", "Shanghai is"])
+print(response)
+```
+关于 pipeline 的详细介绍，请参考[这里](../inference/pipeline.md)
+除了推理本地量化模型外，LMDeploy 还支持直接推理 huggingface hub 上的通过 AWQ 量化的 4bit 权重模型，比如 [lmdeploy 空间](https://huggingface.co/lmdeploy)和 [TheBloke 空间](https://huggingface.co/TheBloke)下的模型。
+```python
+# 推理 lmdeploy 空间下的模型
+from lmdeploy import pipeline, TurbomindEngineConfig
+pipe = pipeline("lmdeploy/llama2-chat-70b-4bit",
+                backend_config=TurbomindEngineConfig(model_format='awq', tp=4))
+response = pipe(["Hi, pls intro yourself", "Shanghai is"])
+print(response)
+# 推理 TheBloke 空间下的模型（试试codellama行不行）
+from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
+pipe = pipeline("TheBloke/LLaMA2-13B-Tiefighter-AWQ",
+                backend_config=TurbomindEngineConfig(model_format='awq'),
+                chat_template_config=ChatTemplateConfig(model_name='llama2')
+                )
+response = pipe(["Hi, pls intro yourself", "Shanghai is"])
+print(response)
+```
+## 推理性能
+我们在 NVIDIA GeForce RTX 4090 上使用 [profile_generation.py](https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py)，分别测试了 4-bit Llama-2-7B-chat 和 Llama-2-13B-chat 模型的 token 生成速度。测试配置为 batch size = 1，(prompt_tokens, completion_tokens) = (1, 512)
+| model            | llm-awq | mlc-llm | turbomind |
+| ---------------- | ------- | ------- | --------- |
+| Llama-2-7B-chat  | 112.9   | 159.4   | 206.4     |
+| Llama-2-13B-chat | N/A     | 90.7    | 115.8     |
+## 推理服务
+LMDeploy `api_server` 支持把模型一键封装为服务，对外提供的 RESTful API 兼容 openai 的接口。以下为服务启动的示例：
+```shell
+lmdeploy serve api_server internlm/internlm-chat-7b --backend turbomind --model-format awq
+```
+服务默认端口是23333。在 server 启动后，你可以在终端通过`api_client`与server进行对话：
+```shell
+lmdeploy serve api_client http://0.0.0.0:23333
+```
+还可以通过 Swagger UI `http://0.0.0.0:23333` 在线阅读和试用 `api_server` 的各接口，也可直接查阅[文档](../serving/api_server.md)，了解各接口的定义和使用方法。
--- a/docs/zh_cn/quantization/w8a8.md
+++ b/docs/zh_cn/quantization/w8a8.md
+# W8A8 LLM 模型部署
+LMDeploy 提供了使用 8 bit 整数对神经网络模型进行量化和推理的功能。
+在开始推理前，需要确保已经正确安装了 lmdeploy 和 openai/triton。可以通过以下命令进行安装：
+```shell
+pip install lmdeploy
+pip install triton>=2.1.0
+```
+## 8bit 权重模型推理
+如果你需要进行 8 bit 权重模型推理，可以直接从 LMDeploy 的 [model zoo](https://huggingface.co/lmdeploy) 下载已经量化好的 8bit 权重模型。以8bit 的 Internlm-chat-7B 模型为例，可以从 model zoo 直接下载：
+```shell
+git-lfs install
+git clone https://huggingface.co/lmdeploy/internlm-chat-7b-w8 (coming soon)
+```
+你也可以参考["8bit 权重量化"](#8bit-权重量化)章节的内容手动将原 16bit 权重量化为 8bit，并保存至 `internlm-chat-7b-w8` 目录下，操作命令如下：
+```shell
+lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8
+```
+然后，执行以下命令，即可在终端与模型对话：
+```shell
+lmdeploy chat torch ./internlm-chat-7b-w8
+```
+## 启动 gradio 服务
+Coming soon...
+## 推理速度
+Coming soon...
+## 8bit 权重量化
+进行 4bit 权重量化需要经历以下三步：
+1. **权重平滑**：首先对语言模型的权重进行平滑处理，以便更好地进行量化。
+2. **模块替换**：使用 `QRSMNorm` 和 `QLinear` 模块替换原模型 `DecoderLayer` 中的 `RSMNorm` 模块和 `nn.Linear` 模块。`lmdeploy/pytorch/models/q_modules.py` 文件中定义了这些量化模块。
+3. **保存量化模型**：完成上述必要的替换后，我们即可保存新的量化模型。
+我们在`lmdeploy/lite/api/smooth_quantity.py`脚本中已经实现了以上三个步骤。例如，可以通过以下命令得到量化后的 Internlm-chat-7B 模型的模型权重：
+```shell
+lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8
+```
+保存之后，你就可以通过调用from_pretrained接口来实例化你的量化模型。
--- a/docs/zh_cn/serving/api_server.md
+++ b/docs/zh_cn/serving/api_server.md
+# 部署 LLM 类 openai 服务
+本文主要介绍单个模型在单机多卡环境下，部署兼容 openai 接口服务的方式，以及服务接口的用法。为行文方便，我们把该服务名称为 `api_server`。对于多模型的并行服务，请阅读[请求分发服务器](./proxy_server.md)一文。
+在这篇文章中， 我们首先介绍服务启动的两种方法，你可以根据应用场景，选择合适的。
+其次，我们重点介绍服务的 RESTful API 定义，以及接口使用的方式，并展示如何通过 Swagger UI、LMDeploy CLI 工具体验服务功能
+最后，向大家演示把服务接入到 WebUI 的方式，你可以参考它简单搭建一个演示 demo。
+## 启动服务
+以 huggingface hub 上的 [internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) 模型为例，你可以任选以下方式之一，启动推理服务。
+### 方式一：使用 lmdeploy cli 工具
+```shell
+lmdeploy serve api_server internlm/internlm2-chat-7b --server-port 23333
+```
+api_server 启动时的参数可以通过命令行`lmdeploy serve api_server -h`查看。
+比如，`--tp` 设置张量并行，`--session-len` 设置推理的最大上下文窗口长度，`--cache-max-entry-count` 调整 k/v cache 的内存使用比例等等。
+### 方式二：使用 docker
+使用 LMDeploy 官方[镜像](https://hub.docker.com/r/openmmlab/lmdeploy/tags)，可以运行兼容 OpenAI 的服务。下面是使用示例：
+```shell
+docker run --runtime nvidia --gpus all \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
+    -p 23333:23333 \
+    --ipc=host \
+    openmmlab/lmdeploy:latest \
+    lmdeploy serve api_server internlm/internlm2-chat-7b
+```
+在这个例子中，`lmdeploy server api_server` 的命令参数与方式一一致。
+## RESTful API
+LMDeploy 的 RESTful API 兼容了 OpenAI 以下 3 个接口：
+- /v1/chat/completions
+- /v1/models
+- /v1/completions
+此外，LMDeploy 还定义了 `/v1/chat/interactive`，用来支持交互式推理。交互式推理的特点是不用像`v1/chat/completions`传入用户对话历史，因为对话历史会被缓存在服务端。
+这种方式在多轮次的长序列推理时，拥有很好的性能。
+服务启动后，你可以在浏览器中打开网页 http://0.0.0.0:23333，通过 Swagger UI 查看接口的详细说明，并且也可以直接在网页上操作，体验每个接口的用法，如下图所示。
+![swagger_ui](https://github.com/InternLM/lmdeploy/assets/4560679/b891dd90-3ffa-4333-92b2-fb29dffa1459)
+也可以使用 LMDeploy 自带的 CLI 工具，在控制台验证服务的正确性。
+```shell
+# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
+lmdeploy serve api_client ${api_server_url}
+```
+若需要把服务集成到自己的项目或者产品中，我们推荐以下用法：
+### 使用 openai 接口
+以下代码是通过 openai 包使用 `v1/chat/completions` 服务的例子。运行之前，请先安装 openai 包: `pip install openai`。
+```python
+from openai import OpenAI
+client = OpenAI(
+    api_key='YOUR_API_KEY',
+    base_url="http://0.0.0.0:23333/v1"
+)
+model_name = client.models.list().data[0].id
+response = client.chat.completions.create(
+  model=model_name,
+  messages=[
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": " provide three suggestions about time management"},
+  ],
+    temperature=0.8,
+    top_p=0.8
+)
+print(response)
+```
+关于其他 openai 接口的调用，也可以如法炮制。详情请参考 openai 官方[文档](https://platform.openai.com/docs/guides/text-generation)
+### 使用 lmdeploy `APIClient` 接口
+如果你想用 `/v1/chat/completions` 接口，你可以尝试下面代码：
+```python
+from lmdeploy.serve.openai.api_client import APIClient
+api_client = APIClient(f'http://{server_ip}:{server_port}')
+model_name = api_client.available_models[0]
+messages = [{"role": "user", "content": "Say this is a test!"}]
+for item in api_client.chat_completions_v1(model=model_name, messages=messages):
+    print(item)
+```
+如果你想用 `/v1/completions` 接口，你可以尝试：
+```python
+from lmdeploy.serve.openai.api_client import APIClient
+api_client = APIClient(f'http://{server_ip}:{server_port}')
+model_name = api_client.available_models[0]
+for item in api_client.completions_v1(model=model_name, prompt='hi'):
+    print(item)
+```
+关于 `/v1/chat/interactive` 接口，我们默认是关闭的。在使用时，请设置`interactive_mode = True`打开它。否则，它会退化为 openai 接口。
+在交互式推理中，每个对话序列的 id 必须唯一，所有属于该独立的对话请求，必须使用相同的 id。这里的 id 对应与接口中的 `session_id`。
+比如，一个对话序列中，有 10 轮对话请求，那么每轮对话请求中的 `session_id` 都要相同。
+```python
+from lmdeploy.serve.openai.api_client import APIClient
+api_client = APIClient(f'http://{server_ip}:{server_port}')
+messages = [
+    "hi, what's your name?",
+    "who developed you?",
+    "Tell me more about your developers",
+    "Summarize the information we've talked so far"
+]
+for message in messages:
+    for item in api_client.chat_interactive_v1(prompt=message,
+                                               session_id=1,
+                                               interactive_mode=True,
+                                               stream=False):
+        print(item)
+```
+### 使用 Java/Golang/Rust
+可以使用代码生成工具 [openapi-generator-cli](https://github.com/OpenAPITools/openapi-generator-cli) 将 `http://{server_ip}:{server_port}/openapi.json` 转成 java/rust/golang 客户端。
+下面是一个使用示例：
+```shell
+$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust
+$ ls rust/*
+rust/Cargo.toml  rust/git_push.sh  rust/README.md
+rust/docs:
+ChatCompletionRequest.md  EmbeddingsRequest.md  HttpValidationError.md  LocationInner.md  Prompt.md
+DefaultApi.md             GenerateRequest.md    Input.md                Messages.md       ValidationError.md
+rust/src:
+apis  lib.rs  models
+```
+### 使用 cURL
+cURL 也可以用于查看 API 的输出结果
+- 查看模型列表 `v1/models`
+```bash
+curl http://{server_ip}:{server_port}/v1/models
+```
+- 对话 `v1/chat/completions`
+```bash
+curl http://{server_ip}:{server_port}/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "internlm-chat-7b",
+    "messages": [{"role": "user", "content": "Hello! How are you?"}]
+  }'
+```
+- 文本补全 `v1/completions`
+```shell
+curl http://{server_ip}:{server_port}/v1/completions \
+  -H 'Content-Type: application/json' \
+  -d '{
+  "model": "llama",
+  "prompt": "two steps to build a house:"
+}'
+```
+- 交互式对话 `v1/chat/interactive`
+```bash
+curl http://{server_ip}:{server_port}/v1/chat/interactive \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "Hello! How are you?",
+    "session_id": 1,
+    "interactive_mode": true
+  }'
+```
+## 接入 WebUI
+LMDeploy 提供 gradio 和 [OpenAOE](https://github.com/InternLM/OpenAOE) 两种方式，为 api_server 接入 WebUI。
+### 方式一：通过 gradio 接入
+```shell
+# api_server_url 就是 api_server 产生的，比如 http://localhost:23333
+# server_name 和 server_port 是用来提供 gradio ui 访问服务的
+# 例子: lmdeploy serve gradio http://localhost:23333 --server-name localhost --server-port 6006
+lmdeploy serve gradio api_server_url --server-name ${gradio_ui_ip} --server-port ${gradio_ui_port}
+```
+### 方式二：通过 OpenAOE 接入
+```shell
+pip install -U openaoe
+openaoe -f /path/to/your/config-template.yaml
+```
+具体信息请参考 [部署说明](https://github.com/InternLM/OpenAOE/blob/main/docs/tech-report/model_serving_by_lmdeploy/model_serving_by_lmdeploy.md).
+## FAQ
+1. 当返回结果结束原因为 `"finish_reason":"length"`，这表示回话长度超过最大值。如需调整会话支持的最大长度，可以通过启动`api_server`时，设置`--session_len`参数大小。
+2. 当服务端显存 OOM 时，可以适当减小启动服务时的 `backend_config` 的 `cache_max_entry_count` 大小
+3. 当同一个 `session_id` 的请求给 `/v1/chat/interactive` 函数后，出现返回空字符串和负值的 `tokens`，应该是 `session_id` 混乱了，可以先将交互模式关闭，再重新开启。
+4. `/v1/chat/interactive` api 支持多轮对话, 但是默认关闭。`messages` 或者 `prompt` 参数既可以是一个简单字符串表示用户的单词提问，也可以是一段对话历史。
+5. 关于停止符，我们只支持编码后为单个 index 的字符。此外，可能存在多种 index 都会解码出带有停止符的结果。对于这种情况，如果这些 index 数量太多，我们只会采用 tokenizer 编码出的 index。而如果你想要编码后为多个 index 的停止符，可以考虑在流式客户端做字符串匹配，匹配成功后跳出流式循环即可。
+6. 自定义对话模板，请参考[chat_template.md](../advance/chat_template.md)
--- a/docs/zh_cn/serving/api_server_vl.md
+++ b/docs/zh_cn/serving/api_server_vl.md
+# 部署 VLM 类 openai 服务
+本文主要介绍单个VL模型在单机多卡环境下，部署兼容 openai 接口服务的方式，以及服务接口的用法。为行文方便，我们把该服务名称为 `api_server`。对于多模型的并行服务，请阅读[请求分发服务器](./proxy_server.md)一文。
+在这篇文章中， 我们首先介绍服务启动的两种方法，你可以根据应用场景，选择合适的。
+其次，我们重点介绍服务的 RESTful API 定义，以及接口使用的方式，并展示如何通过 Swagger UI、LMDeploy CLI 工具体验服务功能
+最后，向大家演示把服务接入到 WebUI 的方式，你可以参考它简单搭建一个演示 demo。
+## 启动服务
+以 huggingface hub 上的 [llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) 模型为例，你可以任选以下方式之一，启动推理服务。
+### 方式一：使用 lmdeploy cli 工具
+```shell
+lmdeploy serve api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 23333
+```
+api_server 启动时的参数可以通过命令行`lmdeploy serve api_server -h`查看。
+比如，`--tp` 设置张量并行，`--session-len` 设置推理的最大上下文窗口长度，`--cache-max-entry-count` 调整 k/v cache 的内存使用比例等等。
+### 方式二：使用 docker
+使用 LMDeploy 官方[镜像](https://hub.docker.com/r/openmmlab/lmdeploy/tags)，可以运行兼容 OpenAI 的服务。下面是使用示例：
+```shell
+docker run --runtime nvidia --gpus all \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
+    -p 23333:23333 \
+    --ipc=host \
+    openmmlab/lmdeploy:latest \
+    lmdeploy serve api_server liuhaotian/llava-v1.6-vicuna-7b
+```
+在这个例子中，`lmdeploy server api_server` 的命令参数与方式一一致。
+## RESTful API
+LMDeploy 的 RESTful API 兼容了 OpenAI 以下 3 个接口：
+- /v1/chat/completions
+- /v1/models
+- /v1/completions
+其中使用图片交互的接口是 `/v1/chat/completions`，与 OpenAI 的一致。
+服务启动后，你可以在浏览器中打开网页 http://0.0.0.0:23333，通过 Swagger UI 查看接口的详细说明，并且也可以直接在网页上操作，体验每个接口的用法，如下图所示。
+![swagger_ui](https://github.com/InternLM/lmdeploy/assets/4560679/b891dd90-3ffa-4333-92b2-fb29dffa1459)
+若需要把服务集成到自己的项目或者产品中，我们推荐以下用法：
+### 使用 openai 接口
+以下代码是通过 openai 包使用 `v1/chat/completions` 服务的例子。运行之前，请先安装 openai 包: `pip install openai`。
+```python
+from openai import OpenAI
+client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
+model_name = client.models.list().data[0].id
+response = client.chat.completions.create(
+    model=model_name,
+    messages=[{
+        'role':
+        'user',
+        'content': [{
+            'type': 'text',
+            'text': 'Describe the image please',
+        }, {
+            'type': 'image_url',
+            'image_url': {
+                'url':
+                'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
+            },
+        }],
+    }],
+    temperature=0.8,
+    top_p=0.8)
+print(response)
+```
+### 使用 lmdeploy `APIClient` 接口
+如果你想用 `/v1/chat/completions` 接口，你可以尝试下面代码：
+```python
+from lmdeploy.serve.openai.api_client import APIClient
+api_client = APIClient(f'http://0.0.0.0:23333')
+model_name = api_client.available_models[0]
+messages = [{
+    'role':
+    'user',
+    'content': [{
+        'type': 'text',
+        'text': 'Describe the image please',
+    }, {
+        'type': 'image_url',
+        'image_url': {
+            'url':
+            'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
+        },
+    }]
+}]
+for item in api_client.chat_completions_v1(model=model_name,
+                                           messages=messages):
+    print(item)
+```
+### 使用 Java/Golang/Rust
+可以使用代码生成工具 [openapi-generator-cli](https://github.com/OpenAPITools/openapi-generator-cli) 将 `http://{server_ip}:{server_port}/openapi.json` 转成 java/rust/golang 客户端。
+下面是一个使用示例：
+```shell
+$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust
+$ ls rust/*
+rust/Cargo.toml  rust/git_push.sh  rust/README.md
+rust/docs:
+ChatCompletionRequest.md  EmbeddingsRequest.md  HttpValidationError.md  LocationInner.md  Prompt.md
+DefaultApi.md             GenerateRequest.md    Input.md                Messages.md       ValidationError.md
+rust/src:
+apis  lib.rs  models
+```
--- a/docs/zh_cn/serving/gradio.md
+++ b/docs/zh_cn/serving/gradio.md
+# 部署 gradio 服务
+通过 LMDeploy 启动 LLM 模型的 gradio 服务，并在 WebUI 上和模型对话特别简单，一条命令即可。
+```shell
+pip install lmdeploy[serve]
+lmdeploy serve gradio {model_path}
+```
+把上面命令中的 `{model_path}` 换成 huggingface hub 上的模型 id，比如 internlm/internlm2-chat-7b，或者换成模型的本地路径就可以了。
+关于命令的详细参数，请使用 `lmdeploy serve gradio --help` 查阅。
+## 创建 huggingface demo
+如果想要在 huggingface 上创建模型的在线演示项目，请按以下步骤进行。
+### 第一步：创建 space
+首先，注册一个 huggingface 的账号，注册成功后，可以点击右上角头像，选择 New Space 创建。
+根据 huggingface 的引导选择需要的配置，完成后即可得到一个空白的 demo。
+### 第二步：编写 demo 入口代码 app.py
+以 `internlm/internlm2-chat-7b` 模型为例，将 space 空间中的`app.py`内容填写为：
+```python
+from lmdeploy.serve.gradio.turbomind_coupled import run_local
+from lmdeploy.messages import TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(max_batch_size=8)
+model_path = 'internlm/internlm2-chat-7b'
+run_local(model_path, backend_config=backend_config, server_name="huggingface-space")
+```
+创建`requirements.txt`文本文件，填写如下安装包：
+```
+lmdeploy
+```
+## FAQs
+- ZeroGPU 适配问题。ZeroGPU不适用 LMDeploy Turbomind 引擎，请选择普通 GPU，或者把上述代码中的 backend_config 改成 PyTorchEngineConfig，就可以用 ZeroGPU 了。
+- gradio 版本问题，目前不支持 4.0.0 以上版本，可以在 `app.py` 中修改，类似：
+  ```python
+  import os
+  os.system("pip uninstall -y gradio")
+  os.system("pip install gradio==3.43.0")
+  ```